Обсуждение: WAL replay issue from 9.6.8 to 9.6.10

Поиск
Список
Период
Сортировка

WAL replay issue from 9.6.8 to 9.6.10

От
Dave Peticolas
Дата:
Hello, I'm seeing some issues with WAL replay on a test server running 9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs during replay with messages like so:

WARNING:  page 1209270272 of relation base/16422/47496599 does not exist
CONTEXT:  xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
PANIC:  WAL contains references to invalid pages

This is after restoring from a snapshot that completed successfully as far as I can tell.

I downgraded the test server to 9.6.8 and I was able to replay WAL past the point where 9.6.10 would PANIC and crash. Curiously, 9.6.8 PANICed the first time, but then when I restarted it was able to replay through.

Should I expect this to just not work until the primary database is upgraded to 9.6.10? I'm a little wary to upgrade the primary at this point until I understand what might be going on.

thanks,
dave

Re: WAL replay issue from 9.6.8 to 9.6.10

От
Andreas Kretschmer
Дата:
On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com> wrote:
>Hello, I'm seeing some issues with WAL replay on a test server running
>9.6.10 using WAL archived from a 9.6.8 primary server. It reliably
>

Can you tell us the wal-level?

Regards, Andreas


--
2ndQuadrant - The PostgreSQL Support Company


Re: WAL replay issue from 9.6.8 to 9.6.10

От
Dave Peticolas
Дата:
On Tue, Aug 28, 2018 at 9:21 PM Andreas Kretschmer <andreas@a-kretschmer.de> wrote:
On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com> wrote:
>Hello, I'm seeing some issues with WAL replay on a test server running
>9.6.10 using WAL archived from a 9.6.8 primary server. It reliably
>

Can you tell us the wal-level?

Yes, it is "hot_standby". 

Re: WAL replay issue from 9.6.8 to 9.6.10

От
Andreas Kretschmer
Дата:
On 29 August 2018 06:26:06 CEST, Dave Peticolas <dave@krondo.com> wrote:
>On Tue, Aug 28, 2018 at 9:21 PM Andreas Kretschmer
><andreas@a-kretschmer.de>
>wrote:
>
>> On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com>
>wrote:
>> >Hello, I'm seeing some issues with WAL replay on a test server
>running
>> >9.6.10 using WAL archived from a 9.6.8 primary server. It reliably
>> >
>>
>> Can you tell us the wal-level?
>>
>
>Yes, it is "hot_standby".

ok, this was to just rule out a problem that we discovered with logical decoding, but this isn't the case here.


Regards, Andreas


--
2ndQuadrant - The PostgreSQL Support Company


Re: WAL replay issue from 9.6.8 to 9.6.10

От
Alexander Kukushkin
Дата:
Hi,


2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>:
> Hello, I'm seeing some issues with WAL replay on a test server running
> 9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs
> during replay with messages like so:
>
> WARNING:  page 1209270272 of relation base/16422/47496599 does not exist
> CONTEXT:  xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
> PANIC:  WAL contains references to invalid pages


it looks like you are hitting pretty much the same problem as I:
https://www.postgresql.org/message-id/flat/153492341830.1368.3936905691758473953%40wrigleys.postgresql.org
The only major difference, you are restoring from the backup, while in
my case the host running replica has crashed.
Also in my case, the primary was already running 9.6.10.


In my case, it also panics during "Btree/DELETE: XYZ items" and page
number of relation is insanely huge.


Re: WAL replay issue from 9.6.8 to 9.6.10

От
Michael Paquier
Дата:
On Wed, Aug 29, 2018 at 08:31:50AM +0200, Alexander Kukushkin wrote:
> 2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>:
>> Hello, I'm seeing some issues with WAL replay on a test server running
>> 9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs
>> during replay with messages like so:
>>
>> WARNING:  page 1209270272 of relation base/16422/47496599 does not exist
>> CONTEXT:  xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
>> PANIC:  WAL contains references to invalid pages
>
>
> it looks like you are hitting pretty much the same problem as I:
> https://www.postgresql.org/message-id/flat/153492341830.1368.3936905691758473953%40wrigleys.postgresql.org
> The only major difference, you are restoring from the backup, while in
> my case the host running replica has crashed.
> Also in my case, the primary was already running 9.6.10.
>
> In my case, it also panics during "Btree/DELETE: XYZ items" and page
> number of relation is insanely huge.

That would be the same problem.  Dave, do you have a background worker
running in parallel or some read-only workload with backends doing
read-only operations on a standby once it has reached a consistent
point?
--
Michael

Вложения

Re: WAL replay issue from 9.6.8 to 9.6.10

От
Dave Peticolas
Дата:
On Wed, Aug 29, 2018 at 4:54 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Aug 29, 2018 at 08:31:50AM +0200, Alexander Kukushkin wrote:
> 2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>:
>> Hello, I'm seeing some issues with WAL replay on a test server running
>> 9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs
>> during replay with messages like so:
>>
>> WARNING:  page 1209270272 of relation base/16422/47496599 does not exist
>> CONTEXT:  xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
>> PANIC:  WAL contains references to invalid pages
>
>
> it looks like you are hitting pretty much the same problem as I:
> https://www.postgresql.org/message-id/flat/153492341830.1368.3936905691758473953%40wrigleys.postgresql.org
> The only major difference, you are restoring from the backup, while in
> my case the host running replica has crashed.
> Also in my case, the primary was already running 9.6.10.
>
> In my case, it also panics during "Btree/DELETE: XYZ items" and page
> number of relation is insanely huge.

That would be the same problem.  Dave, do you have a background worker
running in parallel or some read-only workload with backends doing
read-only operations on a standby once it has reached a consistent
point?


Oh, perhaps I do, depending on what you mean by worker. There are a couple of periodic processes that connect to the server to obtain metrics. Is that what is triggering this issue? In my case I could probably suspend them until the replay has reached the desired point.

I have noticed this behavior in the past but prior to 9.6.10 restarting the server would fix the issue. And the replay always seemed to reach a point past which the problem would not re-occur.

dave

Re: WAL replay issue from 9.6.8 to 9.6.10

От
Michael Paquier
Дата:
On Wed, Aug 29, 2018 at 09:15:29AM -0700, Dave Peticolas wrote:
> Oh, perhaps I do, depending on what you mean by worker. There are a couple
> of periodic processes that connect to the server to obtain metrics. Is that
> what is triggering this issue? In my case I could probably suspend them
> until the replay has reached the desired point.

That would be it.  How do you decide when those begin to run and connect
to Postgres.  Do you use pg_isready or similar in a loop for sanity
checks?

> I have noticed this behavior in the past but prior to 9.6.10 restarting the
> server would fix the issue. And the replay always seemed to reach a point
> past which the problem would not re-occur.

You are picking my interest here.  Did you actually see the same
problem?  In 9.6.10 what happens is that I have tightened the consistent
point checks and logic so as inconsistent page issues would actually
show up when they should, and that those become reproducible so as we
can track down any rogue WAL record or inconsistent behavior.
--
Michael

Вложения

Re: WAL replay issue from 9.6.8 to 9.6.10

От
Dave Peticolas
Дата:
On Wed, Aug 29, 2018 at 1:50 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Aug 29, 2018 at 09:15:29AM -0700, Dave Peticolas wrote:
> Oh, perhaps I do, depending on what you mean by worker. There are a couple
> of periodic processes that connect to the server to obtain metrics. Is that
> what is triggering this issue? In my case I could probably suspend them
> until the replay has reached the desired point.

That would be it.  How do you decide when those begin to run and connect
to Postgres.  Do you use pg_isready or similar in a loop for sanity
checks?

I do not, they just try to connect and bail if they cannot.
 
> I have noticed this behavior in the past but prior to 9.6.10 restarting the
> server would fix the issue. And the replay always seemed to reach a point
> past which the problem would not re-occur.

You are picking my interest here.  Did you actually see the same
problem?  In 9.6.10 what happens is that I have tightened the consistent
point checks and logic so as inconsistent page issues would actually
show up when they should, and that those become reproducible so as we
can track down any rogue WAL record or inconsistent behavior.

Yes, I've seen this problem occasionally in the past. I think only in the 9.6 series. But before 9.6.10, if I restarted the server it would start replaying WAL again and typically when it reached the point where it PANICed before, instead it would report a consistent state and allow read-only connections. Sometimes it would then PANIC again after more WAL was replayed. But eventually it would reach a point where it seemed to be able to replay WAL indefinitely without the issue happening.

dave