Обсуждение: WAL replay issue from 9.6.8 to 9.6.10
Hello, I'm seeing some issues with WAL replay on a test server running 9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs during replay with messages like so:
WARNING: page 1209270272 of relation base/16422/47496599 does not exist
CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
PANIC: WAL contains references to invalid pages
This is after restoring from a snapshot that completed successfully as far as I can tell.
I downgraded the test server to 9.6.8 and I was able to replay WAL past the point where 9.6.10 would PANIC and crash. Curiously, 9.6.8 PANICed the first time, but then when I restarted it was able to replay through.
Should I expect this to just not work until the primary database is upgraded to 9.6.10? I'm a little wary to upgrade the primary at this point until I understand what might be going on.
dave
On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com> wrote: >Hello, I'm seeing some issues with WAL replay on a test server running >9.6.10 using WAL archived from a 9.6.8 primary server. It reliably > Can you tell us the wal-level? Regards, Andreas -- 2ndQuadrant - The PostgreSQL Support Company
On Tue, Aug 28, 2018 at 9:21 PM Andreas Kretschmer <andreas@a-kretschmer.de> wrote:
On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com> wrote:
>Hello, I'm seeing some issues with WAL replay on a test server running
>9.6.10 using WAL archived from a 9.6.8 primary server. It reliably
>
Can you tell us the wal-level?
Yes, it is "hot_standby".
On 29 August 2018 06:26:06 CEST, Dave Peticolas <dave@krondo.com> wrote: >On Tue, Aug 28, 2018 at 9:21 PM Andreas Kretschmer ><andreas@a-kretschmer.de> >wrote: > >> On 29 August 2018 06:02:45 CEST, Dave Peticolas <dave@krondo.com> >wrote: >> >Hello, I'm seeing some issues with WAL replay on a test server >running >> >9.6.10 using WAL archived from a 9.6.8 primary server. It reliably >> > >> >> Can you tell us the wal-level? >> > >Yes, it is "hot_standby". ok, this was to just rule out a problem that we discovered with logical decoding, but this isn't the case here. Regards, Andreas -- 2ndQuadrant - The PostgreSQL Support Company
Hi, 2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>: > Hello, I'm seeing some issues with WAL replay on a test server running > 9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs > during replay with messages like so: > > WARNING: page 1209270272 of relation base/16422/47496599 does not exist > CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items > PANIC: WAL contains references to invalid pages it looks like you are hitting pretty much the same problem as I: https://www.postgresql.org/message-id/flat/153492341830.1368.3936905691758473953%40wrigleys.postgresql.org The only major difference, you are restoring from the backup, while in my case the host running replica has crashed. Also in my case, the primary was already running 9.6.10. In my case, it also panics during "Btree/DELETE: XYZ items" and page number of relation is insanely huge.
On Wed, Aug 29, 2018 at 08:31:50AM +0200, Alexander Kukushkin wrote: > 2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>: >> Hello, I'm seeing some issues with WAL replay on a test server running >> 9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs >> during replay with messages like so: >> >> WARNING: page 1209270272 of relation base/16422/47496599 does not exist >> CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items >> PANIC: WAL contains references to invalid pages > > > it looks like you are hitting pretty much the same problem as I: > https://www.postgresql.org/message-id/flat/153492341830.1368.3936905691758473953%40wrigleys.postgresql.org > The only major difference, you are restoring from the backup, while in > my case the host running replica has crashed. > Also in my case, the primary was already running 9.6.10. > > In my case, it also panics during "Btree/DELETE: XYZ items" and page > number of relation is insanely huge. That would be the same problem. Dave, do you have a background worker running in parallel or some read-only workload with backends doing read-only operations on a standby once it has reached a consistent point? -- Michael
Вложения
On Wed, Aug 29, 2018 at 4:54 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Aug 29, 2018 at 08:31:50AM +0200, Alexander Kukushkin wrote:
> 2018-08-29 6:02 GMT+02:00 Dave Peticolas <dave@krondo.com>:
>> Hello, I'm seeing some issues with WAL replay on a test server running
>> 9.6.10 using WAL archived from a 9.6.8 primary server. It reliably PANICs
>> during replay with messages like so:
>>
>> WARNING: page 1209270272 of relation base/16422/47496599 does not exist
>> CONTEXT: xlog redo at 4810/C84F8A0 for Btree/DELETE: 88 items
>> PANIC: WAL contains references to invalid pages
>
>
> it looks like you are hitting pretty much the same problem as I:
> https://www.postgresql.org/message-id/flat/153492341830.1368.3936905691758473953%40wrigleys.postgresql.org
> The only major difference, you are restoring from the backup, while in
> my case the host running replica has crashed.
> Also in my case, the primary was already running 9.6.10.
>
> In my case, it also panics during "Btree/DELETE: XYZ items" and page
> number of relation is insanely huge.
That would be the same problem. Dave, do you have a background worker
running in parallel or some read-only workload with backends doing
read-only operations on a standby once it has reached a consistent
point?
Oh, perhaps I do, depending on what you mean by worker. There are a couple of periodic processes that connect to the server to obtain metrics. Is that what is triggering this issue? In my case I could probably suspend them until the replay has reached the desired point.
I have noticed this behavior in the past but prior to 9.6.10 restarting the server would fix the issue. And the replay always seemed to reach a point past which the problem would not re-occur.
dave
On Wed, Aug 29, 2018 at 09:15:29AM -0700, Dave Peticolas wrote: > Oh, perhaps I do, depending on what you mean by worker. There are a couple > of periodic processes that connect to the server to obtain metrics. Is that > what is triggering this issue? In my case I could probably suspend them > until the replay has reached the desired point. That would be it. How do you decide when those begin to run and connect to Postgres. Do you use pg_isready or similar in a loop for sanity checks? > I have noticed this behavior in the past but prior to 9.6.10 restarting the > server would fix the issue. And the replay always seemed to reach a point > past which the problem would not re-occur. You are picking my interest here. Did you actually see the same problem? In 9.6.10 what happens is that I have tightened the consistent point checks and logic so as inconsistent page issues would actually show up when they should, and that those become reproducible so as we can track down any rogue WAL record or inconsistent behavior. -- Michael
Вложения
On Wed, Aug 29, 2018 at 1:50 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Aug 29, 2018 at 09:15:29AM -0700, Dave Peticolas wrote:
> Oh, perhaps I do, depending on what you mean by worker. There are a couple
> of periodic processes that connect to the server to obtain metrics. Is that
> what is triggering this issue? In my case I could probably suspend them
> until the replay has reached the desired point.
That would be it. How do you decide when those begin to run and connect
to Postgres. Do you use pg_isready or similar in a loop for sanity
checks?
I do not, they just try to connect and bail if they cannot.
> I have noticed this behavior in the past but prior to 9.6.10 restarting the
> server would fix the issue. And the replay always seemed to reach a point
> past which the problem would not re-occur.
You are picking my interest here. Did you actually see the same
problem? In 9.6.10 what happens is that I have tightened the consistent
point checks and logic so as inconsistent page issues would actually
show up when they should, and that those become reproducible so as we
can track down any rogue WAL record or inconsistent behavior.
Yes, I've seen this problem occasionally in the past. I think only in the 9.6 series. But before 9.6.10, if I restarted the server it would start replaying WAL again and typically when it reached the point where it PANICed before, instead it would report a consistent state and allow read-only connections. Sometimes it would then PANIC again after more WAL was replayed. But eventually it would reach a point where it seemed to be able to replay WAL indefinitely without the issue happening.
dave