Обсуждение: requested timeline ... does not contain minimum recovery point ...

Поиск
Список
Период
Сортировка

requested timeline ... does not contain minimum recovery point ...

От
Christophe Pettus
Дата:
PostgreSQL 9.6.9, Windows Server 2012 Datacenter (64-bit).

We're trying to diagnose the error:

    requested timeline 105 does not contain minimum recovery point A58/6B109F28 on timeline 103

The error occurs when a WAL-shipping (not streaming) secondary starts up.

These two machines have been part of a stress-test where, repeatedly, the secondary is promoted, the old primary is
rewoundusing pg_rewind, and then attached to the new primary.  This has worked for multiple iterations, but this error
poppedup.  The last cycle was particularly fast: the new primary was only up for about 10 seconds (although it had
completedrecovery) before being shut down again, and pg_rewind applied to it to reconnect it with the promoted
secondary.

--
-- Christophe Pettus
   xof@thebuild.com



Re: requested timeline ... does not contain minimum recovery point...

От
Andres Freund
Дата:
Hi,

On 2018-07-12 10:20:06 -0700, Christophe Pettus wrote:
> PostgreSQL 9.6.9, Windows Server 2012 Datacenter (64-bit).
> 
> We're trying to diagnose the error:
> 
>     requested timeline 105 does not contain minimum recovery point A58/6B109F28 on timeline 103
> 
> The error occurs when a WAL-shipping (not streaming) secondary starts up.
> 
> These two machines have been part of a stress-test where, repeatedly, the secondary is promoted, the old primary is
rewoundusing pg_rewind, and then attached to the new primary.  This has worked for multiple iterations, but this error
poppedup.  The last cycle was particularly fast: the new primary was only up for about 10 seconds (although it had
completedrecovery) before being shut down again, and pg_rewind applied to it to reconnect it with the promoted
secondary.

This needs a lot more information before somebody can reasonably act on
it.

Greetings,

Andres Freund


Re: requested timeline ... does not contain minimum recovery point...

От
Christophe Pettus
Дата:
> On Jul 12, 2018, at 10:29, Andres Freund <andres@anarazel.de> wrote:
>
> This needs a lot more information before somebody can reasonably act on
> it.

Happy to provide, of course!

The two relevant hosts are "Ash" and "Chi".  As mentioned, they've been flipped back and forth repeatedly using
pg_rewind:One will be promoted, the other pg_rewind'd, and then brought up as a WAL shipping secondary to the first.
Thisprocedure has worked repeatedly, but this last instance failed.  The logs from the failure are attached below,
alongwith the output of from pg_controldata for both hosts.  The systems are still in this configuration, so we can
gathermore as required. 

What surprises me about the error is that while the recovery point seems reasonable, it shouldn't be on timeline 103,
buton timeline 105. 


Failover to Ash:
2018-07-10 19:14:22 UTC [2072]: [5153-1] user=,db=,app=,client= LOG:  received promote request
2018-07-10 19:14:22 UTC [2072]: [5154-1] user=,db=,app=,client= LOG:  redo done at A58/4F0822D0
2018-07-10 19:14:22 UTC [2072]: [5155-1] user=,db=,app=,client= LOG:  last completed transaction was at log time
2018-07-1019:13:54.515121+00 
2018-07-10 19:14:23 UTC [2072]: [5156-1] user=,db=,app=,client= LOG:  restored log file "0000006700000A580000004F" from
archive
2018-07-10 19:14:23 UTC [2072]: [5157-1] user=,db=,app=,client= LOG:  selected new timeline ID: 104
2018-07-10 19:14:24 UTC [2072]: [5158-1] user=,db=,app=,client= LOG:  restored log file "00000067.history" from archive

[Chi rewound using pg_rewind against Ash, brought up]

Chi:
2018-07-10 19:26:05 UTC [3260]: [4-1] user=,db=,app=,client= LOG:  restored log file "00000068.history" from archive
2018-07-10 19:26:06 UTC [3260]: [5-1] user=,db=,app=,client= LOG:  redo starts at A58/4E061EF8
2018-07-10 19:26:07 UTC [3260]: [6-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A5800000050" from
archive
2018-07-10 19:26:08 UTC [3260]: [7-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A5800000051" from
archive
2018-07-10 19:26:09 UTC [3260]: [8-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A5800000052" from
archive
2018-07-10 19:26:10 UTC [3260]: [9-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A5800000053" from
archive
2018-07-10 19:26:11 UTC [3260]: [10-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A5800000054" from
archive
2018-07-10 19:26:12 UTC [3260]: [11-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A5800000055" from
archive
2018-07-10 19:26:13 UTC [3260]: [12-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A5800000056" from
archive
2018-07-10 19:26:13 UTC [3260]: [13-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A5800000057" from
archive
2018-07-10 19:26:14 UTC [3260]: [14-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A5800000058" from
archive
2018-07-10 19:26:15 UTC [3260]: [15-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A5800000059" from
archive
2018-07-10 19:26:15 UTC [3260]: [16-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A580000005A" from
archive
2018-07-10 19:26:16 UTC [3260]: [17-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A580000005B" from
archive
2018-07-10 19:26:17 UTC [3260]: [18-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A580000005C" from
archive
2018-07-10 19:27:28 UTC [3260]: [19-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A580000005D" from
archive
2018-07-10 19:27:28 UTC [3260]: [20-1] user=,db=,app=,client= LOG:  consistent recovery state reached at A58/5D01BCA8
2018-07-10 19:27:28 UTC [2564]: [3-1] user=,db=,app=,client= LOG:  database system is ready to accept read only
connections
2018-07-10 19:28:28 UTC [3260]: [21-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A580000005E" from
archive

Chi promoted:
2018-07-10 19:28:37 UTC [3260]: [22-1] user=,db=,app=,client= LOG:  received promote request
2018-07-10 19:28:37 UTC [3260]: [23-1] user=,db=,app=,client= LOG:  redo done at A58/5E0817D0
2018-07-10 19:28:37 UTC [3260]: [24-1] user=,db=,app=,client= LOG:  last completed transaction was at log time
2018-07-1019:28:12.954559+00 
2018-07-10 19:28:37 UTC [3260]: [25-1] user=,db=,app=,client= LOG:  restored log file "0000006800000A580000005E" from
archive
2018-07-10 19:28:37 UTC [3260]: [26-1] user=,db=,app=,client= LOG:  selected new timeline ID: 105
2018-07-10 19:28:38 UTC [3260]: [27-1] user=,db=,app=,client= LOG:  restored log file "00000068.history" from archive
2018-07-10 19:28:38 UTC [3260]: [28-1] user=,db=,app=,client= LOG:  archive recovery complete
2018-07-10 19:28:38 UTC [3260]: [29-1] user=,db=,app=,client= LOG:  MultiXact member wraparound protections are now
enabled
2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG:  checkpoint starting: force
2018-07-10 19:28:38 UTC [2564]: [4-1] user=,db=,app=,client= LOG:  database system is ready to accept connections

[Ash rewound using pg_rewind against Chi, brought up]

Ash:
2018-07-10 19:40:09 UTC [520]: [3-1] user=,db=,app=,client= LOG:  restored log file "00000069.history" from archive
2018-07-10 19:40:10 UTC [520]: [4-1] user=,db=,app=,client= LOG:  entering standby mode
2018-07-10 19:40:10 UTC [520]: [5-1] user=,db=,app=,client= LOG:  restored log file "00000069.history" from archive
2018-07-10 19:40:12 UTC [520]: [6-1] user=,db=,app=,client= LOG:  restored log file "0000006700000A580000004E" from
archive
2018-07-10 19:40:12 UTC [520]: [7-1] user=,db=,app=,client= FATAL:  requested timeline 105 does not contain minimum
recoverypoint A58/6B109F28 on timeline 103 

--
From Ash:

C:\Windows\system32>pg_controldata -D e:\pgdata
pg_control version number:            960
Catalog version number:               201608131
Database system identifier:           6377726782605679660
Database cluster state:               in archive recovery
pg_control last modified:             7/10/2018 7:28:38 PM
Latest checkpoint location:           A58/4E0689F0
Prior checkpoint location:            A58/51E0EF50
Latest checkpoint's REDO location:    A58/4E061EF8
Latest checkpoint's REDO WAL file:    0000006700000A580000004E
Latest checkpoint's TimeLineID:       103
Latest checkpoint's PrevTimeLineID:   103
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:2428875781
Latest checkpoint's NextOID:          2368444
Latest checkpoint's NextMultiXactId:  2
Latest checkpoint's NextMultiOffset:  3
Latest checkpoint's oldestXID:        2243636521
Latest checkpoint's oldestXID's DB:   16397
Latest checkpoint's oldestActiveXID:  2428782169
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 16396
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            7/10/2018 7:12:11 PM
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     A58/6B109F28
Min recovery ending loc's timeline:   103
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    replica
wal_log_hints setting:                on
max_connections setting:              1000
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

From Chi:

C:\Windows\system32>pg_controldata -D e:\pgdata
pg_control version number:            960
Catalog version number:               201608131
Database system identifier:           6377726782605679660
Database cluster state:               in production
pg_control last modified:             7/11/2018 3:39:21 PM
Latest checkpoint location:           A5A/C205EE68
Prior checkpoint location:            A5A/BF15ED30
Latest checkpoint's REDO location:    A5A/C1102590
Latest checkpoint's REDO WAL file:    0000006900000A5A000000C1
Latest checkpoint's TimeLineID:       105
Latest checkpoint's PrevTimeLineID:   105
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:2431541507
Latest checkpoint's NextOID:          2384828
Latest checkpoint's NextMultiXactId:  2
Latest checkpoint's NextMultiOffset:  3
Latest checkpoint's oldestXID:        2243636521
Latest checkpoint's oldestXID's DB:   16397
Latest checkpoint's oldestActiveXID:  2431541507
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 16396
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            7/11/2018 3:38:41 PM
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    replica
wal_log_hints setting:                on
max_connections setting:              1000
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

--
-- Christophe Pettus
   xof@thebuild.com



Re: requested timeline ... does not contain minimum recovery point...

От
Michael Paquier
Дата:
On Thu, Jul 12, 2018 at 02:26:17PM -0700, Christophe Pettus wrote:
> What surprises me about the error is that while the recovery point
> seems reasonable, it shouldn't be on timeline 103, but on timeline
> 105.

Wild guess: you did not issue a checkpoint on the promoted standby
before running pg_rewind.
--
Michael

Вложения

Re: requested timeline ... does not contain minimum recovery point...

От
Christophe Pettus
Дата:
> On Jul 12, 2018, at 17:52, Michael Paquier <michael@paquier.xyz> wrote:
> Wild guess: you did not issue a checkpoint on the promoted standby
> before running pg_rewind.

I don't believe a manual checkpoint was done on the target (promoted standby, new master), but it did one as usual
duringstartup after the timeline switch: 

> 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG:  checkpoint starting: force


The pg_rewind was started about 90 seconds later.

--
-- Christophe Pettus
   xof@thebuild.com



Re: requested timeline ... does not contain minimum recovery point...

От
Christophe Pettus
Дата:
> On Jul 12, 2018, at 19:22, Christophe Pettus <xof@thebuild.com> wrote:
>
>
>> On Jul 12, 2018, at 17:52, Michael Paquier <michael@paquier.xyz> wrote:
>> Wild guess: you did not issue a checkpoint on the promoted standby
>> before running pg_rewind.
>
> I don't believe a manual checkpoint was done on the target (promoted standby, new master), but it did one as usual
duringstartup after the timeline switch: 
>
>> 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG:  checkpoint starting: force
>
> The pg_rewind was started about 90 seconds later.

That being said, the pg_rewind output seems to indicate that the old divergence point was still being picked up, rather
thanthe one on timeline 104: 

> servers diverged at WAL position A58/50000000 on timeline 103
> rewinding from last common checkpoint at A58/4E0689F0 on timeline 103

--
-- Christophe Pettus
   xof@thebuild.com



Re: requested timeline ... does not contain minimum recovery point...

От
Andres Freund
Дата:
On 2018-07-12 19:22:50 -0700, Christophe Pettus wrote:
> 
> > On Jul 12, 2018, at 17:52, Michael Paquier <michael@paquier.xyz> wrote:
> > Wild guess: you did not issue a checkpoint on the promoted standby
> > before running pg_rewind.
> 
> I don't believe a manual checkpoint was done on the target (promoted standby, new master), but it did one as usual
duringstartup after the timeline switch:
 
> 
> > 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG:  checkpoint starting: force
> 
> 
> The pg_rewind was started about 90 seconds later.

Note that that message doesn't indicate a completed checkpoint, just
that one started. Do you see a "checkpoint complete: wrote ..." message
before the rewind started?

Greetings,

Andres Freund


Re: requested timeline ... does not contain minimum recovery point...

От
Christophe Pettus
Дата:
> On Jul 12, 2018, at 19:54, Andres Freund <andres@anarazel.de> wrote:
> Do you see a "checkpoint complete: wrote ..." message
> before the rewind started?

Checking, but I suspect that's exactly the problem.

This raises a question: Would it make sense for pg_rewind to either force a checkpoint or have a --checkpoint option
alongthe lines of pg_basebackup?  This scenario (pg_rewind being run very quickly after secondary promotion) is not
uncommonwhen there's scripting around the switch-over process. 

--
-- Christophe Pettus
   xof@thebuild.com