Обсуждение: Replication failure, slave requesting old segments
Dear Experts, I recently set up replication for the first time. It seemed to be working OK in my initial tests, but didn't cope when the slave was down for a longer period. This is all with the Debian stable packages of PostgreSQL 9.6. My replication setup involves a third server, "backup", to and from which WAL files are copied using scp (and essentially never deleted), plus streaming replication in "hot standby" mode. On the master, I have: wal_level = replica archive_mode = on archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f && scp %p backup:backup/postgresql/archivedir/%f' On the slave I have: standby_mode = 'on' primary_conninfo = 'user=postgres host=master port=5432' restore_command = 'scp backup:backup/postgresql/archivedir/%f %p' hot_standby = on This all seemed to work OK until the slave was shut down for a few days. (During this time the master was also down for a shorter period.) When the slave came back up it successfully copied several WAL files from the backup server until it reached one that hasn't been created yet: 2018-08-10 22:37:38.322 UTC [615] LOG: restored log file "0000000100000007000000CC" from archive 2018-08-10 22:38:02.011 UTC [615] LOG: restored log file "0000000100000007000000CD" from archive 2018-08-10 22:38:29.802 UTC [615] LOG: restored log file "0000000100000007000000CE" from archive 2018-08-10 22:38:55.973 UTC [615] LOG: restored log file "0000000100000007000000CF" from archive 2018-08-10 22:39:24.964 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory At this point there was a temporary problem due to a firewall rule that blocked the replication connection to the master: 2018-08-10 22:41:35.865 UTC [1289] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly The slave then entered a loop, retrying to scp the next file from the backup archive (and failing) and connecting to the master for streaming replication (and also failing). That is as expected, except for the "invalid magic number": scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory 2018-08-10 22:41:37.018 UTC [615] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 2018-08-10 22:43:46.936 UTC [1445] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly Note that it's complaining about an invalid magic number in log segment D1, which is the one that has just failed to scp. That looked suspicious to me and I remembered notes in the docs about ensuring that the archive and/or restore commands return an error exit status when they fail; I checked that and yes scp does exit(1) when the requested file doesn't exist: $ scp backup:/egheriugherg /tmp/a scp: /egheriugherg: No such file or directory $ echo $? 1 Anyway, the slave continued in its retrying loop as expected, except that each time it re-fetched the previous segment, D0; this seems a bit peculiar: 2018-08-10 22:44:17.796 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory 2018-08-10 22:46:28.727 UTC [1628] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly (Note the "invalid magic number" was only reported the first time.) Eventually I fixed the networking issue that had prevented the replication connection to master and it could start streaming: 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0has already been removed So here is the real problem which worries me: it seems to ask the master for segment D0, which is one that it did successfully restore from the backup earlier on, and the master - not unexpectedly - reports that it has already been removed. I tried restarting the slave: 2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive 2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 2018-08-11 00:12:16.068 UTC [7953] LOG: database system is ready to accept read only connections scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory 2018-08-11 00:12:17.057 UTC [7954] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 2018-08-11 00:12:17.242 UTC [8034] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 2018-08-11 00:12:17.242 UTC [8034] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0has already been removed Note that again it failed to get D1 from the backup, as expected, and then reports "invalid magic number" in that file, and again tries to start streaming from a segment that the master doesn't have. Eventually I decided to try and cause the D1 file to exist in the backup: db=> create table waste_space as select * from large_table; That caused four new WAL segments to be copied to the backup, and restored by the slave: 2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive 2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file "0000000100000007000000D1" from archive 2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file "0000000100000007000000D2" from archive 2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file "0000000100000007000000D3" from archive 2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file "0000000100000007000000D4" from archive scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory 2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000 in log segment 00000001000000070000 00D5, offset 0 2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from primary at 7/D5000000 on timeline 1 Note the "unexpected pageaddr" message. Despite that, the streaming started and everything was working correctly. Hurray! I remain concerned, however, that something bad happened and I'd like to understand it. Can anyone help? Thanks, Phil.
On 08/11/2018 08:18 AM, Phil Endecott wrote: > Dear Experts, > > I recently set up replication for the first time. It seemed to be > working OK in my initial tests, but didn't cope when the slave was > down for a longer period. This is all with the Debian stable > packages of PostgreSQL 9.6. My replication setup involves a third > server, "backup", to and from which WAL files are copied using scp > (and essentially never deleted), plus streaming replication in > "hot standby" mode. > > On the master, I have: > > wal_level = replica > archive_mode = on > archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f && > scp %p backup:backup/postgresql/archivedir/%f' > > On the slave I have: > > standby_mode = 'on' > primary_conninfo = 'user=postgres host=master port=5432' > restore_command = 'scp backup:backup/postgresql/archivedir/%f %p' > > hot_standby = on > > > This all seemed to work OK until the slave was shut down for a few days. > (During this time the master was also down for a shorter period.) > When the slave came back up it successfully copied several WAL files > from the backup server until it reached one that hasn't been created yet: > > 2018-08-10 22:37:38.322 UTC [615] LOG: restored log file "0000000100000007000000CC" from archive > 2018-08-10 22:38:02.011 UTC [615] LOG: restored log file "0000000100000007000000CD" from archive > 2018-08-10 22:38:29.802 UTC [615] LOG: restored log file "0000000100000007000000CE" from archive > 2018-08-10 22:38:55.973 UTC [615] LOG: restored log file "0000000100000007000000CF" from archive > 2018-08-10 22:39:24.964 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > > At this point there was a temporary problem due to a firewall rule that > blocked the replication connection to the master: > > 2018-08-10 22:41:35.865 UTC [1289] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly > > The slave then entered a loop, retrying to scp the next file from the backup > archive (and failing) and connecting to the master for streaming replication > (and also failing). That is as expected, except for the "invalid magic number": > > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-10 22:41:37.018 UTC [615] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 > 2018-08-10 22:43:46.936 UTC [1445] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly > > Note that it's complaining about an invalid magic number in log segment D1, which is > the one that has just failed to scp. That looked suspicious to me and I remembered > notes in the docs about ensuring that the archive and/or restore commands return an > error exit status when they fail; I checked that and yes scp does exit(1) when the > requested file doesn't exist: > > $ scp backup:/egheriugherg /tmp/a > scp: /egheriugherg: No such file or directory > $ echo $? > 1 > > Anyway, the slave continued in its retrying loop as expected, except that each time > it re-fetched the previous segment, D0; this seems a bit peculiar: > > 2018-08-10 22:44:17.796 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-10 22:46:28.727 UTC [1628] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly > > (Note the "invalid magic number" was only reported the first time.) > > Eventually I fixed the networking issue that had prevented the replication connection > to master and it could start streaming: > > 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 > 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0has already been removed > > So here is the real problem which worries me: it seems to ask the master for > segment D0, which is one that it did successfully restore from the backup earlier > on, and the master - not unexpectedly - reports that it has already been removed. > > I tried restarting the slave: > > 2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive > 2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 > 2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 > 2018-08-11 00:12:16.068 UTC [7953] LOG: database system is ready to accept read only connections > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-11 00:12:17.057 UTC [7954] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 > 2018-08-11 00:12:17.242 UTC [8034] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 > 2018-08-11 00:12:17.242 UTC [8034] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0has already been removed > > Note that again it failed to get D1 from the backup, as expected, and then > reports "invalid magic number" in that file, and again tries to start > streaming from a segment that the master doesn't have. > > Eventually I decided to try and cause the D1 file to exist in the backup: > > db=> create table waste_space as select * from large_table; > > That caused four new WAL segments to be copied to the backup, and restored > by the slave: > > 2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive > 2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file "0000000100000007000000D1" from archive > 2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file "0000000100000007000000D2" from archive > 2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file "0000000100000007000000D3" from archive > 2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file "0000000100000007000000D4" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory > 2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000 in log segment 00000001000000070000 > 00D5, offset 0 > 2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from primary at 7/D5000000 on timeline 1 > > Note the "unexpected pageaddr" message. Despite that, the streaming started > and everything was working correctly. Hurray! > > I remain concerned, however, that something bad happened and I'd like to > understand it. Can anyone help? Looks like the master recycled the WAL's while the slave could not connect. Might want to take a look at replication slots to keep WAL's from being recycled: https://www.postgresql.org/docs/9.6/static/warm-standby.html#STREAMING-REPLICATION-SLOTS with the heads up that slots do not limit the space taken up by WAL's. The "unexpected pageaddr" probably is for same reason, per comments in xlogreader.c: " /* * Check that the address on the page agrees with what we expected. * This check typically fails when an old WAL segment is recycled, * and hasn't yet been overwritten with new data yet. */ ... report_invalid_record(state, "unexpected pageaddr %X/%X in log segment %s, offset %u", (uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr, fname, offset); " > > > Thanks, > > Phil. > > > > -- Adrian Klaver adrian.klaver@aklaver.com
Hi Adrian, Adrian Klaver wrote: > Looks like the master recycled the WAL's while the slave could not connect. Yes but... why is that a problem? The master is copying the WALs to the backup server using scp, where they remain forever. The slave gets them from there before it starts streaming. So it shouldn't matter if the master recycles them, as the slave should be able to get everything using the combination of scp and then streaming. Am I missing something about how this sort of replication is supposed to work? Thanks, Phil.
On 08/11/2018 12:42 PM, Phil Endecott wrote: > Hi Adrian, > > Adrian Klaver wrote: >> Looks like the master recycled the WAL's while the slave could not >> connect. > > Yes but... why is that a problem? The master is copying the WALs to > the backup server using scp, where they remain forever. The slave gets To me it looks like that did not happen: 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removed Above 0000000100000007000000D0 is gone/recycled on the master and the archived version does not seem to be complete as the streaming replication is trying to find it. Below you kick the master and it coughs up the files to the archive including *D0 and *D1 on up to *D4 and then the streaming picks using *D5. 2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive 2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file "0000000100000007000000D1" from archive 2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file "0000000100000007000000D2" from archive 2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file "0000000100000007000000D3" from archive 2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file "0000000100000007000000D4" from archive scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory 2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000 in log segment 00000001000000070000 00D5, offset 0 2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from primary at 7/D5000000 on timeline 1 Best guess is the archiving did not work as expected during: "(During this time the master was also down for a shorter period.)" > them from there before it starts streaming. So it shouldn't matter > if the master recycles them, as the slave should be able to get everything > using the combination of scp and then streaming. > > Am I missing something about how this sort of replication is supposed to > work? > > > Thanks, Phil. > > > > > -- Adrian Klaver adrian.klaver@aklaver.com
Greetings, * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: > Dear Experts, Since you're asking ... > I recently set up replication for the first time. It seemed to be > working OK in my initial tests, but didn't cope when the slave was > down for a longer period. This is all with the Debian stable > packages of PostgreSQL 9.6. My replication setup involves a third > server, "backup", to and from which WAL files are copied using scp > (and essentially never deleted), plus streaming replication in > "hot standby" mode. > > On the master, I have: > > wal_level = replica > archive_mode = on > archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f && > scp %p backup:backup/postgresql/archivedir/%f' This is really not a sufficient or particularly intelligent archive_command. In general, I'd strongly recommend against trying to roll your own archiving/backup solution for PostgreSQL. In particular, the above is quite expensive, doesn't address the case where a file is only partially copied to the backup server, and doesn't ensure that the file is completely written and fsync'd on the backup server meaning that a failure will likely result in WAL going missing. There's much better solutions out there and which you really should be using instead of trying to build your own. In particular, my preference is pgbackrest (though I'm one of the folks in its development, to be clear), but there's other options such as barman or WAL-E which I believe also address the concerns raised above. > On the slave I have: > > standby_mode = 'on' > primary_conninfo = 'user=postgres host=master port=5432' > restore_command = 'scp backup:backup/postgresql/archivedir/%f %p' > > hot_standby = on > > > This all seemed to work OK until the slave was shut down for a few days. > (During this time the master was also down for a shorter period.) > When the slave came back up it successfully copied several WAL files > from the backup server until it reached one that hasn't been created yet: That all sounds fine. > 2018-08-10 22:37:38.322 UTC [615] LOG: restored log file "0000000100000007000000CC" from archive > 2018-08-10 22:38:02.011 UTC [615] LOG: restored log file "0000000100000007000000CD" from archive > 2018-08-10 22:38:29.802 UTC [615] LOG: restored log file "0000000100000007000000CE" from archive > 2018-08-10 22:38:55.973 UTC [615] LOG: restored log file "0000000100000007000000CF" from archive > 2018-08-10 22:39:24.964 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > > At this point there was a temporary problem due to a firewall rule that > blocked the replication connection to the master: > > 2018-08-10 22:41:35.865 UTC [1289] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly Well, that's certainly not good. > The slave then entered a loop, retrying to scp the next file from the backup > archive (and failing) and connecting to the master for streaming replication > (and also failing). That is as expected, except for the "invalid magic number": > > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-10 22:41:37.018 UTC [615] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 > 2018-08-10 22:43:46.936 UTC [1445] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly Note that the invalid magic number error is just LOG-level, and that PostgreSQL will certainly ask for WAL files which don't exist in the archive yet. > Note that it's complaining about an invalid magic number in log segment D1, which is > the one that has just failed to scp. That looked suspicious to me and I remembered > notes in the docs about ensuring that the archive and/or restore commands return an > error exit status when they fail; I checked that and yes scp does exit(1) when the > requested file doesn't exist: Sure, but scp doesn't produce any output when it's successful, so are you sure that an scp wasn't performed after the "no such file or directory" log message and which perhaps performed a partial copy (see above about *exactly* that concern being mentioned by me, even before getting down to this part, due to using these simple unix commands...)? > $ scp backup:/egheriugherg /tmp/a > scp: /egheriugherg: No such file or directory > $ echo $? > 1 scp has no idea if it made a copy of a partial file though. > Anyway, the slave continued in its retrying loop as expected, except that each time > it re-fetched the previous segment, D0; this seems a bit peculiar: > > 2018-08-10 22:44:17.796 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-10 22:46:28.727 UTC [1628] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly Having it go back to D0 is at least a bit interesting. I wonder if somehow it was partial.. Have you looked at the files on the archive server to make sure they're all 16MB and the same size? > (Note the "invalid magic number" was only reported the first time.) Part of the issue here is that there's no logging happening with these scp commands, so we don't really know what happened, exactly. I'm wondering if perhaps the scp has *just* started copying the file from the primary to the backup server when the scp from the backup server to the replica started and it ended up copying a zero-byte file, though if that was the case then the next time around the file should have been there, unless there was some network hiccup between the primary and the backup server that caused the copy to fail... > Eventually I fixed the networking issue that had prevented the replication connection > to master and it could start streaming: > > 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 > 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0has already been removed > > So here is the real problem which worries me: it seems to ask the master for > segment D0, which is one that it did successfully restore from the backup earlier > on, and the master - not unexpectedly - reports that it has already been removed. If D0 has been removed from the primary then it should be on the backup server, right? Are you sure it actually is, and that it's a complete segment? I'm starting to think that you're in exactly the predicament that I mentioned above- maybe D0 was a partial copy to the backup server, and that failed and was returned to PG as failed but then it tried to run again and the stupid 'test' command returned success because the file exists, even though it wasn't complete... > I tried restarting the slave: > > 2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive > 2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 > 2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 > 2018-08-11 00:12:16.068 UTC [7953] LOG: database system is ready to accept read only connections > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-11 00:12:17.057 UTC [7954] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 > 2018-08-11 00:12:17.242 UTC [8034] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 > 2018-08-11 00:12:17.242 UTC [8034] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0has already been removed > > Note that again it failed to get D1 from the backup, as expected, and then > reports "invalid magic number" in that file, and again tries to start > streaming from a segment that the master doesn't have. > > Eventually I decided to try and cause the D1 file to exist in the backup: > > db=> create table waste_space as select * from large_table; > > That caused four new WAL segments to be copied to the backup, and restored > by the slave: > > 2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive > 2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file "0000000100000007000000D1" from archive > 2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file "0000000100000007000000D2" from archive > 2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file "0000000100000007000000D3" from archive > 2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file "0000000100000007000000D4" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory > 2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000 in log segment 00000001000000070000 > 00D5, offset 0 > 2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from primary at 7/D5000000 on timeline 1 > > Note the "unexpected pageaddr" message. Despite that, the streaming started > and everything was working correctly. Hurray! So, just to be clear, those LOG-level messages aren't generally an issue, they're just LOGs. We expect to see them. > I remain concerned, however, that something bad happened and I'd like to > understand it. Can anyone help? Really hard to tell from this. On the one hand, it seems like maybe the primary was done with D0, but hadn't finished with D1 yet, but the replica hadn't made that transistion from D0 -> D1, hence why it was still asking for D0. I'm certainly concerned that there might have been some amount of partial-WAL-file copying happening, but PG is pretty good at dealing with a variety of odd cases and so I tend to doubt that the replica actually ended up being corrupted, but I'll again strongly advocate for using a proven solution which avoids these risks by doing things like copying into temporary files on the backup server and then using an atomic 'mv' to make sure that anyone reading from the archive wouldn't ever see a partially-written file, and, just to double-check, also calculate a hash of every WAL file written into the archive and re-verify that hash when reading it back out. That's what we do in pgbackrest, at least. Thanks! Stephen
Вложения
Hi Stephen, Stephen Frost wrote: > * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: >> archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f && >> scp %p backup:backup/postgresql/archivedir/%f' > > This is really not a sufficient or particularly intelligent > archive_command. In general, I'd strongly recommend against trying to > roll your own archiving/backup solution for PostgreSQL. In particular, > the above is quite expensive, doesn't address the case where a file is > only partially copied to the backup server, and doesn't ensure that the > file is completely written and fsync'd on the backup server meaning that > a failure will likely result in WAL going missing. There's much better > solutions out there and which you really should be using instead of > trying to build your own. In particular, my preference is pgbackrest > (though I'm one of the folks in its development, to be clear), but > there's other options such as barman or WAL-E which I believe also > address the concerns raised above. Hmmm, well I based it on what I read in the documentation: https://www.postgresql.org/docs/9.6/static/continuous-archiving.html The expense is not an issue in this case. I would be surprised if scp were to exit(0) after only partially copying a file but perhaps you know something I don't about that. I know that rsync creates a temporary file and moves it into place when all the data has been transferred, which scp doesn't do; perhaps that would be safer. The master was only writing WAL files maybe 3 or 4 times per day at the time, so any scenario that requires its crash to have occurred exactly as it was doing an scp involves a substantial coincidence. I am limited in what I can run on the backup server. > Having it go back to D0 is at least a bit interesting. I wonder if > somehow it was partial.. Have you looked at the files on the archive > server to make sure they're all 16MB and the same size? Yes, all the files on the backup including D0 were 16 MB. But: >> 2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive >> 2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 >> 2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 Are the last two log lines above telling us anything useful? Is that saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as far as byte 0xf956c0 or 0xfff088? Is that what we would expect? Is it trying to use the streaming connection to get the missing bytes from FFFF088 to FFFFFFFF? Is that just an empty gap at the end of the file due to the next record being too big to fit? >> db=> create table waste_space as select * from large_table; >> >> That caused four new WAL segments to be copied to the backup, and restored >> by the slave: Note again that this was sufficient to make it start working again, which I think is important. > Really hard to tell from this. On the one hand, it seems like maybe the > primary was done with D0, but hadn't finished with D1 yet, but the > replica hadn't made that transistion from D0 -> D1, hence why it was > still asking for D0. Yes. What is actually supposed to happen? What exactly do you mean by "make the transition"? > I'm certainly concerned that there might have > been some amount of partial-WAL-file copying happening, but PG is pretty > good at dealing with a variety of odd cases and so I tend to doubt that > the replica actually ended up being corrupted, but I'll again strongly > advocate for using a proven solution which avoids these risks by doing > things like copying into temporary files on the backup server and then > using an atomic 'mv' to make sure that anyone reading from the archive > wouldn't ever see a partially-written file, and, just to double-check, > also calculate a hash of every WAL file written into the archive and > re-verify that hash when reading it back out. That's what we do in > pgbackrest, at least. Noted. But I don't see any evidence that that was actually the root cause here, and it relies on an unlikely timing coincidence. This is the very first time that I've had a replica recover after a significant downtime, and it failed. Regards, Phil.
Hi Adrian, Adrian Klaver wrote: > On 08/11/2018 12:42 PM, Phil Endecott wrote: >> Hi Adrian, >> >> Adrian Klaver wrote: >>> Looks like the master recycled the WAL's while the slave could not >>> connect. >> >> Yes but... why is that a problem? The master is copying the WALs to >> the backup server using scp, where they remain forever. The slave gets > > To me it looks like that did not happen: > > 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file > "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file > or directory > 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from > primary at 7/D0000000 on timeline 1 > 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from > WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has > already been removed > > Above 0000000100000007000000D0 is gone/recycled on the master and the > archived version does not seem to be complete as the streaming > replication is trying to find it. The files on the backup server were all 16 MB. > Below you kick the master and it coughs up the files to the archive > including *D0 and *D1 on up to *D4 and then the streaming picks using *D5. When I kicked it, the master wrote D1 to D4 to the backup. It did not change D0 (its modification time on the backup is from before the "kick"). The slave re-read D0, again, as it had been doing throughout this period, and then read D1 to D4. > Best guess is the archiving did not work as expected during: > > "(During this time the master was also down for a shorter period.)" Around the time the master was down, the WAL segment names were CB and CC. Files CD to CF were written between the master coming up and the slave coming up. The slave had no trouble restoring those segments when it started. The problematic segments D0 and D1 were the ones that were "current" when the slave restarted, at which time the master was up consistently. Regards, Phil.
Phil Endecott wrote: > On the master, I have: > > wal_level = replica > archive_mode = on > archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f && > scp %p backup:backup/postgresql/archivedir/%f' > > On the slave I have: > > standby_mode = 'on' > primary_conninfo = 'user=postgres host=master port=5432' > restore_command = 'scp backup:backup/postgresql/archivedir/%f %p' > > hot_standby = on > 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 > 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0has already been removed I am wondering if I need to set wal_keep_segments to at least 1 or 2 for this to work. I currently have it unset and I believe the default is 0. My understanding was that when using archive_command/restore_command to copy WAL segments it would not be necessary to use wal_keep_segments to retain files in pg_xlog on the server; the slave can get everything using a combination of copying files using the restore_command and streaming. But these lines from the log: 2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 make me think that there is an issue when the slave reaches the end of the copied WAL file. I speculate that the useful content of this WAL segment ends at FFF088, which is followed by an empty gap due to record sizes. But the slave tries to start streaming from this point, D0FFF088, not D1000000. If the master still had a copy of segment D0 then it would be able to stream this gap followed by the real content in the current segment D1. Does that make any sense at all? Regards, Phil.
On 08/12/2018 12:25 PM, Phil Endecott wrote: > Hi Adrian, > > Adrian Klaver wrote: >> On 08/11/2018 12:42 PM, Phil Endecott wrote: >>> Hi Adrian, >>> >>> Adrian Klaver wrote: >>>> Looks like the master recycled the WAL's while the slave could not >>>> connect. >>> >>> Yes but... why is that a problem? The master is copying the WALs to >>> the backup server using scp, where they remain forever. The slave gets >> >> To me it looks like that did not happen: >> >> 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file >> "0000000100000007000000D0" from archive >> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such >> file or directory >> 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from >> primary at 7/D0000000 on timeline 1 >> 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from >> WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has >> already been removed >> >> Above 0000000100000007000000D0 is gone/recycled on the master and the >> archived version does not seem to be complete as the streaming >> replication is trying to find it. > > The files on the backup server were all 16 MB. WAL files are created/recycled as 16 MB files, which is not the same as saying they are complete for the purposes of restoring. In other words you could be looking at a 16 MB file full of 0's. > > >> Below you kick the master and it coughs up the files to the archive >> including *D0 and *D1 on up to *D4 and then the streaming picks using >> *D5. > > When I kicked it, the master wrote D1 to D4 to the backup. It did not > change D0 (its modification time on the backup is from before the "kick"). > The slave re-read D0, again, as it had been doing throughout this period, > and then read D1 to D4. Well something happened because the slave could not get all the information it needed from the D0 in the archive and was trying to get it from the masters pg_xlog. > > >> Best guess is the archiving did not work as expected during: >> >> "(During this time the master was also down for a shorter period.)" > > Around the time the master was down, the WAL segment names were CB and CC. > Files CD to CF were written between the master coming up and the slave > coming up. The slave had no trouble restoring those segments when it > started. > The problematic segments D0 and D1 were the ones that were "current" > when the > slave restarted, at which time the master was up consistently. > > > Regards, Phil. > > > > > -- Adrian Klaver adrian.klaver@aklaver.com
On 08/12/2018 12:53 PM, Phil Endecott wrote: > Phil Endecott wrote: >> On the master, I have: >> >> wal_level = replica >> archive_mode = on >> archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f && >> scp %p backup:backup/postgresql/archivedir/%f' >> >> On the slave I have: >> >> standby_mode = 'on' >> primary_conninfo = 'user=postgres host=master port=5432' >> restore_command = 'scp backup:backup/postgresql/archivedir/%f %p' >> >> hot_standby = on > >> 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive >> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory >> 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 >> 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0has already been removed > > > I am wondering if I need to set wal_keep_segments to at least 1 or 2 for > this to work. I currently have it unset and I believe the default is 0. Given that WAL's are only 16 MB I would probably bump it up to be on safe side, or use: https://www.postgresql.org/docs/9.6/static/warm-standby.html 26.2.6. Replication Slots Though the above does not limit storage of WAL's, so a long outage could result in WAL's piling up. > > My understanding was that when using archive_command/restore_command to copy > WAL segments it would not be necessary to use wal_keep_segments to retain > files in pg_xlog on the server; the slave can get everything using a > combination of copying files using the restore_command and streaming. > But these lines from the log: > > 2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 > 2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 > > make me think that there is an issue when the slave reaches the end of the > copied WAL file. I speculate that the useful content of this WAL segment > ends at FFF088, which is followed by an empty gap due to record sizes. But > the slave tries to start streaming from this point, D0FFF088, not D1000000. > If the master still had a copy of segment D0 then it would be able to stream > this gap followed by the real content in the current segment D1. > > Does that make any sense at all? > > > Regards, Phil. > > > > -- Adrian Klaver adrian.klaver@aklaver.com
Greetings, * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: > Stephen Frost wrote: > >* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: > >>archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f && > >> scp %p backup:backup/postgresql/archivedir/%f' > > > >This is really not a sufficient or particularly intelligent > >archive_command. In general, I'd strongly recommend against trying to > >roll your own archiving/backup solution for PostgreSQL. In particular, > >the above is quite expensive, doesn't address the case where a file is > >only partially copied to the backup server, and doesn't ensure that the > >file is completely written and fsync'd on the backup server meaning that > >a failure will likely result in WAL going missing. There's much better > >solutions out there and which you really should be using instead of > >trying to build your own. In particular, my preference is pgbackrest > >(though I'm one of the folks in its development, to be clear), but > >there's other options such as barman or WAL-E which I believe also > >address the concerns raised above. > > Hmmm, well I based it on what I read in the documentation: > > https://www.postgresql.org/docs/9.6/static/continuous-archiving.html > > The expense is not an issue in this case. I would be surprised if > scp were to exit(0) after only partially copying a file but perhaps > you know something I don't about that. The scp might not, but so what? If a partial file remains on the far side, then a subsequent invokation of that archive_command will come back saying that the file is there. Further, the file isn't going to actually be sync'd to the filesystem on the far side, so even if the scp returns success, a failure on the backup server could result in loss of the WAL file. > I know that rsync creates a temporary file and moves it into place when > all the data has been transferred, which scp doesn't do; perhaps that > would be safer. No, what would be better is using a tool which will actually fsync() the file on the backup server after writing it out there, making sure it's been stored to the filesystem, and then to perform an atomic mv operation to put it in place, and then to fsync() the directory to ensure that change has also been sync'd to disk. Even better is to calculate a hash of the file and to reverify that on restore, to make sure that it doesn't end up getting corrupted on the backup server somehow. > The master was only writing WAL files maybe 3 or 4 times per day at > the time, so any scenario that requires its crash to have occurred > exactly as it was doing an scp involves a substantial coincidence. While I get that it's unlikely to be relevant to this particular case, I wouldn't recommend depending on never having that coincidence happen. What I've seen quite a bit of, just so you know, is that things tend to all break down at about the same time- so right when something bad is happening on the primary, the backup server runs into issues too. This might be due to networking breaking down in a particular area all at once or shared storage having some issue but it's happened often enough that relying on uncommon things to avoid each other is something I've got a hard time doing these days. > I am limited in what I can run on the backup server. You could consider backing up locally using a tool which is built for PostgreSQL and then sync'ing the results of that backup over to the backup server, perhaps. Or point out that you need a backup server where you don't have those restirctions in order to properly have backups. Or use an external service like s3 (or a clone) to back up to. > >Having it go back to D0 is at least a bit interesting. I wonder if > >somehow it was partial.. Have you looked at the files on the archive > >server to make sure they're all 16MB and the same size? > > Yes, all the files on the backup including D0 were 16 MB. But: Ok. Did you see any failures on the sending side when archive_command was being run? PG shouldn't start trying to archive the file until it's done with it, of course, so unless something failed right at the end of the scp and left the file at 16MB but it wasn't actually filled.. > >>2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive > >>2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 > >>2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 > > Are the last two log lines above telling us anything useful? Is that > saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as > far as byte 0xf956c0 or 0xfff088? Is that what we would expect? Is it > trying to use the streaming connection to get the missing bytes from > FFFF088 to FFFFFFFF? Is that just an empty gap at the end of the file > due to the next record being too big to fit? The short answer is that, yes, the next record was likely too big to fit, but that's what the replica was trying to figure out and couldn't because D0 was gone from the primary already. One thing which should help this would be to use physical replication slots on the primary, which would keep it from throwing away WAL files until it knows the replica has them, but that runs the risk of ending up with lots of extra WAL on the primary if the replica is gone for a while. If you'd prefer to avoid that then having wal_keep_segments set to '1' would avoid this particular issue as well, I'd expect. I do wonder if perhaps we should just default to having it as '1' to avoid exactly this case, as it seems like perhaps PG archived D0 and then flipped to D1 and got rid of D0, which is all pretty reasonable, except that a replica trying to catch up is going to end up asking for D0 from the primary because it didn't know if there was anything else that should have been in D0.. > >>db=> create table waste_space as select * from large_table; > >> > >>That caused four new WAL segments to be copied to the backup, and > >>restored by the slave: > > Note again that this was sufficient to make it start working again, > which I think is important. I agree that it's certainly interesting though presumably what ends up happening is that the replica finally gets to D1 and is then able to connect to the primary before the primary has finished with D1 and it's able to start tracking along with the primary. > >Really hard to tell from this. On the one hand, it seems like maybe the > >primary was done with D0, but hadn't finished with D1 yet, but the > >replica hadn't made that transistion from D0 -> D1, hence why it was > >still asking for D0. > > Yes. What is actually supposed to happen? What exactly do you mean by > "make the transition"? Simply moving along the WAL stream to get to the D1 WAL file. This all more-or-less looks like what's "supposed to happen" except that the lack of additional writes to the primary meant that things appeared to get "stuck" here, and then got "unstuck" once you created some data. Looks like this might be a bit of an edge case that we aren't handling very cleanly. > >I'm certainly concerned that there might have > >been some amount of partial-WAL-file copying happening, but PG is pretty > >good at dealing with a variety of odd cases and so I tend to doubt that > >the replica actually ended up being corrupted, but I'll again strongly > >advocate for using a proven solution which avoids these risks by doing > >things like copying into temporary files on the backup server and then > >using an atomic 'mv' to make sure that anyone reading from the archive > >wouldn't ever see a partially-written file, and, just to double-check, > >also calculate a hash of every WAL file written into the archive and > >re-verify that hash when reading it back out. That's what we do in > >pgbackrest, at least. > > Noted. But I don't see any evidence that that was actually the root > cause here, and it relies on an unlikely timing coincidence. This is > the very first time that I've had a replica recover after a significant > downtime, and it failed. Failed is really over-stating things, isn't it? The replica caught all the way up to D0 just fine, but it just couldn't get past that because the primary had removed D0 already, and the replica couldn't actually reconnect to the primary to start streaming again, but it's not like it would have been all *that* far behind the primary since the primary was still working on D1, and once the primary did write a few things, the replica was able to start streaming again. If this can be reproduced and then a fix put in place for it, I'm all for that, to be clear, but I don't see this as a particularly big issue or a systemic problem with the way WAL replay and replication work. Thanks! Stephen
Вложения
Stephen Frost wrote: > * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: >> Stephen Frost wrote: >> >* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: >> >>2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive >> >>2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 >> >>2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 >> >> Are the last two log lines above telling us anything useful? Is that >> saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as >> far as byte 0xf956c0 or 0xfff088? Is that what we would expect? Is it >> trying to use the streaming connection to get the missing bytes from >> FFFF088 to FFFFFFFF? Is that just an empty gap at the end of the file >> due to the next record being too big to fit? > > The short answer is that, yes, the next record was likely too big to > fit, but that's what the replica was trying to figure out and couldn't > because D0 was gone from the primary already. One thing which should > help this would be to use physical replication slots on the primary, > which would keep it from throwing away WAL files until it knows the > replica has them, but that runs the risk of ending up with lots of extra > WAL on the primary if the replica is gone for a while. If you'd prefer > to avoid that then having wal_keep_segments set to '1' would avoid this > particular issue as well, I'd expect. > > I do wonder if perhaps we should just default to having it as '1' to > avoid exactly this case, as it seems like perhaps PG archived D0 and > then flipped to D1 and got rid of D0, which is all pretty reasonable, > except that a replica trying to catch up is going to end up asking for > D0 from the primary because it didn't know if there was anything else > that should have been in D0.. OK. I think this is perhaps a documentation bug, maybe a missing warning when the master reads its configuration, and maybe (as you say) a bad default value. Specifically, section 26.2.5 of the docs says: "If you use streaming replication without file-based continuous archiving, the server might recycle old WAL segments before the standby has received them. If this occurs, the standby will need to be reinitialized from a new base backup. You can avoid this by setting wal_keep_segments to a value large enough to ensure that WAL segments are not recycled too early, or by configuring a replication slot for the standby. If you set up a WAL archive that's accessible from the standby, these solutions are not required, since the standby can always use the archive to catch up provided it retains enough segments." OR, maybe the WAL reader that process the files that restore_command fetches could be smart enough to realise that it can skip over the gap at the end? Anyway. Do others agree that my issue was the result of wal_keep_segments=0 ? Regards, Phil.
Greetings, * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: > OK. I think this is perhaps a documentation bug, maybe a missing > warning when the master reads its configuration, and maybe (as you say) > a bad default value. If we consider it to be an issue worthy of a change then we should probably just change the default value, and maybe not even allow it to be set lower than '1'. > Specifically, section 26.2.5 of the docs says: > > "If you use streaming replication without file-based continuous archiving, > the server might recycle old WAL segments before the standby has received > them. If this occurs, the standby will need to be reinitialized from a new > base backup. You can avoid this by setting wal_keep_segments to a value > large enough to ensure that WAL segments are not recycled too early, or by > configuring a replication slot for the standby. If you set up a WAL archive > that's accessible from the standby, these solutions are not required, since > the standby can always use the archive to catch up provided it retains enough > segments." > > OR, maybe the WAL reader that process the files that restore_command fetches > could be smart enough to realise that it can skip over the gap at the end? That strikes me as a whole lot more complication in something we'd rather not introduce additional complications into without very good reason. Then again, there was just a nearby discussion about how it'd be nice if the backend could realize when a WAL file is complete, and I do think that'll be more of an issue when users start configuring larger WAL files, so perhaps we should figure out a way to handle this.. > Anyway. Do others agree that my issue was the result of wal_keep_segments=0 ? Yeah, I've not spent much time actually looking at code around this, so it'd be nice to get: a) a reproducible case demonstrating it's happening b) testing to see how long it's been this way c) if setting wal_keep_segments=1 fixes it d) perhaps some thought around if we could address this some other way Thanks! Stephen
Вложения
On 08/12/2018 02:56 PM, Phil Endecott wrote: > Stephen Frost wrote: > > Specifically, section 26.2.5 of the docs says: > > "If you use streaming replication without file-based continuous archiving, > the server might recycle old WAL segments before the standby has received > them. If this occurs, the standby will need to be reinitialized from a new > base backup. You can avoid this by setting wal_keep_segments to a value > large enough to ensure that WAL segments are not recycled too early, or by > configuring a replication slot for the standby. If you set up a WAL archive > that's accessible from the standby, these solutions are not required, since > the standby can always use the archive to catch up provided it retains > enough > segments." > > OR, maybe the WAL reader that process the files that restore_command > fetches > could be smart enough to realise that it can skip over the gap at the end? > > Anyway. Do others agree that my issue was the result of > wal_keep_segments=0 ? Only as a sub-issue of the slave losing contact with the master. The basic problem is maintaining two separate operations, archiving and streaming, in sync. If either or some combination of both lose synchronization then it is anyone's guess on what is appropriate for wal_keep_segments. > > > Regards, Phil. > > > > > > -- Adrian Klaver adrian.klaver@aklaver.com
On 08/12/2018 03:54 PM, Stephen Frost wrote: > Greetings, > > * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: >> OK. I think this is perhaps a documentation bug, maybe a missing >> warning when the master reads its configuration, and maybe (as you say) >> a bad default value. > > If we consider it to be an issue worthy of a change then we should > probably just change the default value, and maybe not even allow it to > be set lower than '1'. > I would say leave the default at 0 as it leaves no doubt that you are performing without a net. A setting of '1' implies you are covered and for a fast moving cluster or slow moving one with sufficient downtime that would not be the case. Better to let the end user know this is not a simple problem and some thought needs to go into configuration. -- Adrian Klaver adrian.klaver@aklaver.com
Greetings, * Adrian Klaver (adrian.klaver@aklaver.com) wrote: > On 08/12/2018 03:54 PM, Stephen Frost wrote: > >* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: > >>OK. I think this is perhaps a documentation bug, maybe a missing > >>warning when the master reads its configuration, and maybe (as you say) > >>a bad default value. > > > >If we consider it to be an issue worthy of a change then we should > >probably just change the default value, and maybe not even allow it to > >be set lower than '1'. > > I would say leave the default at 0 as it leaves no doubt that you are > performing without a net. A setting of '1' implies you are covered and for a > fast moving cluster or slow moving one with sufficient downtime that would > not be the case. Better to let the end user know this is not a simple > problem and some thought needs to go into configuration. Uh, this specific case is where there *is* a 'safety net' though- archive command and restore command were configured and being used, so I don't buy off on this argument at all. Maybe we just internally bump wal_keep_segments to '1' to avoid this specific risk without actually changing the default or making people change their existing configurations, but if this is really what's happening then I don't think the answer is "don't do anything." Thanks! Stephen
Вложения
Adrian Klaver wrote: > On 08/12/2018 02:56 PM, Phil Endecott wrote: >> Anyway. Do others agree that my issue was the result of >> wal_keep_segments=0 ? > > Only as a sub-issue of the slave losing contact with the master. The > basic problem is maintaining two separate operations, archiving and > streaming, in sync. If either or some combination of both lose > synchronization then it is anyone's guess on what is appropriate for > wal_keep_segments. Really? I thought the intention was that the system should be able to recover reliably when the slave reconnects after a period of downtime, subject only to there being sufficient network/CPU/disk bandwidth etc. for it to eventually catch up. If that's not true, I think the docs need an even more extensive overhaul! Suggestion for the paragraph that I quoted before from 26.2.5: "If you set up a WAL archive that's accessible from the standby, it's anyone's guess what is appropriate for wal_keep_segments." Regards, Phil.
Adrian Klaver wrote: > On 08/12/2018 03:54 PM, Stephen Frost wrote: >> Greetings, >> >> * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: >>> OK. I think this is perhaps a documentation bug, maybe a missing >>> warning when the master reads its configuration, and maybe (as you say) >>> a bad default value. >> >> If we consider it to be an issue worthy of a change then we should >> probably just change the default value, and maybe not even allow it to >> be set lower than '1'. >> > > I would say leave the default at 0 as it leaves no doubt that you are > performing without a net. A setting of '1' implies you are covered and > for a fast moving cluster or slow moving one with sufficient downtime > that would not be the case. Can you explain how it can fail in the case of a "slow moving cluster with sufficient downtime"? It seems to me that if I have correctly understood what happened in this case then 0, the default, really cannot ever work properly when you have enabled WAL archiving plus streaming. > Better to let the end user know this is not > a simple problem and some thought needs to go into configuration. I certainly agree that this is "not a simple problem", having read something like 18,000 words of documentation multiple times and, apparently, still got it wrong in multiple ways. Regards, Phil.
Greetings, * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: > Adrian Klaver wrote: > >On 08/12/2018 03:54 PM, Stephen Frost wrote: > >>Greetings, > >> > >>* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: > >>>OK. I think this is perhaps a documentation bug, maybe a missing > >>>warning when the master reads its configuration, and maybe (as you say) > >>>a bad default value. > >> > >>If we consider it to be an issue worthy of a change then we should > >>probably just change the default value, and maybe not even allow it to > >>be set lower than '1'. > > > >I would say leave the default at 0 as it leaves no doubt that you are > >performing without a net. A setting of '1' implies you are covered and for > >a fast moving cluster or slow moving one with sufficient downtime that > >would not be the case. > > Can you explain how it can fail in the case of a "slow moving cluster with > sufficient downtime"? I'm guessing 'sufficient downtime' here is, basically, 'offline until the next checkpoint', which isn't actually all *that* much time. > It seems to me that if I have correctly understood what happened in this > case then 0, the default, really cannot ever work properly when you have > enabled WAL archiving plus streaming. Well, it's not like it'd work without WAL archiving either, though that's perhaps more obviously true. Thanks! Stephen
Вложения
Greetings, * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: > Adrian Klaver wrote: > >On 08/12/2018 02:56 PM, Phil Endecott wrote: > >>Anyway. Do others agree that my issue was the result of > >>wal_keep_segments=0 ? > > > >Only as a sub-issue of the slave losing contact with the master. The basic > >problem is maintaining two separate operations, archiving and streaming, > >in sync. If either or some combination of both lose synchronization then > >it is anyone's guess on what is appropriate for wal_keep_segments. Uh, no, having an archive_command and a restore_command configures exactly should remove the need to worry about what wal_keep_segments is set to because anything not on the primary really should be available through what's been archived and PG shouldn't have any trouble figuring that out and working with it. If all you've got is streaming replication then, sure, you have no idea what to set wal_keep_segments to because the replica could be offline for an indeterminate amount of time, but as long as you're keeping track of all the WAL through archive_command, that shouldn't be an issue. > Really? I thought the intention was that the system should be > able to recover reliably when the slave reconnects after a > period of downtime, subject only to there being sufficient > network/CPU/disk bandwidth etc. for it to eventually catch up. Yes, that's correct, the replica should always be able to catch back up presuming there's no gaps in the WAL between when the replica failed and where the primary is at. Thanks! Stephen
Вложения
On 08/13/2018 05:39 AM, Stephen Frost wrote: > Greetings, > > * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: >> Adrian Klaver wrote: >>> On 08/12/2018 02:56 PM, Phil Endecott wrote: >>>> Anyway. Do others agree that my issue was the result of >>>> wal_keep_segments=0 ? >>> >>> Only as a sub-issue of the slave losing contact with the master. The basic >>> problem is maintaining two separate operations, archiving and streaming, >>> in sync. If either or some combination of both lose synchronization then >>> it is anyone's guess on what is appropriate for wal_keep_segments. > > Uh, no, having an archive_command and a restore_command configures > exactly should remove the need to worry about what wal_keep_segments is > set to because anything not on the primary really should be available > through what's been archived and PG shouldn't have any trouble figuring > that out and working with it. > > If all you've got is streaming replication then, sure, you have no idea > what to set wal_keep_segments to because the replica could be offline > for an indeterminate amount of time, but as long as you're keeping track > of all the WAL through archive_command, that shouldn't be an issue. Therein lies the rub. As I stated previously the bigger issue is syncing two different operations, archiving and streaming. The OP got caught short assuming the archiving would handle the situation where the streaming was down for a period. In his particular setup and for this particular situation a wal_keep_segments of 1 would have helped. I do not see this as a default value though as it depends on too many variables outside the reach of the database, mostly notably the success of the archive command. First is the command even valid, two is the network link reliable, three is there even a network link, is there more then one network link, four is the restore command valid? That is just of the top of my head, more caffeine and I could come up with more. Saying that having archiving, streaming and a wal_keep_segments=1 has you covered, is misleading. I don't see it as detrimental to performance but I do see more posts down the road from folks who are surprised when it does not cover their case. Personally I think it better to be up front that this requires more thought or a third party solution that has done the thinking. > >> Really? I thought the intention was that the system should be >> able to recover reliably when the slave reconnects after a >> period of downtime, subject only to there being sufficient >> network/CPU/disk bandwidth etc. for it to eventually catch up. > > Yes, that's correct, the replica should always be able to catch back up > presuming there's no gaps in the WAL between when the replica failed and > where the primary is at. > > Thanks! > > Stephen > -- Adrian Klaver adrian.klaver@aklaver.com
Greetings, * Adrian Klaver (adrian.klaver@aklaver.com) wrote: > On 08/13/2018 05:39 AM, Stephen Frost wrote: > >* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote: > >>Adrian Klaver wrote: > >>>On 08/12/2018 02:56 PM, Phil Endecott wrote: > >>>>Anyway. Do others agree that my issue was the result of > >>>>wal_keep_segments=0 ? > >>> > >>>Only as a sub-issue of the slave losing contact with the master. The basic > >>>problem is maintaining two separate operations, archiving and streaming, > >>>in sync. If either or some combination of both lose synchronization then > >>>it is anyone's guess on what is appropriate for wal_keep_segments. > > > >Uh, no, having an archive_command and a restore_command configures > >exactly should remove the need to worry about what wal_keep_segments is > >set to because anything not on the primary really should be available > >through what's been archived and PG shouldn't have any trouble figuring > >that out and working with it. > > > >If all you've got is streaming replication then, sure, you have no idea > >what to set wal_keep_segments to because the replica could be offline > >for an indeterminate amount of time, but as long as you're keeping track > >of all the WAL through archive_command, that shouldn't be an issue. > > Therein lies the rub. As I stated previously the bigger issue is syncing two > different operations, archiving and streaming. That's not correct though, there isn't a big issue regarding syncing of those two operations. > The OP got caught short > assuming the archiving would handle the situation where the streaming was > down for a period. In his particular setup and for this particular situation > a wal_keep_segments of 1 would have helped. I do not see this as a default > value though as it depends on too many variables outside the reach of the > database, mostly notably the success of the archive command. What's been pointed out here is that even if everything is working (archive_command, restore_command, etc) there's a possible gap in the transisition from replay-from-archive and starting streaming replication where the replica might not be able to start streaming. That's an entirely *PostgreSQL* issue, as far as I see it, and hasn't got anything to do with his particular setup except that he managed to expose the issue. > First is the > command even valid, two is the network link reliable, three is there even a > network link, is there more then one network link, four is the restore > command valid? That is just of the top of my head, more caffeine and I could > come up with more. Saying that having archiving, streaming and a > wal_keep_segments=1 has you covered, is misleading. I don't see it as > detrimental to performance but I do see more posts down the road from folks > who are surprised when it does not cover their case. Personally I think it > better to be up front that this requires more thought or a third party > solution that has done the thinking. This is all down to "and you should also monitor to make sure things continue working" which I certainly agree with but that doesn't mean we shouldn't fix this issue. This entire side-discussion feels like it's really off in the weeds. The next steps which I outlined a while ago seem to still be entirely appropriate and we should figure out a way to solve this issue so that, when everything else is working (archive command, restore command, replica is able to connect to the primary, etc), that PG behaves sanely and is able to catch up with and connect to the primary and resume streaming. Thanks! Stephen
Вложения
On 08/13/2018 05:08 AM, Phil Endecott wrote: > Adrian Klaver wrote: >> On 08/12/2018 02:56 PM, Phil Endecott wrote: >>> Anyway. Do others agree that my issue was the result of >>> wal_keep_segments=0 ? >> >> Only as a sub-issue of the slave losing contact with the master. The >> basic problem is maintaining two separate operations, archiving and >> streaming, in sync. If either or some combination of both lose >> synchronization then it is anyone's guess on what is appropriate for >> wal_keep_segments. > > Really? I thought the intention was that the system should be > able to recover reliably when the slave reconnects after a > period of downtime, subject only to there being sufficient > network/CPU/disk bandwidth etc. for it to eventually catch up. See also my reply to Stephen earlier. Basically you are trying to coordinate two different operations. They start from the same source pg_xlog(pg_wal 10+) but arrive on a different time scale and from different locations. Without sufficient sanity checks it is possible they diverge enough on one or both paths to render the process unstable. > > If that's not true, I think the docs need an even more extensive > overhaul! Suggestion for the paragraph that I quoted before from > 26.2.5: > > "If you set up a WAL archive that's accessible from the standby, > it's anyone's guess what is appropriate for wal_keep_segments." I would say that: "If you set up a WAL archive that's accessible from the standby, these solutions are not required, since the standby can always use the archive to catch up provided it retains enough segments." should be more like: "If you set up a WAL archive that's accessible from the standby, these solutions are not required, since the standby can always use the archive to catch up provided it retains enough segments. *This is dependent on verification that the archiving is working properly. A belt and suspenders approach would be to set wal_keep_segments to a value > 0 in the event archiving is not properly functioning*" " > > > Regards, Phil. > > > > > -- Adrian Klaver adrian.klaver@aklaver.com
Greetings, * Adrian Klaver (adrian.klaver@aklaver.com) wrote: > On 08/13/2018 05:08 AM, Phil Endecott wrote: > >Adrian Klaver wrote: > >Really? I thought the intention was that the system should be > >able to recover reliably when the slave reconnects after a > >period of downtime, subject only to there being sufficient > >network/CPU/disk bandwidth etc. for it to eventually catch up. That's correct. > See also my reply to Stephen earlier. Basically you are trying to coordinate > two different operations. They start from the same source pg_xlog(pg_wal > 10+) but arrive on a different time scale and from different locations. > Without sufficient sanity checks it is possible they diverge enough on one > or both paths to render the process unstable. This isn't what's happening. We're not talking about a timeline change here or a replica being promoted to be a primary in general. There's no diverging happening- it's the same consistent WAL stream, just coming from two different sources, which PG is specifically designed to handle and should be handling seamlessly. > I would say that: > > "If you set up a WAL archive that's accessible from the standby, these > solutions are not required, since the standby can always use the archive to > catch up provided it retains enough segments." > > should be more like: > > "If you set up a WAL archive that's accessible from the standby, these > solutions are not required, since the standby can always use the archive to > catch up provided it retains enough segments. *This is dependent on > verification that the archiving is working properly. A belt and suspenders > approach would be to set wal_keep_segments to a value > 0 in the event > archiving is not properly functioning*" > " I don't think I can disagree more with this additional wording, and I *really* don't think we should be encouraging people to set a high wal_keep_segments. The specific case here looks like it just need to be set to, exactly, '1', to ensure that the primary hasn't removed the last WAL file that it archived. Thanks! Stephen
Вложения
Adrian Klaver wrote: > "If you set up a WAL archive that's accessible from the standby, these > solutions are not required, since the standby can always use the archive > to catch up provided it retains enough segments. *This is dependent on > verification that the archiving is working properly. A belt and > suspenders approach would be to set wal_keep_segments to a value > 0 in > the event archiving is not properly functioning*" > " Adrian, I believe that the suggestion that my issue was the result of my archiving process not working is not correct. The quote above does not address the requirement for wal_keep_segments to be >= 1 even when archiving is functioning correctly. I will continue to monitor this thread in the hope that others will confirm my understanding, but otherwise I will bow out now (and file a bug). Thanks to all. Phil.