Обсуждение: Replication failure, slave requesting old segments

Поиск

Список

Период

Сортировка

Replication failure, slave requesting old segments

От

"Phil Endecott"

Дата:

11 августа 2018 г., 21:18:55

Dear Experts,

I recently set up replication for the first time.  It seemed to be 
working OK in my initial tests, but didn't cope when the slave was 
down for a longer period.  This is all with the Debian stable 
packages of PostgreSQL 9.6.  My replication setup involves a third 
server, "backup", to and from which WAL files are copied using scp 
(and essentially never deleted), plus streaming replication in 
"hot standby" mode.

On the master, I have:

wal_level = replica
archive_mode = on
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
                   scp %p backup:backup/postgresql/archivedir/%f'

On the slave I have:

standby_mode = 'on'
primary_conninfo = 'user=postgres host=master port=5432'
restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'

hot_standby = on


This all seemed to work OK until the slave was shut down for a few days.  
(During this time the master was also down for a shorter period.) 
When the slave came back up it successfully copied several WAL files 
from the backup server until it reached one that hasn't been created yet:

2018-08-10 22:37:38.322 UTC [615] LOG:  restored log file "0000000100000007000000CC" from archive
2018-08-10 22:38:02.011 UTC [615] LOG:  restored log file "0000000100000007000000CD" from archive
2018-08-10 22:38:29.802 UTC [615] LOG:  restored log file "0000000100000007000000CE" from archive
2018-08-10 22:38:55.973 UTC [615] LOG:  restored log file "0000000100000007000000CF" from archive
2018-08-10 22:39:24.964 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory

At this point there was a temporary problem due to a firewall rule that 
blocked the replication connection to the master:

2018-08-10 22:41:35.865 UTC [1289] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly

The slave then entered a loop, retrying to scp the next file from the backup 
archive (and failing) and connecting to the master for streaming replication 
(and also failing).  That is as expected, except for the "invalid magic number":

scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:41:37.018 UTC [615] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-10 22:43:46.936 UTC [1445] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly

Note that it's complaining about an invalid magic number in log segment D1, which is 
the one that has just failed to scp.  That looked suspicious to me and I remembered 
notes in the docs about ensuring that the archive and/or restore commands return an 
error exit status when they fail; I checked that and yes scp does exit(1) when the 
requested file doesn't exist:

$ scp backup:/egheriugherg /tmp/a
scp: /egheriugherg: No such file or directory
$ echo $?
1

Anyway, the slave continued in its retrying loop as expected, except that each time 
it re-fetched the previous segment, D0; this seems a bit peculiar:

2018-08-10 22:44:17.796 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:46:28.727 UTC [1628] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly

(Note the "invalid magic number" was only reported the first time.)

Eventually I fixed the networking issue that had prevented the replication connection 
to master and it could start streaming:

2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
0000000100000007000000D0has already been removed
 

So here is the real problem which worries me: it seems to ask the master for 
segment D0, which is one that it did successfully restore from the backup earlier 
on, and the master - not unexpectedly - reports that it has already been removed.

I tried restarting the slave:

2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088
2018-08-11 00:12:16.068 UTC [7953] LOG:  database system is ready to accept read only connections
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:12:17.057 UTC [7954] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-11 00:12:17.242 UTC [8034] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:12:17.242 UTC [8034] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
0000000100000007000000D0has already been removed
 

Note that again it failed to get D1 from the backup, as expected, and then 
reports "invalid magic number" in that file, and again tries to start 
streaming from a segment that the master doesn't have.

Eventually I decided to try and cause the D1 file to exist in the backup:

db=> create table waste_space as select * from large_table;

That caused four new WAL segments to be copied to the backup, and restored 
by the slave:

2018-08-11 00:55:49.741 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
2018-08-11 00:56:12.304 UTC [7954] LOG:  restored log file "0000000100000007000000D1" from archive
2018-08-11 00:56:35.481 UTC [7954] LOG:  restored log file "0000000100000007000000D2" from archive
2018-08-11 00:56:57.443 UTC [7954] LOG:  restored log file "0000000100000007000000D3" from archive
2018-08-11 00:57:21.723 UTC [7954] LOG:  restored log file "0000000100000007000000D4" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory
2018-08-11 00:57:22.915 UTC [7954] LOG:  unexpected pageaddr 7/C7000000 in log segment 00000001000000070000
00D5, offset 0
2018-08-11 00:57:23.114 UTC [12348] LOG:  started streaming WAL from primary at 7/D5000000 on timeline 1

Note the "unexpected pageaddr" message.  Despite that, the streaming started 
and everything was working correctly.  Hurray!

I remain concerned, however, that something bad happened and I'd like to 
understand it.  Can anyone help?


Thanks,

Phil.

Re: Replication failure, slave requesting old segments

От

Adrian Klaver

Дата:

12 августа 2018 г., 00:30:00

On 08/11/2018 08:18 AM, Phil Endecott wrote:
> Dear Experts,
> 
> I recently set up replication for the first time.  It seemed to be
> working OK in my initial tests, but didn't cope when the slave was
> down for a longer period.  This is all with the Debian stable
> packages of PostgreSQL 9.6.  My replication setup involves a third
> server, "backup", to and from which WAL files are copied using scp
> (and essentially never deleted), plus streaming replication in
> "hot standby" mode.
> 
> On the master, I have:
> 
> wal_level = replica
> archive_mode = on
> archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
>                     scp %p backup:backup/postgresql/archivedir/%f'
> 
> On the slave I have:
> 
> standby_mode = 'on'
> primary_conninfo = 'user=postgres host=master port=5432'
> restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
> 
> hot_standby = on
> 
> 
> This all seemed to work OK until the slave was shut down for a few days.
> (During this time the master was also down for a shorter period.)
> When the slave came back up it successfully copied several WAL files
> from the backup server until it reached one that hasn't been created yet:
> 
> 2018-08-10 22:37:38.322 UTC [615] LOG:  restored log file "0000000100000007000000CC" from archive
> 2018-08-10 22:38:02.011 UTC [615] LOG:  restored log file "0000000100000007000000CD" from archive
> 2018-08-10 22:38:29.802 UTC [615] LOG:  restored log file "0000000100000007000000CE" from archive
> 2018-08-10 22:38:55.973 UTC [615] LOG:  restored log file "0000000100000007000000CF" from archive
> 2018-08-10 22:39:24.964 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 
> At this point there was a temporary problem due to a firewall rule that
> blocked the replication connection to the master:
> 
> 2018-08-10 22:41:35.865 UTC [1289] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly
> 
> The slave then entered a loop, retrying to scp the next file from the backup
> archive (and failing) and connecting to the master for streaming replication
> (and also failing).  That is as expected, except for the "invalid magic number":
> 
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-10 22:41:37.018 UTC [615] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
> 2018-08-10 22:43:46.936 UTC [1445] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly
> 
> Note that it's complaining about an invalid magic number in log segment D1, which is
> the one that has just failed to scp.  That looked suspicious to me and I remembered
> notes in the docs about ensuring that the archive and/or restore commands return an
> error exit status when they fail; I checked that and yes scp does exit(1) when the
> requested file doesn't exist:
> 
> $ scp backup:/egheriugherg /tmp/a
> scp: /egheriugherg: No such file or directory
> $ echo $?
> 1
> 
> Anyway, the slave continued in its retrying loop as expected, except that each time
> it re-fetched the previous segment, D0; this seems a bit peculiar:
> 
> 2018-08-10 22:44:17.796 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-10 22:46:28.727 UTC [1628] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly
> 
> (Note the "invalid magic number" was only reported the first time.)
> 
> Eventually I fixed the networking issue that had prevented the replication connection
> to master and it could start streaming:
> 
> 2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
> 2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
0000000100000007000000D0has already been removed
 
> 
> So here is the real problem which worries me: it seems to ask the master for
> segment D0, which is one that it did successfully restore from the backup earlier
> on, and the master - not unexpectedly - reports that it has already been removed.
> 
> I tried restarting the slave:
> 
> 2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
> 2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
> 2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088
> 2018-08-11 00:12:16.068 UTC [7953] LOG:  database system is ready to accept read only connections
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-11 00:12:17.057 UTC [7954] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
> 2018-08-11 00:12:17.242 UTC [8034] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
> 2018-08-11 00:12:17.242 UTC [8034] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
0000000100000007000000D0has already been removed
 
> 
> Note that again it failed to get D1 from the backup, as expected, and then
> reports "invalid magic number" in that file, and again tries to start
> streaming from a segment that the master doesn't have.
> 
> Eventually I decided to try and cause the D1 file to exist in the backup:
> 
> db=> create table waste_space as select * from large_table;
> 
> That caused four new WAL segments to be copied to the backup, and restored
> by the slave:
> 
> 2018-08-11 00:55:49.741 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
> 2018-08-11 00:56:12.304 UTC [7954] LOG:  restored log file "0000000100000007000000D1" from archive
> 2018-08-11 00:56:35.481 UTC [7954] LOG:  restored log file "0000000100000007000000D2" from archive
> 2018-08-11 00:56:57.443 UTC [7954] LOG:  restored log file "0000000100000007000000D3" from archive
> 2018-08-11 00:57:21.723 UTC [7954] LOG:  restored log file "0000000100000007000000D4" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory
> 2018-08-11 00:57:22.915 UTC [7954] LOG:  unexpected pageaddr 7/C7000000 in log segment 00000001000000070000
> 00D5, offset 0
> 2018-08-11 00:57:23.114 UTC [12348] LOG:  started streaming WAL from primary at 7/D5000000 on timeline 1
> 
> Note the "unexpected pageaddr" message.  Despite that, the streaming started
> and everything was working correctly.  Hurray!
> 
> I remain concerned, however, that something bad happened and I'd like to
> understand it.  Can anyone help?

Looks like the master recycled the WAL's while the slave could not connect.

Might want to take a look at replication slots to keep WAL's from being 
recycled:

https://www.postgresql.org/docs/9.6/static/warm-standby.html#STREAMING-REPLICATION-SLOTS

with the heads up that slots do not limit the space taken up by WAL's.


The "unexpected pageaddr" probably is for same reason, per comments in 
xlogreader.c:

"
/*
  * Check that the address on the page agrees with what we expected.
  * This check typically fails when an old WAL segment is recycled,
  * and hasn't yet been overwritten with new data yet.
  */

...

report_invalid_record(state,
                       "unexpected pageaddr %X/%X in log segment %s, 
offset %u",
(uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
                                              fname,
                                              offset);
"

> 
> 
> Thanks,
> 
> Phil.
> 
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Replication failure, slave requesting old segments

От

"Phil Endecott"

Дата:

12 августа 2018 г., 01:42:05

Hi Adrian,

Adrian Klaver wrote:
> Looks like the master recycled the WAL's while the slave could not connect.

Yes but... why is that a problem?  The master is copying the WALs to
the backup server using scp, where they remain forever.  The slave gets
them from there before it starts streaming.  So it shouldn't matter
if the master recycles them, as the slave should be able to get everything
using the combination of scp and then streaming.

Am I missing something about how this sort of replication is supposed to
work?


Thanks, Phil.

Re: Replication failure, slave requesting old segments

От

Adrian Klaver

Дата:

12 августа 2018 г., 03:48:01

On 08/11/2018 12:42 PM, Phil Endecott wrote:
> Hi Adrian,
> 
> Adrian Klaver wrote:
>> Looks like the master recycled the WAL's while the slave could not 
>> connect.
> 
> Yes but... why is that a problem?  The master is copying the WALs to
> the backup server using scp, where they remain forever.  The slave gets

To me it looks like that did not happen:

2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file 
"0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file 
or directory
2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from 
primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from 
WAL stream: ERROR:  requested WAL segment 0000000100000007000000D0 has 
already been removed

Above 0000000100000007000000D0 is gone/recycled on the master and the 
archived version does not seem to be complete as the streaming 
replication is trying to find it.


Below you kick the master and it coughs up the files to the archive 
including *D0 and *D1 on up to *D4 and then the streaming picks using *D5.

2018-08-11 00:55:49.741 UTC [7954] LOG:  restored log file 
"0000000100000007000000D0" from archive
2018-08-11 00:56:12.304 UTC [7954] LOG:  restored log file 
"0000000100000007000000D1" from archive
2018-08-11 00:56:35.481 UTC [7954] LOG:  restored log file 
"0000000100000007000000D2" from archive
2018-08-11 00:56:57.443 UTC [7954] LOG:  restored log file 
"0000000100000007000000D3" from archive
2018-08-11 00:57:21.723 UTC [7954] LOG:  restored log file 
"0000000100000007000000D4" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file 
or directory
2018-08-11 00:57:22.915 UTC [7954] LOG:  unexpected pageaddr 7/C7000000 
in log segment 00000001000000070000
00D5, offset 0
2018-08-11 00:57:23.114 UTC [12348] LOG:  started streaming WAL from 
primary at 7/D5000000 on timeline 1


Best guess is the archiving did not work as expected during:

"(During this time the master was also down for a shorter period.)"

> them from there before it starts streaming.  So it shouldn't matter
> if the master recycles them, as the slave should be able to get everything
> using the combination of scp and then streaming.
> 
> Am I missing something about how this sort of replication is supposed to
> work?
> 
> 
> Thanks, Phil.
> 
> 
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Replication failure, slave requesting old segments

От

Stephen Frost

Дата:

12 августа 2018 г., 04:53:36

Greetings,

* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> Dear Experts,

Since you're asking ...

> I recently set up replication for the first time.  It seemed to be
> working OK in my initial tests, but didn't cope when the slave was
> down for a longer period.  This is all with the Debian stable
> packages of PostgreSQL 9.6.  My replication setup involves a third
> server, "backup", to and from which WAL files are copied using scp
> (and essentially never deleted), plus streaming replication in
> "hot standby" mode.
>
> On the master, I have:
>
> wal_level = replica
> archive_mode = on
> archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
>                    scp %p backup:backup/postgresql/archivedir/%f'

This is really not a sufficient or particularly intelligent
archive_command.  In general, I'd strongly recommend against trying to
roll your own archiving/backup solution for PostgreSQL.  In particular,
the above is quite expensive, doesn't address the case where a file is
only partially copied to the backup server, and doesn't ensure that the
file is completely written and fsync'd on the backup server meaning that
a failure will likely result in WAL going missing.  There's much better
solutions out there and which you really should be using instead of
trying to build your own.  In particular, my preference is pgbackrest
(though I'm one of the folks in its development, to be clear), but
there's other options such as barman or WAL-E which I believe also
address the concerns raised above.

> On the slave I have:
>
> standby_mode = 'on'
> primary_conninfo = 'user=postgres host=master port=5432'
> restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
>
> hot_standby = on
>
>
> This all seemed to work OK until the slave was shut down for a few days.
> (During this time the master was also down for a shorter period.)
> When the slave came back up it successfully copied several WAL files
> from the backup server until it reached one that hasn't been created yet:

That all sounds fine.

> 2018-08-10 22:37:38.322 UTC [615] LOG:  restored log file "0000000100000007000000CC" from archive
> 2018-08-10 22:38:02.011 UTC [615] LOG:  restored log file "0000000100000007000000CD" from archive
> 2018-08-10 22:38:29.802 UTC [615] LOG:  restored log file "0000000100000007000000CE" from archive
> 2018-08-10 22:38:55.973 UTC [615] LOG:  restored log file "0000000100000007000000CF" from archive
> 2018-08-10 22:39:24.964 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
>
> At this point there was a temporary problem due to a firewall rule that
> blocked the replication connection to the master:
>
> 2018-08-10 22:41:35.865 UTC [1289] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly

Well, that's certainly not good.

> The slave then entered a loop, retrying to scp the next file from the backup
> archive (and failing) and connecting to the master for streaming replication
> (and also failing).  That is as expected, except for the "invalid magic number":
>
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-10 22:41:37.018 UTC [615] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
> 2018-08-10 22:43:46.936 UTC [1445] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly

Note that the invalid magic number error is just LOG-level, and that
PostgreSQL will certainly ask for WAL files which don't exist in the
archive yet.

> Note that it's complaining about an invalid magic number in log segment D1, which is
> the one that has just failed to scp.  That looked suspicious to me and I remembered
> notes in the docs about ensuring that the archive and/or restore commands return an
> error exit status when they fail; I checked that and yes scp does exit(1) when the
> requested file doesn't exist:

Sure, but scp doesn't produce any output when it's successful, so are
you sure that an scp wasn't performed after the "no such file or
directory" log message and which perhaps performed a partial copy
(see above about *exactly* that concern being mentioned by me, even
before getting down to this part, due to using these simple unix
commands...)?

> $ scp backup:/egheriugherg /tmp/a
> scp: /egheriugherg: No such file or directory
> $ echo $?
> 1

scp has no idea if it made a copy of a partial file though.

> Anyway, the slave continued in its retrying loop as expected, except that each time
> it re-fetched the previous segment, D0; this seems a bit peculiar:
>
> 2018-08-10 22:44:17.796 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-10 22:46:28.727 UTC [1628] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly

Having it go back to D0 is at least a bit interesting.  I wonder if
somehow it was partial..  Have you looked at the files on the archive
server to make sure they're all 16MB and the same size?

> (Note the "invalid magic number" was only reported the first time.)

Part of the issue here is that there's no logging happening with these
scp commands, so we don't really know what happened, exactly.  I'm
wondering if perhaps the scp has *just* started copying the file from
the primary to the backup server when the scp from the backup server to
the replica started and it ended up copying a zero-byte file, though if
that was the case then the next time around the file should have been
there, unless there was some network hiccup between the primary and the
backup server that caused the copy to fail...

> Eventually I fixed the networking issue that had prevented the replication connection
> to master and it could start streaming:
>
> 2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
> 2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
0000000100000007000000D0has already been removed 
>
> So here is the real problem which worries me: it seems to ask the master for
> segment D0, which is one that it did successfully restore from the backup earlier
> on, and the master - not unexpectedly - reports that it has already been removed.

If D0 has been removed from the primary then it should be on the backup
server, right?  Are you sure it actually is, and that it's a complete
segment?  I'm starting to think that you're in exactly the predicament
that I mentioned above- maybe D0 was a partial copy to the backup
server, and that failed and was returned to PG as failed but then it
tried to run again and the stupid 'test' command returned success
because the file exists, even though it wasn't complete...

> I tried restarting the slave:
>
> 2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
> 2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
> 2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088
> 2018-08-11 00:12:16.068 UTC [7953] LOG:  database system is ready to accept read only connections
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-11 00:12:17.057 UTC [7954] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
> 2018-08-11 00:12:17.242 UTC [8034] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
> 2018-08-11 00:12:17.242 UTC [8034] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
0000000100000007000000D0has already been removed 
>
> Note that again it failed to get D1 from the backup, as expected, and then
> reports "invalid magic number" in that file, and again tries to start
> streaming from a segment that the master doesn't have.
>
> Eventually I decided to try and cause the D1 file to exist in the backup:
>
> db=> create table waste_space as select * from large_table;
>
> That caused four new WAL segments to be copied to the backup, and restored
> by the slave:
>
> 2018-08-11 00:55:49.741 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
> 2018-08-11 00:56:12.304 UTC [7954] LOG:  restored log file "0000000100000007000000D1" from archive
> 2018-08-11 00:56:35.481 UTC [7954] LOG:  restored log file "0000000100000007000000D2" from archive
> 2018-08-11 00:56:57.443 UTC [7954] LOG:  restored log file "0000000100000007000000D3" from archive
> 2018-08-11 00:57:21.723 UTC [7954] LOG:  restored log file "0000000100000007000000D4" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory
> 2018-08-11 00:57:22.915 UTC [7954] LOG:  unexpected pageaddr 7/C7000000 in log segment 00000001000000070000
> 00D5, offset 0
> 2018-08-11 00:57:23.114 UTC [12348] LOG:  started streaming WAL from primary at 7/D5000000 on timeline 1
>
> Note the "unexpected pageaddr" message.  Despite that, the streaming started
> and everything was working correctly.  Hurray!

So, just to be clear, those LOG-level messages aren't generally an
issue, they're just LOGs.  We expect to see them.

> I remain concerned, however, that something bad happened and I'd like to
> understand it.  Can anyone help?

Really hard to tell from this.  On the one hand, it seems like maybe the
primary was done with D0, but hadn't finished with D1 yet, but the
replica hadn't made that transistion from D0 -> D1, hence why it was
still asking for D0.  I'm certainly concerned that there might have
been some amount of partial-WAL-file copying happening, but PG is pretty
good at dealing with a variety of odd cases and so I tend to doubt that
the replica actually ended up being corrupted, but I'll again strongly
advocate for using a proven solution which avoids these risks by doing
things like copying into temporary files on the backup server and then
using an atomic 'mv' to make sure that anyone reading from the archive
wouldn't ever see a partially-written file, and, just to double-check,
also calculate a hash of every WAL file written into the archive and
re-verify that hash when reading it back out.  That's what we do in
pgbackrest, at least.

Thanks!

Stephen

Вложения

signature.asc

Re: Replication failure, slave requesting old segments

От

"Phil Endecott"

Дата:

13 августа 2018 г., 01:09:32

Hi Stephen,

Stephen Frost wrote:
> * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
>> archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
>>                    scp %p backup:backup/postgresql/archivedir/%f'
>
> This is really not a sufficient or particularly intelligent
> archive_command.  In general, I'd strongly recommend against trying to
> roll your own archiving/backup solution for PostgreSQL.  In particular,
> the above is quite expensive, doesn't address the case where a file is
> only partially copied to the backup server, and doesn't ensure that the
> file is completely written and fsync'd on the backup server meaning that
> a failure will likely result in WAL going missing.  There's much better
> solutions out there and which you really should be using instead of
> trying to build your own.  In particular, my preference is pgbackrest
> (though I'm one of the folks in its development, to be clear), but
> there's other options such as barman or WAL-E which I believe also
> address the concerns raised above.

Hmmm, well I based it on what I read in the documentation:

   https://www.postgresql.org/docs/9.6/static/continuous-archiving.html

The expense is not an issue in this case.  I would be surprised if
scp were to exit(0) after only partially copying a file but perhaps
you know something I don't about that.

I know that rsync creates a temporary file and moves it into place when
all the data has been transferred, which scp doesn't do; perhaps that
would be safer.

The master was only writing WAL files maybe 3 or 4 times per day at
the time, so any scenario that requires its crash to have occurred
exactly as it was doing an scp involves a substantial coincidence.

I am limited in what I can run on the backup server.

> Having it go back to D0 is at least a bit interesting.  I wonder if
> somehow it was partial..  Have you looked at the files on the archive
> server to make sure they're all 16MB and the same size?

Yes, all the files on the backup including D0 were 16 MB.  But:

>> 2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
>> 2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
>> 2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088

Are the last two log lines above telling us anything useful?  Is that
saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as
far as byte 0xf956c0 or 0xfff088?  Is that what we would expect?  Is it
trying to use the streaming connection to get the missing bytes from
FFFF088 to FFFFFFFF?  Is that just an empty gap at the end of the file
due to the next record being too big to fit?

>> db=> create table waste_space as select * from large_table;
>> 
>> That caused four new WAL segments to be copied to the backup, and restored 
>> by the slave:

Note again that this was sufficient to make it start working again,
which I think is important.

> Really hard to tell from this.  On the one hand, it seems like maybe the
> primary was done with D0, but hadn't finished with D1 yet, but the
> replica hadn't made that transistion from D0 -> D1, hence why it was
> still asking for D0.

Yes.  What is actually supposed to happen?  What exactly do you mean by
"make the transition"?

> I'm certainly concerned that there might have
> been some amount of partial-WAL-file copying happening, but PG is pretty
> good at dealing with a variety of odd cases and so I tend to doubt that
> the replica actually ended up being corrupted, but I'll again strongly
> advocate for using a proven solution which avoids these risks by doing
> things like copying into temporary files on the backup server and then
> using an atomic 'mv' to make sure that anyone reading from the archive
> wouldn't ever see a partially-written file, and, just to double-check,
> also calculate a hash of every WAL file written into the archive and
> re-verify that hash when reading it back out.  That's what we do in
> pgbackrest, at least.

Noted.  But I don't see any evidence that that was actually the root
cause here, and it relies on an unlikely timing coincidence.  This is
the very first time that I've had a replica recover after a significant
downtime, and it failed.

Regards, Phil.

Re: Replication failure, slave requesting old segments

От

"Phil Endecott"

Дата:

13 августа 2018 г., 01:25:38

Hi Adrian,

Adrian Klaver wrote:
> On 08/11/2018 12:42 PM, Phil Endecott wrote:
>> Hi Adrian,
>> 
>> Adrian Klaver wrote:
>>> Looks like the master recycled the WAL's while the slave could not 
>>> connect.
>> 
>> Yes but... why is that a problem?  The master is copying the WALs to
>> the backup server using scp, where they remain forever.  The slave gets
>
> To me it looks like that did not happen:
>
> 2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file 
> "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file 
> or directory
> 2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from 
> primary at 7/D0000000 on timeline 1
> 2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from 
> WAL stream: ERROR:  requested WAL segment 0000000100000007000000D0 has 
> already been removed
>
> Above 0000000100000007000000D0 is gone/recycled on the master and the 
> archived version does not seem to be complete as the streaming 
> replication is trying to find it.

The files on the backup server were all 16 MB.

> Below you kick the master and it coughs up the files to the archive 
> including *D0 and *D1 on up to *D4 and then the streaming picks using *D5.

When I kicked it, the master wrote D1 to D4 to the backup.  It did not
change D0 (its modification time on the backup is from before the "kick").
The slave re-read D0, again, as it had been doing throughout this period,
and then read D1 to D4.

> Best guess is the archiving did not work as expected during:
>
> "(During this time the master was also down for a shorter period.)"

Around the time the master was down, the WAL segment names were CB and CC.
Files CD to CF were written between the master coming up and the slave
coming up.  The slave had no trouble restoring those segments when it started.
The problematic segments D0 and D1 were the ones that were "current" 
when the
slave restarted, at which time the master was up consistently.

Regards, Phil.

Re: Replication failure, slave requesting old segments

От

"Phil Endecott"

Дата:

13 августа 2018 г., 01:53:08

Phil Endecott wrote:
> On the master, I have:
>
> wal_level = replica
> archive_mode = on
> archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
>                    scp %p backup:backup/postgresql/archivedir/%f'
>
> On the slave I have:
>
> standby_mode = 'on'
> primary_conninfo = 'user=postgres host=master port=5432'
> restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
>
> hot_standby = on

> 2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
> 2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
0000000100000007000000D0has already been removed

I am wondering if I need to set wal_keep_segments to at least 1 or 2 for 
this to work.  I currently have it unset and I believe the default is 0.

My understanding was that when using archive_command/restore_command to copy 
WAL segments it would not be necessary to use wal_keep_segments to retain 
files in pg_xlog on the server; the slave can get everything using a 
combination of copying files using the restore_command and streaming. 
But these lines from the log:

2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088

make me think that there is an issue when the slave reaches the end of the 
copied WAL file.  I speculate that the useful content of this WAL segment 
ends at FFF088, which is followed by an empty gap due to record sizes.  But 
the slave tries to start streaming from this point, D0FFF088, not D1000000. 
If the master still had a copy of segment D0 then it would be able to stream 
this gap followed by the real content in the current segment D1.

Does that make any sense at all?

Regards, Phil.

Re: Replication failure, slave requesting old segments

От

Adrian Klaver

Дата:

13 августа 2018 г., 01:58:26

On 08/12/2018 12:25 PM, Phil Endecott wrote:
> Hi Adrian,
> 
> Adrian Klaver wrote:
>> On 08/11/2018 12:42 PM, Phil Endecott wrote:
>>> Hi Adrian,
>>>
>>> Adrian Klaver wrote:
>>>> Looks like the master recycled the WAL's while the slave could not 
>>>> connect.
>>>
>>> Yes but... why is that a problem?  The master is copying the WALs to
>>> the backup server using scp, where they remain forever.  The slave gets
>>
>> To me it looks like that did not happen:
>>
>> 2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file 
>> "0000000100000007000000D0" from archive
>> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such 
>> file or directory
>> 2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from 
>> primary at 7/D0000000 on timeline 1
>> 2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from 
>> WAL stream: ERROR:  requested WAL segment 0000000100000007000000D0 has 
>> already been removed
>>
>> Above 0000000100000007000000D0 is gone/recycled on the master and the 
>> archived version does not seem to be complete as the streaming 
>> replication is trying to find it.
> 
> The files on the backup server were all 16 MB.

WAL files are created/recycled as 16 MB files, which is not the same as 
saying they are complete for the purposes of restoring. In other words 
you could be looking at a 16 MB file full of 0's.

> 
> 
>> Below you kick the master and it coughs up the files to the archive 
>> including *D0 and *D1 on up to *D4 and then the streaming picks using 
>> *D5.
> 
> When I kicked it, the master wrote D1 to D4 to the backup.  It did not
> change D0 (its modification time on the backup is from before the "kick").
> The slave re-read D0, again, as it had been doing throughout this period,
> and then read D1 to D4.

Well something happened because the slave could not get all the 
information it needed from the D0 in the archive and was trying to get 
it from the masters pg_xlog.

> 
> 
>> Best guess is the archiving did not work as expected during:
>>
>> "(During this time the master was also down for a shorter period.)"
> 
> Around the time the master was down, the WAL segment names were CB and CC.
> Files CD to CF were written between the master coming up and the slave
> coming up.  The slave had no trouble restoring those segments when it 
> started.
> The problematic segments D0 and D1 were the ones that were "current" 
> when the
> slave restarted, at which time the master was up consistently.
> 
> 
> Regards, Phil.
> 
> 
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Replication failure, slave requesting old segments

От

Adrian Klaver

Дата:

13 августа 2018 г., 02:23:29

On 08/12/2018 12:53 PM, Phil Endecott wrote:
> Phil Endecott wrote:
>> On the master, I have:
>>
>> wal_level = replica
>> archive_mode = on
>> archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
>>                     scp %p backup:backup/postgresql/archivedir/%f'
>>
>> On the slave I have:
>>
>> standby_mode = 'on'
>> primary_conninfo = 'user=postgres host=master port=5432'
>> restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
>>
>> hot_standby = on
> 
>> 2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
>> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
>> 2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
>> 2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
0000000100000007000000D0has already been removed
 
> 
> 
> I am wondering if I need to set wal_keep_segments to at least 1 or 2 for
> this to work.  I currently have it unset and I believe the default is 0.

Given that WAL's are only 16 MB I would probably bump it up to be on 
safe side, or use:

https://www.postgresql.org/docs/9.6/static/warm-standby.html

26.2.6. Replication Slots

Though the above does not limit storage of WAL's, so a long outage could 
result in WAL's piling up.

> 
> My understanding was that when using archive_command/restore_command to copy
> WAL segments it would not be necessary to use wal_keep_segments to retain
> files in pg_xlog on the server; the slave can get everything using a
> combination of copying files using the restore_command and streaming.
> But these lines from the log:
> 
> 2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
> 2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088
> 
> make me think that there is an issue when the slave reaches the end of the
> copied WAL file.  I speculate that the useful content of this WAL segment
> ends at FFF088, which is followed by an empty gap due to record sizes.  But
> the slave tries to start streaming from this point, D0FFF088, not D1000000.
> If the master still had a copy of segment D0 then it would be able to stream
> this gap followed by the real content in the current segment D1.
> 
> Does that make any sense at all?
> 
> 
> Regards, Phil.
> 
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Replication failure, slave requesting old segments

От

Stephen Frost

Дата:

13 августа 2018 г., 03:29:53

Greetings,

* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> Stephen Frost wrote:
> >* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> >>archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
> >>                   scp %p backup:backup/postgresql/archivedir/%f'
> >
> >This is really not a sufficient or particularly intelligent
> >archive_command.  In general, I'd strongly recommend against trying to
> >roll your own archiving/backup solution for PostgreSQL.  In particular,
> >the above is quite expensive, doesn't address the case where a file is
> >only partially copied to the backup server, and doesn't ensure that the
> >file is completely written and fsync'd on the backup server meaning that
> >a failure will likely result in WAL going missing.  There's much better
> >solutions out there and which you really should be using instead of
> >trying to build your own.  In particular, my preference is pgbackrest
> >(though I'm one of the folks in its development, to be clear), but
> >there's other options such as barman or WAL-E which I believe also
> >address the concerns raised above.
>
> Hmmm, well I based it on what I read in the documentation:
>
>   https://www.postgresql.org/docs/9.6/static/continuous-archiving.html
>
> The expense is not an issue in this case.  I would be surprised if
> scp were to exit(0) after only partially copying a file but perhaps
> you know something I don't about that.

The scp might not, but so what?  If a partial file remains on the far
side, then a subsequent invokation of that archive_command will come
back saying that the file is there.  Further, the file isn't going to
actually be sync'd to the filesystem on the far side, so even if the scp
returns success, a failure on the backup server could result in loss of
the WAL file.

> I know that rsync creates a temporary file and moves it into place when
> all the data has been transferred, which scp doesn't do; perhaps that
> would be safer.

No, what would be better is using a tool which will actually fsync() the
file on the backup server after writing it out there, making sure it's
been stored to the filesystem, and then to perform an atomic mv
operation to put it in place, and then to fsync() the directory to
ensure that change has also been sync'd to disk.

Even better is to calculate a hash of the file and to reverify that on
restore, to make sure that it doesn't end up getting corrupted on the
backup server somehow.

> The master was only writing WAL files maybe 3 or 4 times per day at
> the time, so any scenario that requires its crash to have occurred
> exactly as it was doing an scp involves a substantial coincidence.

While I get that it's unlikely to be relevant to this particular case, I
wouldn't recommend depending on never having that coincidence happen.
What I've seen quite a bit of, just so you know, is that things tend to
all break down at about the same time- so right when something bad is
happening on the primary, the backup server runs into issues too.  This
might be due to networking breaking down in a particular area all at
once or shared storage having some issue but it's happened often enough
that relying on uncommon things to avoid each other is something I've
got a hard time doing these days.

> I am limited in what I can run on the backup server.

You could consider backing up locally using a tool which is built for
PostgreSQL and then sync'ing the results of that backup over to the
backup server, perhaps.  Or point out that you need a backup server
where you don't have those restirctions in order to properly have
backups.  Or use an external service like s3 (or a clone) to back up to.

> >Having it go back to D0 is at least a bit interesting.  I wonder if
> >somehow it was partial..  Have you looked at the files on the archive
> >server to make sure they're all 16MB and the same size?
>
> Yes, all the files on the backup including D0 were 16 MB.  But:

Ok.  Did you see any failures on the sending side when archive_command
was being run?  PG shouldn't start trying to archive the file until it's
done with it, of course, so unless something failed right at the end of
the scp and left the file at 16MB but it wasn't actually filled..

> >>2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
> >>2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
> >>2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088
>
> Are the last two log lines above telling us anything useful?  Is that
> saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as
> far as byte 0xf956c0 or 0xfff088?  Is that what we would expect?  Is it
> trying to use the streaming connection to get the missing bytes from
> FFFF088 to FFFFFFFF?  Is that just an empty gap at the end of the file
> due to the next record being too big to fit?

The short answer is that, yes, the next record was likely too big to
fit, but that's what the replica was trying to figure out and couldn't
because D0 was gone from the primary already.  One thing which should
help this would be to use physical replication slots on the primary,
which would keep it from throwing away WAL files until it knows the
replica has them, but that runs the risk of ending up with lots of extra
WAL on the primary if the replica is gone for a while.  If you'd prefer
to avoid that then having wal_keep_segments set to '1' would avoid this
particular issue as well, I'd expect.

I do wonder if perhaps we should just default to having it as '1' to
avoid exactly this case, as it seems like perhaps PG archived D0 and
then flipped to D1 and got rid of D0, which is all pretty reasonable,
except that a replica trying to catch up is going to end up asking for
D0 from the primary because it didn't know if there was anything else
that should have been in D0..

> >>db=> create table waste_space as select * from large_table;
> >>
> >>That caused four new WAL segments to be copied to the backup, and
> >>restored by the slave:
>
> Note again that this was sufficient to make it start working again,
> which I think is important.

I agree that it's certainly interesting though presumably what ends up
happening is that the replica finally gets to D1 and is then able to
connect to the primary before the primary has finished with D1 and it's
able to start tracking along with the primary.

> >Really hard to tell from this.  On the one hand, it seems like maybe the
> >primary was done with D0, but hadn't finished with D1 yet, but the
> >replica hadn't made that transistion from D0 -> D1, hence why it was
> >still asking for D0.
>
> Yes.  What is actually supposed to happen?  What exactly do you mean by
> "make the transition"?

Simply moving along the WAL stream to get to the D1 WAL file.  This all
more-or-less looks like what's "supposed to happen" except that the lack
of additional writes to the primary meant that things appeared to get
"stuck" here, and then got "unstuck" once you created some data.  Looks
like this might be a bit of an edge case that we aren't handling very
cleanly.

> >I'm certainly concerned that there might have
> >been some amount of partial-WAL-file copying happening, but PG is pretty
> >good at dealing with a variety of odd cases and so I tend to doubt that
> >the replica actually ended up being corrupted, but I'll again strongly
> >advocate for using a proven solution which avoids these risks by doing
> >things like copying into temporary files on the backup server and then
> >using an atomic 'mv' to make sure that anyone reading from the archive
> >wouldn't ever see a partially-written file, and, just to double-check,
> >also calculate a hash of every WAL file written into the archive and
> >re-verify that hash when reading it back out.  That's what we do in
> >pgbackrest, at least.
>
> Noted.  But I don't see any evidence that that was actually the root
> cause here, and it relies on an unlikely timing coincidence.  This is
> the very first time that I've had a replica recover after a significant
> downtime, and it failed.

Failed is really over-stating things, isn't it?  The replica caught all
the way up to D0 just fine, but it just couldn't get past that because
the primary had removed D0 already, and the replica couldn't actually
reconnect to the primary to start streaming again, but it's not like it
would have been all *that* far behind the primary since the primary was
still working on D1, and once the primary did write a few things, the
replica was able to start streaming again.

If this can be reproduced and then a fix put in place for it, I'm all
for that, to be clear, but I don't see this as a particularly big issue
or a systemic problem with the way WAL replay and replication work.

Thanks!

Stephen

Вложения

signature.asc

Re: Replication failure, slave requesting old segments

От

"Phil Endecott"

Дата:

13 августа 2018 г., 03:56:11

Stephen Frost wrote:
> * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
>> Stephen Frost wrote:
>> >* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
>> >>2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
>> >>2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
>> >>2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088
>> 
>> Are the last two log lines above telling us anything useful?  Is that
>> saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as
>> far as byte 0xf956c0 or 0xfff088?  Is that what we would expect?  Is it
>> trying to use the streaming connection to get the missing bytes from
>> FFFF088 to FFFFFFFF?  Is that just an empty gap at the end of the file
>> due to the next record being too big to fit?
>
> The short answer is that, yes, the next record was likely too big to
> fit, but that's what the replica was trying to figure out and couldn't
> because D0 was gone from the primary already.  One thing which should
> help this would be to use physical replication slots on the primary,
> which would keep it from throwing away WAL files until it knows the
> replica has them, but that runs the risk of ending up with lots of extra
> WAL on the primary if the replica is gone for a while.  If you'd prefer
> to avoid that then having wal_keep_segments set to '1' would avoid this
> particular issue as well, I'd expect.
>
> I do wonder if perhaps we should just default to having it as '1' to
> avoid exactly this case, as it seems like perhaps PG archived D0 and
> then flipped to D1 and got rid of D0, which is all pretty reasonable,
> except that a replica trying to catch up is going to end up asking for
> D0 from the primary because it didn't know if there was anything else
> that should have been in D0..

OK.  I think this is perhaps a documentation bug, maybe a missing
warning when the master reads its configuration, and maybe (as you say)
a bad default value.

Specifically, section 26.2.5 of the docs says:

"If you use streaming replication without file-based continuous archiving,
the server might recycle old WAL segments before the standby has received
them. If this occurs, the standby will need to be reinitialized from a new
base backup. You can avoid this by setting wal_keep_segments to a value
large enough to ensure that WAL segments are not recycled too early, or by
configuring a replication slot for the standby. If you set up a WAL archive
that's accessible from the standby, these solutions are not required, since
the standby can always use the archive to catch up provided it retains enough
segments."

OR, maybe the WAL reader that process the files that restore_command fetches
could be smart enough to realise that it can skip over the gap at the end?

Anyway.  Do others agree that my issue was the result of 
wal_keep_segments=0 ?

Regards, Phil.

Re: Replication failure, slave requesting old segments

От

Stephen Frost

Дата:

13 августа 2018 г., 04:54:06

Greetings,

* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> OK.  I think this is perhaps a documentation bug, maybe a missing
> warning when the master reads its configuration, and maybe (as you say)
> a bad default value.

If we consider it to be an issue worthy of a change then we should
probably just change the default value, and maybe not even allow it to
be set lower than '1'.

> Specifically, section 26.2.5 of the docs says:
>
> "If you use streaming replication without file-based continuous archiving,
> the server might recycle old WAL segments before the standby has received
> them. If this occurs, the standby will need to be reinitialized from a new
> base backup. You can avoid this by setting wal_keep_segments to a value
> large enough to ensure that WAL segments are not recycled too early, or by
> configuring a replication slot for the standby. If you set up a WAL archive
> that's accessible from the standby, these solutions are not required, since
> the standby can always use the archive to catch up provided it retains enough
> segments."
>
> OR, maybe the WAL reader that process the files that restore_command fetches
> could be smart enough to realise that it can skip over the gap at the end?

That strikes me as a whole lot more complication in something we'd
rather not introduce additional complications into without very good
reason.  Then again, there was just a nearby discussion about how it'd
be nice if the backend could realize when a WAL file is complete, and I
do think that'll be more of an issue when users start configuring larger
WAL files, so perhaps we should figure out a way to handle this..

> Anyway.  Do others agree that my issue was the result of wal_keep_segments=0 ?

Yeah, I've not spent much time actually looking at code around this, so
it'd be nice to get:

a) a reproducible case demonstrating it's happening
b) testing to see how long it's been this way
c) if setting wal_keep_segments=1 fixes it
d) perhaps some thought around if we could address this some other way

Thanks!

Stephen

Вложения

signature.asc

Re: Replication failure, slave requesting old segments

От

Adrian Klaver

Дата:

13 августа 2018 г., 06:14:32

On 08/12/2018 02:56 PM, Phil Endecott wrote:
> Stephen Frost wrote:

> 
> Specifically, section 26.2.5 of the docs says:
> 
> "If you use streaming replication without file-based continuous archiving,
> the server might recycle old WAL segments before the standby has received
> them. If this occurs, the standby will need to be reinitialized from a new
> base backup. You can avoid this by setting wal_keep_segments to a value
> large enough to ensure that WAL segments are not recycled too early, or by
> configuring a replication slot for the standby. If you set up a WAL archive
> that's accessible from the standby, these solutions are not required, since
> the standby can always use the archive to catch up provided it retains 
> enough
> segments."
> 
> OR, maybe the WAL reader that process the files that restore_command 
> fetches
> could be smart enough to realise that it can skip over the gap at the end?
> 
> Anyway.  Do others agree that my issue was the result of 
> wal_keep_segments=0 ?

Only as a sub-issue of the slave losing contact with the master. The 
basic problem is maintaining two separate operations, archiving and 
streaming, in sync. If either or some combination of both lose 
synchronization then it is anyone's guess on what is appropriate for 
wal_keep_segments.


> 
> 
> Regards, Phil.
> 
> 
> 
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Replication failure, slave requesting old segments

От

Adrian Klaver

Дата:

13 августа 2018 г., 06:19:16

On 08/12/2018 03:54 PM, Stephen Frost wrote:
> Greetings,
> 
> * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
>> OK.  I think this is perhaps a documentation bug, maybe a missing
>> warning when the master reads its configuration, and maybe (as you say)
>> a bad default value.
> 
> If we consider it to be an issue worthy of a change then we should
> probably just change the default value, and maybe not even allow it to
> be set lower than '1'.
> 

I would say leave the default at 0 as it leaves no doubt that you are 
performing without a net. A setting of '1' implies you are covered and 
for a fast moving cluster or slow moving one with sufficient downtime 
that would not be the case. Better to let the end user know this is not 
a simple problem and some thought needs to go into configuration.

-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Replication failure, slave requesting old segments

От

Stephen Frost

Дата:

13 августа 2018 г., 18:04:37

Greetings,

* Adrian Klaver (adrian.klaver@aklaver.com) wrote:
> On 08/12/2018 03:54 PM, Stephen Frost wrote:
> >* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> >>OK.  I think this is perhaps a documentation bug, maybe a missing
> >>warning when the master reads its configuration, and maybe (as you say)
> >>a bad default value.
> >
> >If we consider it to be an issue worthy of a change then we should
> >probably just change the default value, and maybe not even allow it to
> >be set lower than '1'.
>
> I would say leave the default at 0 as it leaves no doubt that you are
> performing without a net. A setting of '1' implies you are covered and for a
> fast moving cluster or slow moving one with sufficient downtime that would
> not be the case. Better to let the end user know this is not a simple
> problem and some thought needs to go into configuration.

Uh, this specific case is where there *is* a 'safety net' though-
archive command and restore command were configured and being used, so I
don't buy off on this argument at all.

Maybe we just internally bump wal_keep_segments to '1' to avoid this
specific risk without actually changing the default or making people
change their existing configurations, but if this is really what's
happening then I don't think the answer is "don't do anything."

Thanks!

Stephen

Вложения

signature.asc

Re: Replication failure, slave requesting old segments

От

"Phil Endecott"

Дата:

13 августа 2018 г., 18:08:12

Adrian Klaver wrote:
> On 08/12/2018 02:56 PM, Phil Endecott wrote:
>> Anyway.  Do others agree that my issue was the result of 
>> wal_keep_segments=0 ?
>
> Only as a sub-issue of the slave losing contact with the master. The 
> basic problem is maintaining two separate operations, archiving and 
> streaming, in sync. If either or some combination of both lose 
> synchronization then it is anyone's guess on what is appropriate for 
> wal_keep_segments.

Really?  I thought the intention was that the system should be
able to recover reliably when the slave reconnects after a
period of downtime, subject only to there being sufficient
network/CPU/disk bandwidth etc. for it to eventually catch up.

If that's not true, I think the docs need an even more extensive
overhaul!  Suggestion for the paragraph that I quoted before from
26.2.5:

"If you set up a WAL archive that's accessible from the standby,
it's anyone's guess what is appropriate for wal_keep_segments."

Regards, Phil.

Re: Replication failure, slave requesting old segments

От

"Phil Endecott"

Дата:

13 августа 2018 г., 18:18:45

Adrian Klaver wrote:
> On 08/12/2018 03:54 PM, Stephen Frost wrote:
>> Greetings,
>> 
>> * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
>>> OK.  I think this is perhaps a documentation bug, maybe a missing
>>> warning when the master reads its configuration, and maybe (as you say)
>>> a bad default value.
>> 
>> If we consider it to be an issue worthy of a change then we should
>> probably just change the default value, and maybe not even allow it to
>> be set lower than '1'.
>> 
>
> I would say leave the default at 0 as it leaves no doubt that you are 
> performing without a net. A setting of '1' implies you are covered and 
> for a fast moving cluster or slow moving one with sufficient downtime 
> that would not be the case.

Can you explain how it can fail in the case of a "slow moving cluster with
sufficient downtime"?

It seems to me that if I have correctly understood what happened in this
case then 0, the default, really cannot ever work properly when you have
enabled WAL archiving plus streaming.

> Better to let the end user know this is not 
> a simple problem and some thought needs to go into configuration.

I certainly agree that this is "not a simple problem", having read something
like 18,000 words of documentation multiple times and, apparently, still
got it wrong in multiple ways.

Regards, Phil.

Re: Replication failure, slave requesting old segments

От

Stephen Frost

Дата:

13 августа 2018 г., 18:30:50

Greetings,

* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> Adrian Klaver wrote:
> >On 08/12/2018 03:54 PM, Stephen Frost wrote:
> >>Greetings,
> >>
> >>* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> >>>OK.  I think this is perhaps a documentation bug, maybe a missing
> >>>warning when the master reads its configuration, and maybe (as you say)
> >>>a bad default value.
> >>
> >>If we consider it to be an issue worthy of a change then we should
> >>probably just change the default value, and maybe not even allow it to
> >>be set lower than '1'.
> >
> >I would say leave the default at 0 as it leaves no doubt that you are
> >performing without a net. A setting of '1' implies you are covered and for
> >a fast moving cluster or slow moving one with sufficient downtime that
> >would not be the case.
>
> Can you explain how it can fail in the case of a "slow moving cluster with
> sufficient downtime"?

I'm guessing 'sufficient downtime' here is, basically, 'offline until
the next checkpoint', which isn't actually all *that* much time.

> It seems to me that if I have correctly understood what happened in this
> case then 0, the default, really cannot ever work properly when you have
> enabled WAL archiving plus streaming.

Well, it's not like it'd work without WAL archiving either, though
that's perhaps more obviously true.

Thanks!

Stephen

Вложения

signature.asc

Re: Replication failure, slave requesting old segments

От

Stephen Frost

Дата:

13 августа 2018 г., 18:39:15

Greetings,

* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> Adrian Klaver wrote:
> >On 08/12/2018 02:56 PM, Phil Endecott wrote:
> >>Anyway.  Do others agree that my issue was the result of
> >>wal_keep_segments=0 ?
> >
> >Only as a sub-issue of the slave losing contact with the master. The basic
> >problem is maintaining two separate operations, archiving and streaming,
> >in sync. If either or some combination of both lose synchronization then
> >it is anyone's guess on what is appropriate for wal_keep_segments.

Uh, no, having an archive_command and a restore_command configures
exactly should remove the need to worry about what wal_keep_segments is
set to because anything not on the primary really should be available
through what's been archived and PG shouldn't have any trouble figuring
that out and working with it.

If all you've got is streaming replication then, sure, you have no idea
what to set wal_keep_segments to because the replica could be offline
for an indeterminate amount of time, but as long as you're keeping track
of all the WAL through archive_command, that shouldn't be an issue.

> Really?  I thought the intention was that the system should be
> able to recover reliably when the slave reconnects after a
> period of downtime, subject only to there being sufficient
> network/CPU/disk bandwidth etc. for it to eventually catch up.

Yes, that's correct, the replica should always be able to catch back up
presuming there's no gaps in the WAL between when the replica failed and
where the primary is at.

Thanks!

Stephen

Вложения

signature.asc

Re: Replication failure, slave requesting old segments

От

Adrian Klaver

Дата:

13 августа 2018 г., 19:51:43

On 08/13/2018 05:39 AM, Stephen Frost wrote:
> Greetings,
> 
> * Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
>> Adrian Klaver wrote:
>>> On 08/12/2018 02:56 PM, Phil Endecott wrote:
>>>> Anyway.  Do others agree that my issue was the result of
>>>> wal_keep_segments=0 ?
>>>
>>> Only as a sub-issue of the slave losing contact with the master. The basic
>>> problem is maintaining two separate operations, archiving and streaming,
>>> in sync. If either or some combination of both lose synchronization then
>>> it is anyone's guess on what is appropriate for wal_keep_segments.
> 
> Uh, no, having an archive_command and a restore_command configures
> exactly should remove the need to worry about what wal_keep_segments is
> set to because anything not on the primary really should be available
> through what's been archived and PG shouldn't have any trouble figuring
> that out and working with it.
> 
> If all you've got is streaming replication then, sure, you have no idea
> what to set wal_keep_segments to because the replica could be offline
> for an indeterminate amount of time, but as long as you're keeping track
> of all the WAL through archive_command, that shouldn't be an issue.

Therein lies the rub. As I stated previously the bigger issue is syncing 
two different operations, archiving and streaming. The OP got caught 
short assuming the archiving would handle the situation where the 
streaming was down for a period. In his particular setup and for this 
particular situation a wal_keep_segments of 1 would have helped. I do 
not see this as a default value though as it depends on too many 
variables outside the reach of the database, mostly notably the success 
of the archive command. First is the command even valid, two is the 
network link reliable, three is there even a network link, is there more 
then one network link, four is the restore command valid? That is just 
of the top of my head, more caffeine and I could come up with more. 
Saying that having archiving, streaming and a wal_keep_segments=1 has 
you covered, is misleading. I don't see it as detrimental to performance 
but I do see more posts down the road from folks who are surprised when 
it does not cover their case. Personally I think it better to be up 
front that this requires more thought or a third party solution that has 
done the thinking.

> 
>> Really?  I thought the intention was that the system should be
>> able to recover reliably when the slave reconnects after a
>> period of downtime, subject only to there being sufficient
>> network/CPU/disk bandwidth etc. for it to eventually catch up.
> 
> Yes, that's correct, the replica should always be able to catch back up
> presuming there's no gaps in the WAL between when the replica failed and
> where the primary is at.
> 
> Thanks!
> 
> Stephen
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Replication failure, slave requesting old segments

От

Stephen Frost

Дата:

13 августа 2018 г., 19:59:03

Greetings,

* Adrian Klaver (adrian.klaver@aklaver.com) wrote:
> On 08/13/2018 05:39 AM, Stephen Frost wrote:
> >* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> >>Adrian Klaver wrote:
> >>>On 08/12/2018 02:56 PM, Phil Endecott wrote:
> >>>>Anyway.  Do others agree that my issue was the result of
> >>>>wal_keep_segments=0 ?
> >>>
> >>>Only as a sub-issue of the slave losing contact with the master. The basic
> >>>problem is maintaining two separate operations, archiving and streaming,
> >>>in sync. If either or some combination of both lose synchronization then
> >>>it is anyone's guess on what is appropriate for wal_keep_segments.
> >
> >Uh, no, having an archive_command and a restore_command configures
> >exactly should remove the need to worry about what wal_keep_segments is
> >set to because anything not on the primary really should be available
> >through what's been archived and PG shouldn't have any trouble figuring
> >that out and working with it.
> >
> >If all you've got is streaming replication then, sure, you have no idea
> >what to set wal_keep_segments to because the replica could be offline
> >for an indeterminate amount of time, but as long as you're keeping track
> >of all the WAL through archive_command, that shouldn't be an issue.
>
> Therein lies the rub. As I stated previously the bigger issue is syncing two
> different operations, archiving and streaming.

That's not correct though, there isn't a big issue regarding syncing of
those two operations.

> The OP got caught short
> assuming the archiving would handle the situation where the streaming was
> down for a period. In his particular setup and for this particular situation
> a wal_keep_segments of 1 would have helped. I do not see this as a default
> value though as it depends on too many variables outside the reach of the
> database, mostly notably the success of the archive command.

What's been pointed out here is that even if everything is working
(archive_command, restore_command, etc) there's a possible gap in the
transisition from replay-from-archive and starting streaming replication
where the replica might not be able to start streaming.

That's an entirely *PostgreSQL* issue, as far as I see it, and hasn't
got anything to do with his particular setup except that he managed to
expose the issue.

> First is the
> command even valid, two is the network link reliable, three is there even a
> network link, is there more then one network link, four is the restore
> command valid? That is just of the top of my head, more caffeine and I could
> come up with more. Saying that having archiving, streaming and a
> wal_keep_segments=1 has you covered, is misleading. I don't see it as
> detrimental to performance but I do see more posts down the road from folks
> who are surprised when it does not cover their case. Personally I think it
> better to be up front that this requires more thought or a third party
> solution that has done the thinking.

This is all down to "and you should also monitor to make sure things
continue working" which I certainly agree with but that doesn't mean we
shouldn't fix this issue.

This entire side-discussion feels like it's really off in the weeds.
The next steps which I outlined a while ago seem to still be entirely
appropriate and we should figure out a way to solve this issue so that,
when everything else is working (archive command, restore command,
replica is able to connect to the primary, etc), that PG behaves sanely
and is able to catch up with and connect to the primary and resume
streaming.

Thanks!

Stephen

Вложения

signature.asc

Re: Replication failure, slave requesting old segments

От

Adrian Klaver

Дата:

13 августа 2018 г., 20:06:16

On 08/13/2018 05:08 AM, Phil Endecott wrote:
> Adrian Klaver wrote:
>> On 08/12/2018 02:56 PM, Phil Endecott wrote:
>>> Anyway.  Do others agree that my issue was the result of 
>>> wal_keep_segments=0 ?
>>
>> Only as a sub-issue of the slave losing contact with the master. The 
>> basic problem is maintaining two separate operations, archiving and 
>> streaming, in sync. If either or some combination of both lose 
>> synchronization then it is anyone's guess on what is appropriate for 
>> wal_keep_segments.
> 
> Really?  I thought the intention was that the system should be
> able to recover reliably when the slave reconnects after a
> period of downtime, subject only to there being sufficient
> network/CPU/disk bandwidth etc. for it to eventually catch up.

See also my reply to Stephen earlier. Basically you are trying to 
coordinate two different operations. They start from the same source 
pg_xlog(pg_wal 10+) but arrive on a different time scale and from 
different locations. Without sufficient sanity checks it is possible 
they diverge enough on one or both paths to render the process unstable.

> 
> If that's not true, I think the docs need an even more extensive
> overhaul!  Suggestion for the paragraph that I quoted before from
> 26.2.5:
> 
> "If you set up a WAL archive that's accessible from the standby,
> it's anyone's guess what is appropriate for wal_keep_segments."

I would say that:

"If you set up a WAL archive that's accessible from the standby, these 
solutions are not required, since the standby can always use the archive 
to catch up provided it retains enough segments."

should be more like:

"If you set up a WAL archive that's accessible from the standby, these 
solutions are not required, since the standby can always use the archive 
to catch up provided it retains enough segments. *This is dependent on 
verification that the archiving is working properly. A belt and 
suspenders approach would be to set wal_keep_segments to a value > 0 in 
the event archiving is not properly functioning*"
"

> 
> 
> Regards, Phil.
> 
> 
> 
> 
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Replication failure, slave requesting old segments

От

Stephen Frost

Дата:

13 августа 2018 г., 20:11:54

Greetings,

* Adrian Klaver (adrian.klaver@aklaver.com) wrote:
> On 08/13/2018 05:08 AM, Phil Endecott wrote:
> >Adrian Klaver wrote:
> >Really?  I thought the intention was that the system should be
> >able to recover reliably when the slave reconnects after a
> >period of downtime, subject only to there being sufficient
> >network/CPU/disk bandwidth etc. for it to eventually catch up.

That's correct.

> See also my reply to Stephen earlier. Basically you are trying to coordinate
> two different operations. They start from the same source pg_xlog(pg_wal
> 10+) but arrive on a different time scale and from different locations.
> Without sufficient sanity checks it is possible they diverge enough on one
> or both paths to render the process unstable.

This isn't what's happening.  We're not talking about a timeline change
here or a replica being promoted to be a primary in general.  There's no
diverging happening- it's the same consistent WAL stream, just coming
from two different sources, which PG is specifically designed to handle
and should be handling seamlessly.

> I would say that:
>
> "If you set up a WAL archive that's accessible from the standby, these
> solutions are not required, since the standby can always use the archive to
> catch up provided it retains enough segments."
>
> should be more like:
>
> "If you set up a WAL archive that's accessible from the standby, these
> solutions are not required, since the standby can always use the archive to
> catch up provided it retains enough segments. *This is dependent on
> verification that the archiving is working properly. A belt and suspenders
> approach would be to set wal_keep_segments to a value > 0 in the event
> archiving is not properly functioning*"
> "

I don't think I can disagree more with this additional wording, and I
*really* don't think we should be encouraging people to set a high
wal_keep_segments.  The specific case here looks like it just need to be
set to, exactly, '1', to ensure that the primary hasn't removed the last
WAL file that it archived.

Thanks!

Stephen

Вложения

signature.asc

Re: Replication failure, slave requesting old segments

От

"Phil Endecott"

Дата:

13 августа 2018 г., 20:17:14

Adrian Klaver wrote:
> "If you set up a WAL archive that's accessible from the standby, these 
> solutions are not required, since the standby can always use the archive 
> to catch up provided it retains enough segments. *This is dependent on 
> verification that the archiving is working properly. A belt and 
> suspenders approach would be to set wal_keep_segments to a value > 0 in 
> the event archiving is not properly functioning*"
> "

Adrian, I believe that the suggestion that my issue was the result of
my archiving process not working is not correct.

The quote above does not address the requirement for wal_keep_segments
to be >= 1 even when archiving is functioning correctly.

I will continue to monitor this thread in the hope that others will
confirm my understanding, but otherwise I will bow out now (and file a
bug).

Thanks to all.

Phil.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Replication failure, slave requesting old segments

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения