Re: Replication failure, slave requesting old segments

Поиск
Список
Период
Сортировка
От Stephen Frost
Тема Re: Replication failure, slave requesting old segments
Дата
Msg-id 20180812212953.GU3326@tamriel.snowman.net
обсуждение исходный текст
Ответ на Re: Replication failure, slave requesting old segments  ("Phil Endecott" <spam_from_pgsql_lists@chezphil.org>)
Ответы Re: Replication failure, slave requesting old segments  ("Phil Endecott" <spam_from_pgsql_lists@chezphil.org>)
Список pgsql-general
Greetings,

* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> Stephen Frost wrote:
> >* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> >>archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
> >>                   scp %p backup:backup/postgresql/archivedir/%f'
> >
> >This is really not a sufficient or particularly intelligent
> >archive_command.  In general, I'd strongly recommend against trying to
> >roll your own archiving/backup solution for PostgreSQL.  In particular,
> >the above is quite expensive, doesn't address the case where a file is
> >only partially copied to the backup server, and doesn't ensure that the
> >file is completely written and fsync'd on the backup server meaning that
> >a failure will likely result in WAL going missing.  There's much better
> >solutions out there and which you really should be using instead of
> >trying to build your own.  In particular, my preference is pgbackrest
> >(though I'm one of the folks in its development, to be clear), but
> >there's other options such as barman or WAL-E which I believe also
> >address the concerns raised above.
>
> Hmmm, well I based it on what I read in the documentation:
>
>   https://www.postgresql.org/docs/9.6/static/continuous-archiving.html
>
> The expense is not an issue in this case.  I would be surprised if
> scp were to exit(0) after only partially copying a file but perhaps
> you know something I don't about that.

The scp might not, but so what?  If a partial file remains on the far
side, then a subsequent invokation of that archive_command will come
back saying that the file is there.  Further, the file isn't going to
actually be sync'd to the filesystem on the far side, so even if the scp
returns success, a failure on the backup server could result in loss of
the WAL file.

> I know that rsync creates a temporary file and moves it into place when
> all the data has been transferred, which scp doesn't do; perhaps that
> would be safer.

No, what would be better is using a tool which will actually fsync() the
file on the backup server after writing it out there, making sure it's
been stored to the filesystem, and then to perform an atomic mv
operation to put it in place, and then to fsync() the directory to
ensure that change has also been sync'd to disk.

Even better is to calculate a hash of the file and to reverify that on
restore, to make sure that it doesn't end up getting corrupted on the
backup server somehow.

> The master was only writing WAL files maybe 3 or 4 times per day at
> the time, so any scenario that requires its crash to have occurred
> exactly as it was doing an scp involves a substantial coincidence.

While I get that it's unlikely to be relevant to this particular case, I
wouldn't recommend depending on never having that coincidence happen.
What I've seen quite a bit of, just so you know, is that things tend to
all break down at about the same time- so right when something bad is
happening on the primary, the backup server runs into issues too.  This
might be due to networking breaking down in a particular area all at
once or shared storage having some issue but it's happened often enough
that relying on uncommon things to avoid each other is something I've
got a hard time doing these days.

> I am limited in what I can run on the backup server.

You could consider backing up locally using a tool which is built for
PostgreSQL and then sync'ing the results of that backup over to the
backup server, perhaps.  Or point out that you need a backup server
where you don't have those restirctions in order to properly have
backups.  Or use an external service like s3 (or a clone) to back up to.

> >Having it go back to D0 is at least a bit interesting.  I wonder if
> >somehow it was partial..  Have you looked at the files on the archive
> >server to make sure they're all 16MB and the same size?
>
> Yes, all the files on the backup including D0 were 16 MB.  But:

Ok.  Did you see any failures on the sending side when archive_command
was being run?  PG shouldn't start trying to archive the file until it's
done with it, of course, so unless something failed right at the end of
the scp and left the file at 16MB but it wasn't actually filled..

> >>2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
> >>2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
> >>2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088
>
> Are the last two log lines above telling us anything useful?  Is that
> saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as
> far as byte 0xf956c0 or 0xfff088?  Is that what we would expect?  Is it
> trying to use the streaming connection to get the missing bytes from
> FFFF088 to FFFFFFFF?  Is that just an empty gap at the end of the file
> due to the next record being too big to fit?

The short answer is that, yes, the next record was likely too big to
fit, but that's what the replica was trying to figure out and couldn't
because D0 was gone from the primary already.  One thing which should
help this would be to use physical replication slots on the primary,
which would keep it from throwing away WAL files until it knows the
replica has them, but that runs the risk of ending up with lots of extra
WAL on the primary if the replica is gone for a while.  If you'd prefer
to avoid that then having wal_keep_segments set to '1' would avoid this
particular issue as well, I'd expect.

I do wonder if perhaps we should just default to having it as '1' to
avoid exactly this case, as it seems like perhaps PG archived D0 and
then flipped to D1 and got rid of D0, which is all pretty reasonable,
except that a replica trying to catch up is going to end up asking for
D0 from the primary because it didn't know if there was anything else
that should have been in D0..

> >>db=> create table waste_space as select * from large_table;
> >>
> >>That caused four new WAL segments to be copied to the backup, and
> >>restored by the slave:
>
> Note again that this was sufficient to make it start working again,
> which I think is important.

I agree that it's certainly interesting though presumably what ends up
happening is that the replica finally gets to D1 and is then able to
connect to the primary before the primary has finished with D1 and it's
able to start tracking along with the primary.

> >Really hard to tell from this.  On the one hand, it seems like maybe the
> >primary was done with D0, but hadn't finished with D1 yet, but the
> >replica hadn't made that transistion from D0 -> D1, hence why it was
> >still asking for D0.
>
> Yes.  What is actually supposed to happen?  What exactly do you mean by
> "make the transition"?

Simply moving along the WAL stream to get to the D1 WAL file.  This all
more-or-less looks like what's "supposed to happen" except that the lack
of additional writes to the primary meant that things appeared to get
"stuck" here, and then got "unstuck" once you created some data.  Looks
like this might be a bit of an edge case that we aren't handling very
cleanly.

> >I'm certainly concerned that there might have
> >been some amount of partial-WAL-file copying happening, but PG is pretty
> >good at dealing with a variety of odd cases and so I tend to doubt that
> >the replica actually ended up being corrupted, but I'll again strongly
> >advocate for using a proven solution which avoids these risks by doing
> >things like copying into temporary files on the backup server and then
> >using an atomic 'mv' to make sure that anyone reading from the archive
> >wouldn't ever see a partially-written file, and, just to double-check,
> >also calculate a hash of every WAL file written into the archive and
> >re-verify that hash when reading it back out.  That's what we do in
> >pgbackrest, at least.
>
> Noted.  But I don't see any evidence that that was actually the root
> cause here, and it relies on an unlikely timing coincidence.  This is
> the very first time that I've had a replica recover after a significant
> downtime, and it failed.

Failed is really over-stating things, isn't it?  The replica caught all
the way up to D0 just fine, but it just couldn't get past that because
the primary had removed D0 already, and the replica couldn't actually
reconnect to the primary to start streaming again, but it's not like it
would have been all *that* far behind the primary since the primary was
still working on D1, and once the primary did write a few things, the
replica was able to start streaming again.

If this can be reproduced and then a fix put in place for it, I'm all
for that, to be clear, but I don't see this as a particularly big issue
or a systemic problem with the way WAL replay and replication work.

Thanks!

Stephen

Вложения

В списке pgsql-general по дате отправления:

Предыдущее
От: TalGloz
Дата:
Сообщение: Re: PostgreSQL C Language Extension with C++ Code
Следующее
От: "Phil Endecott"
Дата:
Сообщение: Re: Replication failure, slave requesting old segments