Обсуждение: During promotion, new master tries to archive same segment twice

Поиск
Список
Период
Сортировка

During promotion, new master tries to archive same segment twice

От
"Phil Endecott"
Дата:
Dear Experts,

Here is my latest issue with replication:

I have 3 systems, X, Y and Z.  Initially X is replicated to Y 
and Y is replicated to Z; in each case the replication involves 
log-shipping using archive_command and restore_command (via a 
4th system called "backup") and then streaming.

The plan is to shut down X and to promote Y to be the new master.

I shut down X, and as expected Y complains that it con no longer 
connect to it:

2018-08-15 15:10:58.785 UTC [617] LOG:  replication terminated by primary server
2018-08-15 15:10:58.785 UTC [617] DETAIL:  End of WAL reached on timeline 1 at 7/E9000098.
2018-08-15 15:10:58.785 UTC [617] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
2018-08-15 15:10:59.928 UTC [354] LOG:  invalid resource manager ID 22 at 7/E9000098
2018-08-15 15:11:00.092 UTC [5856] FATAL:  could not connect to the primary server: FATAL:  the database system is
shuttingdown
 

Note that in the middle of that, after streaming has failed it 
tries to fetch the next WAL segment, E9, from the backup; this 
fails because this segment doesn't exist.

Then I pg_ctl promote on Y:

2018-08-15 15:12:11.888 UTC [354] LOG:  received promote request
2018-08-15 15:12:11.888 UTC [354] LOG:  redo done at 7/E9000028
2018-08-15 15:12:11.888 UTC [354] LOG:  last completed transaction was at log time 2018-08-15 14:45:05.961153+00
scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
scp: backup/postgresql/archivedir/00000002.history: No such file or directory
2018-08-15 15:12:13.316 UTC [354] LOG:  selected new timeline ID: 2
2018-08-15 15:12:13.368 UTC [354] FATAL:  could not open file "recovery.conf": Permission denied

So in the process of exiting recovery mode it tries to rename 
recovery.conf to recovery.done, but it fails to do so because 
my recovery.conf was owned by root, not user postgres.  Oooops. 
Perhaps it would be appropriate to check for that when it 
initially reads recovery.conf?  Anyway, I fix the permissions 
and restart it.  It starts in recovery mode and cannot connect 
to the master, as expected.

2018-08-15 15:16:58.061 UTC [6036] LOG:  restored log file "0000000100000007000000E8" from archive
2018-08-15 15:16:58.108 UTC [6036] LOG:  redo starts at 7/E8678300
2018-08-15 15:16:58.731 UTC [6087] LOG:  archive command failed with exit code 1
2018-08-15 15:16:58.731 UTC [6087] DETAIL:  The failed archive command was: ssh backup test ! -f
backup/postgresql/Y/0000000100000007000000E8&& scp pg_xlog/0000000100000007000000E8
backup:backup/postgresql/Y/0000000100000007000000E8
scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
2018-08-15 15:16:58.823 UTC [6036] LOG:  consistent recovery state reached at 7/E9000098
2018-08-15 15:16:58.823 UTC [6036] LOG:  invalid resource manager ID 22 at 7/E9000098
2018-08-15 15:16:58.824 UTC [6035] LOG:  database system is ready to accept read only connections
2018-08-15 15:16:58.855 UTC [6094] FATAL:  could not connect to the primary server: could not connect to server:
Connectionrefused
 

But note that in the middle of that log fragment it has tried to 
write WAL segment E8 to its (output) WAL archive, which is used to 
replicate to the downstream system Z, and it has found that that 
segment already exists there.  The file that's already there has 
the same checksum as the E8 file in the pg_xlog directory.  
Somehow it has forgotten that it has already archived that segment. 
It keeps retrying this archive command.

I try again to promote it:

2018-08-15 15:17:14.424 UTC [6036] LOG:  received promote request
2018-08-15 15:17:14.424 UTC [6036] LOG:  redo done at 7/E9000028
scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
scp: backup/postgresql/archivedir/00000002.history: No such file or directory
2018-08-15 15:17:15.832 UTC [6036] LOG:  selected new timeline ID: 2
2018-08-15 15:17:15.884 UTC [6036] LOG:  archive recovery complete
scp: backup/postgresql/archivedir/00000001.history: No such file or directory
2018-08-15 15:17:16.612 UTC [6036] LOG:  MultiXact member wraparound protections are now enabled
2018-08-15 15:17:16.614 UTC [6035] LOG:  database system is ready to accept connections

I can now connect to the db and do read/write operations, but it 
continues to try and fail to re-write segment E8 to its archive.

Eventually I manually delete E8 from the archive; it then 
succeeds at writing it with the same content as before and seems to 
continue to function normally.

Questions:

- Can all of this be blamed on the failure of the first attempt 
to promote due to the wrong permissions on recovery.conf?

- Should my archive_command detect the case where it is asked to 
write the same file again with the same contents, and report 
success in that case?

- Is this a bug?


Thanks,

Phil.




Re: During promotion, new master tries to archive same segment twice

От
David Steele
Дата:
Hi Phil,

On 8/15/18 4:25 PM, Phil Endecott wrote:
> 
> Questions:
> 
> - Can all of this be blamed on the failure of the first attempt 
> to promote due to the wrong permissions on recovery.conf?

Yes, it looks that way.

> - Should my archive_command detect the case where it is asked to 
> write the same file again with the same contents, and report 
> success in that case?

Yes.  pgBackRest has done this for years and it saves a *lot* of headaches.

> - Is this a bug?

I don't think so.  There are a number of cases where the same WAL
segment can be pushed more than once, especially after failures where
Postgres is not sure that the command completed.  The archive command
should handle this gracefully.

Regards,
-- 
-David
david@pgmasters.net


Re: During promotion, new master tries to archive same segment twice

От
Adrian Klaver
Дата:
On 08/15/2018 01:25 PM, Phil Endecott wrote:
> Dear Experts,
> 
> Here is my latest issue with replication:
> 
> I have 3 systems, X, Y and Z.  Initially X is replicated to Y
> and Y is replicated to Z; in each case the replication involves
> log-shipping using archive_command and restore_command (via a
> 4th system called "backup") and then streaming.
> 
> The plan is to shut down X and to promote Y to be the new master.
> 
> I shut down X, and as expected Y complains that it con no longer
> connect to it:
> 
> 2018-08-15 15:10:58.785 UTC [617] LOG:  replication terminated by primary server
> 2018-08-15 15:10:58.785 UTC [617] DETAIL:  End of WAL reached on timeline 1 at 7/E9000098.
> 2018-08-15 15:10:58.785 UTC [617] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
> 2018-08-15 15:10:59.928 UTC [354] LOG:  invalid resource manager ID 22 at 7/E9000098
> 2018-08-15 15:11:00.092 UTC [5856] FATAL:  could not connect to the primary server: FATAL:  the database system is
shuttingdown
 
> 
> Note that in the middle of that, after streaming has failed it
> tries to fetch the next WAL segment, E9, from the backup; this
> fails because this segment doesn't exist.
> 
> Then I pg_ctl promote on Y:
> 
> 2018-08-15 15:12:11.888 UTC [354] LOG:  received promote request
> 2018-08-15 15:12:11.888 UTC [354] LOG:  redo done at 7/E9000028
> 2018-08-15 15:12:11.888 UTC [354] LOG:  last completed transaction was at log time 2018-08-15 14:45:05.961153+00
> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
> scp: backup/postgresql/archivedir/00000002.history: No such file or directory
> 2018-08-15 15:12:13.316 UTC [354] LOG:  selected new timeline ID: 2
> 2018-08-15 15:12:13.368 UTC [354] FATAL:  could not open file "recovery.conf": Permission denied
> 
> So in the process of exiting recovery mode it tries to rename
> recovery.conf to recovery.done, but it fails to do so because
> my recovery.conf was owned by root, not user postgres.  Oooops.
> Perhaps it would be appropriate to check for that when it
> initially reads recovery.conf?  Anyway, I fix the permissions
> and restart it.  It starts in recovery mode and cannot connect
> to the master, as expected.
> 
> 2018-08-15 15:16:58.061 UTC [6036] LOG:  restored log file "0000000100000007000000E8" from archive
> 2018-08-15 15:16:58.108 UTC [6036] LOG:  redo starts at 7/E8678300
> 2018-08-15 15:16:58.731 UTC [6087] LOG:  archive command failed with exit code 1
> 2018-08-15 15:16:58.731 UTC [6087] DETAIL:  The failed archive command was: ssh backup test ! -f
backup/postgresql/Y/0000000100000007000000E8&& scp pg_xlog/0000000100000007000000E8
backup:backup/postgresql/Y/0000000100000007000000E8
> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory

The above is not clear to me. My best guess:

1) The ssh backup ... is archiving WAL's from Y to Y specific archive on 
'backup'?

2) The scp: backup ... is the restore_command fetching archived WAL's 
that originated on X and where sent to backup/postgresql/archivedir/ ?

3) If 2) is true why is it showing up in the error for the archive command?

Can you show us your actual archive and restore commands?

More below.

> 2018-08-15 15:16:58.823 UTC [6036] LOG:  consistent recovery state reached at 7/E9000098
> 2018-08-15 15:16:58.823 UTC [6036] LOG:  invalid resource manager ID 22 at 7/E9000098
> 2018-08-15 15:16:58.824 UTC [6035] LOG:  database system is ready to accept read only connections
> 2018-08-15 15:16:58.855 UTC [6094] FATAL:  could not connect to the primary server: could not connect to server:
Connectionrefused
 
> 

> 
> I can now connect to the db and do read/write operations, but it
> continues to try and fail to re-write segment E8 to its archive.
> 
> Eventually I manually delete E8 from the archive; it then
> succeeds at writing it with the same content as before and seems to
> continue to function normally.

Which archive?


> 
> Questions:
> 
> - Can all of this be blamed on the failure of the first attempt
> to promote due to the wrong permissions on recovery.conf?
> 
> - Should my archive_command detect the case where it is asked to
> write the same file again with the same contents, and report
> success in that case?
> 
> - Is this a bug?
> 
> 
> Thanks,
> 
> Phil.
> 
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: During promotion, new master tries to archive same segmenttwice

От
"Phil Endecott"
Дата:
David Steele wrote:
> On 8/15/18 4:25 PM, Phil Endecott wrote:
>> - Should my archive_command detect the case where it is asked to 
>> write the same file again with the same contents, and report 
>> success in that case?
>
> Yes.

> There are a number of cases where the same WAL
> segment can be pushed more than once, especially after failures where
> Postgres is not sure that the command completed.  The archive command
> should handle this gracefully.


Hmm, OK.  Here's what the current docs say:

Section 25.3.1:

"The archive command should generally be designed to refuse to
overwrite any pre-existing archive file. This is an important
safety feature to preserve the integrity of your archive in case
of administrator error (such as sending the output of two
different servers to the same archive directory).

It is advisable to test your proposed archive command to ensure
that it indeed does not overwrite an existing file, and that it
returns nonzero status in this case."

And section 26.2.9:

"When continuous WAL archiving is used in a standby, there
are two different scenarios: the WAL archive can be shared
between the primary and the standby, or the standby can
have its own WAL archive.  When the standby has its own WAL
archive, set archive_mode to always, and the standby will call
the archive command for every WAL segment it receives, whether
it's by restoring from the archive or by streaming replication.
The shared archive can be handled similarly, but the
archive_command must test if the file being archived exists
already, and if the existing file has identical contents.
This requires more care in the archive_command, as it must be
careful to not overwrite an existing file with different contents,
but return success if the exactly same file is archived twice.
And all that must be done free of race conditions, if two
servers attempt to archive the same file at the same time."


So you're saying that that's wrong, and that I must always
handle the case when the same WAL segment is written twice.

I'll file a bug against the documentation.


> pgBackRest has done this for years and it saves a *lot* of headaches.

The system to which I am sending the WAL files is a rsync.net
account.  I use it because of its reliability, but methods for
transferring files are limited largely to things like scp and
rsync.


Thanks, Phil.






Re: During promotion, new master tries to archive same segmenttwice

От
"Phil Endecott"
Дата:
Adrian Klaver wrote:
> On 08/15/2018 01:25 PM, Phil Endecott wrote:
>> Dear Experts,
>> 
>> Here is my latest issue with replication:
>> 
>> I have 3 systems, X, Y and Z.  Initially X is replicated to Y
>> and Y is replicated to Z; in each case the replication involves
>> log-shipping using archive_command and restore_command (via a
>> 4th system called "backup") and then streaming.
>> 
>> The plan is to shut down X and to promote Y to be the new master.
>> 
>> I shut down X, and as expected Y complains that it con no longer
>> connect to it:
>> 
>> 2018-08-15 15:10:58.785 UTC [617] LOG:  replication terminated by primary server
>> 2018-08-15 15:10:58.785 UTC [617] DETAIL:  End of WAL reached on timeline 1 at 7/E9000098.
>> 2018-08-15 15:10:58.785 UTC [617] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
>> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
>> 2018-08-15 15:10:59.928 UTC [354] LOG:  invalid resource manager ID 22 at 7/E9000098
>> 2018-08-15 15:11:00.092 UTC [5856] FATAL:  could not connect to the primary server: FATAL:  the database system is
shuttingdown
 
>> 
>> Note that in the middle of that, after streaming has failed it
>> tries to fetch the next WAL segment, E9, from the backup; this
>> fails because this segment doesn't exist.
>> 
>> Then I pg_ctl promote on Y:
>> 
>> 2018-08-15 15:12:11.888 UTC [354] LOG:  received promote request
>> 2018-08-15 15:12:11.888 UTC [354] LOG:  redo done at 7/E9000028
>> 2018-08-15 15:12:11.888 UTC [354] LOG:  last completed transaction was at log time 2018-08-15 14:45:05.961153+00
>> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
>> scp: backup/postgresql/archivedir/00000002.history: No such file or directory
>> 2018-08-15 15:12:13.316 UTC [354] LOG:  selected new timeline ID: 2
>> 2018-08-15 15:12:13.368 UTC [354] FATAL:  could not open file "recovery.conf": Permission denied
>> 
>> So in the process of exiting recovery mode it tries to rename
>> recovery.conf to recovery.done, but it fails to do so because
>> my recovery.conf was owned by root, not user postgres.  Oooops.
>> Perhaps it would be appropriate to check for that when it
>> initially reads recovery.conf?  Anyway, I fix the permissions
>> and restart it.  It starts in recovery mode and cannot connect
>> to the master, as expected.
>> 
>> 2018-08-15 15:16:58.061 UTC [6036] LOG:  restored log file "0000000100000007000000E8" from archive
>> 2018-08-15 15:16:58.108 UTC [6036] LOG:  redo starts at 7/E8678300
>> 2018-08-15 15:16:58.731 UTC [6087] LOG:  archive command failed with exit code 1
>> 2018-08-15 15:16:58.731 UTC [6087] DETAIL:  The failed archive command was: ssh backup test ! -f
backup/postgresql/Y/0000000100000007000000E8&& scp pg_xlog/0000000100000007000000E8
backup:backup/postgresql/Y/0000000100000007000000E8
>> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
>
> The above is not clear to me. My best guess:
>
> 1) The ssh backup ... is archiving WAL's from Y to Y specific archive on 
> 'backup'?

Yes, there are two archive directories on "backup", one called
"archivedir" which is for the X-to-Y replication and a second
called "Y" which is for the Y-to-Z replication.

> 2) The scp: backup ... is the restore_command fetching archived WAL's 
> that originated on X and where sent to backup/postgresql/archivedir/ ?

Yes.

> 3) If 2) is true why is it showing up in the error for the archive command?

It's not part of the error for the archive command; it's just the
next thing in the log file.  Y is (1) trying write WAL segment
E8 to the Y-to-Z archive and failing because it is already there and
(2) trying to read WAL segment E9 from the X-to-Y archive and failing
because it doesn't exist (and never will).  (1) is the problem, (2) is
expected.

> Can you show us your actual archive and restore commands?

On Y:

restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
archive_command = 'ssh backup test ! -f backup/postgresql/Y/%f && scp 
%p backup:backup/postgresql/Y/%f'

>> 2018-08-15 15:16:58.823 UTC [6036] LOG:  consistent recovery state reached at 7/E9000098
>> 2018-08-15 15:16:58.823 UTC [6036] LOG:  invalid resource manager ID 22 at 7/E9000098
>> 2018-08-15 15:16:58.824 UTC [6035] LOG:  database system is ready to accept read only connections
>> 2018-08-15 15:16:58.855 UTC [6094] FATAL:  could not connect to the primary server: could not connect to server:
Connectionrefused
 
>> 
>
>> 
>> I can now connect to the db and do read/write operations, but it
>> continues to try and fail to re-write segment E8 to its archive.
>> 
>> Eventually I manually delete E8 from the archive; it then
>> succeeds at writing it with the same content as before and seems to
>> continue to function normally.
>
> Which archive?

The Y-to-Z archive, backup:backup/postgresql/Y/...E8.

>> Questions:
>> 
>> - Can all of this be blamed on the failure of the first attempt
>> to promote due to the wrong permissions on recovery.conf?
>> 
>> - Should my archive_command detect the case where it is asked to
>> write the same file again with the same contents, and report
>> success in that case?
>> 
>> - Is this a bug?


Regards, Phil.






Re: During promotion, new master tries to archive same segment twice

От
David Steele
Дата:
On 8/16/18 4:37 AM, Phil Endecott wrote:
> David Steele wrote:
>> On 8/15/18 4:25 PM, Phil Endecott wrote:
>>> - Should my archive_command detect the case where it is asked to
>>> write the same file again with the same contents, and report success
>>> in that case?
>>
>> Yes.
> 
>> There are a number of cases where the same WAL
>> segment can be pushed more than once, especially after failures where
>> Postgres is not sure that the command completed.  The archive command
>> should handle this gracefully.
> 
> Hmm, OK.  Here's what the current docs say:
> 
> Section 25.3.1:
> 
> "The archive command should generally be designed to refuse to
> overwrite any pre-existing archive file. This is an important
> safety feature to preserve the integrity of your archive in case
> of administrator error (such as sending the output of two
> different servers to the same archive directory).
> 
> It is advisable to test your proposed archive command to ensure
> that it indeed does not overwrite an existing file, and that it
> returns nonzero status in this case."
> 
> And section 26.2.9:
> 
> "When continuous WAL archiving is used in a standby, there
> are two different scenarios: the WAL archive can be shared
> between the primary and the standby, or the standby can
> have its own WAL archive.  When the standby has its own WAL
> archive, set archive_mode to always, and the standby will call
> the archive command for every WAL segment it receives, whether
> it's by restoring from the archive or by streaming replication.
> The shared archive can be handled similarly, but the
> archive_command must test if the file being archived exists
> already, and if the existing file has identical contents.
> This requires more care in the archive_command, as it must be
> careful to not overwrite an existing file with different contents,
> but return success if the exactly same file is archived twice.
> And all that must be done free of race conditions, if two
> servers attempt to archive the same file at the same time."
> 
> So you're saying that that's wrong, and that I must always
> handle the case when the same WAL segment is written twice.

Seems like an omission in section 25.3.1 rather than a problem in 26.2.9.

Duplicate WAL is possible in *all* cases.  A trivial example is that
Postgres calls archive_command and it succeeds but an error happens
(e.g. network) right before Postgres is notified.  It will wait a bit
and try the same WAL segment again.

> I'll file a bug against the documentation.

OK.

>> pgBackRest has done this for years and it saves a *lot* of headaches.
> 
> The system to which I am sending the WAL files is a rsync.net
> account.  I use it because of its reliability, but methods for
> transferring files are limited largely to things like scp and
> rsync.

Rsync and scp are not good tools to use for backup because there is no
guarantee of durability, i.e. the file is not synced to disk before
success is returned.  rsync.net may have durability guarantees but you
should verify that with them.

Even so, crafting a safe archive_command using these tools is going to
be very tricky.

Regards,
-- 
-David
david@pgmasters.net


Re: During promotion, new master tries to archive same segment twice

От
Stephen Frost
Дата:
Greetings,

* Phil Endecott (spam_from_pgsql_lists@chezphil.org) wrote:
> David Steele wrote:
> >pgBackRest has done this for years and it saves a *lot* of headaches.
>
> The system to which I am sending the WAL files is a rsync.net
> account.  I use it because of its reliability, but methods for
> transferring files are limited largely to things like scp and
> rsync.

You might consider using a well-tested PG backup tool locally and then
simply rsync that backup over to rsync.net.  Certainly with pgbackrest,
we intend and expect people to use more routine "just copy all the
files" backup methods for backing up the repository.

Basically, you could pgbackrest to /some/local/path and then rsync from
there over to rsync.net.

Having an rsync or scp 'storage' option (similar to the s3 one) for
pgbackrest might be interesting..  We discussed having the ability to
start a backup from the PG server at one point but in that discussion
we were thinking pgbackrest would also be installed on the backup
server.  This would be different from that in that the remote side would
only need to support rsync or scp.  You'd have to accept that if the
the backup server dies then you lose data though, since I don't believe
there's a way to ask for an fsync() through rsync or scp, which makes it
a much less compelling feature, unless rsync.net guarantees writes
somehow..?

Thanks!

Stephen

Вложения

Re: During promotion, new master tries to archive same segment twice

От
Adrian Klaver
Дата:
On 08/16/2018 01:48 AM, Phil Endecott wrote:
> Adrian Klaver wrote:
>> On 08/15/2018 01:25 PM, Phil Endecott wrote:
>>> Dear Experts,
>>>

>> The above is not clear to me. My best guess:
>>

> It's not part of the error for the archive command; it's just the
> next thing in the log file.  Y is (1) trying write WAL segment
> E8 to the Y-to-Z archive and failing because it is already there and
> (2) trying to read WAL segment E9 from the X-to-Y archive and failing
> because it doesn't exist (and never will).  (1) is the problem, (2) is
> expected.

Hmm, still not sure why they are both showing up in the same LOG: entry 
as the entry was for the failure of the archive_command.

What is archive_mode set to on Y?


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: During promotion, new master tries to archive same segmenttwice

От
"Phil Endecott"
Дата:
Adrian Klaver wrote:
> On 08/16/2018 01:48 AM, Phil Endecott wrote:
>> Adrian Klaver wrote:
>>> On 08/15/2018 01:25 PM, Phil Endecott wrote:
>>>> Dear Experts,
>>>>
>
>>> The above is not clear to me. My best guess:
>>>
>
>> It's not part of the error for the archive command; it's just the
>> next thing in the log file.  Y is (1) trying write WAL segment
>> E8 to the Y-to-Z archive and failing because it is already there and
>> (2) trying to read WAL segment E9 from the X-to-Y archive and failing
>> because it doesn't exist (and never will).  (1) is the problem, (2) is
>> expected.
>
> Hmm, still not sure why they are both showing up in the same LOG: entry 
> as the entry was for the failure of the archive_command.

It's not.  It's just the next thing in the log file.  It seems that
stderr output from scp appears raw in the log file.

> What is archive_mode set to on Y?

archive_mode = always


Regards, Phil.