Re: During promotion, new master tries to archive same segmenttwice

Поиск
Список
Период
Сортировка
От Phil Endecott
Тема Re: During promotion, new master tries to archive same segmenttwice
Дата
Msg-id 1534409295072@dmwebmail.dmwebmail.chezphil.org
обсуждение исходный текст
Ответ на Re: During promotion, new master tries to archive same segment twice  (Adrian Klaver <adrian.klaver@aklaver.com>)
Ответы Re: During promotion, new master tries to archive same segment twice  (Adrian Klaver <adrian.klaver@aklaver.com>)
Список pgsql-general
Adrian Klaver wrote:
> On 08/15/2018 01:25 PM, Phil Endecott wrote:
>> Dear Experts,
>> 
>> Here is my latest issue with replication:
>> 
>> I have 3 systems, X, Y and Z.  Initially X is replicated to Y
>> and Y is replicated to Z; in each case the replication involves
>> log-shipping using archive_command and restore_command (via a
>> 4th system called "backup") and then streaming.
>> 
>> The plan is to shut down X and to promote Y to be the new master.
>> 
>> I shut down X, and as expected Y complains that it con no longer
>> connect to it:
>> 
>> 2018-08-15 15:10:58.785 UTC [617] LOG:  replication terminated by primary server
>> 2018-08-15 15:10:58.785 UTC [617] DETAIL:  End of WAL reached on timeline 1 at 7/E9000098.
>> 2018-08-15 15:10:58.785 UTC [617] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
>> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
>> 2018-08-15 15:10:59.928 UTC [354] LOG:  invalid resource manager ID 22 at 7/E9000098
>> 2018-08-15 15:11:00.092 UTC [5856] FATAL:  could not connect to the primary server: FATAL:  the database system is
shuttingdown
 
>> 
>> Note that in the middle of that, after streaming has failed it
>> tries to fetch the next WAL segment, E9, from the backup; this
>> fails because this segment doesn't exist.
>> 
>> Then I pg_ctl promote on Y:
>> 
>> 2018-08-15 15:12:11.888 UTC [354] LOG:  received promote request
>> 2018-08-15 15:12:11.888 UTC [354] LOG:  redo done at 7/E9000028
>> 2018-08-15 15:12:11.888 UTC [354] LOG:  last completed transaction was at log time 2018-08-15 14:45:05.961153+00
>> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
>> scp: backup/postgresql/archivedir/00000002.history: No such file or directory
>> 2018-08-15 15:12:13.316 UTC [354] LOG:  selected new timeline ID: 2
>> 2018-08-15 15:12:13.368 UTC [354] FATAL:  could not open file "recovery.conf": Permission denied
>> 
>> So in the process of exiting recovery mode it tries to rename
>> recovery.conf to recovery.done, but it fails to do so because
>> my recovery.conf was owned by root, not user postgres.  Oooops.
>> Perhaps it would be appropriate to check for that when it
>> initially reads recovery.conf?  Anyway, I fix the permissions
>> and restart it.  It starts in recovery mode and cannot connect
>> to the master, as expected.
>> 
>> 2018-08-15 15:16:58.061 UTC [6036] LOG:  restored log file "0000000100000007000000E8" from archive
>> 2018-08-15 15:16:58.108 UTC [6036] LOG:  redo starts at 7/E8678300
>> 2018-08-15 15:16:58.731 UTC [6087] LOG:  archive command failed with exit code 1
>> 2018-08-15 15:16:58.731 UTC [6087] DETAIL:  The failed archive command was: ssh backup test ! -f
backup/postgresql/Y/0000000100000007000000E8&& scp pg_xlog/0000000100000007000000E8
backup:backup/postgresql/Y/0000000100000007000000E8
>> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
>
> The above is not clear to me. My best guess:
>
> 1) The ssh backup ... is archiving WAL's from Y to Y specific archive on 
> 'backup'?

Yes, there are two archive directories on "backup", one called
"archivedir" which is for the X-to-Y replication and a second
called "Y" which is for the Y-to-Z replication.

> 2) The scp: backup ... is the restore_command fetching archived WAL's 
> that originated on X and where sent to backup/postgresql/archivedir/ ?

Yes.

> 3) If 2) is true why is it showing up in the error for the archive command?

It's not part of the error for the archive command; it's just the
next thing in the log file.  Y is (1) trying write WAL segment
E8 to the Y-to-Z archive and failing because it is already there and
(2) trying to read WAL segment E9 from the X-to-Y archive and failing
because it doesn't exist (and never will).  (1) is the problem, (2) is
expected.

> Can you show us your actual archive and restore commands?

On Y:

restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
archive_command = 'ssh backup test ! -f backup/postgresql/Y/%f && scp 
%p backup:backup/postgresql/Y/%f'

>> 2018-08-15 15:16:58.823 UTC [6036] LOG:  consistent recovery state reached at 7/E9000098
>> 2018-08-15 15:16:58.823 UTC [6036] LOG:  invalid resource manager ID 22 at 7/E9000098
>> 2018-08-15 15:16:58.824 UTC [6035] LOG:  database system is ready to accept read only connections
>> 2018-08-15 15:16:58.855 UTC [6094] FATAL:  could not connect to the primary server: could not connect to server:
Connectionrefused
 
>> 
>
>> 
>> I can now connect to the db and do read/write operations, but it
>> continues to try and fail to re-write segment E8 to its archive.
>> 
>> Eventually I manually delete E8 from the archive; it then
>> succeeds at writing it with the same content as before and seems to
>> continue to function normally.
>
> Which archive?

The Y-to-Z archive, backup:backup/postgresql/Y/...E8.

>> Questions:
>> 
>> - Can all of this be blamed on the failure of the first attempt
>> to promote due to the wrong permissions on recovery.conf?
>> 
>> - Should my archive_command detect the case where it is asked to
>> write the same file again with the same contents, and report
>> success in that case?
>> 
>> - Is this a bug?


Regards, Phil.






В списке pgsql-general по дате отправления:

Предыдущее
От: "Phil Endecott"
Дата:
Сообщение: Re: During promotion, new master tries to archive same segmenttwice
Следующее
От: Alex Kliukin
Дата:
Сообщение: regex match and special characters