Re: During promotion, new master tries to archive same segmenttwice
От | Phil Endecott |
---|---|
Тема | Re: During promotion, new master tries to archive same segmenttwice |
Дата | |
Msg-id | 1534409295072@dmwebmail.dmwebmail.chezphil.org обсуждение исходный текст |
Ответ на | Re: During promotion, new master tries to archive same segment twice (Adrian Klaver <adrian.klaver@aklaver.com>) |
Ответы |
Re: During promotion, new master tries to archive same segment twice
(Adrian Klaver <adrian.klaver@aklaver.com>)
|
Список | pgsql-general |
Adrian Klaver wrote: > On 08/15/2018 01:25 PM, Phil Endecott wrote: >> Dear Experts, >> >> Here is my latest issue with replication: >> >> I have 3 systems, X, Y and Z. Initially X is replicated to Y >> and Y is replicated to Z; in each case the replication involves >> log-shipping using archive_command and restore_command (via a >> 4th system called "backup") and then streaming. >> >> The plan is to shut down X and to promote Y to be the new master. >> >> I shut down X, and as expected Y complains that it con no longer >> connect to it: >> >> 2018-08-15 15:10:58.785 UTC [617] LOG: replication terminated by primary server >> 2018-08-15 15:10:58.785 UTC [617] DETAIL: End of WAL reached on timeline 1 at 7/E9000098. >> 2018-08-15 15:10:58.785 UTC [617] FATAL: could not send end-of-streaming message to primary: no COPY in progress >> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory >> 2018-08-15 15:10:59.928 UTC [354] LOG: invalid resource manager ID 22 at 7/E9000098 >> 2018-08-15 15:11:00.092 UTC [5856] FATAL: could not connect to the primary server: FATAL: the database system is shuttingdown >> >> Note that in the middle of that, after streaming has failed it >> tries to fetch the next WAL segment, E9, from the backup; this >> fails because this segment doesn't exist. >> >> Then I pg_ctl promote on Y: >> >> 2018-08-15 15:12:11.888 UTC [354] LOG: received promote request >> 2018-08-15 15:12:11.888 UTC [354] LOG: redo done at 7/E9000028 >> 2018-08-15 15:12:11.888 UTC [354] LOG: last completed transaction was at log time 2018-08-15 14:45:05.961153+00 >> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory >> scp: backup/postgresql/archivedir/00000002.history: No such file or directory >> 2018-08-15 15:12:13.316 UTC [354] LOG: selected new timeline ID: 2 >> 2018-08-15 15:12:13.368 UTC [354] FATAL: could not open file "recovery.conf": Permission denied >> >> So in the process of exiting recovery mode it tries to rename >> recovery.conf to recovery.done, but it fails to do so because >> my recovery.conf was owned by root, not user postgres. Oooops. >> Perhaps it would be appropriate to check for that when it >> initially reads recovery.conf? Anyway, I fix the permissions >> and restart it. It starts in recovery mode and cannot connect >> to the master, as expected. >> >> 2018-08-15 15:16:58.061 UTC [6036] LOG: restored log file "0000000100000007000000E8" from archive >> 2018-08-15 15:16:58.108 UTC [6036] LOG: redo starts at 7/E8678300 >> 2018-08-15 15:16:58.731 UTC [6087] LOG: archive command failed with exit code 1 >> 2018-08-15 15:16:58.731 UTC [6087] DETAIL: The failed archive command was: ssh backup test ! -f backup/postgresql/Y/0000000100000007000000E8&& scp pg_xlog/0000000100000007000000E8 backup:backup/postgresql/Y/0000000100000007000000E8 >> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory > > The above is not clear to me. My best guess: > > 1) The ssh backup ... is archiving WAL's from Y to Y specific archive on > 'backup'? Yes, there are two archive directories on "backup", one called "archivedir" which is for the X-to-Y replication and a second called "Y" which is for the Y-to-Z replication. > 2) The scp: backup ... is the restore_command fetching archived WAL's > that originated on X and where sent to backup/postgresql/archivedir/ ? Yes. > 3) If 2) is true why is it showing up in the error for the archive command? It's not part of the error for the archive command; it's just the next thing in the log file. Y is (1) trying write WAL segment E8 to the Y-to-Z archive and failing because it is already there and (2) trying to read WAL segment E9 from the X-to-Y archive and failing because it doesn't exist (and never will). (1) is the problem, (2) is expected. > Can you show us your actual archive and restore commands? On Y: restore_command = 'scp backup:backup/postgresql/archivedir/%f %p' archive_command = 'ssh backup test ! -f backup/postgresql/Y/%f && scp %p backup:backup/postgresql/Y/%f' >> 2018-08-15 15:16:58.823 UTC [6036] LOG: consistent recovery state reached at 7/E9000098 >> 2018-08-15 15:16:58.823 UTC [6036] LOG: invalid resource manager ID 22 at 7/E9000098 >> 2018-08-15 15:16:58.824 UTC [6035] LOG: database system is ready to accept read only connections >> 2018-08-15 15:16:58.855 UTC [6094] FATAL: could not connect to the primary server: could not connect to server: Connectionrefused >> > >> >> I can now connect to the db and do read/write operations, but it >> continues to try and fail to re-write segment E8 to its archive. >> >> Eventually I manually delete E8 from the archive; it then >> succeeds at writing it with the same content as before and seems to >> continue to function normally. > > Which archive? The Y-to-Z archive, backup:backup/postgresql/Y/...E8. >> Questions: >> >> - Can all of this be blamed on the failure of the first attempt >> to promote due to the wrong permissions on recovery.conf? >> >> - Should my archive_command detect the case where it is asked to >> write the same file again with the same contents, and report >> success in that case? >> >> - Is this a bug? Regards, Phil.
В списке pgsql-general по дате отправления:
Предыдущее
От: "Phil Endecott"Дата:
Сообщение: Re: During promotion, new master tries to archive same segmenttwice