Обсуждение: Unable to start replica after failover
Hi,
We're using asynchronous streaming replication together with WAL archiving and recovery (we've set these parameters: archive_command, restore_command, recovery_target_timeline=latest).
To manage backup and recovery, we're using pg_probackup.
We have a 3-node cluster managed by Patroni on Raft.
While running our cluster, we encounter a recurring problem: after a failover, the former leader cannot recover neither via the tools provided by PostgreSQL nor with the Patroni tools (pg_rewind).
We've seen a number of different WAL reading errors. For example:
· "could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50"; or
· "could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000050000054E000000FB, offset 3448832".
There have been some other variations but all such errors have the same cause:
1. At startup, the former leader does not immediately get access to the backup. As a result, the recovery process uses the local WAL, which inevitably leads to a split brain since in most cases there is a lag in replication after a failover.
2. Next, the former leader finally manages to read the latest log from the backup, and that log has a higher priority than the local log, even though the recovery with the local log has already occurred.
As a result, we see the following:
"new timeline 3 forked off current database system timeline 2 before current recovery point 54E/FB348150".
When pg_rewind is run, it also uses the log from the backup (the lagging log from the new leader) instead of the partial log with which the former leader has already been started.
Next, pg_rewind returns errors while reading the log from the backup back, looking for the last checkpoint, which is quite reasonable because, once a new leader starts, the point of divergence normally ends up in the next timeline and the previous timeline's backup log does not have a block with the LSN of the divergence.
Also, in this case, it is impossible to build a map of the pages that have been changed to roll back the changes using pg_rewind.
To avoid the problem, we decided to stop using restore_command. Could you please let us know if there is a better solution to the problem we've described?
At Fri, 29 Jul 2022 15:01:44 +0000, "Lahnov, Igor" <Igor.Lahnov@nexign.com> wrote in > * "could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50"; or > * "could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000050000054E000000FB,offset 3448832". .. > To avoid the problem, we decided to stop using restore_command. Could you please let us know if there is a better solutionto the problem we've described? Perhaps almost none of us don't see what is happning there, since you didn't give us sufficient information on the configurationand exact steps. But roughly it looks like shuffling/mixing of WAL files among several systems (or WAL archives) with different histories. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
What additional information is needed?
> What additional information is needed? Usually server logs and the output of pg_rewind at the trouble time are needed as the first step. > Next, pg_rewind returns errors while reading the log from the backup > back, looking for the last checkpoint, which is quite reasonable > because, once a new leader starts, the point of divergence normally > ends up in the next timeline and the previous timeline's backup log > does not have a block with the LSN of the divergence. That sounds like pg_rewind is a crap. pg_rewind reads timeline history files from the both servers to find the last timeline up to where the two servers share the same history then determine the divergence point at the latest LSN where the two servers are known to share. Then it overwrites the pages modified since the common checkpoint until the last (shutdown) checkpoint on the previous leader that are modified in the *previous* timeline on the former leader by the data of the same pages *on the new leader*. No need for page data from the older timeline. If nothing's going wrong, pg_rewind is not expected to face the situation of: > could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50 > could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000050000054E000000FB, Which means the WAL files are somehow broken. > When pg_rewind is run, it also uses the log from the backup (the > lagging log from the new leader) instead of the partial log with > which the former leader has already been started. I don't see how come the former leader doesn't have access to the partial log (or the latest WAL file, I suppose)? It is essential for pg_rewind to work (since it exists nowhere other than there) and it must be in pg_wal directory unless someone removed it. Thus, I think we need the exact steps you and your system took after the failover happened about postgresql. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
We know what the problem is, but don't know how to solve it correctly. After failover, *new leader* promoted and read local partial log to LSN 54E/FB348118 -> 2022-05-23 01:47:52.124 [12088] LOG: record with incorrect prev-link 0/100 at 54E/FB348118 2022-05-23 01:47:52.124 [12088] LOG: redo done at 54E/FB3480F0 system usage: CPU: user: 321.21 s, system: 144.77 s, elapsed:354977.44 s rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn: 54E/FB348118, prev 54E/FB3480F0, desc: END_OF_RECOVERY tli 3; prev tli 2; time2022-05-23 01:47:53.342100 2022-05-23 01:47:52.900 [12088] LOG: selected new timeline ID: 3 2022-05-23 01:47:53.658 [12084] LOG: database system is ready to accept connections <- The *previous leader* started after a failure and read the local partial log to LSN 54E/FB348150 -> 2022-05-23 01:50:12.945 [122891] LOG: redo starts at 54E/F97F4878 2022-05-23 01:50:13.010 [122891] LOG: invalid record length at 54E/FB348150: wanted 24, got 0 2022-05-23 01:50:13.011 [122891] LOG: consistent recovery state reached at 54E/FB348150 2022-05-23 01:50:13.012 [122883] LOG: database system is ready to accept read-only connections <- Our 'restore_command' on *previous leader* restores a partial file from archive (from *new leader*) -> 2022-05-23 01:50:14 [123730]: [1-1]: INFO: pg_probackup archive-get WAL file: 000000020000054E000000FB, remote: ssh, threads:1/1, batch: 1 <- And this file has a higher priority than the original local partial file on the * previous leader*. And this leads to theproblem: --- 2022-05-23 01:50:14.448 [122891] LOG: new timeline 3 forked off current database system timeline 2 before current recoverypoint 54E/FB348150 --- And pg_rewind tries to use this file (from *new leader*) to build a map of pages that have changed since the last checkpoint --- 2022-05-23 01:51:32,202 INFO: Lock owner: pg51; I am pg01 2022-05-23 01:51:32,209 INFO: Local timeline=2 lsn=54E/FB348150 2022-05-23 01:51:32,252 INFO: master_timeline=3 2022-05-23 01:51:32,254 INFO: master: history=1 53E/FD65D298 no recovery target specified 2 54E/FB348118 no recovery target specified 2022-05-23 01:51:32,499 INFO: running pg_rewind from pg51 2022-05-23 01:51:32,574 INFO: running pg_rewind from dbname=postgres user=postgres host= IP port=5432 target_session_attrs=read-write 2022-05-23 01:51:32,640 INFO: pg_rewind exit code=1 2022-05-23 01:51:32,640 INFO: stdout= 2022-05-23 01:51:32,640 INFO: stderr=pg_rewind: servers diverged at WAL location 54E/FB348118 on timeline 2 pg_rewind: fatal: could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000020000054E000000FB,offset 3448832 --- We checked this log (from *new leader*), and when direct reading, the output like this: --- record with incorrect prev-link 0/100 at 54E/FB348118 ---- when backward reading: --- 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000020000054E000000FB, offset 3448832 --- Now we have disabled restore_command, but I think it's not best solution. What the correct solution in this case?
Our 'restore_command' on *previous leader* restores a partial file from archive (from *new leader*)
->
2022-05-23 01:50:14 [123730]: [1-1]: INFO: pg_probackup archive-get WAL file: 000000020000054E000000FB, remote: ssh, threads: 1/1, batch: 1
<-
Hi,
Yes, the *patial* from the *new leader* is restored to *last leader* and renamed to 000000020000054E000000FB, without *partial* postfix.
>>Postgres asks for file 000000020000054E000000FB but somehow gets
>>000000020000054E000000FB.partial instead. Why?
Yes, Postgres asks for 000000020000054E000000FB and gets renamed 000000020000054E000000FB.partial (without *partial* postfix).
And, finally, the *last leader* has a local 000000020000054E000000FB.partial, with the segments needed for recovery, and 000000020000054E000000FB ( renamed 000000020000054E000000FB.partial from new leader). But 000000020000054E000000FB has a higher priority than 000000020000054E000000FB.partial
Also, the *new leader* archives full 000000030000054E000000FB of the new timeline. And this file contains the same as 000000020000054E000000FB data.
Yes, Postgres asks for 000000020000054E000000FB and gets renamed 000000020000054E000000FB.partial (without *partial* postfix).
I can't answer your question. What do you think, the recovery from the archive should work in this case?
1. the partial file should not be restored at all cases
2. the partial file should recover as a partial and replace the local partial
3. recovery command, should return a conflict - file already exists
I can't answer your question. What do you think, the recovery from the archive should work in this case?
1. the partial file should not be restored at all cases
2. the partial file should recover as a partial and replace the local partial
3. recovery command, should return a conflict - file already exists
As far as I understand, according to the logs, the last leader does not yet know about the new timeline
and it is trying to download the full log from the previous timeline.
It seems there should be a conflict that the partial file already exists locally when restoring in this case, but this does not happen.
And the partial file from the new leader loaded as a full log.
What do you think it is possible to add a check to the restore command, that a partial or full file already exists?
Or is disabling the restore command a possible solution in this case?
What do you think it is possible to add a check to the restore command, that a partial or full file already exists?
Or is disabling the restore command a possible solution in this case?