Обсуждение: BUG #8543: Standby recovery use incorrect timeline to determine WAL length
BUG #8543: Standby recovery use incorrect timeline to determine WAL length
От
alfonso.vicente@logos.com.uy
Дата:
The following bug has been logged on the website: Bug reference: 8543 Logged by: Alfonso Vicente Email address: alfonso.vicente@logos.com.uy PostgreSQL version: 9.1.3 Operating system: OpenSUSE 11 Description: Environment: PostgreSQL 9.1.3 / OpenSUSE 11 The use case is a switchover and switchback test, using a standby cluster with Streaming Replication. We have 2 datacenters: main-site and backup-site The cluster in main-site has timeline ID 1, and we perform the switchover to backup-site without any trouble. When we execute the pg_ctl promote, the cluster sets the new timeline ID 2 (OK). But... we don't stop the cluster in main-site. So, the cluster in main-site continue to archive logs in /archlogs directory, and archives the WAL 00000001000000FE00000002 of 16MB. Immediately, we perform a backup in the backup-site, to prepare for switchback. The pg_stop_backup() force a WAL switch, and generate a WAL 00000002000000FE00000002 of 2MB. Then, the archive_command script perform the log shipping to the main-site and store the 00000002000000FE00000002 in the same /archlogs directory When we execute the pg_ctl promote in the main-site, the postgresql.log show the following error: 2013-10-19 22:39:59 UYST LOG: database system was shut down in recovery at 2013-10-19 21:47:07 UYST sh: /home/postgres/9.1/backups/archlogs/00000002.history.gz: No existe el fichero o el directorio 2013-10-19 22:39:59 UYST LOG: starting archive recovery 2013-10-19 22:39:59 UYST FATAL: archive file "00000002000000FE00000002" has wrong size: 2293760 instead of 16777216 2013-10-19 22:39:59 UYST LOG: startup process (PID 10636) exited with exit code 1 2013-10-19 22:39:59 UYST LOG: aborting startup due to startup process failure The WAL 00000002000000FE00000002 is of 2293760 bytes, this is not an error. Why PostgreSQL think WAL 00000002000000FE00000002 should be of 16MB ? ... I think the cause can be the WAL 00000001000000FE00000002 (timeline ID 1) stored in the /archlogs directory If we drop all the timeline ID 1 archived WALs, the failback executes succesfully There is an easy workaround, but please, confirm if this is a bug. The documentation says: "The default behavior of recovery is to recover along the same timeline that was current when the base backup was taken". The base backup in backup-site was timeline ID 2 Thanks in advance
Re: BUG #8543: Standby recovery use incorrect timeline to determine WAL length
От
Michael Paquier
Дата:
On Mon, Oct 21, 2013 at 11:31 PM, <alfonso.vicente@logos.com.uy> wrote: > PostgreSQL version: 9.1.3 This version is outdated, and 9.1.9 even contains important security fixes, so you should update to the latest minor release of 9.1 which is 9.1.10. > But... we don't stop the cluster in main-site. So, the cluster in main-site > continue to archive logs in /archlogs directory, and archives the WAL > 00000001000000FE00000002 of 16MB. This is fine, the old and new WAL files are not going to overlap because of their different timeline ID. > Immediately, we perform a backup in the backup-site, to prepare for > switchback. The pg_stop_backup() force a WAL switch, and generate a WAL > 00000002000000FE00000002 of 2MB. Then, the archive_command script perform > the log shipping to the main-site and store the 00000002000000FE00000002 in > the same /archlogs directory As far as I recall, the size of the WAL segment archived after pg_stop_backup should be 16MB. Isn't this a problem? > When we execute the pg_ctl promote in the main-site, the postgresql.log show > the following error: > sh: /home/postgres/9.1/backups/archlogs/00000002.history.gz: No existe el > fichero o el directorio > 2013-10-19 22:39:59 UYST LOG: starting archive recovery > 2013-10-19 22:39:59 UYST FATAL: archive file "00000002000000FE00000002" > has wrong size: 2293760 instead of 16777216 > > If we drop all the timeline ID 1 archived WALs, the failback executes > succesfully 9.1.8 and 9.1.10 include a couple fixes related to timeline handling of WAL files and detection of a database consistency point. Hence I would recommend trying once again with a latest version. > The documentation says: "The default behavior of recovery is to recover along > the same timeline that was current when the base backup was taken". The base > backup in backup-site was timeline ID 2 True. -- Michael