Обсуждение: are WAL file segment boundaries a point of consistency?
We use logshipping replication, and have recently noticed a nasty bug where, in certain very rare cases, the primary archive_command program will fail to send the WAL file to the standby but report good return code 0 to postgresql. In such cases, if the standby then triggers its termination of recovery mode, it will come up in normal accessible mode but missing the log records from that last WAL file. This is a bug in our code which we will fix, but I am wondering if it means there is a possibility of worse than missing some updates. I.e. could it result in this was-standby cluster now having a corrupt database (e.g. an index entry with no matching heap slot or something like that - or worse)? I think the question is whether the end of a WAL file is a point of consistency? like the timestamp you can specify in the recovery.conf for a point-in-time recovery? Or does postgresql xlogger just chop each WAL segment at the physical page boundary? Cheers, John Lumby
I would look at WAL files as a sequence of commits and not a sequence of files within timelines where you can specify either with recovery_target_time or recovery_target_xid the point of consistency you want to reach.
Cheers,On Fri, Sep 6, 2013 at 1:26 PM, John Lumby <johnlumby@hotmail.com> wrote:
We use logshipping replication, and have recently noticed a nasty bug
where, in certain very rare cases, the primary archive_command program
will fail to send the WAL file to the standby but report good return code 0 to postgresql.
In such cases, if the standby then triggers its termination of recovery mode,
it will come up in normal accessible mode but missing the log records from that last WAL file.
This is a bug in our code which we will fix, but I am wondering if it means there is a possibility
of worse than missing some updates. I.e. could it result in this was-standby cluster now having
a corrupt database (e.g. an index entry with no matching heap slot or something like that - or worse)?
I think the question is whether the end of a WAL file is a point of consistency?
like the timestamp you can specify in the recovery.conf for a point-in-time recovery?
Or does postgresql xlogger just chop each WAL segment at the physical page boundary?
Cheers, John Lumby
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Sep 6, 2013 at 1:26 PM, John Lumby <johnlumby@hotmail.com> wrote: > We use logshipping replication, and have recently noticed a nasty bug > where, in certain very rare cases, the primary archive_command program > will fail to send the WAL file to the standby but report good return code 0 to postgresql. > In such cases, if the standby then triggers its termination of recovery mode, > it will come up in normal accessible mode but missing the log records from that last WAL file. > > This is a bug in our code which we will fix, but I am wondering if it means there is a possibility > of worse than missing some updates. I.e. could it result in this was-standby cluster now having > a corrupt database (e.g. an index entry with no matching heap slot or something like that - or worse)? As long as the standby ever reached consistency in the first place, then it should not lose it due to this issue. Once consistency is reached, changes to the data files are driven only by replay of the WAL records, and those should only take the database from one consistent state to another. Where you risk corruption is if the problem occured while you are taking the base backup. Then some of the base files that were copied might already have data in them which is from the "future", but that future cannot be reached because recovery stops early due to the lost file. The database should detect this situation and refuse to start, forcing you to retake the base backup or use an earlier one. But there were known bugs in this general area, some fixed in 9.2.3. Cheers, Jeff