Обсуждение: Corner-case bugs in ThisStartUpID accounting

Поиск
Список
Период
Сортировка

Corner-case bugs in ThisStartUpID accounting

От
Tom Lane
Дата:
I've been studying a log file that Rao Kumar sent me, showing a 7.3.2
system having some serious difficulties :-(.  I believe it exposes an
error in the startup-id logic in xlog.c.  The system is crashing
regularly due to apparent data corruption in heavily-used tables.
Now watch this:

2003-05-19 18:04:18 PANIC:  open of /usr/local/pgsql/data/pg_clog/0000 failed: No such file or directory
2003-05-19 18:04:18 LOG:  server process (pid 72392) was terminated by signal 6
2003-05-19 18:04:18 LOG:  terminating any other active server processes
2003-05-19 18:04:18 LOG:  all server processes terminated; reinitializing shared memory and semaphores
2003-05-19 18:04:18 LOG:  database system was interrupted at 2003-05-19 18:04:16 EDT
2003-05-19 18:04:18 LOG:  checkpoint record is at 9/54ACD800
2003-05-19 18:04:18 LOG:  redo record is at 9/54ACD800; undo record is at 0/0; shutdown TRUE
2003-05-19 18:04:18 LOG:  next transaction id: 2762375; next oid: 4292578
2003-05-19 18:04:18 LOG:  database system was not properly shut down; automatic recovery in progress
2003-05-19 18:04:18 LOG:  redo starts at 9/54ACD840
2003-05-19 18:04:18 LOG:  ReadRecord: unexpected pageaddr 9/4BB2A000 in log file 9, segment 84, offset 11706368
2003-05-19 18:04:18 LOG:  redo done at 9/54B27F90
2003-05-19 18:04:21 LOG:  database system is ready
2003-05-19 18:04:33 PANIC:  open of /usr/local/pgsql/data/pg_clog/0000 failed: No such file or directory
2003-05-19 18:04:33 LOG:  server process (pid 72401) was terminated by signal 6
2003-05-19 18:04:33 LOG:  terminating any other active server processes
2003-05-19 18:04:33 LOG:  all server processes terminated; reinitializing shared memory and semaphores
2003-05-19 18:04:33 LOG:  database system was interrupted at 2003-05-19 18:04:21 EDT
2003-05-19 18:04:33 LOG:  checkpoint record is at 9/54B2A010
2003-05-19 18:04:33 LOG:  redo record is at 9/54B2A010; undo record is at 0/0; shutdown TRUE
2003-05-19 18:04:33 LOG:  next transaction id: 2762386; next oid: 4300770
2003-05-19 18:04:33 LOG:  database system was not properly shut down; automatic recovery in progress
2003-05-19 18:04:33 LOG:  redo starts at 9/54B2A050
2003-05-19 18:04:34 LOG:  ReadRecord: record with zero length at 9/54CBFFC4
2003-05-19 18:04:34 LOG:  redo done at 9/54CBFFA0
2003-05-19 18:04:36 LOG:  database system is ready
2003-05-19 18:04:36 PANIC:  open of /usr/local/pgsql/data/pg_clog/0000 failed: No such file or directory
2003-05-19 18:04:37 LOG:  server process (pid 72410) was terminated by signal 6
2003-05-19 18:04:37 LOG:  terminating any other active server processes
2003-05-19 18:04:37 LOG:  all server processes terminated; reinitializing shared memory and semaphores
2003-05-19 18:04:37 LOG:  database system was interrupted at 2003-05-19 18:04:36 EDT
2003-05-19 18:04:37 LOG:  ReadRecord: out-of-sequence SUI 43 (after 44) in log file 9, segment 84, offset 13369344
2003-05-19 18:04:37 LOG:  invalid primary checkpoint record
2003-05-19 18:04:37 LOG:  using previous checkpoint record at 9/54B2A010
2003-05-19 18:04:37 LOG:  redo record is at 9/54B2A010; undo record is at 0/0; shutdown TRUE
2003-05-19 18:04:37 LOG:  next transaction id: 2762386; next oid: 4300770
2003-05-19 18:04:37 LOG:  database system was not properly shut down; automatic recovery in progress
2003-05-19 18:04:37 LOG:  redo starts at 9/54B2A050
2003-05-19 18:04:37 LOG:  ReadRecord: out-of-sequence SUI 43 (after 44) in log file 9, segment 84, offset 13369344
2003-05-19 18:04:37 LOG:  redo done at 9/54CBFFA0
2003-05-19 18:04:39 LOG:  database system is ready

My interpretation of this trace is:

1. The crash restart at 18:04:18 recovered from the prior checkpoint (at
9/54ACD800), which contained SUI 43.  It ran forward through additional
WAL entries, redoing as needed, then wrote a fresh checkpoint record
(at 9/54B2A010), incremented its SUI to 44, and opened for business.

2. In the next few seconds the DB was able to do enough work to fill one
or more XLOG pages, so that there were some pages with xlp_sui = 44 in
their headers.  It did not live long enough to write another checkpoint
though.

3. The crash at 18:04:33 occurs with WAL filled almost but not quite to
a page boundary (note "redo done at 9/54CBFFA0").  The redo starts at
9/54B2A010 which contains SUI 43.  It rolls forward to the end of the
log, writes another checkpoint record, increments its SUI to 44, and
is open for business.  However, this checkpoint record crosses a WAL
page boundary, and the xlp_sui recorded in the second page's header
will be 43.  Also, we have failed to advance StartUpID --- it is now
the same as in the previous cycle of life, which I do not think is the
intended behavior.

4. Now we are screwed, because any attempt to re-read this checkpoint
record will detect that the first half is on an XLOG page with xlp_sui
44 while the second half is on a page with xlp_sui 43.  xlog.c will
quite properly suspect that the second page is stale, reject the
checkpoint, and fall back to the prior checkpoint --- as indeed we
see it doing in the crash at 18:04:37.  The cycle will then repeat
*exactly*, right down to the recreation of the broken checkpoint record.

Subsequent entries in the log indeed show several cycles of reversion
to the checkpoint at 9/54B2A010.  The only way to get out of the loop
is for the system to stay up long enough to perform a normal checkpoint
that will write a new not-broken primary checkpoint record.

But it gets worse :-(.  What actually happened, a few crashes later,
was that the system stayed up long enough to get a pretty fair amount of
work done.  Not long enough for a checkpoint, but long enough to write
out quite a few data pages that have WAL LSNs beyond the broken
checkpoint.  After the next crash and restart, we start getting
XLogFlush complaints about those pages:

2003-05-19 18:06:01 PANIC:  open of /usr/local/pgsql/data/pg_clog/06E7 failed: No such file or directory
2003-05-19 18:06:01 LOG:  server process (pid 72485) was terminated by signal 6
2003-05-19 18:06:01 LOG:  terminating any other active server processes
2003-05-19 18:06:01 LOG:  all server processes terminated; reinitializing shared memory and semaphores
2003-05-19 18:06:01 LOG:  database system was interrupted at 2003-05-19 18:05:41 EDT
2003-05-19 18:06:01 LOG:  ReadRecord: out-of-sequence SUI 43 (after 44) in log file 9, segment 84, offset 13369344
2003-05-19 18:06:01 LOG:  invalid primary checkpoint record
2003-05-19 18:06:01 LOG:  using previous checkpoint record at 9/54B2A010
2003-05-19 18:06:01 LOG:  redo record is at 9/54B2A010; undo record is at 0/0; shutdown TRUE
2003-05-19 18:06:01 LOG:  next transaction id: 2762386; next oid: 4300770
2003-05-19 18:06:01 LOG:  database system was not properly shut down; automatic recovery in progress
2003-05-19 18:06:01 LOG:  redo starts at 9/54B2A050
2003-05-19 18:06:01 LOG:  ReadRecord: out-of-sequence SUI 43 (after 44) in log file 9, segment 84, offset 13369344
2003-05-19 18:06:01 LOG:  redo done at 9/54CBFFA0
2003-05-19 18:06:03 LOG:  database system is ready
2003-05-19 18:06:07 ERROR:  XLogFlush: request 9/552E847C is not satisfied --- flushed only to 9/54D18E18

This is disastrous because it means we have abandoned a bunch of WAL
entries that correspond to data page updates that are already down to
disk.  We now have no guarantee of data consistency, since we can't be
sure that all actions described by the lost WAL entries are down to
disk.

To prevent this problem in future we need to take steps to ensure that
ThisStartUpID can never go backwards.  I think that it is correct for
WAL replay to be executed with the same startupID that was current when
the WAL entries were being made (Vadim, can you confirm that that is the
required behavior?).  I propose the following changes:

1. It is still correct to adopt ThisStartUpID from the checkpoint we are
restarting from as we begin to roll forward.  However, redo for a
checkpoint record must adopt the ThisStartUpID seen in the checkpoint;
else we fail to handle the case where we are rolling forward over a
crash-restart checkpoint with a new SUI.

2. ThisStartUpID must be advanced to its new value *before* we write the
end-of-recovery checkpoint, rather than after as it is now.  This
ensures that the new checkpoint will have the current startup ID
(preventing the reversion-to-a-previously-used-SUI behavior seen above)
and that if it chances to cross an XLOG page boundary we won't have an
out-of-sequence SUI in the continuation page header.

3. When we select a new StartUpID after finishing rollforward, I think
we ought to use one plus the value shown in pg_control's copy of the
latest checkpoint, rather than relying on what we extracted from the
last checkpoint we were able to read from XLOG.  This should prevent
any possibility of StartUpID going backwards or not incrementing because
we lost a checkpoint record in XLOG.

4. The bit in StartupXLOG where it initializes a new XLOG page header
with 
       if (InRecovery)           Insert->currpage->xlp_sui = ThisStartUpID;       else
Insert->currpage->xlp_sui= ThisStartUpID + 1;
 

now scares the heck out of me, not least because it is using a startup
id adopted from a checkpoint that might not be the last one.  What I
would suggest we do is simply dike out that whole chunk of code (lines
2676-2706 in CVS tip) and not try to be smart about
almost-at-end-of-page cases.  It would be better to handle the
advance-to-new-page case in the normal flow of control when we are
actually ready to write the first new WAL entry; by that point we should
know the correct SUI to put into the page header.  I'm not totally sure
about this though.  Is there some reason why StartupXLOG needs to handle
end-of-page as a special case?

I would really like to fix this for 7.3.3, so please comment ASAP.
        regards, tom lane