Обсуждение: BUG #13459: Replaying WAL logs can hang on startup

Поиск
Список
Период
Сортировка

BUG #13459: Replaying WAL logs can hang on startup

От
chris+postgresql@qwirx.com
Дата:
The following bug has been logged on the website:

Bug reference:      13459
Logged by:          Chris Wilson
Email address:      chris+postgresql@qwirx.com
PostgreSQL version: 9.4.1
Operating system:   Linux 2.6.32-220.30.1.el6.x86_64
Description:

Dear developers,

I restored a standby from a pg_basebackup made from the master the previous
morning. As a result, it had a lot of WAL logs to catch up on.

At one point it hangs while restoring logs. It normally takes a few seconds
to process a 16 MB WAL segment, but on this one, it was "recovering
0000000100000CEC00000025" for 7 minutes now with no log output at all.

I restarted it at least once and it reached the same point and hung again. I
attached gdb and saw that it was hung here:

(gdb) bt
#0  0x0000003f358e1433 in __select_nocancel () from /lib64/libc.so.6
#1  0x00000000008be748 in pg_usleep ()
#2  0x000000000074ef9f in ForgetRelationFsyncRequests ()
#3  0x000000000074d0b0 in mdunlink ()
#4  0x000000000074faa5 in smgrdounlink ()
#5  0x00000000004d5b3a in xact_redo_commit_internal ()
#6  0x00000000004d5c36 in xact_redo_commit ()
#7  0x00000000004d5f37 in xact_redo ()
#8  0x00000000004eb36a in StartupXLOG ()
#9  0x00000000006eac29 in StartupProcessMain ()
#10 0x00000000004f9eeb in AuxiliaryProcessMain ()
#11 0x00000000006e9ea1 in StartChildProcess ()
#12 0x00000000006e50ab in PostmasterMain ()
#13 0x000000000064e693 in main ()

Browsing the code, I saw this interesting comment in
ForgetRelationFsyncRequests:

void
ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
{
        if (pendingOpsTable)
        {
                /* standalone backend or startup process: fsync state is
local */
                RememberFsyncRequest(rnode, forknum,
FORGET_RELATION_FSYNC);
        }
        else if (IsUnderPostmaster)
        {
                /*
                 * Notify the checkpointer about it.  If we fail to queue
the cancel
                 * message, we have to sleep and try again ... ugly, but
hopefully
                 * won't happen often.
                 *
                 * XXX should we CHECK_FOR_INTERRUPTS in this loop?
Escaping with an
                 * error would leave the no-longer-used file still present
on disk,
                 * which would be bad, so I'm inclined to assume that the
checkpointer
                 * will always empty the queue soon.
                 */
                while (!ForwardFsyncRequest(rnode, forknum,
FORGET_RELATION_FSYNC))
                        pg_usleep(10000L);      /* 10 msec seems a good
number */

                /*
                 * Note we don't wait for the checkpointer to actually
absorb the
                 * cancel message; see mdsync() for the implications.
                 */
        }
}

My guess is that ForwardFsyncRequest() is continually returning false, and
this code is stuck forever. I noticed that it says that "I'm inclined to
assume that the checkpointer
will always empty the queue soon", but there is no checkpointer running
during recovery, is there?

Although I managed to recover the system by restoring from a different
backup the next day, bypassing this logfile, I think I still have the copy
that hangs on this logfile so I should be able to conduct further
investigations if required. Please let me know.

Cheers, Chris.

Re: BUG #13459: Replaying WAL logs can hang on startup

От
Tom Lane
Дата:
chris+postgresql@qwirx.com writes:
> I restored a standby from a pg_basebackup made from the master the previous
> morning. As a result, it had a lot of WAL logs to catch up on.
> At one point it hangs while restoring logs. It normally takes a few seconds
> to process a 16 MB WAL segment, but on this one, it was "recovering
> 0000000100000CEC00000025" for 7 minutes now with no log output at all.

Hmm ...

> My guess is that ForwardFsyncRequest() is continually returning false, and
> this code is stuck forever. I noticed that it says that "I'm inclined to
> assume that the checkpointer
> will always empty the queue soon", but there is no checkpointer running
> during recovery, is there?

There is supposed to be one once we have reached a consistent state; see
SetForwardFsyncRequests().  AFAICS it should be impossible to reach the
wait you're seeing unless the startup process's local pendingOpsTable has
been removed by SetForwardFsyncRequests(), and the caller of that should
have pinged the postmaster to start up a checkpointer.

If you can repro this easily, please look to see whether there's a
checkpointer, and if not, what state the postmaster is in (pmState,
CheckpointerPID, Shutdown, FatalError, RecoveryError might be
interesting).  If there is a checkpointer, then that's what to be
looking at.

            regards, tom lane