Обсуждение: Various bugs if segment containing redo pointer does not exist

Поиск
Список
Период
Сортировка

Various bugs if segment containing redo pointer does not exist

От
Andres Freund
Дата:
Hi,

I investigated a crashing postgres instance. Turns out the original issue was
operator error. But in the process I found a few postgres issues. The scenario
is basically that redo LSN and checkpoint LSN are in seperate segments, and
that for whatever reason, the file containing the redo LSN is missing.

I'd expect a PANIC in that situation. But, turns out that no such luck.

I've looked at 15 and HEAD so far.

1) For some reason I haven't yet debugged, the ReadRecord(PANIC) in
   PerformWalRecovery() doesn't PANIC, but instead just returns NULL

   We *do* output a DEBUG message, but well, that's insufficient.


2) On HEAD, we then segfault, because the cross check for XLOG_CHECKPOINT_REDO
   causes null pointer dereference. Which I guess is good, we shouldn't have
   gotten there without a record.


3) On 15, with or without assertions, we decide that "redo is not
   required". Gulp.


4) On 15, with assertions enabled, we fail with an assertion in the startup
   process, in FinishWalRecovery()->XLogPrefetcherBeginRead()->XLogBeginRead()
   Assert(!XLogRecPtrIsInvalid(RecPtr))


5) On 15, with optimizations enabled, we don't just crash, it gets scarier.

   First, the startup process actually creates a bogus WAL segment:

#1  0x000055f53b2725ff in XLogFileInitInternal (path=0x7ffccb7b3360 "pg_wal/000000010000000000000000",
added=0x7ffccb7b335f,logtli=1, logsegno=1099511627776)
 
    at /home/andres/src/postgresql-15/src/backend/access/transam/xlog.c:2936
#2  PreallocXlogFiles (endptr=endptr@entry=0, tli=tli@entry=1) at
/home/andres/src/postgresql-15/src/backend/access/transam/xlog.c:3433
#3  0x000055f53b277e00 in PreallocXlogFiles (tli=1, endptr=0) at
/home/andres/src/postgresql-15/src/backend/access/transam/xlog.c:3425
#4  StartupXLOG () at /home/andres/src/postgresql-15/src/backend/access/transam/xlog.c:5517
#5  0x000055f53b4a385e in StartupProcessMain () at /home/andres/src/postgresql-15/src/backend/postmaster/startup.c:282
#6  0x000055f53b49860b in AuxiliaryProcessMain (auxtype=auxtype@entry=StartupProcess)
    at /home/andres/src/postgresql-15/src/backend/postmaster/auxprocess.c:141
#7  0x000055f53b4a2eed in StartChildProcess (type=StartupProcess) at
/home/andres/src/postgresql-15/src/backend/postmaster/postmaster.c:5432
#8  PostmasterMain (argc=argc@entry=39, argv=argv@entry=0x55f53d5095d0) at
/home/andres/src/postgresql-15/src/backend/postmaster/postmaster.c:1473
#9  0x000055f53b1d1bff in main (argc=39, argv=0x55f53d5095d0) at
/home/andres/src/postgresql-15/src/backend/main/main.c:202

    Note the endptr=0 and pg_wal/000000010000000000000000 path.

    With normal log level, one wouldn't learn anything about this.


    Then the *checkpointer* segfaults, trying to write the end-of-recovery
    checkpoint:

#0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:344
#1  0x000056102ebe84a2 in memcpy (__len=26, __src=0x56102fbe7b48, __dest=<optimized out>) at
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:29
#2  CopyXLogRecordToWAL (tli=1, EndPos=160, StartPos=40, rdata=0x56102f31ffb0 <hdr_rdt>, isLogSwitch=false,
write_len=114)
    at /home/andres/src/postgresql-15/src/backend/access/transam/xlog.c:1229
#3  XLogInsertRecord (rdata=<optimized out>, fpw_lsn=<optimized out>, flags=<optimized out>, num_fpi=0,
topxid_included=<optimizedout>)
 
    at /home/andres/src/postgresql-15/src/backend/access/transam/xlog.c:861
#4  0x000056102ebf12cb in XLogInsert (rmid=rmid@entry=0 '\000', info=info@entry=0 '\000')
    at /home/andres/src/postgresql-15/src/backend/access/transam/xloginsert.c:492
#5  0x000056102ebea92e in CreateCheckPoint (flags=102) at
/home/andres/src/postgresql-15/src/backend/access/transam/xlog.c:6583
#6  0x000056102ee0e552 in CheckpointerMain () at
/home/andres/src/postgresql-15/src/backend/postmaster/checkpointer.c:455
#7  0x000056102ee0c5f9 in AuxiliaryProcessMain (auxtype=auxtype@entry=CheckpointerProcess)
    at /home/andres/src/postgresql-15/src/backend/postmaster/auxprocess.c:153
#8  0x000056102ee12c38 in StartChildProcess (type=CheckpointerProcess) at
/home/andres/src/postgresql-15/src/backend/postmaster/postmaster.c:5432
#9  0x000056102ee16d54 in PostmasterMain (argc=argc@entry=35, argv=argv@entry=0x56102fb6e5d0)
    at /home/andres/src/postgresql-15/src/backend/postmaster/postmaster.c:1466
#10 0x000056102eb45bff in main (argc=35, argv=0x56102fb6e5d0) at
/home/andres/src/postgresql-15/src/backend/main/main.c:202

    The immediate cause of the crash is that GetXLogBuffer() is called with
    ptr=40, which makes GetXLogBuffer() think it can use the cached path,
    because cachedPage is still zero.

    Which in turn is because because the startup process happily initialized
    XLogCtl->Insert.{CurrBytePos,PrevBytePos} with 0s.  Even though it
    initialized RedoRecPtr with the valid redo pointer.

    The checkpointer actually ends up resetting the valid RedoRecPtr with
    bogus content as part of CreateCheckPoint(), due to the bogus CurrBytePos.


6) On 15, with optimizations enabled, we don't just crash, we also
   can't even get to the prior crashes anymore, because *now* we PANIC:

2023-10-23 16:15:07.208 PDT [2554457][startup][:0][] DEBUG:  could not open file "pg_wal/000000010000000000000095": No
suchfile or directory
 
2023-10-23 16:15:07.208 PDT [2554457][startup][:0][] LOG:  redo is not required
2023-10-23 16:15:07.208 PDT [2554457][startup][:0][] PANIC:  invalid magic number 0000 in log segment
000000010000000000000000,offset 0
 
2023-10-23 16:15:07.211 PDT [2554453][postmaster][:0][] LOG:  startup process (PID 2554457) was terminated by signal 6:
Aborted

(rr) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at
./nptl/pthread_kill.c:44
#1  0x00007f7f6abc915f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007f7f6ab7b472 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f7f6ab654b2 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00005566b5983339 in errfinish (filename=<optimized out>, lineno=<optimized out>, funcname=<optimized out>)
    at /home/andres/src/postgresql-15/src/backend/utils/error/elog.c:675
#5  0x00005566b555630c in ReadRecord (xlogprefetcher=0x5566b60fd7c8, emode=emode@entry=23,
fetching_ckpt=fetching_ckpt@entry=false,
 
    replayTLI=replayTLI@entry=1) at /home/andres/src/postgresql-15/src/backend/access/transam/xlogrecovery.c:3082
#6  0x00005566b5557ea0 in FinishWalRecovery () at
/home/andres/src/postgresql-15/src/backend/access/transam/xlogrecovery.c:1454
#7  0x00005566b554ac1c in StartupXLOG () at /home/andres/src/postgresql-15/src/backend/access/transam/xlog.c:5309
#8  0x00005566b577685e in StartupProcessMain () at /home/andres/src/postgresql-15/src/backend/postmaster/startup.c:282
#9  0x00005566b576b60b in AuxiliaryProcessMain (auxtype=auxtype@entry=StartupProcess)
    at /home/andres/src/postgresql-15/src/backend/postmaster/auxprocess.c:141
#10 0x00005566b5775eed in StartChildProcess (type=StartupProcess) at
/home/andres/src/postgresql-15/src/backend/postmaster/postmaster.c:5432
#11 PostmasterMain (argc=argc@entry=39, argv=argv@entry=0x5566b60fb5d0) at
/home/andres/src/postgresql-15/src/backend/postmaster/postmaster.c:1473
#12 0x00005566b54a4bff in main (argc=39, argv=0x5566b60fb5d0) at
/home/andres/src/postgresql-15/src/backend/main/main.c:202

  Of course 000000010000000000000000 has bogus data, it's an invalid
  file. Here we *do* actually PANIC, presumably because the file actually
  exists.


Of course most of this is downstream from the issue of not PANICing in 1). But
it feels like it showcases a serious lack of error checking in StartupXLOG(),
CreateCheckPoint() etc.


Greetings,

Andres Freund



Re: Various bugs if segment containing redo pointer does not exist

От
Andres Freund
Дата:
Hi,

On 2023-10-23 16:21:45 -0700, Andres Freund wrote:
> 1) For some reason I haven't yet debugged, the ReadRecord(PANIC) in
>    PerformWalRecovery() doesn't PANIC, but instead just returns NULL
>
>    We *do* output a DEBUG message, but well, that's insufficient.

The debug is from this backtrace:

#0  XLogFileReadAnyTLI (segno=6476, emode=13, source=XLOG_FROM_PG_WAL) at
/home/andres/src/postgresql/src/backend/access/transam/xlogrecovery.c:4291
#1  0x000055d7b3949db0 in WaitForWALToBecomeAvailable (RecPtr=108649259008, randAccess=true, fetching_ckpt=false,
tliRecPtr=108665421680,replayTLI=1,
 
    replayLSN=108665421680, nonblocking=false) at
/home/andres/src/postgresql/src/backend/access/transam/xlogrecovery.c:3697
#2  0x000055d7b39494ff in XLogPageRead (xlogreader=0x55d7b472c470, targetPagePtr=108649250816, reqLen=8192,
targetRecPtr=108665421680,
    readBuf=0x55d7b47ba5d8 "\024\321\005") at
/home/andres/src/postgresql/src/backend/access/transam/xlogrecovery.c:3278
#3  0x000055d7b3941bb1 in ReadPageInternal (state=0x55d7b472c470, pageptr=108665413632, reqLen=8072)
    at /home/andres/src/postgresql/src/backend/access/transam/xlogreader.c:1014
#4  0x000055d7b3940f43 in XLogDecodeNextRecord (state=0x55d7b472c470, nonblocking=false)
    at /home/andres/src/postgresql/src/backend/access/transam/xlogreader.c:571
#5  0x000055d7b3941a41 in XLogReadAhead (state=0x55d7b472c470, nonblocking=false) at
/home/andres/src/postgresql/src/backend/access/transam/xlogreader.c:947
#6  0x000055d7b393f5fa in XLogPrefetcherNextBlock (pgsr_private=94384934340072, lsn=0x55d7b47cfeb8)
    at /home/andres/src/postgresql/src/backend/access/transam/xlogprefetcher.c:496
#7  0x000055d7b393efcd in lrq_prefetch (lrq=0x55d7b47cfe88) at
/home/andres/src/postgresql/src/backend/access/transam/xlogprefetcher.c:256
#8  0x000055d7b393f190 in lrq_complete_lsn (lrq=0x55d7b47cfe88, lsn=0) at
/home/andres/src/postgresql/src/backend/access/transam/xlogprefetcher.c:294
#9  0x000055d7b39401ba in XLogPrefetcherReadRecord (prefetcher=0x55d7b47bc5e8, errmsg=0x7ffc23505920)
    at /home/andres/src/postgresql/src/backend/access/transam/xlogprefetcher.c:1041
#10 0x000055d7b3948ff8 in ReadRecord (xlogprefetcher=0x55d7b47bc5e8, emode=23, fetching_ckpt=false, replayTLI=1)
    at /home/andres/src/postgresql/src/backend/access/transam/xlogrecovery.c:3078
#11 0x000055d7b3946749 in PerformWalRecovery () at
/home/andres/src/postgresql/src/backend/access/transam/xlogrecovery.c:1640


The source of the emode=13=DEBUG2 is that that's hardcoded in
WaitForWALToBecomeAvailable(). I guess the error ought to come from
XLogPageRead(), but all that happens is this:

            case XLREAD_FAIL:
                if (readFile >= 0)
                    close(readFile);
                readFile = -1;
                readLen = 0;
                readSource = XLOG_FROM_ANY;
                return XLREAD_FAIL;

which *does* error out for some other failures:
            errno = save_errno;
            ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
                    (errcode_for_file_access(),
                     errmsg("could not read from WAL segment %s, LSN %X/%X, offset %u: %m",
                            fname, LSN_FORMAT_ARGS(targetPagePtr),
                            readOff)));

but not for a file that couldn't be opened. Which wouldn't have to be due to
ENOENT, could also be EACCESS...


xlogreader has undergone a fair bit of changes in the last few releases. As of
now, I can't really understand who is responsible for reporting what kind of
error.

    /*
     * Data input callback
     *
     * This callback shall read at least reqLen valid bytes of the xlog page
     * starting at targetPagePtr, and store them in readBuf.  The callback
     * shall return the number of bytes read (never more than XLOG_BLCKSZ), or
     * -1 on failure.  The callback shall sleep, if necessary, to wait for the
     * requested bytes to become available.  The callback will not be invoked
     * again for the same page unless more than the returned number of bytes
     * are needed.
     *
     * targetRecPtr is the position of the WAL record we're reading.  Usually
     * it is equal to targetPagePtr + reqLen, but sometimes xlogreader needs
     * to read and verify the page or segment header, before it reads the
     * actual WAL record it's interested in.  In that case, targetRecPtr can
     * be used to determine which timeline to read the page from.
     *
     * The callback shall set ->seg.ws_tli to the TLI of the file the page was
     * read from.
     */
    XLogPageReadCB page_read;

    /*
     * Callback to open the specified WAL segment for reading.  ->seg.ws_file
     * shall be set to the file descriptor of the opened segment.  In case of
     * failure, an error shall be raised by the callback and it shall not
     * return.
     *
     * "nextSegNo" is the number of the segment to be opened.
     *
     * "tli_p" is an input/output argument. WALRead() uses it to pass the
     * timeline in which the new segment should be found, but the callback can
     * use it to return the TLI that it actually opened.
     */
    WALSegmentOpenCB segment_open;

My reading of this is that page_read isn't ever supposed to error out - yet we
do it all over (e.g. WALDumpReadPage(), XLogPageRead()) and that segment_open
is supposed to error out - but we don't even use segment_open for
xlogrecovery.c.  Who is supposed to error out then?  And why is it ok for the
read callbacks to error out directly, if they're supposed to return -1 in case
of failure?

Somewhat of an aside: It also seems "The callback shall sleep, if necessary,
to wait for the requested bytes to become available." is outdated, given that
we *explicitly* don't do so in some cases and support that via
XLREAD_WOULDBLOCK?


I dug through recent changes, expecting to find the problem. But uh, no. I
reproduced this in 9.4, and I think the behaviour might have been introduced
in 9.3 (didn't have a build of that around, hence didn't test that), as part
of:

commit abf5c5c9a4f
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   2013-02-22 11:43:04 +0200

    If recovery.conf is created after "pg_ctl stop -m i", do crash recovery.

Before that, XLogPageRead() called XLogFileReadAnyTLI() with the "incoming"
emode, when in crash recovery -> PANIC in this case. After that commit, the
call to XLogFileReadAnyTLI() was moved to WaitForWALToBecomeAvailable(), but
the emode was changed to unconditionally be DEBUG2.

                /*
                 * Try to restore the file from archive, or read an existing
                 * file from pg_xlog.
                 */
                readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
                        currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
                                         currentSource);

In 9.4 the server actually came up ok after encountering the problem
(destroying data in the process!), not sure where we started to bogusly
initialize shared memory...

Greetings,

Andres Freund



Re: Various bugs if segment containing redo pointer does not exist

От
Robert Haas
Дата:
On Mon, Oct 23, 2023 at 8:43 PM Andres Freund <andres@anarazel.de> wrote:
> The source of the emode=13=DEBUG2 is that that's hardcoded in
> WaitForWALToBecomeAvailable(). I guess the error ought to come from
> XLogPageRead(), but all that happens is this:
>
>                         case XLREAD_FAIL:
>                                 if (readFile >= 0)
>                                         close(readFile);
>                                 readFile = -1;
>                                 readLen = 0;
>                                 readSource = XLOG_FROM_ANY;
>                                 return XLREAD_FAIL;

I've been under the impression that the guiding principle here is that
we shouldn't error out upon hitting a condition that should only cause
us to switch sources. I think WaitForWALToBecomeAvailable() is
supposed to set things up so that XLogPageRead()'s call to pg_pread()
will succeed. If it says it can't, then XLogPageRead() is only obliged
to pass that information up to the caller, who can decide to wait
longer for the data to show up, or give up, or whatever it wants to
do. On the other hand, if WaitForWALToBecomeAvailable() says that it's
fine to go ahead and call pg_pread() and pg_pread() then fails, then
that means that we've got a problem with the WAL file other than it
just not being available yet, like it's the wrong length or there was
an I/O error, and those are reportable errors. Said differently, in
the former case, the WAL is not broken, merely not currently
available; in the latter case, it's broken.

The legibility and maintainability of this code are certainly not
great. The xlogreader mechanism is extremely useful, but maybe we
should have done more cleanup of the underlying mechanism first. It's
a giant ball of spaghetti code that is challenging to understand and
almost impossible to validate thoroughly (as you just discovered).

--
Robert Haas
EDB: http://www.enterprisedb.com