Обсуждение: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

Поиск
Список
Период
Сортировка

v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

От
Justin Pryzby
Дата:
I hit this assertion while pg_restoring data into a v16 instance.
postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64

wal_level=minimal and pg_dump --single-transaction both seem to be
required to hit the issue.

$ /usr/pgsql-16/bin/postgres -D ./pg16test -c maintenance_work_mem=1GB -c max_wal_size=16GB -c wal_level=minimal -c
max_wal_senders=0-c port=5678 -c logging_collector=no &
 

$ time sudo -u postgres /usr/pgsql-16/bin/pg_restore -d postgres -p 5678 --single-transaction --no-tablespace
./curtables

TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

Core was generated by `postgres: postgres postgres [local] COMMIT                                    '.
Program terminated with signal 6, Aborted.
#0  0x00007f28b8bd5387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64
(gdb) bt
#0  0x00007f28b8bd5387 in raise () from /lib64/libc.so.6
#1  0x00007f28b8bd6a78 in abort () from /lib64/libc.so.6
#2  0x00000000009bc8c9 in ExceptionalCondition (conditionName=conditionName@entry=0xa373e1 "size > SizeOfXLogRecord",
fileName=fileName@entry=0xa31b13"xlog.c", lineNumber=lineNumber@entry=1055) at assert.c:66
 
#3  0x000000000057b049 in ReserveXLogInsertLocation (PrevPtr=0x2e3d750, EndPos=<synthetic pointer>, StartPos=<synthetic
pointer>,size=24) at xlog.c:1055
 
#4  XLogInsertRecord (rdata=rdata@entry=0xf187a0 <hdr_rdt>, fpw_lsn=fpw_lsn@entry=0, flags=<optimized out>,
num_fpi=num_fpi@entry=0,topxid_included=topxid_included@entry=false) at xlog.c:844
 
#5  0x000000000058210c in XLogInsert (rmid=rmid@entry=0 '\000', info=info@entry=176 '\260') at xloginsert.c:510
#6  0x0000000000582b09 in log_newpage_range (rel=rel@entry=0x2e1f628, forknum=forknum@entry=FSM_FORKNUM,
startblk=startblk@entry=0,endblk=endblk@entry=3, page_std=page_std@entry=false) at xloginsert.c:1317
 
#7  0x00000000005d02f9 in smgrDoPendingSyncs (isCommit=isCommit@entry=true,
isParallelWorker=isParallelWorker@entry=false)at storage.c:837
 
#8  0x0000000000571637 in CommitTransaction () at xact.c:2225
#9  0x0000000000572b25 in CommitTransactionCommand () at xact.c:3201
#10 0x000000000086afc7 in finish_xact_command () at postgres.c:2782
#11 0x000000000086d7e1 in exec_simple_query (query_string=0x2dec4f8 "COMMIT") at postgres.c:1307



Justin Pryzby <pryzby@telsasoft.com> writes:
> I hit this assertion while pg_restoring data into a v16 instance.
> postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64

> wal_level=minimal and pg_dump --single-transaction both seem to be
> required to hit the issue.

Hmm.  I wonder if log_newpages() is confused here:

    XLogEnsureRecordSpace(XLR_MAX_BLOCK_ID - 1, 0);

Why is XLR_MAX_BLOCK_ID - 1 enough, rather than XLR_MAX_BLOCK_ID?

            regards, tom lane



Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

От
Matthias van de Meent
Дата:
On Mon, 17 Apr 2023 at 17:53, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> I hit this assertion while pg_restoring data into a v16 instance.
> postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64
>
> wal_level=minimal and pg_dump --single-transaction both seem to be
> required to hit the issue.
>
> $ /usr/pgsql-16/bin/postgres -D ./pg16test -c maintenance_work_mem=1GB -c max_wal_size=16GB -c wal_level=minimal -c
max_wal_senders=0-c port=5678 -c logging_collector=no &
 
>
> $ time sudo -u postgres /usr/pgsql-16/bin/pg_restore -d postgres -p 5678 --single-transaction --no-tablespace
./curtables
>
> TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
>
> Core was generated by `postgres: postgres postgres [local] COMMIT                                    '.
> Program terminated with signal 6, Aborted.
> #0  0x00007f28b8bd5387 in raise () from /lib64/libc.so.6
> Missing separate debuginfos, use: debuginfo-install postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64
> (gdb) bt
> #0  0x00007f28b8bd5387 in raise () from /lib64/libc.so.6
> #1  0x00007f28b8bd6a78 in abort () from /lib64/libc.so.6
> #2  0x00000000009bc8c9 in ExceptionalCondition (conditionName=conditionName@entry=0xa373e1 "size > SizeOfXLogRecord",
fileName=fileName@entry=0xa31b13"xlog.c", lineNumber=lineNumber@entry=1055) at assert.c:66
 
> #3  0x000000000057b049 in ReserveXLogInsertLocation (PrevPtr=0x2e3d750, EndPos=<synthetic pointer>,
StartPos=<syntheticpointer>, size=24) at xlog.c:1055
 
> #4  XLogInsertRecord (rdata=rdata@entry=0xf187a0 <hdr_rdt>, fpw_lsn=fpw_lsn@entry=0, flags=<optimized out>,
num_fpi=num_fpi@entry=0,topxid_included=topxid_included@entry=false) at xlog.c:844
 
> #5  0x000000000058210c in XLogInsert (rmid=rmid@entry=0 '\000', info=info@entry=176 '\260') at xloginsert.c:510
> #6  0x0000000000582b09 in log_newpage_range (rel=rel@entry=0x2e1f628, forknum=forknum@entry=FSM_FORKNUM,
startblk=startblk@entry=0,endblk=endblk@entry=3, page_std=page_std@entry=false) at xloginsert.c:1317
 


Looking at log_newpage_range, it seems like we're always trying to log
a record if startblk < endblk; but don't register the PageIsNew()
buffers in the range. That means that if the last buffers in the range
are new, this can result in no buffers being registered in the last
iteration of the main loop (if the number of non-new buffers in the
range is 0 (mod 32)).

A change like attached should fix the issue; or alternatively we could
force log the last (new) buffer when we detect this edge case.

Kind regards,

Matthias van de Meent

Вложения
I wrote:
> Hmm.  I wonder if log_newpages() is confused here:
>     XLogEnsureRecordSpace(XLR_MAX_BLOCK_ID - 1, 0);

Oh, no, it's simpler than that: log_newpage_range is trying to
log zero page images, and ReserveXLogInsertLocation doesn't
like that because every WAL record should contain some data.
Will fix, thanks for report.

            regards, tom lane



Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> Looking at log_newpage_range, it seems like we're always trying to log
> a record if startblk < endblk; but don't register the PageIsNew()
> buffers in the range. That means that if the last buffers in the range
> are new, this can result in no buffers being registered in the last
> iteration of the main loop (if the number of non-new buffers in the
> range is 0 (mod 32)).

Yeah, I just came to the same conclusion.  One thing I don't understand
yet: log_newpage_range is old (it looks like this back to v12), and
that Assert is older, so why doesn't this reproduce further back?
Maybe the state where all the pages are new didn't happen before?
Is that telling us there's a bug somewhere else?  Seems like a job
for git bisect.

To be clear: log_newpage_range is certainly broken, and your fix looks
appropriate.  I'm just wondering what else we need to learn here.

            regards, tom lane



I wrote:
> Yeah, I just came to the same conclusion.  One thing I don't understand
> yet: log_newpage_range is old (it looks like this back to v12), and
> that Assert is older, so why doesn't this reproduce further back?
> Maybe the state where all the pages are new didn't happen before?

Bingo: bisecting shows the failure started at

commit 3d6a98457d8e21d85bed86cfd3e1d1df1b260721
Author: Andres Freund <andres@anarazel.de>
Date:   Wed Apr 5 08:19:39 2023 -0700

    Don't initialize page in {vm,fsm}_extend(), not needed

So previously, log_newpage_range could only have failed in very
unlikely circumstances, whereas now it's not hard to hit when
committing a table creation.  I wonder what other bugs may be
lurking.

I'll patch it back to v12 anyway, since that function is
clearly wrong in isolation.

            regards, tom lane



Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

От
Andres Freund
Дата:
Hi,

On 2023-04-17 12:13:41 -0400, Tom Lane wrote:
> Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> > Looking at log_newpage_range, it seems like we're always trying to log
> > a record if startblk < endblk; but don't register the PageIsNew()
> > buffers in the range. That means that if the last buffers in the range
> > are new, this can result in no buffers being registered in the last
> > iteration of the main loop (if the number of non-new buffers in the
> > range is 0 (mod 32)).
> 
> Yeah, I just came to the same conclusion.  One thing I don't understand
> yet: log_newpage_range is old (it looks like this back to v12), and
> that Assert is older, so why doesn't this reproduce further back?
> Maybe the state where all the pages are new didn't happen before?
> Is that telling us there's a bug somewhere else?  Seems like a job
> for git bisect.

One plausible explanation is that bulk relation extension has made it more
likely to encounter this scenario. We had some bulk extension code before, but
it was triggered purely based on contention - quite unlikely in simple test
scenarios - but now we also bulk extend if we know that we'll insert multiple
pages (when coming from heap_multi_insert(), with sufficient data).

Greetings,

Andres Freund



Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

От
Andres Freund
Дата:
Hi,

On 2023-04-17 13:50:30 -0400, Tom Lane wrote:
> I wrote:
> > Yeah, I just came to the same conclusion.  One thing I don't understand
> > yet: log_newpage_range is old (it looks like this back to v12), and
> > that Assert is older, so why doesn't this reproduce further back?
> > Maybe the state where all the pages are new didn't happen before?
> 
> Bingo: bisecting shows the failure started at
> 
> commit 3d6a98457d8e21d85bed86cfd3e1d1df1b260721
> Author: Andres Freund <andres@anarazel.de>
> Date:   Wed Apr 5 08:19:39 2023 -0700
> 
>     Don't initialize page in {vm,fsm}_extend(), not needed
> 
> So previously, log_newpage_range could only have failed in very
> unlikely circumstances, whereas now it's not hard to hit when
> committing a table creation.  I wonder what other bugs may be
> lurking.

Oh, interesting. We haven't initialized the extra pages added by
RelationAddExtraBlocks() (in <= 15) for quite a while now, so I'm a bit
surprised it causes more issues for the VM / FSM. I guess it's that it's quite
common in real workloads to contend on the extension lock and add extra
blocks, but not in simple single-threaded tests?

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2023-04-17 13:50:30 -0400, Tom Lane wrote:
>> So previously, log_newpage_range could only have failed in very
>> unlikely circumstances, whereas now it's not hard to hit when
>> committing a table creation.  I wonder what other bugs may be
>> lurking.

> Oh, interesting. We haven't initialized the extra pages added by
> RelationAddExtraBlocks() (in <= 15) for quite a while now, so I'm a bit
> surprised it causes more issues for the VM / FSM. I guess it's that it's quite
> common in real workloads to contend on the extension lock and add extra
> blocks, but not in simple single-threaded tests?

I haven't tried hard to run it to ground, but maybe log_newpage_range
isn't used in that code path?  Seems like we'd have detected this
before now if the case were reachable without any crash involved.

            regards, tom lane



Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

От
Justin Pryzby
Дата:
On Mon, Apr 17, 2023 at 01:50:30PM -0400, Tom Lane wrote:
> I wrote:
> > Yeah, I just came to the same conclusion.  One thing I don't understand
> > yet: log_newpage_range is old (it looks like this back to v12), and
> > that Assert is older, so why doesn't this reproduce further back?
> > Maybe the state where all the pages are new didn't happen before?
> 
> Bingo: bisecting shows the failure started at

Just curious: what "test" did you use to bisect with ?



Justin Pryzby <pryzby@telsasoft.com> writes:
> On Mon, Apr 17, 2023 at 01:50:30PM -0400, Tom Lane wrote:
>> Bingo: bisecting shows the failure started at

> Just curious: what "test" did you use to bisect with ?

The test case I used looked like

start postmaster with -c wal_level=minimal -c max_wal_senders=0
make installcheck-parallel
psql -d regression -c "do 'begin for i in 1..1000 loop execute ''create table lots''||i||'' as select * from onek'';
endloop; end';" 
pg_dump -Fc -Z0 regression >~/regression.dump
createdb r2
pg_restore -d r2 --single-transaction --no-tablespace ~/regression.dump

Dumping the regression database as-is didn't reproduce it for me,
but after I added a bunch more tables it did reproduce.

(I added the -Z0 bit after some of the bisection test points hit the
interval where somebody had broken pg_dump's compression features.
It didn't seem relevant to the problem so I just disabled that.)

            regards, tom lane