Re: BUG #18259: Assertion in ExtendBufferedRelLocal() fails after no-space-left condition

Поиск
Список
Период
Сортировка
От tender wang
Тема Re: BUG #18259: Assertion in ExtendBufferedRelLocal() fails after no-space-left condition
Дата
Msg-id CAHewXNnayN3NM1HfaOCejk=sGfSva6ZDArWxKiTxL7PdDHRtMw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #18259: Assertion in ExtendBufferedRelLocal() fails after no-space-left condition  (tender wang <tndrwang@gmail.com>)
Ответы Re: BUG #18259: Assertion in ExtendBufferedRelLocal() fails after no-space-left condition  (Alexander Lakhin <exclusion@gmail.com>)
Список pgsql-bugs
I tried to analyze the issue, and I found that it might be caused by this commit:
commit dad50f677c42de207168a3f08982ba23c9fc6720
       bufmgr: Acquire and clean victim buffer separately

Before this dad50f677 commit, the LocalBufferAlloc() will do below operation:
/*
 * it's all ours now.
*/
bufHdr->tag = newTag;
buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
buf_state |= BM_TAG_VALID;
buf_state &= ~BUF_USAGECOUNT_MASK;
buf_state += BUF_USAGECOUNT_ONE;

Now after dad50f677,  GetLocalVictimBuffer()  doesn't do above operations, so my reported issue will happen.
In my reported issue:
f 3
(gdb) p /x buf_state
$1 = 0x1000000

In GetLocalVictimBuffer(), buf_state has no choices to do: buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);

I try to fix this issue in attached patch according to LocalBufferAlloc() logic, but I'm not 100% understanded all detailed about bufmgr. 
So any thoughts?

tender wang <tndrwang@gmail.com> 于2023年12月26日周二 18:51写道:
Thanks for the report. I can reproduce your reported bug on master. And I find another assert failed when run below SQL:

psql (17devel)
Type "help" for help.

postgres=# CREATE UNLOGGED TABLE filler(a int, b text STORAGE plain);
CREATE TABLE
postgres=# INSERT INTO filler SELECT g, repeat('x', 1000) FROM generate_series(1,
postgres(# 50000) g;
INSERT 0 50000
postgres=# CREATE TEMP TABLE tbl(a int);
CREATE TABLE
postgres=# INSERT INTO tbl SELECT g FROM generate_series(1, 200000) g;
ERROR:  could not extend file "base/5/t3_16389": No space left on device
HINT:  Check free disk space.
postgres=# DROP TABLE filler;
DROP TABLE
postgres=# INSERT INTO tbl SELECT g FROM generate_series(1, 200000) g;
INSERT 0 200000
postgres=# INSERT INTO tbl SELECT g FROM generate_series(1, 200000) g;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded. 
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f9d3d8b1859 in __GI_abort () at abort.c:79
#2  0x000055f83501c868 in ExceptionalCondition (conditionName=0x55f8351fcb78 "!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED))", fileName=0x55f8351fca4b "localbuf.c",
    lineNumber=402) at assert.c:66
#3  0x000055f834df05ab in ExtendBufferedRelLocal (bmr=..., fork=MAIN_FORKNUM, flags=8, extend_by=1, extend_upto=4294967295, buffers=0x7ffff3ed1530, extended_by=0x7ffff3ed13fc)
    at localbuf.c:402
#4  0x000055f834de7a0a in ExtendBufferedRelCommon (bmr=..., fork=MAIN_FORKNUM, strategy=0x0, flags=8, extend_by=1, extend_upto=4294967295, buffers=0x7ffff3ed1530, extended_by=0x7ffff3ed14dc)
    at bufmgr.c:1828
#5  0x000055f834de6393 in ExtendBufferedRelBy (bmr=..., fork=MAIN_FORKNUM, strategy=0x0, flags=8, extend_by=1, buffers=0x7ffff3ed1530, extended_by=0x7ffff3ed14dc) at bufmgr.c:889
#6  0x000055f83492a240 in RelationAddBlocks (relation=0x7f9d325a7648, bistate=0x0, num_pages=1, use_fsm=true, did_unlock=0x7ffff3ed168d) at hio.c:342
#7  0x000055f83492ab67 in RelationGetBufferForTuple (relation=0x7f9d325a7648, len=32, otherBuffer=0, options=0, bistate=0x0, vmbuffer=0x7ffff3ed1714, vmbuffer_other=0x0, num_pages=1)
    at hio.c:768
#8  0x000055f834910840 in heap_insert (relation=0x7f9d325a7648, tup=0x55f83786e898, cid=0, options=0, bistate=0x0) at heapam.c:1853
#9  0x000055f834920cc0 in heapam_tuple_insert (relation=0x7f9d325a7648, slot=0x55f83786e808, cid=0, options=0, bistate=0x0) at heapam_handler.c:252
#10 0x000055f834bd582a in table_tuple_insert (rel=0x7f9d325a7648, slot=0x55f83786e808, cid=0, options=0, bistate=0x0) at ../../../src/include/access/tableam.h:1400
#11 0x000055f834bd7859 in ExecInsert (context=0x7ffff3ed1970, resultRelInfo=0x55f836fe5ed0, slot=0x55f83786e808, canSetTag=true, inserted_tuple=0x0, insert_destrel=0x0)
    at nodeModifyTable.c:1133
#12 0x000055f834bdbbae in ExecModifyTable (pstate=0x55f836fe5cc0) at nodeModifyTable.c:3806
#13 0x000055f834b9a6cb in ExecProcNodeFirst (node=0x55f836fe5cc0) at execProcnode.c:464
#14 0x000055f834b8db69 in ExecProcNode (node=0x55f836fe5cc0) at ../../../src/include/executor/executor.h:273
#15 0x000055f834b9096f in ExecutePlan (estate=0x55f836fe5a30, planstate=0x55f836fe5cc0, use_parallel_mode=false, operation=CMD_INSERT, sendTuples=false, numberTuples=0,
    direction=ForwardScanDirection, dest=0x55f836ff4378, execute_once=true) at execMain.c:1670
#16 0x000055f834b8e20f in standard_ExecutorRun (queryDesc=0x55f836f35a20, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:365
#17 0x000055f834b8e033 in ExecutorRun (queryDesc=0x55f836f35a20, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:309
#18 0x000055f834e3f27a in ProcessQuery (plan=0x55f836ff4218, sourceText=0x55f836f0b4b0 "INSERT INTO tbl SELECT g FROM generate_series(1, 200000) g;", params=0x0, queryEnv=0x0,
    dest=0x55f836ff4378, qc=0x7ffff3ed1dd0) at pquery.c:160
#19 0x000055f834e40d99 in PortalRunMulti (portal=0x55f836f86a00, isTopLevel=true, setHoldSnapshot=false, dest=0x55f836ff4378, altdest=0x55f836ff4378, qc=0x7ffff3ed1dd0) at pquery.c:1277
#20 0x000055f834e402bf in PortalRun (portal=0x55f836f86a00, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x55f836ff4378, altdest=0x55f836ff4378, qc=0x7ffff3ed1dd0)
    at pquery.c:791
#21 0x000055f834e39478 in exec_simple_query (query_string=0x55f836f0b4b0 "INSERT INTO tbl SELECT g FROM generate_series(1, 200000) g;") at postgres.c:1273
#22 0x000055f834e3e105 in PostgresMain (dbname=0x55f836f42870 "postgres", username=0x55f836f42858 "gpadmin") at postgres.c:4653
#23 0x000055f834d63393 in BackendRun (port=0x55f836f39fd0) at postmaster.c:4422
#24 0x000055f834d62a4c in BackendStartup (port=0x55f836f39fd0) at postmaster.c:4101
#25 0x000055f834d5f358 in ServerLoop () at postmaster.c:1769
#26 0x000055f834d5ec7e in PostmasterMain (argc=3, argv=0x55f836f05b80) at postmaster.c:1468
#27 0x000055f834c1525d in main (argc=3, argv=0x55f836f05b80) at main.c:198

PG Bug reporting form <noreply@postgresql.org> 于2023年12月26日周二 17:32写道:
The following bug has been logged on the website:

Bug reference:      18259
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16.1
Operating system:   Ubuntu 22.04
Description:       

The following script:
mkdir /tmp/100m
sudo mount -t tmpfs -o size=100M tmpfs /tmp/100m
export PGDATA=/tmp/100m/tmpdb

initdb
pg_ctl -l server.log start

cat << 'EOF' | psql
CREATE UNLOGGED TABLE filler(a int, b text STORAGE plain);
INSERT INTO filler SELECT g, repeat('x', 1000) FROM generate_series(1,
50000) g;
CREATE TEMP TABLE tbl(a int);
INSERT INTO tbl SELECT g FROM generate_series(1, 200000) g;
INSERT INTO tbl SELECT g FROM generate_series(1, 200000) g;
DROP TABLE filler;
INSERT INTO tbl SELECT g from generate_series(1, 200000) g;
EOF

triggers an assertion failure following "no space left" errors:
...
CREATE TABLE
ERROR:  could not extend file "base/5/t3_16391": No space left on device
HINT:  Check free disk space.
ERROR:  could not extend file "base/5/t3_16391": No space left on device
HINT:  Check free disk space.
DROP TABLE
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost
TRAP: failed Assert("buf_state & BM_TAG_VALID"), File: "localbuf.c", Line:
390, PID: 25978

The call stack of the failure is:
ExtendBufferedRelLocal at localbuf.c:391:4
ExtendBufferedRelCommon at bufmgr.c:1801:17
ExtendBufferedRelBy at bufmgr.c:862:9
RelationAddBlocks at hio.c:342:16
RelationGetBufferForTuple at hio.c:768:11
heap_insert at heapam.c:1862:11
heapam_tuple_insert at heapam_handler.c:253:2
table_tuple_insert at tableam.h:1402:1
ExecInsert at nodeModifyTable.c:1138:21
ExecModifyTable at nodeModifyTable.c:3810:12
ExecProcNodeFirst at execProcnode.c:465:1
ExecProcNode at executor.h:274:1
ExecutePlan at execMain.c:1670:10
standard_ExecutorRun at execMain.c:365:3
ExecutorRun at execMain.c:310:1
ProcessQuery at pquery.c:165:5
PortalRunMulti at pquery.c:1277:5
PortalRun at pquery.c:795:5
exec_simple_query at postgres.c:1274:10
PostgresMain at postgres.c:4641:27
ExitPostmaster at postmaster.c:5047:1
BackendStartup at postmaster.c:4196:5
ServerLoop at postmaster.c:1788:6
PostmasterMain at postmaster.c:1466:11

The first bad commit for this anomaly is 31966b15 (and exactly that commit
added the Assert).

With debug logging added in this code within ExtendBufferedRelLocal():
        if (found)
        {
            BufferDesc *existing_hdr =
GetLocalBufferDescriptor(hresult->id);
            uint32      buf_state;

            UnpinLocalBuffer(BufferDescriptorGetBuffer(victim_buf_hdr));

            existing_hdr = GetLocalBufferDescriptor(hresult->id);
            PinLocalBuffer(existing_hdr, false);
            buffers[i] = BufferDescriptorGetBuffer(existing_hdr);

            buf_state = pg_atomic_read_u32(&existing_hdr->state);
            Assert(buf_state & BM_TAG_VALID);
            Assert(!(buf_state & BM_DIRTY));
            buf_state &= BM_VALID;
            pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
...
I see that it reached for the second INSERT (and NOSPC error) with
existing_hdr->state == 0x2040000, but for the third INSERT I observe
state == 0x0.

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: BUG #18252: Assert in CheckOpSlotCompatibility() fails when recursive union filters tuples in non-recursive term
Следующее
От: Alexander Lakhin
Дата:
Сообщение: Re: BUG #18259: Assertion in ExtendBufferedRelLocal() fails after no-space-left condition