Обсуждение: PANIC: wrong buffer passed to visibilitymap_clear
Hi,
I recently find this problem while testing PG14 with sysbench.
Then I look through the emails from pgsql-hackers and find a previous similary bug which is https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us. But the bugfix commit(34f581c39e97e2ea237255cf75cccebccc02d477) is already patched to PG14.
The following is the stack of coredump.
#0 0x00007f2ba9bfa277 in raise () from /lib64/libc.so.6
#1 0x00007f2ba9bfb968 in abort () from /lib64/libc.so.6
#2 0x00000000009416bb in errfinish () at elog.c:717
#3 0x000000000049518b in visibilitymap_clear (rel=<optimized out>, heapBlk=<optimized out>, buf=<optimized out>, flags=<optimized out>, polar_record=<optimized out>) at visibilitymap.c:142
#4 0x000000000054c2df in heap_update () at heapam.c:3948
#5 0x0000000000555538 in heapam_tuple_update (relation=0x7f2b909cfef8, otid=0x7fff930c577a, slot=0x2612528, cid=0, snapshot=<optimized out>, crosscheck=0x0, wait=true, tmfd=0x7fff930c5690, lockmode=0x7fff930c5684, update_indexes=0x7fff930c5681)
at heapam_handler.c:327
#6 0x00000000006e04f3 in table_tuple_update (update_indexes=0x7fff930c5681, lockmode=0x7fff930c5684, tmfd=0x7fff930c5690, wait=true, crosscheck=<optimized out>, snapshot=<optimized out>, cid=<optimized out>, slot=0x2612528, otid=0x26126d0,
rel=0x7f2b909cfef8) at ../../../src/include/access/tableam.h:1509
#7 ExecUpdate () at nodeModifyTable.c:1785
#8 0x00000000006e0f2a in ExecModifyTable () at nodeModifyTable.c:2592
#9 0x00000000006b909c in ExecProcNode (node=0x24d4eb0) at ../../../src/include/executor/executor.h:257
#10 ExecutePlan (execute_once=<optimized out>, dest=0xaa52e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_UPDATE, use_parallel_mode=<optimized out>, planstate=0x24d4eb0, estate=0x24d4c08)
at execMain.c:1553
#11 standard_ExecutorRun () at execMain.c:363
#12 0x00007f2babd893dd in pgss_ExecutorRun (queryDesc=0x25dcab8, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:1018
#13 0x00007f2babd816fa in explain_ExecutorRun (queryDesc=0x25dcab8, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at auto_explain.c:334
#14 0x0000000000828ac8 in ProcessQuery (plan=0x25c9e88, sourceText=0x25dc9a8 "UPDATE sbtest8 SET k=k+1 WHERE id=$1", params=0x25dca28, queryEnv=0x0, dest=<optimized out>, qc=0x7fff930c5e40) at pquery.c:160
#15 0x00000000008294d8 in PortalRunMulti (portal=portal@entry=0x25697c8, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=0xaa52e0 <donothingDR>, dest@entry=0x24a6568, altdest=0xaa52e0 <donothingDR>,
altdest@entry=0x24a6568, qc=qc@entry=0x7fff930c5e40) at pquery.c:1277
#16 0x0000000000829929 in PortalRun () at pquery.c:797
#17 0x0000000000826f57 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x24a6158 "") at postgres.c:2306
#18 PostgresMain () at postgres.c:4826
#19 0x00000000007a1e8a in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4910
#20 BackendStartup (port=<optimized out>) at postmaster.c:4621
#21 ServerLoop () at postmaster.c:1823
#22 0x00000000007a2c4b in PostmasterMain () at postmaster.c:1488
#23 0x000000000050c5d0 in main (argc=3, argv=0x24a0f50) at main.c:209I'm wondering whether there's another code path to lead this problem happened. Since, I take a deep dig via gdb which turns out that newbuffer is not euqal to buffer. In other words, the function RelationGetBufferForTuple must have been called just now.
Besides, why didn't we re-check the flag after RelationGetBufferForTuple was called?
But I'm confused about the heap_update and RelationGetBufferForTuple functions which are too long to understand for me. Can anyone give me a hand?
--
Best regards,
rogers.ww
On 7/22/22 10:22, 王伟(学弈) wrote: > Hi, > I recently find this problem while testing PG14 with sysbench. > Then I look through the emails from pgsql-hackers and find a previous > similary bug which > is https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us > <https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us>. > But the bugfix commit(34f581c39e97e2ea237255cf75cccebccc02d477) is > already patched to PG14. > Which PG14 version / commit is this, exactly? What sysbench parameters did you use, how likely is hitting the issue? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/22/22 18:06, Tomas Vondra wrote:
> Which PG14 version / commit is this, exactly? What sysbench parameters
> did you use, how likely is hitting the issue?
> did you use, how likely is hitting the issue?
PG_VERSION is '14beta2'.
The head commit id is 'e1c1c30f635390b6a3ae4993e8cac213a33e6e3f'.
I have run these sysbench commands for couple of days, but only two times to hit the issue.
These sysbench commands are:
prepare:
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_insert.lua prepare
parallel execution:
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_insert.lua run
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=*--pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_delete.lua run
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_read_write.lua run
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_update_index.lua run
--
regards,
rogers.ww
------------------------------------------------------------------
发件人:Tomas Vondra<tomas.vondra@enterprisedb.com>
日 期:2022年07月22日 18:06:21
收件人:王伟(学弈)<rogers.ww@alibaba-inc.com>; pgsql-hackers<pgsql-hackers@lists.postgresql.org>
主 题:Re: PANIC: wrong buffer passed to visibilitymap_clear
On 7/22/22 10:22, 王伟(学弈) wrote:
> Hi,
> I recently find this problem while testing PG14 with sysbench.
> Then I look through the emails from pgsql-hackers and find a previous
> similary bug which
> is https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us
> <https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us>.
> But the bugfix commit(34f581c39e97e2ea237255cf75cccebccc02d477) is
> already patched to PG14.
>
Which PG14 version / commit is this, exactly? What sysbench parameters
did you use, how likely is hitting the issue?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 7/22/22 14:17, 王伟(学弈) wrote: > On 7/22/22 18:06, Tomas Vondra wrote: >> Which PG14 version / commit is this, exactly? What sysbench parameters >> did you use, how likely is hitting the issue? > PG_VERSION is '14beta2'. > The head commit id is 'e1c1c30f635390b6a3ae4993e8cac213a33e6e3f'. Why not current REL_14_STABLE? 14beta2 is pretty old, and while I haven't checked, perhaps this was already fixed since then. > I have run these sysbench commands for couple of days, but only two times to hit the issue. > These sysbench commands are: > prepare: > sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* > --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" > ./src/lua/oltp_insert.lua prepare > parallel execution: > sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* > --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* > --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" > ./src/lua/oltp_insert.lua run > sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* > --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=*--pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" > ./src/lua/oltp_delete.lua run > sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* > --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* > --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_read_write.lua run > sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* > --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* > --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_update_index.lua run > Thanks. Not sure I'll be able to do such long sysbench runs, though. Can you try reproducing this with current REL_14_STABLE? I wonder if dumping the WAL (using pg_waldump) might tell us more about what happened. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 22, 2022 at 1:22 AM 王伟(学弈) <rogers.ww@alibaba-inc.com> wrote: > I recently find this problem while testing PG14 with sysbench. The line numbers from your stack trace don't match up with REL_14_STABLE. Is this actually a fork of Postgres 14? (Oh, looks like it's an old beta release.) > Then I look through the emails from pgsql-hackers and find a previous similary bug which is https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us.But the bugfix commit(34f581c39e97e2ea237255cf75cccebccc02d477)is already patched to PG14. It does seem possible that there is another similar bug somewhere -- another case where we were protected by the fact that VACUUM acquired a full cleanup lock (not just an exclusive buffer lock) during its second heap pass. That changed in Postgres 14 (commit 8523492d4e). But I really don't know -- almost anything is possible. > I'm wondering whether there's another code path to lead this problem happened. Since, I take a deep dig via gdb which turnsout that newbuffer is not euqal to buffer. In other words, the function RelationGetBufferForTuple must have been calledjust now. > Besides, why didn't we re-check the flag after RelationGetBufferForTuple was called? Recheck what flag? And at what point? It's not easy to figure this out from your stack trace, because of the line number issues. It would also be helpful if you told us about the specific table involved. Though the important thing (the essential thing) is to test today's REL_14_STABLE. There have been *lots* of bug fixes since Postgres 14 beta2 was current. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > It would also be helpful if you told us about the specific table > involved. Though the important thing (the essential thing) is to test > today's REL_14_STABLE. There have been *lots* of bug fixes since > Postgres 14 beta2 was current. Yeah. To be blunt, you're wasting your time and ours by testing a year-old beta version. The odds are respectable that the problem is already fixed. Even if it's not, the version skew between what you are looking at and what we are looking at creates lots of confusion. regards, tom lane