Обсуждение: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

Поиск
Список
Период
Сортировка

BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18031
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16beta2
Operating system:   Ubuntu 22.04
Description:

The following script:
numclients=100
for ((c=1;c<=$numclients;c++)); do
createdb db$c
done

for ((i=1;i<=50;i++)); do
  echo "ITERATION $i"
  for ((c=1;c<=$numclients;c++)); do
    echo "SELECT format('CREATE TABLE t1_%s_$i (t TEXT);', g) FROM
generate_series(1,10) g \\gexec" | psql db$c >psql1_$c.out 2>&1 &
    echo "SELECT format('CREATE TABLE t2_%s_$i (t TEXT);', g) FROM
generate_series(1,10) g \\gexec" | psql db$c >psql2_$c.out 2>&1 &
    echo "SELECT 'VACUUM FULL pg_class' FROM generate_series(1,10) g
\\gexec" | psql db$c >psql3_$c.out 2>&1 &
  done
  wait
  grep -C1 'signal 11' server.log && break;
done

when executed with the custom settings:
fsync = off
max_connections = 1000
deadlock_timeout = 100ms
min_parallel_table_scan_size = 1kB

Leads to a server crash:
...
ITERATION 15
2023-07-19 17:01:46.542 MSK [1449673] STATEMENT:  VACUUM FULL pg_class
2023-07-19 17:01:46.617 MSK [1434014] LOG:  background worker "parallel
worker" (PID 1450094) was terminated by signal 11: Segmentation fault
2023-07-19 17:01:46.617 MSK [1434014] DETAIL:  Failed process was running:
VACUUM FULL pg_class

With the following stack trace:
Core was generated by `postgres: parallel worker for PID 1449148  waiting
                        '.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/1450094' in core file too small.
#0  RemoveFromWaitQueue (proc=0x7fbae3910400, hashcode=1106083722) at
lock.c:1874
1874            LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
(gdb) bt
#0  RemoveFromWaitQueue (proc=0x7fbae3910400, hashcode=1106083722) at
lock.c:1874
#1  0x000055dcfc60f595 in LockErrorCleanup () at proc.c:735
#2  0x000055dcfc618757 in ProcessInterrupts () at postgres.c:3207
#3  0x000055dcfc60fbe1 in ProcSleep
(locallock=locallock@entry=0x55dcfe964380, 
    lockMethodTable=lockMethodTable@entry=0x55dcfca92800
<default_lockmethod>) at proc.c:1295
#4  0x000055dcfc5ffe0b in WaitOnLock
(locallock=locallock@entry=0x55dcfe964380, 
    owner=owner@entry=0x55dcfe978ed8) at lock.c:1818
#5  0x000055dcfc601ba5 in LockAcquireExtended
(locktag=locktag@entry=0x7ffd32e28110, 
    lockmode=lockmode@entry=5, sessionLock=sessionLock@entry=false,
dontWait=dontWait@entry=false, 
    reportMemoryError=reportMemoryError@entry=true,
locallockp=locallockp@entry=0x0) at lock.c:1082
#6  0x000055dcfc601c67 in LockAcquire (locktag=locktag@entry=0x7ffd32e28110,
lockmode=lockmode@entry=5, 
    sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false) at
lock.c:740
#7  0x000055dcfc5fe85f in XactLockTableWait (xid=xid@entry=43588,
rel=rel@entry=0x7fbad92f4348, 
    ctid=ctid@entry=0x55dcfe9f74b4, oper=oper@entry=XLTW_InsertIndexUnique)
at lmgr.c:702
#8  0x000055dcfc2720d2 in heapam_index_build_range_scan
(heapRelation=0x7fbad92f4348, 
    indexRelation=0x7fbad92f90c8, indexInfo=0x55dcfe9f7138,
allow_sync=<optimized out>, anyvisible=false, 
    progress=false, start_blockno=0, numblocks=4294967295,
callback=0x55dcfc29aba6 <_bt_build_callback>, 
    callback_state=0x7ffd32e28610, scan=0x55dcfe9f7458) at
heapam_handler.c:1496
#9  0x000055dcfc29af07 in table_index_build_scan (scan=<optimized out>,
callback_state=0x7ffd32e28610, 
    callback=0x55dcfc29aba6 <_bt_build_callback>, progress=false,
allow_sync=true, 
    index_info=0x55dcfe9f7138, index_rel=<optimized out>,
table_rel=<optimized out>)
    at ../../../../src/include/access/tableam.h:1781
#10 _bt_parallel_scan_and_sort (btspool=btspool@entry=0x55dcfe9f7d08,
btspool2=btspool2@entry=0x0, 
    btshared=btshared@entry=0x7fbad93122a0,
sharedsort=sharedsort@entry=0x7fbad9312240, 
    sharedsort2=sharedsort2@entry=0x0, sortmem=<optimized out>,
progress=false) at nbtsort.c:1985
#11 0x000055dcfc29d343 in _bt_parallel_build_main (seg=0x55dcfe979608,
toc=0x7fbad9312000) at nbtsort.c:1888
#12 0x000055dcfc2c64a6 in ParallelWorkerMain (main_arg=<optimized out>) at
parallel.c:1520
#13 0x000055dcfc56faf0 in StartBackgroundWorker () at bgworker.c:861
#14 0x000055dcfc57631a in do_start_bgworker (rw=rw@entry=0x55dcfe969ad0) at
postmaster.c:5762
#15 0x000055dcfc576496 in maybe_start_bgworkers () at postmaster.c:5986
#16 0x000055dcfc576dbf in process_pm_pmsignal () at postmaster.c:5149
#17 0x000055dcfc57736b in ServerLoop () at postmaster.c:1770
#18 0x000055dcfc5788ef in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x55dcfe938620)
    at postmaster.c:1463
#19 0x000055dcfc4964f5 in main (argc=3, argv=0x55dcfe938620) at main.c:198

(I got also another stack trace:
(gdb) bt
#0  RemoveFromWaitQueue (proc=0x7f763dde44c0, hashcode=1514899366) at
lock.c:1874
#1  0x0000559984e4e595 in LockErrorCleanup () at proc.c:735
#2  0x0000559984b125cc in AbortTransaction () at xact.c:2735
#3  0x0000559984b1307d in AbortCurrentTransaction () at xact.c:3338
#4  0x0000559984e5b68a in PostgresMain (dbname=<optimized out>,
username=0x55998640c6d8 "law")
    at postgres.c:4354
#5  0x0000559984db3183 in BackendRun (port=port@entry=0x559986445a60) at
postmaster.c:4461
#6  0x0000559984db623a in BackendStartup (port=port@entry=0x559986445a60) at
postmaster.c:4189
#7  0x0000559984db63d8 in ServerLoop () at postmaster.c:1779
#8  0x0000559984db78ef in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x55998640a620)
    at postmaster.c:1463
#9  0x0000559984cd54f5 in main (argc=3, argv=0x55998640a620) at main.c:198
)

server.log also contains:
2023-07-19 17:01:46.542 MSK [1449148] ERROR:  deadlock detected
2023-07-19 17:01:46.542 MSK [1449148] DETAIL:  Process 1449148 waits for
ShareLock on transaction 43588; blocked by process 1449150.
    Process 1449150 waits for RowExclusiveLock on relation 1259 of database
16385; blocked by process 1449148.
    Process 1449148: VACUUM FULL pg_class
    Process 1449150: CREATE TABLE t2_2_15 (t TEXT);
2023-07-19 17:01:46.542 MSK [1449148] HINT:  See server log for query
details.
2023-07-19 17:01:46.542 MSK [1449148] CONTEXT:  while checking uniqueness of
tuple (28,39) in relation "pg_class"
2023-07-19 17:01:46.542 MSK [1449148] STATEMENT:  VACUUM FULL pg_class

It looks like in LockErrorCleanup():
    if (!dlist_node_is_detached(&MyProc->links))
    {
        /* We could not have been granted the lock yet */
        RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
    }
MyProc->links happened to be not detached, but MyProc->waitLock is NULL.

`git bisect` for this anomaly points at 5764f611e.


Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

От
Masahiko Sawada
Дата:
Hi,

On Thu, Jul 20, 2023 at 12:21 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18031
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 16beta2
> Operating system:   Ubuntu 22.04
> Description:
>
> The following script:
> numclients=100
> for ((c=1;c<=$numclients;c++)); do
> createdb db$c
> done
>
> for ((i=1;i<=50;i++)); do
>   echo "ITERATION $i"
>   for ((c=1;c<=$numclients;c++)); do
>     echo "SELECT format('CREATE TABLE t1_%s_$i (t TEXT);', g) FROM
> generate_series(1,10) g \\gexec" | psql db$c >psql1_$c.out 2>&1 &
>     echo "SELECT format('CREATE TABLE t2_%s_$i (t TEXT);', g) FROM
> generate_series(1,10) g \\gexec" | psql db$c >psql2_$c.out 2>&1 &
>     echo "SELECT 'VACUUM FULL pg_class' FROM generate_series(1,10) g
> \\gexec" | psql db$c >psql3_$c.out 2>&1 &
>   done
>   wait
>   grep -C1 'signal 11' server.log && break;
> done
>
> when executed with the custom settings:
> fsync = off
> max_connections = 1000
> deadlock_timeout = 100ms
> min_parallel_table_scan_size = 1kB
>
> Leads to a server crash:
Thank you for reporting!

I've reproduced the issue in my environment with the provided script.
The crashed process is not a parallel vacuum worker actually but a
parallel worker for rebuilding the index. The scenario seems that when
detecting a deadlock, the process removes itself from the wait queue
by RemoveFromWaitQueue() (called by CheckDeadLock()), and then
RemoveFromWaitQueue() is called again by LockErrorCleanup() while
aborting the transaction. With commit 5764f611e, in
RemoveFromWaitQueue() we remove the process from the wait queue using
dclist_delete_from():

    /* Remove proc from lock's wait queue */
    dclist_delete_from(&waitLock->waitProcs, &proc->links);
:
    /* Clean up the proc's own state, and pass it the ok/fail signal */
    proc->waitLock = NULL;
    proc->waitProcLock = NULL;
    proc->waitStatus = PROC_WAIT_STATUS_ERROR;

 However, since dclist_delete_from() doesn't clear proc->links, in
LockErrorCleanup(), dlist_node_is_detached() still returns false:

    if (!dlist_node_is_detached(&MyProc->links))
    {
        /* We could not have been granted the lock yet */
        RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
    }

leading to calling RemoveFromWaitQueue() again. I think we should use
dclist_delete_from_thoroughly() instead. With the attached patch, the
issue doesn't happen in my environment.

Another thing I noticed is that the Assert(waitLock) in
RemoveFromWaitQueue() is useless actually, since we access *waitLock
before that:

void
RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
{
    LOCK       *waitLock = proc->waitLock;
    PROCLOCK   *proclock = proc->waitProcLock;
    LOCKMODE    lockmode = proc->waitLockMode;
    LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);

    /* Make sure proc is waiting */
    Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING);
    Assert(proc->links.next != NULL);
    Assert(waitLock);
    Assert(!dclist_is_empty(&waitLock->waitProcs));
    Assert(0 < lockmethodid && lockmethodid < lengthof(LockMethods));

I think we should fix it as well. This fix is also included in the
attached patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

От
Alexander Lakhin
Дата:
Hello Sawada-san,

21.07.2023 11:01, Masahiko Sawada wrote:
> ...
> leading to calling RemoveFromWaitQueue() again. I think we should use
> dclist_delete_from_thoroughly() instead. With the attached patch, the
> issue doesn't happen in my environment.

Thank you for the analysis and the patch! It fixes the issue for me too.

> Another thing I noticed is that the Assert(waitLock) in
> RemoveFromWaitQueue() is useless actually, since we access *waitLock
> before that:

Yeah, I also noticed that, and a simple grep-based script with a little
eyeballing discovered one similar Assert usage:

void
llvm_split_symbol_name(const char *name, char **modname, char **funcname)
{
...
         *funcname = rindex(name, '.');
         (*funcname)++;            /* jump over . */

         *modname = pnstrdup(name + strlen("pgextern."),
                             *funcname - name - strlen("pgextern.") - 1);
         Assert(funcname);

May be it's worth to fix it in passing.

Best regards,
Alexander



Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

От
Masahiko Sawada
Дата:
On Sat, Jul 22, 2023 at 7:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Sawada-san,
>
> 21.07.2023 11:01, Masahiko Sawada wrote:
> > ...
> > leading to calling RemoveFromWaitQueue() again. I think we should use
> > dclist_delete_from_thoroughly() instead. With the attached patch, the
> > issue doesn't happen in my environment.
>
> Thank you for the analysis and the patch! It fixes the issue for me too.

Thank you for the confirmation. I found this issue is the same as one
that has already been reported[1].

>
> > Another thing I noticed is that the Assert(waitLock) in
> > RemoveFromWaitQueue() is useless actually, since we access *waitLock
> > before that:
>
> Yeah, I also noticed that, and a simple grep-based script with a little
> eyeballing discovered one similar Assert usage:
>
> void
> llvm_split_symbol_name(const char *name, char **modname, char **funcname)
> {
> ...
>          *funcname = rindex(name, '.');
>          (*funcname)++;            /* jump over . */
>
>          *modname = pnstrdup(name + strlen("pgextern."),
>                              *funcname - name - strlen("pgextern.") - 1);
>          Assert(funcname);
>
> May be it's worth to fix it in passing.

Yeah, but it should be discussed on another thread.

[1] https://www.postgresql.org/message-id/ZKy4AdrLEfbqrxGJ%40telsasoft.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

От
Andres Freund
Дата:
Hi,

Thanks for finding / debugging this issue!

On 2023-07-21 17:01:11 +0900, Masahiko Sawada wrote:
> I've reproduced the issue in my environment with the provided script.
> The crashed process is not a parallel vacuum worker actually but a
> parallel worker for rebuilding the index. The scenario seems that when
> detecting a deadlock, the process removes itself from the wait queue
> by RemoveFromWaitQueue() (called by CheckDeadLock()), and then
> RemoveFromWaitQueue() is called again by LockErrorCleanup() while
> aborting the transaction. With commit 5764f611e, in
> RemoveFromWaitQueue() we remove the process from the wait queue using
> dclist_delete_from():
> 
>     /* Remove proc from lock's wait queue */
>     dclist_delete_from(&waitLock->waitProcs, &proc->links);
> :
>     /* Clean up the proc's own state, and pass it the ok/fail signal */
>     proc->waitLock = NULL;
>     proc->waitProcLock = NULL;
>     proc->waitStatus = PROC_WAIT_STATUS_ERROR;
> 
>  However, since dclist_delete_from() doesn't clear proc->links, in
> LockErrorCleanup(), dlist_node_is_detached() still returns false:
> 
>     if (!dlist_node_is_detached(&MyProc->links))
>     {
>         /* We could not have been granted the lock yet */
>         RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
>     }

Indeed :(


> leading to calling RemoveFromWaitQueue() again. I think we should use
> dclist_delete_from_thoroughly() instead. With the attached patch, the
> issue doesn't happen in my environment.

Yep. Do you want to push that fix, or should I?


> Another thing I noticed is that the Assert(waitLock) in
> RemoveFromWaitQueue() is useless actually, since we access *waitLock
> before that:
> 
> I think we should fix it as well. This fix is also included in the
> attached patch.

Don't really have an opinion on that. It's been this way for longer, afaict.


Greetings,

Andres Freund



Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

От
Masahiko Sawada
Дата:
Hi,

On Tue, Jul 25, 2023 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Thanks for finding / debugging this issue!
>
> On 2023-07-21 17:01:11 +0900, Masahiko Sawada wrote:
> > I've reproduced the issue in my environment with the provided script.
> > The crashed process is not a parallel vacuum worker actually but a
> > parallel worker for rebuilding the index. The scenario seems that when
> > detecting a deadlock, the process removes itself from the wait queue
> > by RemoveFromWaitQueue() (called by CheckDeadLock()), and then
> > RemoveFromWaitQueue() is called again by LockErrorCleanup() while
> > aborting the transaction. With commit 5764f611e, in
> > RemoveFromWaitQueue() we remove the process from the wait queue using
> > dclist_delete_from():
> >
> >     /* Remove proc from lock's wait queue */
> >     dclist_delete_from(&waitLock->waitProcs, &proc->links);
> > :
> >     /* Clean up the proc's own state, and pass it the ok/fail signal */
> >     proc->waitLock = NULL;
> >     proc->waitProcLock = NULL;
> >     proc->waitStatus = PROC_WAIT_STATUS_ERROR;
> >
> >  However, since dclist_delete_from() doesn't clear proc->links, in
> > LockErrorCleanup(), dlist_node_is_detached() still returns false:
> >
> >     if (!dlist_node_is_detached(&MyProc->links))
> >     {
> >         /* We could not have been granted the lock yet */
> >         RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
> >     }
>
> Indeed :(
>
>
> > leading to calling RemoveFromWaitQueue() again. I think we should use
> > dclist_delete_from_thoroughly() instead. With the attached patch, the
> > issue doesn't happen in my environment.
>
> Yep. Do you want to push that fix, or should I?

Thank you for your confirmation. I can push the fix if you're okay.
I've attached the patch.

>
>
> > Another thing I noticed is that the Assert(waitLock) in
> > RemoveFromWaitQueue() is useless actually, since we access *waitLock
> > before that:
> >
> > I think we should fix it as well. This fix is also included in the
> > attached patch.
>
> Don't really have an opinion on that. It's been this way for longer, afaict.

True. I would leave this part alone. We can fix it in a separate
commit later if necessary.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

От
Andres Freund
Дата:
On 2023-07-25 16:35:54 +0900, Masahiko Sawada wrote:
> Thank you for your confirmation. I can push the fix if you're okay.
> I've attached the patch.

LGTM.

Thanks!



Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

От
Masahiko Sawada
Дата:
On Wed, Jul 26, 2023 at 1:27 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-07-25 16:35:54 +0900, Masahiko Sawada wrote:
> > Thank you for your confirmation. I can push the fix if you're okay.
> > I've attached the patch.
>
> LGTM.

Thank you for reviewing it! Pushed, and marked as resolved in the open
items page.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com