Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
Дата
Msg-id CAD21AoDs7vzK7NErse7xTruqY-FXmM+3K00SdBtMcQhiRNkoeQ@mail.gmail.com
обсуждение исходный текст
Ответ на BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker  (PG Bug reporting form <noreply@postgresql.org>)
Ответы Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker  (Alexander Lakhin <exclusion@gmail.com>)
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker  (Andres Freund <andres@anarazel.de>)
Список pgsql-bugs
Hi,

On Thu, Jul 20, 2023 at 12:21 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18031
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 16beta2
> Operating system:   Ubuntu 22.04
> Description:
>
> The following script:
> numclients=100
> for ((c=1;c<=$numclients;c++)); do
> createdb db$c
> done
>
> for ((i=1;i<=50;i++)); do
>   echo "ITERATION $i"
>   for ((c=1;c<=$numclients;c++)); do
>     echo "SELECT format('CREATE TABLE t1_%s_$i (t TEXT);', g) FROM
> generate_series(1,10) g \\gexec" | psql db$c >psql1_$c.out 2>&1 &
>     echo "SELECT format('CREATE TABLE t2_%s_$i (t TEXT);', g) FROM
> generate_series(1,10) g \\gexec" | psql db$c >psql2_$c.out 2>&1 &
>     echo "SELECT 'VACUUM FULL pg_class' FROM generate_series(1,10) g
> \\gexec" | psql db$c >psql3_$c.out 2>&1 &
>   done
>   wait
>   grep -C1 'signal 11' server.log && break;
> done
>
> when executed with the custom settings:
> fsync = off
> max_connections = 1000
> deadlock_timeout = 100ms
> min_parallel_table_scan_size = 1kB
>
> Leads to a server crash:
Thank you for reporting!

I've reproduced the issue in my environment with the provided script.
The crashed process is not a parallel vacuum worker actually but a
parallel worker for rebuilding the index. The scenario seems that when
detecting a deadlock, the process removes itself from the wait queue
by RemoveFromWaitQueue() (called by CheckDeadLock()), and then
RemoveFromWaitQueue() is called again by LockErrorCleanup() while
aborting the transaction. With commit 5764f611e, in
RemoveFromWaitQueue() we remove the process from the wait queue using
dclist_delete_from():

    /* Remove proc from lock's wait queue */
    dclist_delete_from(&waitLock->waitProcs, &proc->links);
:
    /* Clean up the proc's own state, and pass it the ok/fail signal */
    proc->waitLock = NULL;
    proc->waitProcLock = NULL;
    proc->waitStatus = PROC_WAIT_STATUS_ERROR;

 However, since dclist_delete_from() doesn't clear proc->links, in
LockErrorCleanup(), dlist_node_is_detached() still returns false:

    if (!dlist_node_is_detached(&MyProc->links))
    {
        /* We could not have been granted the lock yet */
        RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
    }

leading to calling RemoveFromWaitQueue() again. I think we should use
dclist_delete_from_thoroughly() instead. With the attached patch, the
issue doesn't happen in my environment.

Another thing I noticed is that the Assert(waitLock) in
RemoveFromWaitQueue() is useless actually, since we access *waitLock
before that:

void
RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
{
    LOCK       *waitLock = proc->waitLock;
    PROCLOCK   *proclock = proc->waitProcLock;
    LOCKMODE    lockmode = proc->waitLockMode;
    LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);

    /* Make sure proc is waiting */
    Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING);
    Assert(proc->links.next != NULL);
    Assert(waitLock);
    Assert(!dclist_is_empty(&waitLock->waitProcs));
    Assert(0 < lockmethodid && lockmethodid < lengthof(LockMethods));

I think we should fix it as well. This fix is also included in the
attached patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: pg_basebackup: errors on macOS on directories with ".DS_Store" files
Следующее
От: vignesh C
Дата:
Сообщение: Re: BUG #18027: Logical replication taking forever