Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

Поиск

Список

Период

Сортировка

От	Masahiko Sawada
Тема	Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
Дата	21 июля 2023 г. 11:01:11
Msg-id	CAD21AoDs7vzK7NErse7xTruqY-FXmM+3K00SdBtMcQhiRNkoeQ@mail.gmail.com обсуждение исходный текст
Ответ на	BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker (PG Bug reporting form <noreply@postgresql.org>)
Ответы	Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker (Alexander Lakhin <exclusion@gmail.com>) Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker (Andres Freund <andres@anarazel.de>)
Список	pgsql-bugs

Дерево обсуждения

Hi,

On Thu, Jul 20, 2023 at 12:21 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18031
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 16beta2
> Operating system:   Ubuntu 22.04
> Description:
>
> The following script:
> numclients=100
> for ((c=1;c<=$numclients;c++)); do
> createdb db$c
> done
>
> for ((i=1;i<=50;i++)); do
>   echo "ITERATION $i"
>   for ((c=1;c<=$numclients;c++)); do
>     echo "SELECT format('CREATE TABLE t1_%s_$i (t TEXT);', g) FROM
> generate_series(1,10) g \\gexec" | psql db$c >psql1_$c.out 2>&1 &
>     echo "SELECT format('CREATE TABLE t2_%s_$i (t TEXT);', g) FROM
> generate_series(1,10) g \\gexec" | psql db$c >psql2_$c.out 2>&1 &
>     echo "SELECT 'VACUUM FULL pg_class' FROM generate_series(1,10) g
> \\gexec" | psql db$c >psql3_$c.out 2>&1 &
>   done
>   wait
>   grep -C1 'signal 11' server.log && break;
> done
>
> when executed with the custom settings:
> fsync = off
> max_connections = 1000
> deadlock_timeout = 100ms
> min_parallel_table_scan_size = 1kB
>
> Leads to a server crash:
Thank you for reporting!

I've reproduced the issue in my environment with the provided script.
The crashed process is not a parallel vacuum worker actually but a
parallel worker for rebuilding the index. The scenario seems that when
detecting a deadlock, the process removes itself from the wait queue
by RemoveFromWaitQueue() (called by CheckDeadLock()), and then
RemoveFromWaitQueue() is called again by LockErrorCleanup() while
aborting the transaction. With commit 5764f611e, in
RemoveFromWaitQueue() we remove the process from the wait queue using
dclist_delete_from():

    /* Remove proc from lock's wait queue */
    dclist_delete_from(&waitLock->waitProcs, &proc->links);
:
    /* Clean up the proc's own state, and pass it the ok/fail signal */
    proc->waitLock = NULL;
    proc->waitProcLock = NULL;
    proc->waitStatus = PROC_WAIT_STATUS_ERROR;

 However, since dclist_delete_from() doesn't clear proc->links, in
LockErrorCleanup(), dlist_node_is_detached() still returns false:

    if (!dlist_node_is_detached(&MyProc->links))
    {
        /* We could not have been granted the lock yet */
        RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
    }

leading to calling RemoveFromWaitQueue() again. I think we should use
dclist_delete_from_thoroughly() instead. With the attached patch, the
issue doesn't happen in my environment.

Another thing I noticed is that the Assert(waitLock) in
RemoveFromWaitQueue() is useless actually, since we access *waitLock
before that:

void
RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
{
    LOCK       *waitLock = proc->waitLock;
    PROCLOCK   *proclock = proc->waitProcLock;
    LOCKMODE    lockmode = proc->waitLockMode;
    LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);

    /* Make sure proc is waiting */
    Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING);
    Assert(proc->links.next != NULL);
    Assert(waitLock);
    Assert(!dclist_is_empty(&waitLock->waitProcs));
    Assert(0 < lockmethodid && lockmethodid < lengthof(LockMethods));

I think we should fix it as well. This fix is also included in the
attached patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

remove_proc_from_wait_queue_thorougly.patch

В списке pgsql-bugs по дате отправления:

Предыдущее

От: Michael Paquier
Дата: 21 июля 2023 г., 02:53:02
Сообщение: Re: pg_basebackup: errors on macOS on directories with ".DS_Store" files

Следующее

От: vignesh C
Дата: 21 июля 2023 г., 11:56:37
Сообщение: Re: BUG #18027: Logical replication taking forever

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

Вложения

Предыдущее

Следующее