Обсуждение: Logic problem in SerializeSnapshot()

Поиск
Список
Период
Сортировка

Logic problem in SerializeSnapshot()

От
Rushabh Lathia
Дата:
Hi All,

During the testing of parallel query (with force_parallel_mode = regress),
noticed random server crash with the below stack:

#0  0x0000003fc84896d5 in memcpy () from /lib64/libc.so.6
#1  0x0000000000a36867 in SerializeSnapshot (snapshot=0x1e49f40, start_address=0x7f391e9ec728 <Address 0x7f391e9ec728 out of bounds>) at snapmgr.c:1523
#2  0x0000000000522a20 in InitializeParallelDSM (pcxt=0x1e49ce0) at parallel.c:330
#3  0x00000000006dd256 in ExecInitParallelPlan (planstate=0x1f012b0, estate=0x1f00be8, nworkers=1) at execParallel.c:398
#4  0x00000000006f8abb in ExecGather (node=0x1f00d00) at nodeGather.c:160
#5  0x00000000006de42e in ExecProcNode (node=0x1f00d00) at execProcnode.c:516
#6  0x00000000006da4fd in ExecutePlan (estate=0x1f00be8, planstate=0x1f00d00, use_parallel_mode=1 '\001', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0,
    direction=ForwardScanDirection, dest=0x1e5e118) at execMain.c:1633

So started looking into SerializeSnapshot() and with code reading I found that
we ignore copying the SubXID array if it has overflowed, unless the snapshot
was taken during recovey, and for this we mark the serialized_snapshot->subxcnt
to 0. But later while copying the SubXID array we check then condition based on
snapshot->subxcnt. We should check serialized_snapshot->subxcnt rather then
snapshot->subxcnt.

I tried hard to come up with individual test but somehow I was unable to
create testcase.

PFA patch to fix the issue.


regards,
Rushabh Lathia
Вложения

Re: Logic problem in SerializeSnapshot()

От
Robert Haas
Дата:
On Tue, Mar 1, 2016 at 12:34 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> During the testing of parallel query (with force_parallel_mode = regress),
> noticed random server crash with the below stack:
>
> #0  0x0000003fc84896d5 in memcpy () from /lib64/libc.so.6
> #1  0x0000000000a36867 in SerializeSnapshot (snapshot=0x1e49f40,
> start_address=0x7f391e9ec728 <Address 0x7f391e9ec728 out of bounds>) at
> snapmgr.c:1523
> #2  0x0000000000522a20 in InitializeParallelDSM (pcxt=0x1e49ce0) at
> parallel.c:330
> #3  0x00000000006dd256 in ExecInitParallelPlan (planstate=0x1f012b0,
> estate=0x1f00be8, nworkers=1) at execParallel.c:398
> #4  0x00000000006f8abb in ExecGather (node=0x1f00d00) at nodeGather.c:160
> #5  0x00000000006de42e in ExecProcNode (node=0x1f00d00) at
> execProcnode.c:516
> #6  0x00000000006da4fd in ExecutePlan (estate=0x1f00be8,
> planstate=0x1f00d00, use_parallel_mode=1 '\001', operation=CMD_SELECT,
> sendTuples=1 '\001', numberTuples=0,
>     direction=ForwardScanDirection, dest=0x1e5e118) at execMain.c:1633
>
> So started looking into SerializeSnapshot() and with code reading I found
> that
> we ignore copying the SubXID array if it has overflowed, unless the snapshot
> was taken during recovey, and for this we mark the
> serialized_snapshot->subxcnt
> to 0. But later while copying the SubXID array we check then condition based
> on
> snapshot->subxcnt. We should check serialized_snapshot->subxcnt rather then
> snapshot->subxcnt.
>
> I tried hard to come up with individual test but somehow I was unable to
> create testcase.
>
> PFA patch to fix the issue.

Thanks, that looks right to me.  Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company