Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943

Поиск
Список
Период
Сортировка
От Tender Wang
Тема Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943
Дата
Msg-id CAHewXNkaKgVmT+OkVA9UHrEYm+b8J6o_8+-84Qey6V5tM-+z9A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Ответы Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943  (Tender Wang <tndrwang@gmail.com>)
Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-bugs


Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年4月9日周二 01:57写道:
On 2024-Mar-05, PG Bug reporting form wrote:

> #2  0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
> "partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
> lineNumber=1943) at assert.c:66
> #3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
> pruneinfo=0x1884188) at execPartition.c:1943
> #4  0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
> n_total_subplans=2, pruneinfo=0x1884188,
> initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803

I had been digging into this crash in late March and seeing if I could
find a reliable fix, but it seems devilish and had to put it aside.  The
problem is that DETACH CONCURRENTLY does a wait for snapshots to
disappear before doing the next detach phase; but since pgbench is using
prepared mode, the wait is already long done by the time EXECUTE wants
to run the plan.  Now, we have relcache invalidations at the point where
the wait ends, and those relcache invalidations should in turn cause the
prepared plan to be invalidated, so we would get a new plan that
excludes the partition being detached.  But this doesn't happen for some
reason that I haven't yet been able to understand.

Still trying to find a proper fix.  In the meantime, not using prepared
plans should serve to work around the problem.

--
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net



I had been analying this crash these days.  And I added a lot debug infos in codes.
Finally, I found a code execution sequence that would trigger this assert, and I could
use gdb not pgbench to help to reproduce this crash.

For example:
./psql postgres  # as session1 to do detach, start first
in another terminal, start gdb(call gdb1)
   gdb -p session1_pid
   b ATExecDetachPartition

in session1,  input alter table p detach partition p1 concurrently;
now session1 will be stalled by gdb.

in gdb terminal, we input step next(e.g. n) until first transaction call CommitTransactionCommand().
wo stop at CommitTransactionCommand().

we start another session2 to do select.
input : prepare p1 as select * from p where a = $1;

we start a new terminal, start gdb(call gdb2)
    gdb -p session2_pid
    b exec_simple_query
in session2, input execute p1(1);
Now session2 will be stalled by gdb.

in gdb terminal, we step into PortalRunUtility(), after getting a snapshot, we stop here.
For session2, the transaction updating pg_inherits is not commited.
We switch to gdb1 terminal, and continue to step next until calling DetachPartitionFinalize().
Because session2 has not get p relaiton lock, so in gdb1, we can cross WaitForLockersMultiple().

Now we swithch to gdb2, and continue to do work. If we breakpoint find_inheritance_children_extended()
We will get a tuple that inhdetachpending is true, but the xmin is in-progress for the session2 snapshot.
So this tuple will be added to the outpue according to the logic. Finally we will get two parts.
After return from add_base_rels_to_query() in query_planner(), we switch to gdb1.

In gdb1, we enter DetachPartitionFinalize() and call RemoveInheritance() to remove the tuple.
We input command "continue" to do left work for the detach.

Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We continue gdb2, and we will
stop at RelationCacheInvalidateEntry(). And we will see that p relation cache item will be cleared. 
The backtrace will be attached at the end of the this email.

Entering ExecInitAppend(), because part_prune_info is not null, so we will enter CreatePartitionPruneState().
We enter find_inheritance_children_extended() again to get partdesc, but in gdb1  we have done DetachPartitionFinalize()
and the detach has commited. So we only get one tuple and parts is 1.

Finally, we will trigger the Assert: (partdesc->nparts >= pinfo->nparts).


--
Tender Wang
OpenPie:  https://en.openpie.com/

В списке pgsql-bugs по дате отправления:

Предыдущее
От: 盧致均(Harry)
Дата:
Сообщение: Re: BUG #18428: Connection broken but DB service still alive.
Следующее
От: Костянтин Томах
Дата:
Сообщение: Re: BUG #18433: Logical replication timeout