Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943

Поиск
Список
Период
Сортировка
От Tender Wang
Тема Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943
Дата
Msg-id CAHewXNnpxy6rMNvBGZpTdgLosNTpEmZOzth6_m57kcU3kE4kTA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943  (Tender Wang <tndrwang@gmail.com>)
Список pgsql-bugs


Tender Wang <tndrwang@gmail.com> 于2024年4月18日周四 20:13写道:


Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年4月9日周二 01:57写道:
On 2024-Mar-05, PG Bug reporting form wrote:

> #2  0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
> "partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
> lineNumber=1943) at assert.c:66
> #3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
> pruneinfo=0x1884188) at execPartition.c:1943
> #4  0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
> n_total_subplans=2, pruneinfo=0x1884188,
> initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803

I had been digging into this crash in late March and seeing if I could
find a reliable fix, but it seems devilish and had to put it aside.  The
problem is that DETACH CONCURRENTLY does a wait for snapshots to
disappear before doing the next detach phase; but since pgbench is using
prepared mode, the wait is already long done by the time EXECUTE wants
to run the plan.  Now, we have relcache invalidations at the point where
the wait ends, and those relcache invalidations should in turn cause the
prepared plan to be invalidated, so we would get a new plan that
excludes the partition being detached.  But this doesn't happen for some
reason that I haven't yet been able to understand.

Still trying to find a proper fix.  In the meantime, not using prepared
plans should serve to work around the problem.

--
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net



I had been analying this crash these days.  And I added a lot debug infos in codes.
Finally, I found a code execution sequence that would trigger this assert, and I could
use gdb not pgbench to help to reproduce this crash.

For example:
./psql postgres  # as session1 to do detach, start first
in another terminal, start gdb(call gdb1)
   gdb -p session1_pid
   b ATExecDetachPartition

in session1,  input alter table p detach partition p1 concurrently;
now session1 will be stalled by gdb.

in gdb terminal, we input step next(e.g. n) until first transaction call CommitTransactionCommand().
wo stop at CommitTransactionCommand().

we start another session2 to do select.
input : prepare p1 as select * from p where a = $1;

we start a new terminal, start gdb(call gdb2)
    gdb -p session2_pid
    b exec_simple_query
in session2, input execute p1(1);
Now session2 will be stalled by gdb.

in gdb terminal, we step into PortalRunUtility(), after getting a snapshot, we stop here.
For session2, the transaction updating pg_inherits is not commited.
We switch to gdb1 terminal, and continue to step next until calling DetachPartitionFinalize().
Because session2 has not get p relaiton lock, so in gdb1, we can cross WaitForLockersMultiple().

Now we swithch to gdb2, and continue to do work. If we breakpoint find_inheritance_children_extended()
We will get a tuple that inhdetachpending is true, but the xmin is in-progress for the session2 snapshot.
So this tuple will be added to the outpue according to the logic. Finally we will get two parts.
After return from add_base_rels_to_query() in query_planner(), we switch to gdb1.

In gdb1, we enter DetachPartitionFinalize() and call RemoveInheritance() to remove the tuple.
We input command "continue" to do left work for the detach.

Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We continue gdb2, and we will
stop at RelationCacheInvalidateEntry(). And we will see that p relation cache item will be cleared. 
The backtrace will be attached at the end of the this email.

Entering ExecInitAppend(), because part_prune_info is not null, so we will enter CreatePartitionPruneState().
We enter find_inheritance_children_extended() again to get partdesc, but in gdb1  we have done DetachPartitionFinalize()
and the detach has commited. So we only get one tuple and parts is 1.

Finally, we will trigger the Assert: (partdesc->nparts >= pinfo->nparts).


--
Tender Wang
OpenPie:  https://en.openpie.com/

Sorry, I forgot to put backtrace that call RelationCacheInvalidateEntry() in planner phase in last email.

I found one self-contradiction comments in CreatePartitionPruneState():

/* For data reading, executor always omits detached partitions */ 
if (estate->es_partition_directory == NULL)
estate->es_partition_directory =
 CreatePartitionDirectory(estate->es_query_cxt, false);

Should it be " not omits" if I didn't  misunderstand. Because we pass false to the function.


I think if we could rewrite logic of CreatePartitionPruneState() as below:
if (partdesc->nparts == pinfo->nparts)
{
    /* no new partition and no detached partition */
}
else  if (partdesc->nparts >= pinfo->nparts)
{
    /* new partition */
}
else
{
  /* detached partition */
}

I haven't figured out a fix to the Scenario I found in last email.
--
Tender Wang
OpenPie:  https://en.openpie.com/

В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #18445: date_part / extract range for hours do not match documentation
Следующее
От: "David G. Johnston"
Дата:
Сообщение: Re: BUG #15954: Unable to alter partitioned table to set logged