Обсуждение: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943

Поиск
Список
Период
Сортировка
The following bug has been logged on the website:

Bug reference:      18377
Logged by:          yajun Hu
Email address:      1026592243@qq.com
PostgreSQL version: 14.11
Operating system:   CentOS7 with kernel version 5.10
Description:

I have reproduced this problem in REL_14_11 and the latest master branch
(393b5599e5177e456cdce500039813629d370b38).
The steps to reproduce are as follows.
1. ./configure  --enable-debug --enable-depend --enable-cassert CFLAGS=-O0
2. make -j; make install -j; initdb -D ./primary; pg_ctl -D ../primary -l
logfile start
3. alter system set plan_cache_mode to 'force_generic_plan' ; select
pg_reload_conf();
4. create table p( a int,b int) partition by range(a);create table p1
partition of p for values from (0) to (1);create table p2 partition of p for
values from (1) to (2);
5. use pgbench with following SQL
1.sql
SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
\if :gotlock
        alter table p detach partition p1 concurrently ;
        alter table p attach partition p1 for values from (0) to (1);
\endif
2.sql
\set id random(0,1)
select * from p where a = :id
pgbench --no-vacuum --client=5 --transactions=1000000 -f 1.sql -f 2.sql -h
127.0.0.1 -M prepared -p 5432

I tested that coredump will appear within 10 seconds, and the stack is as
follows:
(gdb) bt
#0  0x00007effe61aa277 in raise () from /lib64/libc.so.6
#1  0x00007effe61ab968 in abort () from /lib64/libc.so.6
#2  0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
"partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
lineNumber=1943) at assert.c:66
#3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
pruneinfo=0x1884188) at execPartition.c:1943
#4  0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
n_total_subplans=2, pruneinfo=0x1884188,
initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803
#5  0x000000000076171d in ExecInitAppend (node=0x17cb5e0, estate=0x1898870,
eflags=32) at nodeAppend.c:146
#6  0x00000000007499af in ExecInitNode (node=0x17cb5e0, estate=0x1898870,
eflags=32) at execProcnode.c:182
#7  0x000000000073f514 in InitPlan (queryDesc=0x1880f58, eflags=32) at
execMain.c:969
#8  0x000000000073e3ea in standard_ExecutorStart (queryDesc=0x1880f58,
eflags=32) at execMain.c:267
#9  0x000000000073e15f in ExecutorStart (queryDesc=0x1880f58, eflags=0) at
execMain.c:146
#10 0x00000000009cab67 in PortalStart (portal=0x181b000, params=0x1880ec8,
eflags=0, snapshot=0x0) at pquery.c:517
#11 0x00000000009c5e49 in exec_bind_message (input_message=0x7ffdca29fd20)
at postgres.c:2028
#12 0x00000000009c940e in PostgresMain (dbname=0x17d8e88 "postgres",
username=0x17d8e68 "postgres") at postgres.c:4723
#13 0x00000000008f8024 in BackendRun (port=0x17cd640) at postmaster.c:4477
#14 0x00000000008f77c1 in BackendStartup (port=0x17cd640) at
postmaster.c:4153
#15 0x00000000008f4256 in ServerLoop () at postmaster.c:1771
#16 0x00000000008f3c40 in PostmasterMain (argc=3, argv=0x179b680) at
postmaster.c:1470
#17 0x00000000007be309 in main (argc=3, argv=0x179b680) at main.c:198
(gdb) f 3
#3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
pruneinfo=0x1884188) at execPartition.c:1943
1943                            Assert(partdesc->nparts >= pinfo->nparts);
(gdb) p partdesc->nparts
$1 = 1
(gdb) p pinfo->nparts
$2 = 2

I tried to analyze this problem and found the following:
1. ERROR "could not match partition child tables to plan elements" will be
thrown in release mode
2. When Assert is false, the number of partitions in the plan is 2, but
detach concurrently has been submitted at this time, resulting in only 1
partition viewed during execution, which violates the design principle.
3. Perhaps detach concurrently should have a more appropriate waiting
mechanism or planning should exclude partitions in this scenario.


On Tue, 5 Mar 2024 at 20:17, PG Bug reporting form
<noreply@postgresql.org> wrote:
> 1943                            Assert(partdesc->nparts >= pinfo->nparts);
> (gdb) p partdesc->nparts
> $1 = 1
> (gdb) p pinfo->nparts
> $2 = 2

The following comment says this shouldn't happen, but apparently it can:

* Because we request detached partitions to be included, and
* detaching waits for old transactions, it is safe to assume that
* no partitions have disappeared since this query was planned.

Added Álvaro as he knows this area best.

David



On 2024-Mar-06, David Rowley wrote:

> On Tue, 5 Mar 2024 at 20:17, PG Bug reporting form
> <noreply@postgresql.org> wrote:
> > 1943                            Assert(partdesc->nparts >= pinfo->nparts);
> > (gdb) p partdesc->nparts
> > $1 = 1
> > (gdb) p pinfo->nparts
> > $2 = 2
> 
> The following comment says this shouldn't happen, but apparently it can:
> 
> * Because we request detached partitions to be included, and
> * detaching waits for old transactions, it is safe to assume that
> * no partitions have disappeared since this query was planned.
> 
> Added Álvaro as he knows this area best.

Oh, ugh, apologies, I hadn't noticed this report.  I'll have a look ...

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



On 2024-Mar-05, PG Bug reporting form wrote:

> #2  0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
> "partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
> lineNumber=1943) at assert.c:66
> #3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
> pruneinfo=0x1884188) at execPartition.c:1943
> #4  0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
> n_total_subplans=2, pruneinfo=0x1884188,
> initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803

I had been digging into this crash in late March and seeing if I could
find a reliable fix, but it seems devilish and had to put it aside.  The
problem is that DETACH CONCURRENTLY does a wait for snapshots to
disappear before doing the next detach phase; but since pgbench is using
prepared mode, the wait is already long done by the time EXECUTE wants
to run the plan.  Now, we have relcache invalidations at the point where
the wait ends, and those relcache invalidations should in turn cause the
prepared plan to be invalidated, so we would get a new plan that
excludes the partition being detached.  But this doesn't happen for some
reason that I haven't yet been able to understand.

Still trying to find a proper fix.  In the meantime, not using prepared
plans should serve to work around the problem.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net





Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年4月9日周二 01:57写道:
On 2024-Mar-05, PG Bug reporting form wrote:

> #2  0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
> "partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
> lineNumber=1943) at assert.c:66
> #3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
> pruneinfo=0x1884188) at execPartition.c:1943
> #4  0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
> n_total_subplans=2, pruneinfo=0x1884188,
> initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803

I had been digging into this crash in late March and seeing if I could
find a reliable fix, but it seems devilish and had to put it aside.  The
problem is that DETACH CONCURRENTLY does a wait for snapshots to
disappear before doing the next detach phase; but since pgbench is using
prepared mode, the wait is already long done by the time EXECUTE wants
to run the plan.  Now, we have relcache invalidations at the point where
the wait ends, and those relcache invalidations should in turn cause the
prepared plan to be invalidated, so we would get a new plan that
excludes the partition being detached.  But this doesn't happen for some
reason that I haven't yet been able to understand.

Still trying to find a proper fix.  In the meantime, not using prepared
plans should serve to work around the problem.

--
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net



I had been analying this crash these days.  And I added a lot debug infos in codes.
Finally, I found a code execution sequence that would trigger this assert, and I could
use gdb not pgbench to help to reproduce this crash.

For example:
./psql postgres  # as session1 to do detach, start first
in another terminal, start gdb(call gdb1)
   gdb -p session1_pid
   b ATExecDetachPartition

in session1,  input alter table p detach partition p1 concurrently;
now session1 will be stalled by gdb.

in gdb terminal, we input step next(e.g. n) until first transaction call CommitTransactionCommand().
wo stop at CommitTransactionCommand().

we start another session2 to do select.
input : prepare p1 as select * from p where a = $1;

we start a new terminal, start gdb(call gdb2)
    gdb -p session2_pid
    b exec_simple_query
in session2, input execute p1(1);
Now session2 will be stalled by gdb.

in gdb terminal, we step into PortalRunUtility(), after getting a snapshot, we stop here.
For session2, the transaction updating pg_inherits is not commited.
We switch to gdb1 terminal, and continue to step next until calling DetachPartitionFinalize().
Because session2 has not get p relaiton lock, so in gdb1, we can cross WaitForLockersMultiple().

Now we swithch to gdb2, and continue to do work. If we breakpoint find_inheritance_children_extended()
We will get a tuple that inhdetachpending is true, but the xmin is in-progress for the session2 snapshot.
So this tuple will be added to the outpue according to the logic. Finally we will get two parts.
After return from add_base_rels_to_query() in query_planner(), we switch to gdb1.

In gdb1, we enter DetachPartitionFinalize() and call RemoveInheritance() to remove the tuple.
We input command "continue" to do left work for the detach.

Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We continue gdb2, and we will
stop at RelationCacheInvalidateEntry(). And we will see that p relation cache item will be cleared. 
The backtrace will be attached at the end of the this email.

Entering ExecInitAppend(), because part_prune_info is not null, so we will enter CreatePartitionPruneState().
We enter find_inheritance_children_extended() again to get partdesc, but in gdb1  we have done DetachPartitionFinalize()
and the detach has commited. So we only get one tuple and parts is 1.

Finally, we will trigger the Assert: (partdesc->nparts >= pinfo->nparts).


--
Tender Wang
OpenPie:  https://en.openpie.com/


Tender Wang <tndrwang@gmail.com> 于2024年4月18日周四 20:13写道:


Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年4月9日周二 01:57写道:
On 2024-Mar-05, PG Bug reporting form wrote:

> #2  0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
> "partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
> lineNumber=1943) at assert.c:66
> #3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
> pruneinfo=0x1884188) at execPartition.c:1943
> #4  0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
> n_total_subplans=2, pruneinfo=0x1884188,
> initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803

I had been digging into this crash in late March and seeing if I could
find a reliable fix, but it seems devilish and had to put it aside.  The
problem is that DETACH CONCURRENTLY does a wait for snapshots to
disappear before doing the next detach phase; but since pgbench is using
prepared mode, the wait is already long done by the time EXECUTE wants
to run the plan.  Now, we have relcache invalidations at the point where
the wait ends, and those relcache invalidations should in turn cause the
prepared plan to be invalidated, so we would get a new plan that
excludes the partition being detached.  But this doesn't happen for some
reason that I haven't yet been able to understand.

Still trying to find a proper fix.  In the meantime, not using prepared
plans should serve to work around the problem.

--
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net



I had been analying this crash these days.  And I added a lot debug infos in codes.
Finally, I found a code execution sequence that would trigger this assert, and I could
use gdb not pgbench to help to reproduce this crash.

For example:
./psql postgres  # as session1 to do detach, start first
in another terminal, start gdb(call gdb1)
   gdb -p session1_pid
   b ATExecDetachPartition

in session1,  input alter table p detach partition p1 concurrently;
now session1 will be stalled by gdb.

in gdb terminal, we input step next(e.g. n) until first transaction call CommitTransactionCommand().
wo stop at CommitTransactionCommand().

we start another session2 to do select.
input : prepare p1 as select * from p where a = $1;

we start a new terminal, start gdb(call gdb2)
    gdb -p session2_pid
    b exec_simple_query
in session2, input execute p1(1);
Now session2 will be stalled by gdb.

in gdb terminal, we step into PortalRunUtility(), after getting a snapshot, we stop here.
For session2, the transaction updating pg_inherits is not commited.
We switch to gdb1 terminal, and continue to step next until calling DetachPartitionFinalize().
Because session2 has not get p relaiton lock, so in gdb1, we can cross WaitForLockersMultiple().

Now we swithch to gdb2, and continue to do work. If we breakpoint find_inheritance_children_extended()
We will get a tuple that inhdetachpending is true, but the xmin is in-progress for the session2 snapshot.
So this tuple will be added to the outpue according to the logic. Finally we will get two parts.
After return from add_base_rels_to_query() in query_planner(), we switch to gdb1.

In gdb1, we enter DetachPartitionFinalize() and call RemoveInheritance() to remove the tuple.
We input command "continue" to do left work for the detach.

Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We continue gdb2, and we will
stop at RelationCacheInvalidateEntry(). And we will see that p relation cache item will be cleared. 
The backtrace will be attached at the end of the this email.

Entering ExecInitAppend(), because part_prune_info is not null, so we will enter CreatePartitionPruneState().
We enter find_inheritance_children_extended() again to get partdesc, but in gdb1  we have done DetachPartitionFinalize()
and the detach has commited. So we only get one tuple and parts is 1.

Finally, we will trigger the Assert: (partdesc->nparts >= pinfo->nparts).


--
Tender Wang
OpenPie:  https://en.openpie.com/

Sorry, I forgot to put backtrace that call RelationCacheInvalidateEntry() in planner phase in last email.

I found one self-contradiction comments in CreatePartitionPruneState():

/* For data reading, executor always omits detached partitions */ 
if (estate->es_partition_directory == NULL)
estate->es_partition_directory =
 CreatePartitionDirectory(estate->es_query_cxt, false);

Should it be " not omits" if I didn't  misunderstand. Because we pass false to the function.


I think if we could rewrite logic of CreatePartitionPruneState() as below:
if (partdesc->nparts == pinfo->nparts)
{
    /* no new partition and no detached partition */
}
else  if (partdesc->nparts >= pinfo->nparts)
{
    /* new partition */
}
else
{
  /* detached partition */
}

I haven't figured out a fix to the Scenario I found in last email.
--
Tender Wang
OpenPie:  https://en.openpie.com/
On 2024-Apr-18, Tender Wang wrote:

> Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We
> continue gdb2, and we will
> stop at RelationCacheInvalidateEntry(). And we will see that p relation
> cache item will be cleared.
> The backtrace will be attached at the end of the this email.

Here is where I think the problem occurs -- I mean, surely
PlanCacheRelCallback marking the plan as ->is_valid=false should cause
the prepared query to be replanned, and at that point the replan would
see that the partition is no more.  So by the time we get to this:

> Entering ExecInitAppend(), because part_prune_info is not null, so we will
> enter CreatePartitionPruneState().
> We enter find_inheritance_children_extended() again to get partdesc, but in
> gdb1  we have done DetachPartitionFinalize()
> and the detach has commited. So we only get one tuple and parts is 1.

we have made a new plan, one whose planner partition descriptor has only
one partition, so it'd match what ExecInitAppend sees.

Evidently there's something I'm missing in how this plancache
invalidation works.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/





Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年5月14日周二 00:55写道:
On 2024-Apr-18, Tender Wang wrote:

> Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We
> continue gdb2, and we will
> stop at RelationCacheInvalidateEntry(). And we will see that p relation
> cache item will be cleared.
> The backtrace will be attached at the end of the this email.

It happend in  deconstruct_jointree()(query_planner() called it), where planner had 
got partition information(parts is 2). After RelationCacheInvalidateEntry(), we will re-get
partition information in executor init phase.(parts is 1).

Here is where I think the problem occurs -- I mean, surely
PlanCacheRelCallback marking the plan as ->is_valid=false should cause
the prepared query to be replanned, and at that point the replan would
see that the partition is no more.  So by the time we get to this:

 

> Entering ExecInitAppend(), because part_prune_info is not null, so we will
> enter CreatePartitionPruneState().
> We enter find_inheritance_children_extended() again to get partdesc, but in
> gdb1  we have done DetachPartitionFinalize()
> and the detach has commited. So we only get one tuple and parts is 1.

we have made a new plan, one whose planner partition descriptor has only
one partition, so it'd match what ExecInitAppend sees.

Evidently there's something I'm missing in how this plancache
invalidation works.

Hmm, I don't think this issue is closely related to plancache invalidation.

The scenario,  which I created in [1],  because has no cached plan, so we will create a new plan.

After session1(doing detach) updated pg_inherits and before the first xact commited,
the session2(doing execute) get the snapshot. 
The session1 call WaitForLockersMultiple() before session2 get the parent relation lock.

when the session2 will get the partition information in planner phase
so the tuple in pg_inherits session1 updated would be visibility to session2, 
find_inheritance_children_extended() has below codes:

------
xmin = HeapTupleHeaderGetXmin(inheritsTuple->t_data);
snap = GetActiveSnapshot();

if (!XidInMVCCSnapshot(xmin, snap))
------

So the planner will have 2 parts.

The session1 will continue do detach work, because it has cross WaitForLockersMultiple(), so it will
call RemoveInheritance() to delete the tuple and add parent relation invalid message.
The session2 will enter RelationCacheInvalidateEntry() in deconstruct_jointree(), so partition information will
be clean.

When call ExecInitAppend(), we would get partition information again. The session1 has done the work, so the session2
will get 1 pg_inherits tuple.

Finally, we hit the assert.


Some key point:
1.  the updated pg_inherits tuple should be visibility for select.
2.  detach partition session can cross WaitForLockersMultiple()

We use two transactions to do detach partition concurrently. 
I have a question: can two transaction make sure of consistency?
--
Tender Wang
OpenPie:  https://en.openpie.com/
On 2024-May-13, Alvaro Herrera wrote:

> Evidently there's something I'm missing in how this plancache
> invalidation works.

Actually, what I was missing is that I had forgotten how
PartitionDirectory is actually used.  In the original code I wrote for
DETACH CONCURRENTLY for 2ndQPostgres a few years ago, I set things up so
that both planner and executor used the same PartitionDesc for a
partitioned table.  But when this was reworked for real Postgres by
Robert to implement concurrent ATTACH, he instead introduced
PartitionDirectory which keeps a PartitionDesc unchanged -- but planner
and executor use different PartitionDirectories.  I probably reworked
some of DETACH CONCURRENTLY to cope with that, but evidently must have
missed this scenario.

So the problem is that we read the partition descriptor during
re-planning with N partitions, and then if we receive the corresponding
invalidation message (at executor time) exactly before
CreatePartitionPruneState() creates its PartitionDirectory, then we read
N-1 partitions, and everything explodes.  (If the invalidation is
received at some other time, then the plan is remade before
ExecInitAppend by plancache, and everything works correctly.)

So one fix for this problem seems to be to patch
CreatePartitionPruneState() to accept the possibility that one partition
has been detached concurrently, and construct the planmap/partmap arrays
with a marking that the partition in question has been pruned away.


With that fix in place, I was now getting errors from
RelationBuildPartitionDesc() that some partition in the partdesc had
NULL relpartbound.  If I understand correctly, the problem here is that
we have a snapshot that still sees that partition as being in the
hierarchy, but in reality its relpartbound has been nullified by the
concurrent detach.  I can fix this symptom by doing
AcceptInvalidationMesages() and starting back at the top, where the list
of partitions is obtained by find_inheritance_children_extended().

With these two things in place , the pgbench scripts run without errors.
That's in patch 0001, which deserves a commit message that distills the
above.


I'm not convinced that this fix is correct or complete, but hey, it's
progress.  Here's some things I'm worried about:

1. this code assumes that a single partition can be in this "detach
concurrently" mode.  This sounds correct, because this operation needs
share update exclusive lock on the table, and we only allow one
partition to be in pending-detach state.  However, what if we have a
very slow process doing the query-side mode, and two processes manage to
detach two different tables in between?  I think this isn't possible
because of the wait-for-snapshots phase, but I need to convince myself
that this is the right explanation.

2. The new code in CreatePartitionPruneState assumes that if nparts
decreases, then it must be a detach, and if nparts increases, it must be
an attach.  Can these two things happen together in a way that we see
that the number of partitions remains the same, so we don't actually try
to construct planmap/partmap arrays by matching their OIDs?  I think the
only way to handle a possible problem here would be to verify the OIDs
every time we construct a partition descriptor.  I assume (without
checking) this would come with a performance cost, not sure.

3. The new stanza in CreatePartitionPruneState() should have more
sanity-checks that the two arrays we scan really match.

4. I made RelationBuildPartitionDesc retry only once.  I think this
should be enough, because a single AcceptInvalidationMessages should
process everything that we need to read to bring us up to current
reality, considering that DETACH CONCURRENTLY would need to wait for our
snapshot.

I would like to spend some more time on this, but I'm afraid I'll have
to put it off for a little while.

PS -- yes, the self-contradictory comment in CreatePartitionPruneState()
is bogus.  Patch 0002 fixes that.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."                              (Brian Kernighan)

Вложения


Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年5月21日周二 00:16写道:
On 2024-May-13, Alvaro Herrera wrote:

> Evidently there's something I'm missing in how this plancache
> invalidation works.

Actually, what I was missing is that I had forgotten how
PartitionDirectory is actually used.  In the original code I wrote for
DETACH CONCURRENTLY for 2ndQPostgres a few years ago, I set things up so
that both planner and executor used the same PartitionDesc for a
partitioned table.  But when this was reworked for real Postgres by
Robert to implement concurrent ATTACH, he instead introduced
PartitionDirectory which keeps a PartitionDesc unchanged -- but planner
and executor use different PartitionDirectories.  I probably reworked
some of DETACH CONCURRENTLY to cope with that, but evidently must have
missed this scenario.

So the problem is that we read the partition descriptor during
re-planning with N partitions, and then if we receive the corresponding
invalidation message (at executor time) exactly before
CreatePartitionPruneState() creates its PartitionDirectory, then we read
N-1 partitions, and everything explodes.  (If the invalidation is
received at some other time, then the plan is remade before
ExecInitAppend by plancache, and everything works correctly.)

So one fix for this problem seems to be to patch
CreatePartitionPruneState() to accept the possibility that one partition
has been detached concurrently, and construct the planmap/partmap arrays
with a marking that the partition in question has been pruned away.


With that fix in place, I was now getting errors from
RelationBuildPartitionDesc() that some partition in the partdesc had
NULL relpartbound.  If I understand correctly, the problem here is that
we have a snapshot that still sees that partition as being in the
hierarchy, but in reality its relpartbound has been nullified by the
concurrent detach.  I can fix this symptom by doing
AcceptInvalidationMesages() and starting back at the top, where the list
of partitions is obtained by find_inheritance_children_extended().

With these two things in place , the pgbench scripts run without errors.
That's in patch 0001, which deserves a commit message that distills the
above.


I'm not convinced that this fix is correct or complete, but hey, it's
progress.  Here's some things I'm worried about:

1. this code assumes that a single partition can be in this "detach
concurrently" mode.  This sounds correct, because this operation needs
share update exclusive lock on the table, and we only allow one
partition to be in pending-detach state.  However, what if we have a
very slow process doing the query-side mode, and two processes manage to
detach two different tables in between?  I think this isn't possible
because of the wait-for-snapshots phase, but I need to convince myself
that this is the right explanation.

2. The new code in CreatePartitionPruneState assumes that if nparts
decreases, then it must be a detach, and if nparts increases, it must be
an attach.  Can these two things happen together in a way that we see
that the number of partitions remains the same, so we don't actually try
to construct planmap/partmap arrays by matching their OIDs?  I think the
only way to handle a possible problem here would be to verify the OIDs
every time we construct a partition descriptor.  I assume (without
checking) this would come with a performance cost, not sure.

3. The new stanza in CreatePartitionPruneState() should have more
sanity-checks that the two arrays we scan really match.

4. I made RelationBuildPartitionDesc retry only once.  I think this
should be enough, because a single AcceptInvalidationMessages should
process everything that we need to read to bring us up to current
reality, considering that DETACH CONCURRENTLY would need to wait for our
snapshot.

I would like to spend some more time on this, but I'm afraid I'll have
to put it off for a little while.

PS -- yes, the self-contradictory comment in CreatePartitionPruneState()
is bogus.  Patch 0002 fixes that.

I have tested this patch locally and did not encounter any failures.
I will take time to look the patch in detail and consider the issues you mentioned.
--
Tender Wang
OpenPie:  https://en.openpie.com/
On 2024-May-22, Tender Wang wrote:

> I have tested this patch locally and did not encounter any failures.
> I will take time to look the patch in detail and consider the issues you
> mentioned.

Thank you.  In the meantime,

> Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年5月21日周二 00:16写道:

> > 2. The new code in CreatePartitionPruneState assumes that if nparts
> > decreases, then it must be a detach, and if nparts increases, it must be
> > an attach.  Can these two things happen together in a way that we see
> > that the number of partitions remains the same, so we don't actually try
> > to construct planmap/partmap arrays by matching their OIDs?  I think the
> > only way to handle a possible problem here would be to verify the OIDs
> > every time we construct a partition descriptor.  I assume (without
> > checking) this would come with a performance cost, not sure.

I modified the scripts slightly so that two partitions would be
detached, and lo and behold -- the case where we have one new partition
appearing and one partition disappearing concurrently can indeed happen.
So we have that both nparts are identical, but the OID arrays don't
match.  I attach the scripts I used to test.

I think in order to fix this we would have to compare the OID arrays
each time through CreatePartitionPruneState, so that we can mark as
"pruned" (value -1) any partition that's not on either of the partdescs.
Having to compare the whole arrays each and every time might not be
great, but I don't see any backpatchable alternative at the moment.
Going forward, we could avoid the hit by having something like a
generation counter for the partitioned table (which is incremented for
each attach and detach), but of course that's not backpatchable.


PS: the pg_advisory_unlock() calls are necessary, because otherwise the
session that first succeeds the try_lock function retains the lock for
the whole duration of the pgbench script, so the other sessions always
skip the "\if :gotlock" block.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from 
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
 Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg00000.php

Вложения