Обсуждение: BUG #18238: Cross-partitition MERGE/UPDATE with delete-preventing trigger leads to incorrect memory access

Поиск
Список
Период
Сортировка

BUG #18238: Cross-partitition MERGE/UPDATE with delete-preventing trigger leads to incorrect memory access

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18238
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16.1
Operating system:   Ubuntu 22.04
Description:

When the following query:
CREATE TABLE t (a int) PARTITION BY LIST (a);
CREATE TABLE tp1 PARTITION OF t FOR VALUES IN (1);
CREATE TABLE tp2 PARTITION OF t FOR VALUES IN (2);
INSERT INTO t VALUES (1);

CREATE FUNCTION tf() RETURNS TRIGGER LANGUAGE plpgsql AS
  $$ BEGIN RETURN NULL; END; $$;

CREATE TRIGGER tr BEFORE DELETE ON t
  FOR EACH ROW EXECUTE PROCEDURE tf();

MERGE INTO t USING t st ON TRUE WHEN MATCHED THEN UPDATE SET a = 2;

executed under Valgrind, an incorrect memory access detected:
2023-12-09 13:31:37.178 UTC|law|regression|65744207.29a672|LOG:  statement:
MERGE INTO t USING t st ON TRUE
        WHEN MATCHED THEN UPDATE SET a = 2;
==00:00:00:08.207 2729586== Conditional jump or move depends on
uninitialised value(s)
==00:00:00:08.207 2729586==    at 0x43DB3A: ItemPointerIsValid
(itemptr.h:85)
==00:00:00:08.207 2729586==    by 0x43DB3A: ItemPointerGetOffsetNumber
(itemptr.h:126)
==00:00:00:08.207 2729586==    by 0x43DB3A:
ItemPointerIndicatesMovedPartitions (itemptr.h:200)
==00:00:00:08.207 2729586==    by 0x43DB3A: ExecMergeMatched
(nodeModifyTable.c:3051)
==00:00:00:08.207 2729586==    by 0x43DE41: ExecMerge
(nodeModifyTable.c:2762)
==00:00:00:08.207 2729586==    by 0x43E846: ExecModifyTable
(nodeModifyTable.c:3850)
==00:00:00:08.207 2729586==    by 0x4106FE: ExecProcNodeFirst
(execProcnode.c:464)
==00:00:00:08.207 2729586==    by 0x408CC1: ExecProcNode (executor.h:273)
==00:00:00:08.207 2729586==    by 0x408CC1: ExecutePlan (execMain.c:1670)
==00:00:00:08.207 2729586==    by 0x408E84: standard_ExecutorRun
(execMain.c:365)
==00:00:00:08.207 2729586==    by 0x408F5E: ExecutorRun (execMain.c:309)
==00:00:00:08.207 2729586==    by 0x5F032E: ProcessQuery (pquery.c:160)
==00:00:00:08.207 2729586==    by 0x5F0EBD: PortalRunMulti (pquery.c:1277)
==00:00:00:08.207 2729586==    by 0x5F1457: PortalRun (pquery.c:791)
==00:00:00:08.207 2729586==    by 0x5ED5F4: exec_simple_query
(postgres.c:1274)
==00:00:00:08.207 2729586==    by 0x5EF502: PostgresMain (postgres.c:4637)
==00:00:00:08.207 2729586==  Uninitialised value was created by a stack
allocation
==00:00:00:08.207 2729586==    at 0x3D8663: GetTupleForTrigger
(trigger.c:3283)
==00:00:00:08.207 2729586==

Without Valgrind, I saw a failed assertion:
#5  0x00005609515b766e in ExceptionalCondition
(conditionName=conditionName@entry=0x56095161309a
"ItemPointerIsValid(pointer)", fileName=fileName@entry=0x56095163dc18
"../../../src/include/storage/itemptr.h", lineNumber=lineNumber@entry=126)
at assert.c:66

and
ERROR:  could not open file "base/16384/16393.1" (target block 3081245700):
previous segment is only 1 blocks

Reproduced on REL_15_STABLE .. master.
First bad commit for this anomaly is 9321c79c8.


On Sun, Dec 10, 2023 at 1:10 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18238
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 16.1
> Operating system:   Ubuntu 22.04
> Description:
>
> When the following query:
> CREATE TABLE t (a int) PARTITION BY LIST (a);
> CREATE TABLE tp1 PARTITION OF t FOR VALUES IN (1);
> CREATE TABLE tp2 PARTITION OF t FOR VALUES IN (2);
> INSERT INTO t VALUES (1);
>
> CREATE FUNCTION tf() RETURNS TRIGGER LANGUAGE plpgsql AS
>   $$ BEGIN RETURN NULL; END; $$;
>
> CREATE TRIGGER tr BEFORE DELETE ON t
>   FOR EACH ROW EXECUTE PROCEDURE tf();
>
> MERGE INTO t USING t st ON TRUE WHEN MATCHED THEN UPDATE SET a = 2;
>

--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1828,10 +1828,10 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
                 * additional rechecking, and might end up executing a different
                 * action entirely).
                 */
-               if (context->relaction != NULL)
-                       return false;
-               else if (TupIsNull(epqslot))
+               if (TupIsNull(epqslot))
                        return true;
+               else if (context->relaction != NULL)
+                       return false;
                else
                {
                        /* Fetch the most recent version of old tuple. */

seems to work.
but now the new command tag is MERGE 1. should be MERGE 0.



On Sun, Dec 10, 2023 at 5:55 PM jian he <jian.universality@gmail.com> wrote:
>
> On Sun, Dec 10, 2023 at 1:10 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
> >
> > The following bug has been logged on the website:
> >
> > Bug reference:      18238
> > Logged by:          Alexander Lakhin
> > Email address:      exclusion@gmail.com
> > PostgreSQL version: 16.1
> > Operating system:   Ubuntu 22.04
> > Description:
> >
> > When the following query:
> > CREATE TABLE t (a int) PARTITION BY LIST (a);
> > CREATE TABLE tp1 PARTITION OF t FOR VALUES IN (1);
> > CREATE TABLE tp2 PARTITION OF t FOR VALUES IN (2);
> > INSERT INTO t VALUES (1);
> >
> > CREATE FUNCTION tf() RETURNS TRIGGER LANGUAGE plpgsql AS
> >   $$ BEGIN RETURN NULL; END; $$;
> >
> > CREATE TRIGGER tr BEFORE DELETE ON t
> >   FOR EACH ROW EXECUTE PROCEDURE tf();
> >
> > MERGE INTO t USING t st ON TRUE WHEN MATCHED THEN UPDATE SET a = 2;
> >
>
> --- a/src/backend/executor/nodeModifyTable.c
> +++ b/src/backend/executor/nodeModifyTable.c
> @@ -1828,10 +1828,10 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
>                  * additional rechecking, and might end up executing a different
>                  * action entirely).
>                  */
> -               if (context->relaction != NULL)
> -                       return false;
> -               else if (TupIsNull(epqslot))
> +               if (TupIsNull(epqslot))
>                         return true;
> +               else if (context->relaction != NULL)
> +                       return false;
>                 else
>                 {
>                         /* Fetch the most recent version of old tuple. */
>
> seems to work.
> but now the new command tag is MERGE 1. should be MERGE 0.

This should be ok. Now the command tag is MERGE 0.
But we may need extra words to document this.

diff --git a/src/backend/executor/nodeModifyTable.c
b/src/backend/executor/nodeModifyTable.c
index b16fbe9e..854b1ed4 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1828,10 +1828,10 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
                 * additional rechecking, and might end up executing a different
                 * action entirely).
                 */
-               if (context->relaction != NULL)
-                       return false;
-               else if (TupIsNull(epqslot))
+               if (TupIsNull(epqslot))
                        return true;
+               else if (context->relaction != NULL)
+                       return false;
                else
                {
                        /* Fetch the most recent version of old tuple. */
@@ -2909,9 +2909,12 @@ lmerge_matched:
                                 */
                                if (updateCxt.crossPartUpdate)
                                {
-                                       mtstate->mt_merge_updated += 1;
-                                       if (canSetTag)
-                                               (estate->es_processed)++;
+                                       if (context->cpUpdateReturningSlot)
+                                       {
+                                               mtstate->mt_merge_updated += 1;
+                                               if (canSetTag)
+
(estate->es_processed)++;
+                                       }
                                        return true;
                                }



On Sun, 10 Dec 2023 at 12:20, jian he <jian.universality@gmail.com> wrote:
>
> > On Sun, Dec 10, 2023 at 1:10 AM PG Bug reporting form
> > <noreply@postgresql.org> wrote:
> > >
> > > CREATE TABLE t (a int) PARTITION BY LIST (a);
> > > CREATE TABLE tp1 PARTITION OF t FOR VALUES IN (1);
> > > CREATE TABLE tp2 PARTITION OF t FOR VALUES IN (2);
> > > INSERT INTO t VALUES (1);
> > >
> > > CREATE FUNCTION tf() RETURNS TRIGGER LANGUAGE plpgsql AS
> > >   $$ BEGIN RETURN NULL; END; $$;
> > >
> > > CREATE TRIGGER tr BEFORE DELETE ON t
> > >   FOR EACH ROW EXECUTE PROCEDURE tf();
> > >
> > > MERGE INTO t USING t st ON TRUE WHEN MATCHED THEN UPDATE SET a = 2;
> > >

Ah, yes. Nice catch! The cross-partition MERGE code is evidently still
not right.


> > --- a/src/backend/executor/nodeModifyTable.c
> > +++ b/src/backend/executor/nodeModifyTable.c
> > @@ -1828,10 +1828,10 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
> >                  * additional rechecking, and might end up executing a different
> >                  * action entirely).
> >                  */
> > -               if (context->relaction != NULL)
> > -                       return false;
> > -               else if (TupIsNull(epqslot))
> > +               if (TupIsNull(epqslot))
> >                         return true;
> > +               else if (context->relaction != NULL)
> > +                       return false;
> >                 else
> >                 {
> >                         /* Fetch the most recent version of old tuple. */
> >
> > seems to work.

No, that's not right. Doing that breaks the isolation tests, because
it causes it to fail to spot concurrent updates. What this code needs
to do is differentiate between a delete blocked by a concurrent
update, and a delete blocked by triggers.

The easiest way to do that is to have ExecDelete() return the
TM_Result status to ExecCrossPartitionUpdate(). I think
ExecCrossPartitionUpdate() should then return the TM_Result status to
ExecUpdateAct(), so that it can return the correct status, rather than
just assuming TM_Updated on failure, though possibly that's not
strictly necessary.


> > but now the new command tag is MERGE 1. should be MERGE 0.
>
> @@ -2909,9 +2909,12 @@ lmerge_matched:
>                                  */
>                                 if (updateCxt.crossPartUpdate)
>                                 {
> -                                       mtstate->mt_merge_updated += 1;
> -                                       if (canSetTag)
> -                                               (estate->es_processed)++;
> +                                       if (context->cpUpdateReturningSlot)
> +                                       {
> +                                               mtstate->mt_merge_updated += 1;
> +                                               if (canSetTag)
> +
> (estate->es_processed)++;
> +                                       }
>                                         return true;
>                                 }
>

No, that's not right. cpUpdateReturningSlot will currently always be
NULL in a MERGE, so that'll cause it to never update the command tag.

The simplest fix is for ExecMergeMatched() to pass canSetTag to
ExecUpdateAct(), so that it updates the command tag, if it does a
cross-partition update. That makes that code path in
ExecMergeMatched() more consistent with ExecUpdate().

So I think we need something like the attached.

Regards,
Dean

Вложения

On Mon, Dec 11, 2023 at 8:53 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
The easiest way to do that is to have ExecDelete() return the
TM_Result status to ExecCrossPartitionUpdate(). I think
ExecCrossPartitionUpdate() should then return the TM_Result status to
ExecUpdateAct(), so that it can return the correct status, rather than
just assuming TM_Updated on failure, though possibly that's not
strictly necessary.

The simplest fix is for ExecMergeMatched() to pass canSetTag to
ExecUpdateAct(), so that it updates the command tag, if it does a
cross-partition update. That makes that code path in
ExecMergeMatched() more consistent with ExecUpdate().

So I think we need something like the attached.

I think this is the right way to go.  +1.

BTW, while testing this patch, I encountered some confusion regarding
cross-partition update.  As we know, cross-partition update works by
first deleting the old tuple from the current partition.  So if we have
BEFORE ROW DELETE triggers that suppress the delete, the update would be
suppressed.  For in-partition update, there is no such problem.  For
instance (based on Alexander's query):

CREATE TABLE t (a int) PARTITION BY RANGE (a);
CREATE TABLE tp1 PARTITION OF t FOR VALUES FROM (0) TO (10);
CREATE TABLE tp2 PARTITION OF t FOR VALUES FROM (10) TO (20);

INSERT INTO t VALUES (0), (5);    -- into tp1
INSERT INTO t VALUES (10), (15);  -- into tp2

CREATE FUNCTION tf() RETURNS TRIGGER LANGUAGE plpgsql AS
  $$ BEGIN RETURN NULL; END; $$;

CREATE TRIGGER tr BEFORE DELETE ON t
  FOR EACH ROW EXECUTE PROCEDURE tf();

-- update suppressed by the BEFORE ROW DELETE trigger
# UPDATE t SET a = 15 WHERE a = 0;
UPDATE 0

-- update performed successfully
# UPDATE t SET a = 5 WHERE a = 0;
UPDATE 1

Does this match the expected behavior?

Thanks
Richard
On Tue, 12 Dec 2023 at 09:50, Richard Guo <guofenglinux@gmail.com> wrote:
>
> On Mon, Dec 11, 2023 at 8:53 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>>
>> So I think we need something like the attached.
>
> I think this is the right way to go.  +1.
>

Thanks for looking!

> BTW, while testing this patch, I encountered some confusion regarding
> cross-partition update.  As we know, cross-partition update works by
> first deleting the old tuple from the current partition.  So if we have
> BEFORE ROW DELETE triggers that suppress the delete, the update would be
> suppressed.  For in-partition update, there is no such problem.
>
> Does this match the expected behavior?
>

Yes, that's the intended behaviour. I think it's probably sufficiently
well-covered by the docs here:

https://www.postgresql.org/docs/current/trigger-definition.html#:~:text=If%20an%20UPDATE,the%20destination%20partition.

  If an UPDATE on a partitioned table causes a row to move to another
  partition, it will be performed as a DELETE from the original partition
  followed by an INSERT into the new partition. In this case, all
  row-level BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers
  are fired on the original partition. Then all row-level BEFORE INSERT
  triggers are fired on the destination partition.

and then a couple of paragraphs further down, it mentions how a
row-level BEFORE trigger can return NULL to cause an operation to be
skipped.

So a BEFORE UPDATE trigger can block any kind of update, including a
cross-partition update, whereas a BEFORE DELETE trigger can prevent
rows changing partitions, while allowing other kinds of updates. That
might be quite handy under some circumstances, but it would also block
deletes, so it may not be ideal for all cases.

Anyway, that's what it's supposed to do.

Regards,
Dean




On Tue, Dec 12, 2023 at 9:16 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
On Tue, 12 Dec 2023 at 09:50, Richard Guo <guofenglinux@gmail.com> wrote:
> BTW, while testing this patch, I encountered some confusion regarding
> cross-partition update.  As we know, cross-partition update works by
> first deleting the old tuple from the current partition.  So if we have
> BEFORE ROW DELETE triggers that suppress the delete, the update would be
> suppressed.  For in-partition update, there is no such problem.
>
> Does this match the expected behavior?

Yes, that's the intended behaviour. I think it's probably sufficiently
well-covered by the docs here:

https://www.postgresql.org/docs/current/trigger-definition.html#:~:text=If%20an%20UPDATE,the%20destination%20partition.

  If an UPDATE on a partitioned table causes a row to move to another
  partition, it will be performed as a DELETE from the original partition
  followed by an INSERT into the new partition. In this case, all
  row-level BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers
  are fired on the original partition. Then all row-level BEFORE INSERT
  triggers are fired on the destination partition.

and then a couple of paragraphs further down, it mentions how a
row-level BEFORE trigger can return NULL to cause an operation to be
skipped.

So a BEFORE UPDATE trigger can block any kind of update, including a
cross-partition update, whereas a BEFORE DELETE trigger can prevent
rows changing partitions, while allowing other kinds of updates. That
might be quite handy under some circumstances, but it would also block
deletes, so it may not be ideal for all cases.

Anyway, that's what it's supposed to do.

Ah, I see.  Thanks for the detailed explanation!

Thanks
Richard
sorry for the noise.
generally I only do regress tests, I should do regress and isolation tests.
your patch is correct, especially ExecCrossPartitionUpdate canSetTag
logic is very intuitive to me.

The test example is so good. Maybe it takes a bit of time for people
to comprehend. (i think)

+-- as above, but blocked by BEFORE DELETE ROW trigger
+BEGIN;
+CREATE FUNCTION trig_fn() RETURNS trigger LANGUAGE plpgsql AS
+  $$ BEGIN RETURN NULL; END; $$;
+CREATE TRIGGER del_trig BEFORE DELETE ON pa_target
+  FOR EACH ROW EXECUTE PROCEDURE trig_fn();
+DO $$
+DECLARE
+  result integer;
+BEGIN
+MERGE INTO pa_target t
+  USING pa_source s
+  ON t.tid = s.sid
+  WHEN MATCHED THEN
+    UPDATE SET tid = tid + 1, balance = balance + delta, val = val ||
' updated by merge'
+  WHEN NOT MATCHED THEN
+    INSERT VALUES (sid, delta, 'inserted by merge');
+GET DIAGNOSTICS result := ROW_COUNT;
+RAISE NOTICE 'ROW_COUNT = %', result;
+END;
+$$;
+SELECT * FROM pa_target ORDER BY tid;
+ROLLBACK;

in
+SELECT * FROM pa_target ORDER BY tid;
I would change to

SELECT  pt.*, pg_get_expr(pc.relpartbound, pt.tableoid, TRUE)
FROM    pa_target  pt
JOIN    pg_class   pc ON pc.oid = pt.tableoid
ORDER BY val,tid;

that would make the result more understandable, maybe i am over engineering.
Anyway, that's my personal preference.



On Wed, 13 Dec 2023 at 03:36, jian he <jian.universality@gmail.com> wrote:
>
> your patch is correct, especially ExecCrossPartitionUpdate canSetTag
> logic is very intuitive to me.
>

Thanks for checking. Patch pushed and back-patched.

> +SELECT * FROM pa_target ORDER BY tid;
> I would change to
>
> SELECT  pt.*, pg_get_expr(pc.relpartbound, pt.tableoid, TRUE)
> FROM    pa_target  pt
> JOIN    pg_class   pc ON pc.oid = pt.tableoid
> ORDER BY val,tid;
>
> that would make the result more understandable, maybe i am over engineering.
> Anyway, that's my personal preference.

I decided not to do that, just to keep it simpler, and consistent with
the preceding tests.

Regards,
Dean