Обсуждение: race condition in pg_class

Поиск
Список
Период
Сортировка

race condition in pg_class

От
Smolkin Grigory
Дата:
Hello, hackers!
We are running PG13.10 and recently we have encountered what appears to be a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and some other catalog-writer, possibly ANALYZE.
The problem is that after successfully creating index on relation (which previosly didnt have any indexes), its pg_class.relhasindex remains set to "false", which is illegal, I think.
Index was built using the following statement:
ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

PG_CLASS:
# select ctid,oid,xmin,xmax, * from pg_class where oid = 3566558198;
-[ RECORD 1 ]-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ctid                | (12,49)
oid                 | 3566558198
xmin                | 1202298791
xmax                | 0
relname             | example
relnamespace        | 16479
reltype             | 3566558200
reloftype           | 0
relowner            | 16386
relam               | 2
relfilenode         | 3566558198
reltablespace       | 0
relpages            | 152251
reltuples           | 1.1565544e+07
relallvisible       | 127837
reltoastrelid       | 3566558203
relhasindex         | f
relisshared         | f
relpersistence      | p
relkind             | r
relnatts            | 12
relchecks           | 0
relhasrules         | f
relhastriggers      | f
relhassubclass      | f
relrowsecurity      | f
relforcerowsecurity | f
relispopulated      | t
relreplident        | d
relispartition      | f
relrewrite          | 0
relfrozenxid        | 1201647807
relminmxid          | 1
relacl              |
reloptions          |
relpartbound        |

PG_INDEX:
# select ctid,xmin,xmax,indexrelid::regclass,indrelid::regclass, * from pg_index where indexrelid = 3569625749;
-[ RECORD 1 ]--+---------------------------------------------
ctid           | (3,30)
xmin           | 1202295045
xmax           | 0
indexrelid     | "example_pkey"
indrelid       | "example"
indexrelid     | 3569625749
indrelid       | 3566558198
indnatts       | 1
indnkeyatts    | 1
indisunique    | t
indisprimary   | t
indisexclusion | f
indimmediate   | t
indisclustered | f
indisvalid     | t
indcheckxmin   | f
indisready     | t
indislive      | t
indisreplident | f
indkey         | 1
indcollation   | 0
indclass       | 3124
indoption      | 0
indexprs       |
indpred        |

Looking into the WAL via waldump given us the following picture (full waldump output is attached):

tx: 1202295045, lsn: AAB1/D38378D0, prev AAB1/D3837208, desc: FPI , blkref #0: rel 1663/16387/3569625749 blk 0 FPW
tx: 1202298790, lsn: AAB1/D3912EC0, prev AAB1/D3912E80, desc: NEW_CID rel 1663/16387/1259; tid 6/24; cmin: 0, cmax: 4294967295, combo: 4294967295
tx: 1202298790, lsn: AAB1/D3927580, prev AAB1/D3926988, desc: COMMIT 2023-10-04 22:41:23.863979 UTC
tx: 1202298791, lsn: AAB1/D393C230, prev AAB1/D393C1F0, desc: HOT_UPDATE off 24 xmax 1202298791 flags 0x20 ; new off 45 xmax 0, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D394ADA0, prev AAB1/D394AD60, desc: UPDATE off 45 xmax 1202298791 flags 0x00 ; new off 28 xmax 0, blkref #0: rel 1663/16387/1259 blk 5, blkref #1: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D3961088, prev AAB1/D3961048, desc: NEW_CID rel 1663/16387/1259; tid 12/49; cmin: 50, cmax: 4294967295, combo: 4294967295
tx: 1202295045, lsn: AAB1/D3962E78, prev AAB1/D3962E28, desc: INPLACE off 24, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202295045, lsn: AAB1/D39632A0, prev AAB1/D3963250, desc: COMMIT 2023-10-04 22:41:23.878565 UTC
tx: 1202298791, lsn: AAB1/D3973420, prev AAB1/D39733D0, desc: COMMIT 2023-10-04 22:41:23.884951 UTC

1202295045 - create index statement
1202298790 and 1202298791 are some other concurrent operations, unfortunately I wasnt able to determine what are they

So looks like 1202295045, updated tuple (6,24) in pg_class INPLACE, in which at this point xmax was already set by 1202298791 and new tuple in (12,49) was in created.
So after 1202298791 was commited, that inplace update was effectively lost.
If we do an inclusive PITR with (recovery_target_xid = 1202295045), we can see the following picture (notice relhasindex and xmax):

# select ctid,oid, xmin,xmax,relhasindex,cmin,cmax from pg_class where oid = 3566558198;
-[ RECORD 1 ]-----------
ctid        | (6,24)
oid         | 3566558198
xmin        | 1202298790
xmax        | 1202298791
relhasindex | t
cmin        | 0
cmax        | 0

I've tried to reproduce this scenario with CREATE INDEX and various concurrent statements, but no luck.
Attached full waldump output for the relevant WAL segment.
Вложения

Re: race condition in pg_class

От
"Andrey M. Borodin"
Дата:

> On 25 Oct 2023, at 13:39, Smolkin Grigory <smallkeen@gmail.com> wrote:
>
> We are running PG13.10 and recently we have encountered what appears to be a bug due to some race condition between
ALTERTABLE ... ADD CONSTRAINT and some other catalog-writer, possibly ANALYZ 

> I've tried to reproduce this scenario with CREATE INDEX and various concurrent statements, but no luck.

Maybe it would be possible to reproduce with modifying tests for concurrent index creation. For example add “ANALYZE”
here[0]. 
Keep in mind that for easier reproduction it would make sense to increase transaction count radically.


Best regards, Andrey Borodin.


[0] https://github.com/postgres/postgres/blob/master/contrib/amcheck/t/002_cic.pl#L34




Re: race condition in pg_class

От
Tom Lane
Дата:
Smolkin Grigory <smallkeen@gmail.com> writes:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

ALTER TABLE ADD CONSTRAINT would certainly have taken
AccessExclusiveLock on the "example" table, which should be sufficient
to prevent anything else from touching its pg_class row.  The only
mechanism I can think of that might bypass that is a manual UPDATE on
pg_class, which would just manipulate the row as a row without concern
for associated relation-level locks.  Any chance that somebody was
doing something like that?

            regards, tom lane



Re: race condition in pg_class

От
Smolkin Grigory
Дата:
> ALTER TABLE ADD CONSTRAINT would certainly have taken
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row.  The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks.  Any chance that somebody was
> doing something like that?

No chance. Our infrastructure dont do that, and users dont just have the privileges to mess with pg_catalog.

ср, 25 окт. 2023 г. в 21:06, Tom Lane <tgl@sss.pgh.pa.us>:
Smolkin Grigory <smallkeen@gmail.com> writes:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

ALTER TABLE ADD CONSTRAINT would certainly have taken
AccessExclusiveLock on the "example" table, which should be sufficient
to prevent anything else from touching its pg_class row.  The only
mechanism I can think of that might bypass that is a manual UPDATE on
pg_class, which would just manipulate the row as a row without concern
for associated relation-level locks.  Any chance that somebody was
doing something like that?

                        regards, tom lane

Re: race condition in pg_class

От
Noah Misch
Дата:
On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

This is going to be a problem with any operation that does a transactional
pg_class update without taking a lock that conflicts with ShareLock.  GRANT
doesn't lock the table at all, so I can reproduce this in v17 as follows:

== session 1
create table t (c int);
begin;
grant select on t to public;

== session 2
alter table t add primary key (c);

== back in session 1
commit;


We'll likely need to change how we maintain relhasindex or perhaps take a lock
in GRANT.

> Looking into the WAL via waldump given us the following picture (full
> waldump output is attached):

> 1202295045 - create index statement
> 1202298790 and 1202298791 are some other concurrent operations,
> unfortunately I wasnt able to determine what are they

Can you explore that as follows?

- PITR to just before the COMMIT record.
- Save all rows of pg_class.
- PITR to just after the COMMIT record.
- Save all rows of pg_class.
- Diff the two sets of saved rows.

Which columns changed?  The evidence you've shown would be consistent with a
transaction doing GRANT or REVOKE on dozens of tables.  If the changed column
is something other than relacl, that would be great to know.

On the off-chance it's relevant, what extensions do you have (\dx in psql)?



Re: race condition in pg_class

От
Smolkin Grigory
Дата:
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock.  GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
>
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
>
> == session 2
> alter table t add primary key (c);
>
> == back in session 1
> commit;
>
>
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.

Oh, that explains it. Thank you very much.

> Can you explore that as follows?
>
>- PITR to just before the COMMIT record.
>- Save all rows of pg_class.
>- PITR to just after the COMMIT record.
>- Save all rows of pg_class.
>- Diff the two sets of saved rows.

Sure, but it will take some time, its a large db with lots of WAL segments to apply.

> extensions

      extname       | extversion
--------------------+------------
 plpgsql            | 1.0
 pg_stat_statements | 1.8
 pg_buffercache     | 1.3
 pgstattuple        | 1.5

Re: race condition in pg_class

От
Noah Misch
Дата:
On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> > We are running PG13.10 and recently we have encountered what appears to be
> > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> > some other catalog-writer, possibly ANALYZE.
> > The problem is that after successfully creating index on relation (which
> > previosly didnt have any indexes), its pg_class.relhasindex remains set to
> > "false", which is illegal, I think.

It's damaging.  The table will behave like it has no indexes.  If something
adds an index later, old indexes will reappear, corrupt, having not received
updates during the relhasindex=false era.  ("pg_amcheck --heapallindexed" can
detect this.)

> > Index was built using the following statement:
> > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
> 
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock.  GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
> 
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
> 
> == session 2
> alter table t add primary key (c);
> 
> == back in session 1
> commit;
> 
> 
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.

The main choice is accepting more DDL blocking vs. accepting inefficient
relcache builds.  Options I'm seeing:

=== "more DDL blocking" option family

B1. Take ShareUpdateExclusiveLock in GRANT, REVOKE, and anything that makes
    transactional pg_class updates without holding some stronger lock.  New
    asserts could catch future commands failing to do this.

B2. Take some shorter-lived lock around pg_class tuple formation, such that
    GRANT blocks CREATE INDEX, but two CREATE INDEX don't block each other.
    Anything performing a transactional update of a pg_class row would acquire
    the lock in exclusive mode before fetching the old tuple and hold it till
    end of transaction.  relhasindex=true in-place updates would acquire it
    the same way, but they would release it after the inplace update.  I
    expect a new heavyweight lock type, e.g. LOCKTAG_RELATION_DEFINITION, with
    the same key as LOCKTAG_RELATION.  This has less blocking than the
    previous option, but it's more complicated to explain to both users and
    developers.

B3. Have CREATE INDEX do an EvalPlanQual()-style thing to update all successor
    tuple versions.  Like the previous option, this would require new locking,
    but the new lock would not need to persist till end of xact.  It would be
    even more complicated to explain to users and developers.  (If this is
    promising enough to warrant more detail, let me know.)

B4. Use transactional updates to set relhasindex=true.  Two CREATE INDEX
    commands on the same table would block each other.  If we did it the way
    most DDL does today, they'd get "tuple concurrently updated" failures
    after the blocking ends.

=== "inefficient relcache builds" option family

R1. Ignore relhasindex; possibly remove it in v17.  Relcache builds et
    al. will issue more superfluous queries.

R2. As a weird variant of the previous option, keep relhasindex and make all
    transactional updates of pg_class set relhasindex=true pessimistically.
    (VACUUM will set it back to false.)

=== other

O1. This is another case where the sometimes-discussed "pg_class_nt" for
    nontransactional columns would help.  I'm ruling that out as too hard to
    back-patch.


Are there other options important to consider?  I currently like (B1) the
most, followed closely by (R1) and (B2).  A key unknown is the prevalence of
index-free tables.  Low prevalence would argue in favor of (R1).  In my
limited experience, they've been rare.  That said, I assume relcache builds
happen a lot more than GRANTs, so it's harder to bound the damage from (R1)
compared to the damage from (B1).  Thoughts on this decision?

Thanks,
nm



Re: race condition in pg_class

От
Tom Lane
Дата:
Noah Misch <noah@leadboat.com> writes:
> On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
>> We'll likely need to change how we maintain relhasindex or perhaps take a lock
>> in GRANT.

> The main choice is accepting more DDL blocking vs. accepting inefficient
> relcache builds.  Options I'm seeing:

It looks to me like you're only thinking about relhasindex, but it
seems to me that any call of heap_inplace_update brings some
risk of this kind.  Excluding the bootstrap-mode-only usage in
create_toast_table, I see four callers:

* index_update_stats updating a pg_class tuple's
  relhasindex, relpages, reltuples, relallvisible

* vac_update_relstats updating a pg_class tuple's
  relpages, reltuples, relallvisible, relhasindex, relhasrules,
  relhastriggers, relfrozenxid, relminmxid

* vac_update_datfrozenxid updating a pg_database tuple's
  datfrozenxid, datminmxid

* dropdb updating a pg_database tuple's datconnlimit

So we have just as much of a problem with GRANTs on databases
as GRANTs on relations.  Also, it looks like we can lose
knowledge of the presence of rules and triggers, which seems
nearly as bad as forgetting about indexes.  The rest of these
updates might not be correctness-critical, although I wonder
how bollixed things could get if we forget an advancement of
relfrozenxid or datfrozenxid (especially if the calling
transaction goes on to make other changes that assume that
the update happened).

BTW, vac_update_datfrozenxid believes (correctly I think) that
it cannot use the syscache copy of a tuple as the basis for in-place
update, because syscache will have detoasted any toastable fields.
These other callers are ignoring that, which seems like it should
result in heap_inplace_update failing with "wrong tuple length".
I wonder how come we're not seeing reports of that from the field.

I'm inclined to propose that heap_inplace_update should check to
make sure that it's operating on the latest version of the tuple
(including, I guess, waiting for an uncommitted update?) and throw
error if not.  I think this is what your B3 option is, but maybe
I misinterpreted.  It might be better to throw error immediately
instead of waiting to see if the other updater commits.

            regards, tom lane



Re: race condition in pg_class

От
Noah Misch
Дата:
On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> >> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> >> in GRANT.
> 
> > The main choice is accepting more DDL blocking vs. accepting inefficient
> > relcache builds.  Options I'm seeing:
> 
> It looks to me like you're only thinking about relhasindex, but it
> seems to me that any call of heap_inplace_update brings some
> risk of this kind.  Excluding the bootstrap-mode-only usage in
> create_toast_table, I see four callers:
> 
> * index_update_stats updating a pg_class tuple's
>   relhasindex, relpages, reltuples, relallvisible
> 
> * vac_update_relstats updating a pg_class tuple's
>   relpages, reltuples, relallvisible, relhasindex, relhasrules,
>   relhastriggers, relfrozenxid, relminmxid
> 
> * vac_update_datfrozenxid updating a pg_database tuple's
>   datfrozenxid, datminmxid
> 
> * dropdb updating a pg_database tuple's datconnlimit
> 
> So we have just as much of a problem with GRANTs on databases
> as GRANTs on relations.  Also, it looks like we can lose
> knowledge of the presence of rules and triggers, which seems
> nearly as bad as forgetting about indexes.  The rest of these
> updates might not be correctness-critical, although I wonder
> how bollixed things could get if we forget an advancement of
> relfrozenxid or datfrozenxid (especially if the calling
> transaction goes on to make other changes that assume that
> the update happened).

Thanks for researching that.  Let's treat frozenxid stuff as critical; I
wouldn't want to advance XID limits based on a datfrozenxid that later gets
rolled back.  I agree relhasrules and relhastriggers are also critical.  The
"inefficient relcache builds" option family can't solve cases like
relfrozenxid and datconnlimit, so that leaves us with the "more DDL blocking"
option family.

> BTW, vac_update_datfrozenxid believes (correctly I think) that
> it cannot use the syscache copy of a tuple as the basis for in-place
> update, because syscache will have detoasted any toastable fields.
> These other callers are ignoring that, which seems like it should
> result in heap_inplace_update failing with "wrong tuple length".
> I wonder how come we're not seeing reports of that from the field.

Good question.  Perhaps we'll need some test cases that exercise each inplace
update against a row having a toast pointer.  It's too easy to go a long time
without encountering those in the field.

> I'm inclined to propose that heap_inplace_update should check to
> make sure that it's operating on the latest version of the tuple
> (including, I guess, waiting for an uncommitted update?) and throw
> error if not.  I think this is what your B3 option is, but maybe
> I misinterpreted.  It might be better to throw error immediately
> instead of waiting to see if the other updater commits.

That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
waiting for GRANT to commit but instead inplace-updating every member of the
update chain.  For B2, I was thinking we don't need to error.  There are two
problematic orders of events.  The easy one is heap_inplace_update() mutating
a tuple that already has an xmax.  That's the one in the test case upthread,
and detecting it is trivial.  The harder one is heap_inplace_update() mutating
a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().
I anticipate a new locktag per catalog that can receive inplace updates,
i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.  Here's a
walk-through for the pg_database case.  GRANT will use the following sequence
of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- fetch latest pg_database tuple
- heap_update()
- COMMIT, releasing LOCKTAG_DATABASE_DEFINITION

vac_update_datfrozenxid() sequence of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- (now, all GRANTs on the given database have committed or aborted)
- fetch latest pg_database tuple
- heap_inplace_update()
- release LOCKTAG_DATABASE_DEFINITION, even if xact not ending
- continue with other steps, e.g. vac_truncate_clog()

How does that compare to what you envisioned?  vac_update_datfrozenxid() could
further use xmax as a best-efforts thing to catch conflict with manual UPDATE
statements, but it wouldn't solve the case where the UPDATE had fetched the
tuple but not yet heap_update()'d it.



Re: race condition in pg_class

От
Tom Lane
Дата:
Noah Misch <noah@leadboat.com> writes:
> On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
>> I'm inclined to propose that heap_inplace_update should check to
>> make sure that it's operating on the latest version of the tuple
>> (including, I guess, waiting for an uncommitted update?) and throw
>> error if not.  I think this is what your B3 option is, but maybe
>> I misinterpreted.  It might be better to throw error immediately
>> instead of waiting to see if the other updater commits.

> That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
> waiting for GRANT to commit but instead inplace-updating every member of the
> update chain.  For B2, I was thinking we don't need to error.  There are two
> problematic orders of events.  The easy one is heap_inplace_update() mutating
> a tuple that already has an xmax.  That's the one in the test case upthread,
> and detecting it is trivial.  The harder one is heap_inplace_update() mutating
> a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().

Ugh ... you're right, what I was imagining would not catch that last case.

> I anticipate a new locktag per catalog that can receive inplace updates,
> i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.

We could perhaps make this work by using the existing tuple-lock
infrastructure, rather than inventing new locktags (a choice that
spills to a lot of places including clients that examine pg_locks).

I would prefer though to find a solution that only depends on making
heap_inplace_update protect itself, without high-level cooperation
from the possibly-interfering updater.  This is basically because
I'm still afraid that we're defining the problem too narrowly.
For one thing, I have nearly zero confidence that GRANT et al are
the only problematic source of conflicting transactional updates.
For another, I'm worried that some extension may be using
heap_inplace_update against a catalog we're not considering here.
I'd also like to find a solution that fixes the case of a conflicting
manual UPDATE (although certainly that's a stretch goal we may not be
able to reach).

I wonder if there's a way for heap_inplace_update to mark the tuple
header as just-updated in a way that regular heap_update could
recognize.  (For standard catalog updates, we'd then end up erroring
in simple_heap_update, which I think is fine.)  We can't update xmin,
because the VACUUM callers don't have an XID; but maybe there's some
other way?  I'm speculating about putting a funny value into xmax,
or something like that, and having heap_update check that what it
sees in xmax matches what was in the tuple the update started with.

Or we could try to get rid of in-place updates, but that seems like
a mighty big lift.  All of the existing callers, except maybe
the johnny-come-lately dropdb usage, have solid documented reasons
to do it that way.

            regards, tom lane



Re: race condition in pg_class

От
Noah Misch
Дата:
On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> >> I'm inclined to propose that heap_inplace_update should check to
> >> make sure that it's operating on the latest version of the tuple
> >> (including, I guess, waiting for an uncommitted update?) and throw
> >> error if not.  I think this is what your B3 option is, but maybe
> >> I misinterpreted.  It might be better to throw error immediately
> >> instead of waiting to see if the other updater commits.
> 
> > That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
> > waiting for GRANT to commit but instead inplace-updating every member of the
> > update chain.  For B2, I was thinking we don't need to error.  There are two
> > problematic orders of events.  The easy one is heap_inplace_update() mutating
> > a tuple that already has an xmax.  That's the one in the test case upthread,
> > and detecting it is trivial.  The harder one is heap_inplace_update() mutating
> > a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().
> 
> Ugh ... you're right, what I was imagining would not catch that last case.
> 
> > I anticipate a new locktag per catalog that can receive inplace updates,
> > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> 
> We could perhaps make this work by using the existing tuple-lock
> infrastructure, rather than inventing new locktags (a choice that
> spills to a lot of places including clients that examine pg_locks).

That could be okay.  It would be weird to reuse a short-term lock like that
one as something held till end of transaction.  But the alternative of new
locktags ain't perfect, as you say.

> I would prefer though to find a solution that only depends on making
> heap_inplace_update protect itself, without high-level cooperation
> from the possibly-interfering updater.  This is basically because
> I'm still afraid that we're defining the problem too narrowly.
> For one thing, I have nearly zero confidence that GRANT et al are
> the only problematic source of conflicting transactional updates.

Likewise here, but I have fair confidence that an assertion would flush out
the rest.  heap_inplace_update() would assert that the backend holds one of
the acceptable locks.  It could even be an elog; heap_inplace_update() can
tolerate that cost.

> For another, I'm worried that some extension may be using
> heap_inplace_update against a catalog we're not considering here.

A pgxn search finds "citus" using heap_inplace_update().

> I'd also like to find a solution that fixes the case of a conflicting
> manual UPDATE (although certainly that's a stretch goal we may not be
> able to reach).

It would be nice.

> I wonder if there's a way for heap_inplace_update to mark the tuple
> header as just-updated in a way that regular heap_update could
> recognize.  (For standard catalog updates, we'd then end up erroring
> in simple_heap_update, which I think is fine.)  We can't update xmin,
> because the VACUUM callers don't have an XID; but maybe there's some
> other way?  I'm speculating about putting a funny value into xmax,
> or something like that, and having heap_update check that what it
> sees in xmax matches what was in the tuple the update started with.

Hmmm.  Achieving it without an XID would be the real trick.  (With an XID, we
could use xl_heap_lock like heap_update() does.)  Thinking out loud, what if
heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
TransactionIdAdvance(xmax)?  Or change t_ctid in a similar way.  Then regular
heap_update() could complain if the field changed vs. last seen value.  This
feels like something to regret later in terms of limiting our ability to
harness those fields for more-valuable ends or compact them away in a future
page format.  I can't pinpoint a specific loss, so the idea might have legs.
Nontransactional data in separate tables or in new metapages smells like the
right long-term state.  A project wanting to reuse the tuple header bits could
introduce such storage to unblock its own bit reuse.

> Or we could try to get rid of in-place updates, but that seems like
> a mighty big lift.  All of the existing callers, except maybe
> the johnny-come-lately dropdb usage, have solid documented reasons
> to do it that way.

Yes, removing that smells problematic.



Re: race condition in pg_class

От
Noah Misch
Дата:
I prototyped two ways, one with a special t_ctid and one with LockTuple().

On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote:
> On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> > Noah Misch <noah@leadboat.com> writes:
> > > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:

> > > I anticipate a new locktag per catalog that can receive inplace updates,
> > > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> > 
> > We could perhaps make this work by using the existing tuple-lock
> > infrastructure, rather than inventing new locktags (a choice that
> > spills to a lot of places including clients that examine pg_locks).
> 
> That could be okay.  It would be weird to reuse a short-term lock like that
> one as something held till end of transaction.  But the alternative of new
> locktags ain't perfect, as you say.

That worked.

> > I would prefer though to find a solution that only depends on making
> > heap_inplace_update protect itself, without high-level cooperation
> > from the possibly-interfering updater.  This is basically because
> > I'm still afraid that we're defining the problem too narrowly.
> > For one thing, I have nearly zero confidence that GRANT et al are
> > the only problematic source of conflicting transactional updates.
> 
> Likewise here, but I have fair confidence that an assertion would flush out
> the rest.  heap_inplace_update() would assert that the backend holds one of
> the acceptable locks.  It could even be an elog; heap_inplace_update() can
> tolerate that cost.

That check would fall in both heap_inplace_update() and heap_update().  After
all, a heap_inplace_update() check won't detect an omission in GRANT.

> > For another, I'm worried that some extension may be using
> > heap_inplace_update against a catalog we're not considering here.
> 
> A pgxn search finds "citus" using heap_inplace_update().
> 
> > I'd also like to find a solution that fixes the case of a conflicting
> > manual UPDATE (although certainly that's a stretch goal we may not be
> > able to reach).
> 
> It would be nice.

I expect most approaches could get there by having ExecModifyTable() arrange
for the expected locking or other actions.  That's analogous to how
heap_update() takes care of sinval even for a manual UPDATE.

> > I wonder if there's a way for heap_inplace_update to mark the tuple
> > header as just-updated in a way that regular heap_update could
> > recognize.  (For standard catalog updates, we'd then end up erroring
> > in simple_heap_update, which I think is fine.)  We can't update xmin,
> > because the VACUUM callers don't have an XID; but maybe there's some
> > other way?  I'm speculating about putting a funny value into xmax,
> > or something like that, and having heap_update check that what it
> > sees in xmax matches what was in the tuple the update started with.
> 
> Hmmm.  Achieving it without an XID would be the real trick.  (With an XID, we
> could use xl_heap_lock like heap_update() does.)  Thinking out loud, what if
> heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
> TransactionIdAdvance(xmax)?  Or change t_ctid in a similar way.  Then regular
> heap_update() could complain if the field changed vs. last seen value.  This
> feels like something to regret later in terms of limiting our ability to
> harness those fields for more-valuable ends or compact them away in a future
> page format.  I can't pinpoint a specific loss, so the idea might have legs.
> Nontransactional data in separate tables or in new metapages smells like the
> right long-term state.  A project wanting to reuse the tuple header bits could
> introduce such storage to unblock its own bit reuse.

heap_update() does not have the pre-modification xmax today, so I used t_ctid.
heap_modify_tuple() preserves t_ctid, so heap_update() already has the
pre-modification t_ctid in key cases.  For details of how the prototype uses
t_ctid, see comment at "#define InplaceCanaryOffsetNumber".  The prototype
doesn't prevent corruption in the following scenario, because the aborted
ALTER TABLE RENAME overwrites the special t_ctid:

  == session 1
  drop table t;
  create table t (c int);
  begin;
  -- in gdb, set breakpoint on heap_modify_tuple
  grant select on t to public;

  == session 2
  alter table t add primary key (c);
  begin; alter table t rename to t2; rollback;

  == back in session 1
  -- release breakpoint
  -- want error (would get it w/o the begin;alter;rollback)
  commit;

I'm missing how to mark the tuple in a fashion accessible to a second
heap_update() after a rolled-back heap_update().  The mark needs enough bits
"N" so it's implausible for 2^N inplace updates to happen between GRANT
fetching the old tuple and GRANT completing heap_update().  Looking for bits
that could persist across a rolled-back heap_update(), we have 3 in t_ctid, 2
in t_infomask2, and 0 in xmax.  I definitely don't want to paint us into a
corner by spending the t_infomask2 bits on this.  Even if I did, 2^(3+2)=32
wouldn't clearly be enough inplace updates.

Is there a way to salvage the goal of fixing the bug without modifying code
like ExecGrant_common()?  If not, I'm inclined to pick from one of the
following designs:

- Acquire ShareUpdateExclusiveLock in GRANT ((B1) from previous list).  It
  does make GRANT more intrusive; e.g. GRANT will cancel autovacuum.  I'm
  leaning toward this one for two reasons.  First, it doesn't slow
  heap_update() outside of assert builds.  Second, it makes the GRANT
  experience more like the rest our DDL, in that concurrent DDL will make
  GRANT block, not fail.

- GRANT passes to heapam the fixed-size portion of the pre-modification tuple.
  heap_update() compares those bytes to the oldtup in shared buffers to see if
  an inplace update happened.  (HEAD could get the bytes from a new
  heap_update() parameter, while back branches would need a different passing
  approach.)

- LockTuple(), as seen in its attached prototype.  I like this least at the
  moment, because it changes pg_locks content without having a clear advantage
  over the previous option.  Also, the prototype has enough FIXME markers that
  I expect this to get hairy before it's done.

I might change my preference after further prototypes.  Does anyone have a
strong preference between those?  Briefly, I did consider these additional
alternatives:

- Just accept the yet-rarer chance of corruption from this message's test
  procedure.

- Hold a buffer lock long enough to solve things.

- Remember the tuples where we overwrote a special t_ctid, and reverse the
  overwrite during abort processing.  But I/O in the abort path sounds bad.

Thanks,
nm

Вложения

Re: race condition in pg_class

От
Noah Misch
Дата:
I'm attaching patches implementing the LockTuple() design.  It turns out we
don't just lose inplace updates.  We also overwrite unrelated tuples,
reproduced at inplace.spec.  Good starting points are README.tuplock and the
heap_inplace_update_scan() header comment.

On Wed, Nov 01, 2023 at 08:09:15PM -0700, Noah Misch wrote:
> On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote:
> > On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> > > We could perhaps make this work by using the existing tuple-lock
> > > infrastructure, rather than inventing new locktags (a choice that
> > > spills to a lot of places including clients that examine pg_locks).

> > > I'd also like to find a solution that fixes the case of a conflicting
> > > manual UPDATE (although certainly that's a stretch goal we may not be
> > > able to reach).

I implemented that; search for ri_needLockTagTuple.

> - GRANT passes to heapam the fixed-size portion of the pre-modification tuple.
>   heap_update() compares those bytes to the oldtup in shared buffers to see if
>   an inplace update happened.  (HEAD could get the bytes from a new
>   heap_update() parameter, while back branches would need a different passing
>   approach.)

This could have been fine, but ...

> - LockTuple(), as seen in its attached prototype.  I like this least at the
>   moment, because it changes pg_locks content without having a clear advantage
>   over the previous option.

... I settled on the LockTuple() design for these reasons:

- Solves more conflicts by waiting, instead of by ERROR or by retry loops.
- Extensions wanting inplace updates don't have a big disadvantage over core
  code inplace updates.
- One could use this to stop "tuple concurrently updated" for pg_class rows,
  by using SearchSysCacheLocked1() for all pg_class DDL and making that
  function wait for any existing xmax like inplace_xmax_lock() does.  I don't
  expect to write that, but it's a nice option to have.
- pg_locks shows the new lock acquisitions.

Separable, nontrivial things not fixed in the attached patch stack:

- Inplace update uses transactional CacheInvalidateHeapTuple().  ROLLBACK of
  CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
  still seen in inplace-inval.spec.  CacheInvalidateRelmap() does this right.

- AtEOXact_Inval(true) is outside the RecordTransactionCommit() critical
  section, but it is critical.  We must not commit transactional DDL without
  other backends receiving an inval.  (When the inplace inval becomes
  nontransactional, it will face the same threat.)

- Trouble is possible, I bet, if the system crashes between the inplace-update
  memcpy() and XLogInsert().  See the new XXX comment below the memcpy().
  Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and
  finally issuing memcpy() into the buffer.

- [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's
  xmin does not stop pruning, an MVCC scan in that backend can find zero
  tuples when one is live.  This is like what all backends got in the days of
  SnapshotNow catalog scans.  See the pgbench test suite addition.  (Perhaps
  the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
  setting that flag later and unsetting it earlier.)

If you find decisions in this thread's patches are tied to any of those such
that I should not separate those, let's discuss that.  Topics in the patches
that I feel are most fruitful for debate:

- This makes inplace update block if the tuple has an updater.  It's like one
  GRANT blocking another, except an inplace updater won't get "ERROR:  tuple
  concurrently updated" like one of the GRANTs would.  I had implemented
  versions that avoided this blocking by mutating each tuple in the updated
  tuple chain.  That worked, but it had corner cases bad for maintainability,
  listed in the inplace_xmax_lock() header comment.  I'd rather accept the
  blocking, so hackers can rule out those corner cases.  A long-running GRANT
  already hurts VACUUM progress more just by keeping an XID running.

- Pre-checks could make heap_inplace_update_cancel() calls rarer.  Avoiding
  one of those avoids an exclusive buffer lock, and it avoids waiting on
  concurrent heap_update() if any.  We'd pre-check the syscache tuple.
  EventTriggerOnLogin() does it that way, because the code was already in that
  form.  I expect only vac_update_datfrozenxid() concludes !dirty enough to
  matter.  I didn't bother with the optimization, but it would be simple.

- If any citus extension user feels like porting its heap_inplace_update()
  call to this, I'd value hearing about your experience.

- I paid more than my usual attention to test coverage, considering the patch
  stack's intensity compared to most back-patch bug fixes.

I've kept all the above topics brief; feel free to ask for more details.

Thanks,
nm

Вложения

Re: race condition in pg_class

От
Michael Paquier
Дата:
On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote:
> I'm attaching patches implementing the LockTuple() design.  It turns out we
> don't just lose inplace updates.  We also overwrite unrelated tuples,
> reproduced at inplace.spec.  Good starting points are README.tuplock and the
> heap_inplace_update_scan() header comment.

About inplace050-tests-inj-v1.patch.

+    /* Check if blocked_pid is in injection_wait(). */
+    proc = BackendPidGetProc(blocked_pid);
+    if (proc == NULL)
+        PG_RETURN_BOOL(false);    /* session gone: definitely unblocked */
+    wait_event =
+        pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info));
+    if (wait_event && strncmp("INJECTION_POINT(",
+                              wait_event,
+                              strlen("INJECTION_POINT(")) == 0)
+        PG_RETURN_BOOL(true);

Hmm.  I am not sure that this is the right interface for the job
because this is not only related to injection points but to the
monitoring of a one or more wait events when running a permutation
step.  Perhaps this is something that should be linked to the spec
files with some property area listing the wait events we're expected
to wait on instead when running a step that we know will wait?
--
Michael

Вложения

Re: race condition in pg_class

От
Noah Misch
Дата:
On Mon, May 13, 2024 at 04:59:59PM +0900, Michael Paquier wrote:
> About inplace050-tests-inj-v1.patch.
> 
> +    /* Check if blocked_pid is in injection_wait(). */
> +    proc = BackendPidGetProc(blocked_pid);
> +    if (proc == NULL)
> +        PG_RETURN_BOOL(false);    /* session gone: definitely unblocked */
> +    wait_event =
> +        pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info));
> +    if (wait_event && strncmp("INJECTION_POINT(",
> +                              wait_event,
> +                              strlen("INJECTION_POINT(")) == 0)
> +        PG_RETURN_BOOL(true);
> 
> Hmm.  I am not sure that this is the right interface for the job
> because this is not only related to injection points but to the
> monitoring of a one or more wait events when running a permutation
> step.

Could you say more about that?  Permutation steps don't monitor wait events
today.  This patch would be the first instance of that.

> Perhaps this is something that should be linked to the spec
> files with some property area listing the wait events we're expected
> to wait on instead when running a step that we know will wait?

The spec syntax doesn't distinguish contention types at all.  The isolation
tester's needs are limited to distinguishing:

  (a) process is waiting on another test session
  (b) process is waiting on automatic background activity (autovacuum, mainly)

Automatic background activity doesn't make a process enter or leave
injection_wait(), so all injection point wait events fall in (a).  (The tester
ignores (b), since those clear up without intervention.  Failing to ignore
them, as the tester did long ago, made output unstable.)



Re: race condition in pg_class

От
Robert Haas
Дата:
On Sun, May 12, 2024 at 7:29 PM Noah Misch <noah@leadboat.com> wrote:
> - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's
>   xmin does not stop pruning, an MVCC scan in that backend can find zero
>   tuples when one is live.  This is like what all backends got in the days of
>   SnapshotNow catalog scans.  See the pgbench test suite addition.  (Perhaps
>   the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
>   setting that flag later and unsetting it earlier.)

Are you saying that this is a problem already, or that the patch
causes it to start happening? If it's the former, that's horrible. If
it's the latter, I'd say that is a fatal defect.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: race condition in pg_class

От
Noah Misch
Дата:
On Mon, May 13, 2024 at 03:53:08PM -0400, Robert Haas wrote:
> On Sun, May 12, 2024 at 7:29 PM Noah Misch <noah@leadboat.com> wrote:
> > - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's
> >   xmin does not stop pruning, an MVCC scan in that backend can find zero
> >   tuples when one is live.  This is like what all backends got in the days of
> >   SnapshotNow catalog scans.  See the pgbench test suite addition.  (Perhaps
> >   the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
> >   setting that flag later and unsetting it earlier.)
> 
> Are you saying that this is a problem already, or that the patch
> causes it to start happening? If it's the former, that's horrible.

The former.



Re: race condition in pg_class

От
Noah Misch
Дата:
On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote:
> I'm attaching patches implementing the LockTuple() design.

Starting 2024-06-10, I plan to push the first seven of the ten patches:

inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
inplace010-tests-v1.patch
inplace040-waitfuncs-v1.patch
inplace050-tests-inj-v1.patch
inplace060-nodeModifyTable-comments-v1.patch
  Those five just deal in tests, test infrastructure, and comments.
inplace070-rel-locks-missing-v1.patch
  Main risk is new DDL deadlocks.
inplace080-catcache-detoast-inplace-stale-v1.patch
  If it fails to fix the bug it targets, I expect it's a no-op rather than
  breaking things.

I'll leave the last three of the ten needing review.  Those three are beyond
my skill to self-certify.



Re: race condition in pg_class

От
Robert Haas
Дата:
On Wed, Jun 5, 2024 at 2:17 PM Noah Misch <noah@leadboat.com> wrote:
> Starting 2024-06-10, I plan to push the first seven of the ten patches:
>
> inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
> inplace010-tests-v1.patch
> inplace040-waitfuncs-v1.patch
> inplace050-tests-inj-v1.patch
> inplace060-nodeModifyTable-comments-v1.patch
>   Those five just deal in tests, test infrastructure, and comments.
> inplace070-rel-locks-missing-v1.patch
>   Main risk is new DDL deadlocks.
> inplace080-catcache-detoast-inplace-stale-v1.patch
>   If it fails to fix the bug it targets, I expect it's a no-op rather than
>   breaking things.
>
> I'll leave the last three of the ten needing review.  Those three are beyond
> my skill to self-certify.

It's not this patch set's fault, but I'm not very pleased to see that
the injection point wait events have been shoehorned into the
"Extension" category - which they are not - instead of being a new
wait_event_type. That would have avoided the ugly wait-event naming
pattern, inconsistent with everything else, introduced by
inplace050-tests-inj-v1.patch.

I think that the comments and commit messages in this patch set could,
in some places, use improvement. For instance,
inplace060-nodeModifyTable-comments-v1.patch reflows a bunch of
comments, which makes it hard to see what actually changed, and the
commit message doesn't tell you, either. A good bit of it seems to be
changing "a view" to "a view INSTEAD OF trigger" or "a view having an
INSTEAD OF trigger," but the reasoning behind that change is not
spelled out anywhere. The reader is left to guess what the other case
is and why the same principles don't apply to it. I don't doubt that
the new comments are more correct than the old ones, but I expect
future patch authors to have difficulty maintaining that state of
affairs.

Similarly, inplace070-rel-locks-missing-v1.patch adds no comments.
IMHO, the commit message also isn't very informative. It disclaims
knowledge of what bug it's fixing, while at the same time leaving the
reader to figure out for themselves how the behavior has changed.
Consequently, I expect writing the release notes for a release
including this patch to be difficult: "We added some locks that block
... something ... in some circumstances ... to prevent ... something."
It's not really the job of the release note author to fill in those
blanks, but rather of the patch author or committer. I don't want to
overburden the act of fixing bugs, but I just feel like more
explanation is needed here. When I see for example that we're adding a
lock acquisition to the end of heap_create(), I can't help but wonder
if it's really true that we don't take a lock on a just-created
relation today. I'm certainly under the impression that we lock
newly-created, uncommitted relations, and a quick test seems to
confirm that. I don't quite know whether that happens, but evidently
this call is guarding against something more subtle than a categorical
failure to lock a relation on creation so I think there should be a
comment explaining what that thing is.

It's also quite surprising that SetRelationHasSubclass() says "take X
lock before calling" and 2 of 4 callers just don't. I guess that's how
it is. But shouldn't we then have an assertion inside that function to
guard against future mistakes? If the reason why we failed to add this
initially is discernible from the commit messages that introduced the
bug, it would be nice to mention what it seems to have been; if not,
it would at least be nice to mention the offending commit(s). I'm also
a bit worried that this is going to cause deadlocks, but I suppose if
it does, that's still better than the status quo.

IsInplaceUpdateOid's header comment says IsInplaceUpdateRelation
instead of IsInplaceUpdateOid.

inplace080-catcache-detoast-inplace-stale-v1.patch seems like another
place where spelling out the rationale in more detail would be helpful
to future readers; for instance, the commit message says that
PgDatabaseToastTable is the only one affected, but it doesn't say why
the others are not, or why this one is. The lengthy comment in
CatalogCacheCreateEntry is also difficult to correlate with the code
which follows. I can't guess whether the two cases called out in the
comment always needed to be handled and were handled save only for
in-place updates, and thus the comment changes were simply taking the
opportunity to elaborate on the existing comments; or whether one of
those cases is preexisting and the other arises from the desire to
handle inplace updates. It could be helpful to mention relevant
identifiers from the code in the comment text e.g.
"systable_recheck_tuple detects ordinary updates by noting changes to
the tuple's visibility information, while the equalTuple() case
detects inplace updates."

IMHO, this patch set underscores the desirability of removing in-place
update altogether. That sounds difficult and not back-patchable, but I
can't classify what this patch set does as anything better than grotty
hacks to work around serious design deficiencies. That is not a vote
against these patches: I see no better way forward. Nonetheless, I
dislike the lack of better options.

I have done only cursory review of the last two patches and don't feel
I'm in a place to certify them, at least not now.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: race condition in pg_class

От
Michael Paquier
Дата:
On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> It's not this patch set's fault, but I'm not very pleased to see that
> the injection point wait events have been shoehorned into the
> "Extension" category - which they are not - instead of being a new
> wait_event_type. That would have avoided the ugly wait-event naming
> pattern, inconsistent with everything else, introduced by
> inplace050-tests-inj-v1.patch.

Not sure to agree with that.  The set of core backend APIs supporting
injection points have nothing to do with wait events.  The library
attached to one or more injection points *may* decide to use a wait
event like what the wait/wakeup calls in modules/injection_points do,
but that's entirely optional.  These rely on custom wait events,
plugged into the Extension category as the code run is itself in an
extension.  I am not arguing against the point that it may be
interesting to plug in custom wait event categories, but the current
design of wait events makes that much harder than what core is
currently able to handle, and I am not sure that this brings much at
the end as long as the wait event strings can be customized.

I've voiced upthread concerns over the naming enforced by the patch
and the way it plugs the namings into the isolation functions, by the
way.
--
Michael

Вложения

Re: race condition in pg_class

От
Robert Haas
Дата:
On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > It's not this patch set's fault, but I'm not very pleased to see that
> > the injection point wait events have been shoehorned into the
> > "Extension" category - which they are not - instead of being a new
> > wait_event_type. That would have avoided the ugly wait-event naming
> > pattern, inconsistent with everything else, introduced by
> > inplace050-tests-inj-v1.patch.
>
> Not sure to agree with that.  The set of core backend APIs supporting
> injection points have nothing to do with wait events.  The library
> attached to one or more injection points *may* decide to use a wait
> event like what the wait/wakeup calls in modules/injection_points do,
> but that's entirely optional.  These rely on custom wait events,
> plugged into the Extension category as the code run is itself in an
> extension.  I am not arguing against the point that it may be
> interesting to plug in custom wait event categories, but the current
> design of wait events makes that much harder than what core is
> currently able to handle, and I am not sure that this brings much at
> the end as long as the wait event strings can be customized.
>
> I've voiced upthread concerns over the naming enforced by the patch
> and the way it plugs the namings into the isolation functions, by the
> way.

I think the core code should provide an "Injection Point" wait event
type and let extensions add specific wait events there, just like you
did for "Extension". Then this ugly naming would go away. As I see it,
"Extension" is only supposed to be used as a catch-all when we have no
other information, but here we do. If we refuse to use the
wait_event_type field to categorize waits, then people are going to
have to find some other way to get that data into the system, as Noah
has done.

--
Robert Haas
EDB: http://www.enterprisedb.com