Обсуждение: race condition in pg_class
Hello, hackers!
We are running PG13.10 and recently we have encountered what appears to be a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and some other catalog-writer, possibly ANALYZE.
The problem is that after successfully creating index on relation (which previosly didnt have any indexes), its pg_class.relhasindex remains set to "false", which is illegal, I think.
Index was built using the following statement:
ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
PG_CLASS:
# select ctid,oid,xmin,xmax, * from pg_class where oid = 3566558198;
-[ RECORD 1 ]-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ctid | (12,49)
oid | 3566558198
xmin | 1202298791
xmax | 0
relname | example
relnamespace | 16479
reltype | 3566558200
reloftype | 0
relowner | 16386
relam | 2
relfilenode | 3566558198
reltablespace | 0
relpages | 152251
reltuples | 1.1565544e+07
relallvisible | 127837
reltoastrelid | 3566558203
relhasindex | f
relisshared | f
relpersistence | p
relkind | r
relnatts | 12
relchecks | 0
relhasrules | f
relhastriggers | f
relhassubclass | f
relrowsecurity | f
relforcerowsecurity | f
relispopulated | t
relreplident | d
relispartition | f
relrewrite | 0
relfrozenxid | 1201647807
relminmxid | 1
relacl |
reloptions |
relpartbound |
PG_INDEX:
# select ctid,xmin,xmax,indexrelid::regclass,indrelid::regclass, * from pg_index where indexrelid = 3569625749;
-[ RECORD 1 ]--+---------------------------------------------
ctid | (3,30)
xmin | 1202295045
xmax | 0
indexrelid | "example_pkey"
indrelid | "example"
indexrelid | 3569625749
indrelid | 3566558198
indnatts | 1
indnkeyatts | 1
indisunique | t
indisprimary | t
indisexclusion | f
indimmediate | t
indisclustered | f
indisvalid | t
indcheckxmin | f
indisready | t
indislive | t
indisreplident | f
indkey | 1
indcollation | 0
indclass | 3124
indoption | 0
indexprs |
indpred |
Looking into the WAL via waldump given us the following picture (full waldump output is attached):
tx: 1202295045, lsn: AAB1/D38378D0, prev AAB1/D3837208, desc: FPI , blkref #0: rel 1663/16387/3569625749 blk 0 FPW
tx: 1202298790, lsn: AAB1/D3912EC0, prev AAB1/D3912E80, desc: NEW_CID rel 1663/16387/1259; tid 6/24; cmin: 0, cmax: 4294967295, combo: 4294967295
tx: 1202298790, lsn: AAB1/D3927580, prev AAB1/D3926988, desc: COMMIT 2023-10-04 22:41:23.863979 UTC
tx: 1202298791, lsn: AAB1/D393C230, prev AAB1/D393C1F0, desc: HOT_UPDATE off 24 xmax 1202298791 flags 0x20 ; new off 45 xmax 0, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D394ADA0, prev AAB1/D394AD60, desc: UPDATE off 45 xmax 1202298791 flags 0x00 ; new off 28 xmax 0, blkref #0: rel 1663/16387/1259 blk 5, blkref #1: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D3961088, prev AAB1/D3961048, desc: NEW_CID rel 1663/16387/1259; tid 12/49; cmin: 50, cmax: 4294967295, combo: 4294967295
tx: 1202295045, lsn: AAB1/D3962E78, prev AAB1/D3962E28, desc: INPLACE off 24, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202295045, lsn: AAB1/D39632A0, prev AAB1/D3963250, desc: COMMIT 2023-10-04 22:41:23.878565 UTC
tx: 1202298791, lsn: AAB1/D3973420, prev AAB1/D39733D0, desc: COMMIT 2023-10-04 22:41:23.884951 UTC
1202295045 - create index statement
1202298790 and 1202298791 are some other concurrent operations, unfortunately I wasnt able to determine what are they
So looks like 1202295045, updated tuple (6,24) in pg_class INPLACE, in which at this point xmax was already set by 1202298791 and new tuple in (12,49) was in created.
So after 1202298791 was commited, that inplace update was effectively lost.
If we do an inclusive PITR with (recovery_target_xid = 1202295045), we can see the following picture (notice relhasindex and xmax):
# select ctid,oid, xmin,xmax,relhasindex,cmin,cmax from pg_class where oid = 3566558198;
-[ RECORD 1 ]-----------
ctid | (6,24)
oid | 3566558198
xmin | 1202298790
xmax | 1202298791
relhasindex | t
cmin | 0
cmax | 0
I've tried to reproduce this scenario with CREATE INDEX and various concurrent statements, but no luck.
Attached full waldump output for the relevant WAL segment.
We are running PG13.10 and recently we have encountered what appears to be a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and some other catalog-writer, possibly ANALYZE.
The problem is that after successfully creating index on relation (which previosly didnt have any indexes), its pg_class.relhasindex remains set to "false", which is illegal, I think.
Index was built using the following statement:
ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
PG_CLASS:
# select ctid,oid,xmin,xmax, * from pg_class where oid = 3566558198;
-[ RECORD 1 ]-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ctid | (12,49)
oid | 3566558198
xmin | 1202298791
xmax | 0
relname | example
relnamespace | 16479
reltype | 3566558200
reloftype | 0
relowner | 16386
relam | 2
relfilenode | 3566558198
reltablespace | 0
relpages | 152251
reltuples | 1.1565544e+07
relallvisible | 127837
reltoastrelid | 3566558203
relhasindex | f
relisshared | f
relpersistence | p
relkind | r
relnatts | 12
relchecks | 0
relhasrules | f
relhastriggers | f
relhassubclass | f
relrowsecurity | f
relforcerowsecurity | f
relispopulated | t
relreplident | d
relispartition | f
relrewrite | 0
relfrozenxid | 1201647807
relminmxid | 1
relacl |
reloptions |
relpartbound |
PG_INDEX:
# select ctid,xmin,xmax,indexrelid::regclass,indrelid::regclass, * from pg_index where indexrelid = 3569625749;
-[ RECORD 1 ]--+---------------------------------------------
ctid | (3,30)
xmin | 1202295045
xmax | 0
indexrelid | "example_pkey"
indrelid | "example"
indexrelid | 3569625749
indrelid | 3566558198
indnatts | 1
indnkeyatts | 1
indisunique | t
indisprimary | t
indisexclusion | f
indimmediate | t
indisclustered | f
indisvalid | t
indcheckxmin | f
indisready | t
indislive | t
indisreplident | f
indkey | 1
indcollation | 0
indclass | 3124
indoption | 0
indexprs |
indpred |
Looking into the WAL via waldump given us the following picture (full waldump output is attached):
tx: 1202295045, lsn: AAB1/D38378D0, prev AAB1/D3837208, desc: FPI , blkref #0: rel 1663/16387/3569625749 blk 0 FPW
tx: 1202298790, lsn: AAB1/D3912EC0, prev AAB1/D3912E80, desc: NEW_CID rel 1663/16387/1259; tid 6/24; cmin: 0, cmax: 4294967295, combo: 4294967295
tx: 1202298790, lsn: AAB1/D3927580, prev AAB1/D3926988, desc: COMMIT 2023-10-04 22:41:23.863979 UTC
tx: 1202298791, lsn: AAB1/D393C230, prev AAB1/D393C1F0, desc: HOT_UPDATE off 24 xmax 1202298791 flags 0x20 ; new off 45 xmax 0, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D394ADA0, prev AAB1/D394AD60, desc: UPDATE off 45 xmax 1202298791 flags 0x00 ; new off 28 xmax 0, blkref #0: rel 1663/16387/1259 blk 5, blkref #1: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D3961088, prev AAB1/D3961048, desc: NEW_CID rel 1663/16387/1259; tid 12/49; cmin: 50, cmax: 4294967295, combo: 4294967295
tx: 1202295045, lsn: AAB1/D3962E78, prev AAB1/D3962E28, desc: INPLACE off 24, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202295045, lsn: AAB1/D39632A0, prev AAB1/D3963250, desc: COMMIT 2023-10-04 22:41:23.878565 UTC
tx: 1202298791, lsn: AAB1/D3973420, prev AAB1/D39733D0, desc: COMMIT 2023-10-04 22:41:23.884951 UTC
1202295045 - create index statement
1202298790 and 1202298791 are some other concurrent operations, unfortunately I wasnt able to determine what are they
So looks like 1202295045, updated tuple (6,24) in pg_class INPLACE, in which at this point xmax was already set by 1202298791 and new tuple in (12,49) was in created.
So after 1202298791 was commited, that inplace update was effectively lost.
If we do an inclusive PITR with (recovery_target_xid = 1202295045), we can see the following picture (notice relhasindex and xmax):
# select ctid,oid, xmin,xmax,relhasindex,cmin,cmax from pg_class where oid = 3566558198;
-[ RECORD 1 ]-----------
ctid | (6,24)
oid | 3566558198
xmin | 1202298790
xmax | 1202298791
relhasindex | t
cmin | 0
cmax | 0
I've tried to reproduce this scenario with CREATE INDEX and various concurrent statements, but no luck.
Attached full waldump output for the relevant WAL segment.
Вложения
> On 25 Oct 2023, at 13:39, Smolkin Grigory <smallkeen@gmail.com> wrote: > > We are running PG13.10 and recently we have encountered what appears to be a bug due to some race condition between ALTERTABLE ... ADD CONSTRAINT and some other catalog-writer, possibly ANALYZ > I've tried to reproduce this scenario with CREATE INDEX and various concurrent statements, but no luck. Maybe it would be possible to reproduce with modifying tests for concurrent index creation. For example add “ANALYZE” here[0]. Keep in mind that for easier reproduction it would make sense to increase transaction count radically. Best regards, Andrey Borodin. [0] https://github.com/postgres/postgres/blob/master/contrib/amcheck/t/002_cic.pl#L34
Smolkin Grigory <smallkeen@gmail.com> writes: > We are running PG13.10 and recently we have encountered what appears to be > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and > some other catalog-writer, possibly ANALYZE. > The problem is that after successfully creating index on relation (which > previosly didnt have any indexes), its pg_class.relhasindex remains set to > "false", which is illegal, I think. > Index was built using the following statement: > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id); ALTER TABLE ADD CONSTRAINT would certainly have taken AccessExclusiveLock on the "example" table, which should be sufficient to prevent anything else from touching its pg_class row. The only mechanism I can think of that might bypass that is a manual UPDATE on pg_class, which would just manipulate the row as a row without concern for associated relation-level locks. Any chance that somebody was doing something like that? regards, tom lane
> ALTER TABLE ADD CONSTRAINT would certainly have taken
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row. The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks. Any chance that somebody was
> doing something like that?
No chance. Our infrastructure dont do that, and users dont just have the privileges to mess with pg_catalog.
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row. The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks. Any chance that somebody was
> doing something like that?
No chance. Our infrastructure dont do that, and users dont just have the privileges to mess with pg_catalog.
ср, 25 окт. 2023 г. в 21:06, Tom Lane <tgl@sss.pgh.pa.us>:
Smolkin Grigory <smallkeen@gmail.com> writes:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
ALTER TABLE ADD CONSTRAINT would certainly have taken
AccessExclusiveLock on the "example" table, which should be sufficient
to prevent anything else from touching its pg_class row. The only
mechanism I can think of that might bypass that is a manual UPDATE on
pg_class, which would just manipulate the row as a row without concern
for associated relation-level locks. Any chance that somebody was
doing something like that?
regards, tom lane
On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote: > We are running PG13.10 and recently we have encountered what appears to be > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and > some other catalog-writer, possibly ANALYZE. > The problem is that after successfully creating index on relation (which > previosly didnt have any indexes), its pg_class.relhasindex remains set to > "false", which is illegal, I think. > Index was built using the following statement: > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id); This is going to be a problem with any operation that does a transactional pg_class update without taking a lock that conflicts with ShareLock. GRANT doesn't lock the table at all, so I can reproduce this in v17 as follows: == session 1 create table t (c int); begin; grant select on t to public; == session 2 alter table t add primary key (c); == back in session 1 commit; We'll likely need to change how we maintain relhasindex or perhaps take a lock in GRANT. > Looking into the WAL via waldump given us the following picture (full > waldump output is attached): > 1202295045 - create index statement > 1202298790 and 1202298791 are some other concurrent operations, > unfortunately I wasnt able to determine what are they Can you explore that as follows? - PITR to just before the COMMIT record. - Save all rows of pg_class. - PITR to just after the COMMIT record. - Save all rows of pg_class. - Diff the two sets of saved rows. Which columns changed? The evidence you've shown would be consistent with a transaction doing GRANT or REVOKE on dozens of tables. If the changed column is something other than relacl, that would be great to know. On the off-chance it's relevant, what extensions do you have (\dx in psql)?
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock. GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
>
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
>
> == session 2
> alter table t add primary key (c);
>
> == back in session 1
> commit;
>
>
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.
Oh, that explains it. Thank you very much.
> Can you explore that as follows?
>
>- PITR to just before the COMMIT record.
>- Save all rows of pg_class.
>- PITR to just after the COMMIT record.
>- Save all rows of pg_class.
>- Diff the two sets of saved rows.
Sure, but it will take some time, its a large db with lots of WAL segments to apply.
> extensions
extname | extversion
--------------------+------------
plpgsql | 1.0
pg_stat_statements | 1.8
pg_buffercache | 1.3
pgstattuple | 1.5
> pg_class update without taking a lock that conflicts with ShareLock. GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
>
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
>
> == session 2
> alter table t add primary key (c);
>
> == back in session 1
> commit;
>
>
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.
Oh, that explains it. Thank you very much.
> Can you explore that as follows?
>
>- PITR to just before the COMMIT record.
>- Save all rows of pg_class.
>- PITR to just after the COMMIT record.
>- Save all rows of pg_class.
>- Diff the two sets of saved rows.
Sure, but it will take some time, its a large db with lots of WAL segments to apply.
> extensions
extname | extversion
--------------------+------------
plpgsql | 1.0
pg_stat_statements | 1.8
pg_buffercache | 1.3
pgstattuple | 1.5
On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote: > On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote: > > We are running PG13.10 and recently we have encountered what appears to be > > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and > > some other catalog-writer, possibly ANALYZE. > > The problem is that after successfully creating index on relation (which > > previosly didnt have any indexes), its pg_class.relhasindex remains set to > > "false", which is illegal, I think. It's damaging. The table will behave like it has no indexes. If something adds an index later, old indexes will reappear, corrupt, having not received updates during the relhasindex=false era. ("pg_amcheck --heapallindexed" can detect this.) > > Index was built using the following statement: > > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id); > > This is going to be a problem with any operation that does a transactional > pg_class update without taking a lock that conflicts with ShareLock. GRANT > doesn't lock the table at all, so I can reproduce this in v17 as follows: > > == session 1 > create table t (c int); > begin; > grant select on t to public; > > == session 2 > alter table t add primary key (c); > > == back in session 1 > commit; > > > We'll likely need to change how we maintain relhasindex or perhaps take a lock > in GRANT. The main choice is accepting more DDL blocking vs. accepting inefficient relcache builds. Options I'm seeing: === "more DDL blocking" option family B1. Take ShareUpdateExclusiveLock in GRANT, REVOKE, and anything that makes transactional pg_class updates without holding some stronger lock. New asserts could catch future commands failing to do this. B2. Take some shorter-lived lock around pg_class tuple formation, such that GRANT blocks CREATE INDEX, but two CREATE INDEX don't block each other. Anything performing a transactional update of a pg_class row would acquire the lock in exclusive mode before fetching the old tuple and hold it till end of transaction. relhasindex=true in-place updates would acquire it the same way, but they would release it after the inplace update. I expect a new heavyweight lock type, e.g. LOCKTAG_RELATION_DEFINITION, with the same key as LOCKTAG_RELATION. This has less blocking than the previous option, but it's more complicated to explain to both users and developers. B3. Have CREATE INDEX do an EvalPlanQual()-style thing to update all successor tuple versions. Like the previous option, this would require new locking, but the new lock would not need to persist till end of xact. It would be even more complicated to explain to users and developers. (If this is promising enough to warrant more detail, let me know.) B4. Use transactional updates to set relhasindex=true. Two CREATE INDEX commands on the same table would block each other. If we did it the way most DDL does today, they'd get "tuple concurrently updated" failures after the blocking ends. === "inefficient relcache builds" option family R1. Ignore relhasindex; possibly remove it in v17. Relcache builds et al. will issue more superfluous queries. R2. As a weird variant of the previous option, keep relhasindex and make all transactional updates of pg_class set relhasindex=true pessimistically. (VACUUM will set it back to false.) === other O1. This is another case where the sometimes-discussed "pg_class_nt" for nontransactional columns would help. I'm ruling that out as too hard to back-patch. Are there other options important to consider? I currently like (B1) the most, followed closely by (R1) and (B2). A key unknown is the prevalence of index-free tables. Low prevalence would argue in favor of (R1). In my limited experience, they've been rare. That said, I assume relcache builds happen a lot more than GRANTs, so it's harder to bound the damage from (R1) compared to the damage from (B1). Thoughts on this decision? Thanks, nm
Noah Misch <noah@leadboat.com> writes: > On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote: >> We'll likely need to change how we maintain relhasindex or perhaps take a lock >> in GRANT. > The main choice is accepting more DDL blocking vs. accepting inefficient > relcache builds. Options I'm seeing: It looks to me like you're only thinking about relhasindex, but it seems to me that any call of heap_inplace_update brings some risk of this kind. Excluding the bootstrap-mode-only usage in create_toast_table, I see four callers: * index_update_stats updating a pg_class tuple's relhasindex, relpages, reltuples, relallvisible * vac_update_relstats updating a pg_class tuple's relpages, reltuples, relallvisible, relhasindex, relhasrules, relhastriggers, relfrozenxid, relminmxid * vac_update_datfrozenxid updating a pg_database tuple's datfrozenxid, datminmxid * dropdb updating a pg_database tuple's datconnlimit So we have just as much of a problem with GRANTs on databases as GRANTs on relations. Also, it looks like we can lose knowledge of the presence of rules and triggers, which seems nearly as bad as forgetting about indexes. The rest of these updates might not be correctness-critical, although I wonder how bollixed things could get if we forget an advancement of relfrozenxid or datfrozenxid (especially if the calling transaction goes on to make other changes that assume that the update happened). BTW, vac_update_datfrozenxid believes (correctly I think) that it cannot use the syscache copy of a tuple as the basis for in-place update, because syscache will have detoasted any toastable fields. These other callers are ignoring that, which seems like it should result in heap_inplace_update failing with "wrong tuple length". I wonder how come we're not seeing reports of that from the field. I'm inclined to propose that heap_inplace_update should check to make sure that it's operating on the latest version of the tuple (including, I guess, waiting for an uncommitted update?) and throw error if not. I think this is what your B3 option is, but maybe I misinterpreted. It might be better to throw error immediately instead of waiting to see if the other updater commits. regards, tom lane
On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote: > >> We'll likely need to change how we maintain relhasindex or perhaps take a lock > >> in GRANT. > > > The main choice is accepting more DDL blocking vs. accepting inefficient > > relcache builds. Options I'm seeing: > > It looks to me like you're only thinking about relhasindex, but it > seems to me that any call of heap_inplace_update brings some > risk of this kind. Excluding the bootstrap-mode-only usage in > create_toast_table, I see four callers: > > * index_update_stats updating a pg_class tuple's > relhasindex, relpages, reltuples, relallvisible > > * vac_update_relstats updating a pg_class tuple's > relpages, reltuples, relallvisible, relhasindex, relhasrules, > relhastriggers, relfrozenxid, relminmxid > > * vac_update_datfrozenxid updating a pg_database tuple's > datfrozenxid, datminmxid > > * dropdb updating a pg_database tuple's datconnlimit > > So we have just as much of a problem with GRANTs on databases > as GRANTs on relations. Also, it looks like we can lose > knowledge of the presence of rules and triggers, which seems > nearly as bad as forgetting about indexes. The rest of these > updates might not be correctness-critical, although I wonder > how bollixed things could get if we forget an advancement of > relfrozenxid or datfrozenxid (especially if the calling > transaction goes on to make other changes that assume that > the update happened). Thanks for researching that. Let's treat frozenxid stuff as critical; I wouldn't want to advance XID limits based on a datfrozenxid that later gets rolled back. I agree relhasrules and relhastriggers are also critical. The "inefficient relcache builds" option family can't solve cases like relfrozenxid and datconnlimit, so that leaves us with the "more DDL blocking" option family. > BTW, vac_update_datfrozenxid believes (correctly I think) that > it cannot use the syscache copy of a tuple as the basis for in-place > update, because syscache will have detoasted any toastable fields. > These other callers are ignoring that, which seems like it should > result in heap_inplace_update failing with "wrong tuple length". > I wonder how come we're not seeing reports of that from the field. Good question. Perhaps we'll need some test cases that exercise each inplace update against a row having a toast pointer. It's too easy to go a long time without encountering those in the field. > I'm inclined to propose that heap_inplace_update should check to > make sure that it's operating on the latest version of the tuple > (including, I guess, waiting for an uncommitted update?) and throw > error if not. I think this is what your B3 option is, but maybe > I misinterpreted. It might be better to throw error immediately > instead of waiting to see if the other updater commits. That's perhaps closer to B2. To be pedantic, B3 was about not failing or waiting for GRANT to commit but instead inplace-updating every member of the update chain. For B2, I was thinking we don't need to error. There are two problematic orders of events. The easy one is heap_inplace_update() mutating a tuple that already has an xmax. That's the one in the test case upthread, and detecting it is trivial. The harder one is heap_inplace_update() mutating a tuple after GRANT fetches the old tuple, before GRANT enters heap_update(). I anticipate a new locktag per catalog that can receive inplace updates, i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION. Here's a walk-through for the pg_database case. GRANT will use the following sequence of events: - acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode - fetch latest pg_database tuple - heap_update() - COMMIT, releasing LOCKTAG_DATABASE_DEFINITION vac_update_datfrozenxid() sequence of events: - acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode - (now, all GRANTs on the given database have committed or aborted) - fetch latest pg_database tuple - heap_inplace_update() - release LOCKTAG_DATABASE_DEFINITION, even if xact not ending - continue with other steps, e.g. vac_truncate_clog() How does that compare to what you envisioned? vac_update_datfrozenxid() could further use xmax as a best-efforts thing to catch conflict with manual UPDATE statements, but it wouldn't solve the case where the UPDATE had fetched the tuple but not yet heap_update()'d it.
Noah Misch <noah@leadboat.com> writes: > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote: >> I'm inclined to propose that heap_inplace_update should check to >> make sure that it's operating on the latest version of the tuple >> (including, I guess, waiting for an uncommitted update?) and throw >> error if not. I think this is what your B3 option is, but maybe >> I misinterpreted. It might be better to throw error immediately >> instead of waiting to see if the other updater commits. > That's perhaps closer to B2. To be pedantic, B3 was about not failing or > waiting for GRANT to commit but instead inplace-updating every member of the > update chain. For B2, I was thinking we don't need to error. There are two > problematic orders of events. The easy one is heap_inplace_update() mutating > a tuple that already has an xmax. That's the one in the test case upthread, > and detecting it is trivial. The harder one is heap_inplace_update() mutating > a tuple after GRANT fetches the old tuple, before GRANT enters heap_update(). Ugh ... you're right, what I was imagining would not catch that last case. > I anticipate a new locktag per catalog that can receive inplace updates, > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION. We could perhaps make this work by using the existing tuple-lock infrastructure, rather than inventing new locktags (a choice that spills to a lot of places including clients that examine pg_locks). I would prefer though to find a solution that only depends on making heap_inplace_update protect itself, without high-level cooperation from the possibly-interfering updater. This is basically because I'm still afraid that we're defining the problem too narrowly. For one thing, I have nearly zero confidence that GRANT et al are the only problematic source of conflicting transactional updates. For another, I'm worried that some extension may be using heap_inplace_update against a catalog we're not considering here. I'd also like to find a solution that fixes the case of a conflicting manual UPDATE (although certainly that's a stretch goal we may not be able to reach). I wonder if there's a way for heap_inplace_update to mark the tuple header as just-updated in a way that regular heap_update could recognize. (For standard catalog updates, we'd then end up erroring in simple_heap_update, which I think is fine.) We can't update xmin, because the VACUUM callers don't have an XID; but maybe there's some other way? I'm speculating about putting a funny value into xmax, or something like that, and having heap_update check that what it sees in xmax matches what was in the tuple the update started with. Or we could try to get rid of in-place updates, but that seems like a mighty big lift. All of the existing callers, except maybe the johnny-come-lately dropdb usage, have solid documented reasons to do it that way. regards, tom lane
On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote: > >> I'm inclined to propose that heap_inplace_update should check to > >> make sure that it's operating on the latest version of the tuple > >> (including, I guess, waiting for an uncommitted update?) and throw > >> error if not. I think this is what your B3 option is, but maybe > >> I misinterpreted. It might be better to throw error immediately > >> instead of waiting to see if the other updater commits. > > > That's perhaps closer to B2. To be pedantic, B3 was about not failing or > > waiting for GRANT to commit but instead inplace-updating every member of the > > update chain. For B2, I was thinking we don't need to error. There are two > > problematic orders of events. The easy one is heap_inplace_update() mutating > > a tuple that already has an xmax. That's the one in the test case upthread, > > and detecting it is trivial. The harder one is heap_inplace_update() mutating > > a tuple after GRANT fetches the old tuple, before GRANT enters heap_update(). > > Ugh ... you're right, what I was imagining would not catch that last case. > > > I anticipate a new locktag per catalog that can receive inplace updates, > > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION. > > We could perhaps make this work by using the existing tuple-lock > infrastructure, rather than inventing new locktags (a choice that > spills to a lot of places including clients that examine pg_locks). That could be okay. It would be weird to reuse a short-term lock like that one as something held till end of transaction. But the alternative of new locktags ain't perfect, as you say. > I would prefer though to find a solution that only depends on making > heap_inplace_update protect itself, without high-level cooperation > from the possibly-interfering updater. This is basically because > I'm still afraid that we're defining the problem too narrowly. > For one thing, I have nearly zero confidence that GRANT et al are > the only problematic source of conflicting transactional updates. Likewise here, but I have fair confidence that an assertion would flush out the rest. heap_inplace_update() would assert that the backend holds one of the acceptable locks. It could even be an elog; heap_inplace_update() can tolerate that cost. > For another, I'm worried that some extension may be using > heap_inplace_update against a catalog we're not considering here. A pgxn search finds "citus" using heap_inplace_update(). > I'd also like to find a solution that fixes the case of a conflicting > manual UPDATE (although certainly that's a stretch goal we may not be > able to reach). It would be nice. > I wonder if there's a way for heap_inplace_update to mark the tuple > header as just-updated in a way that regular heap_update could > recognize. (For standard catalog updates, we'd then end up erroring > in simple_heap_update, which I think is fine.) We can't update xmin, > because the VACUUM callers don't have an XID; but maybe there's some > other way? I'm speculating about putting a funny value into xmax, > or something like that, and having heap_update check that what it > sees in xmax matches what was in the tuple the update started with. Hmmm. Achieving it without an XID would be the real trick. (With an XID, we could use xl_heap_lock like heap_update() does.) Thinking out loud, what if heap_inplace_update() sets HEAP_XMAX_INVALID and xmax = TransactionIdAdvance(xmax)? Or change t_ctid in a similar way. Then regular heap_update() could complain if the field changed vs. last seen value. This feels like something to regret later in terms of limiting our ability to harness those fields for more-valuable ends or compact them away in a future page format. I can't pinpoint a specific loss, so the idea might have legs. Nontransactional data in separate tables or in new metapages smells like the right long-term state. A project wanting to reuse the tuple header bits could introduce such storage to unblock its own bit reuse. > Or we could try to get rid of in-place updates, but that seems like > a mighty big lift. All of the existing callers, except maybe > the johnny-come-lately dropdb usage, have solid documented reasons > to do it that way. Yes, removing that smells problematic.
I prototyped two ways, one with a special t_ctid and one with LockTuple(). On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote: > On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote: > > Noah Misch <noah@leadboat.com> writes: > > > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote: > > > I anticipate a new locktag per catalog that can receive inplace updates, > > > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION. > > > > We could perhaps make this work by using the existing tuple-lock > > infrastructure, rather than inventing new locktags (a choice that > > spills to a lot of places including clients that examine pg_locks). > > That could be okay. It would be weird to reuse a short-term lock like that > one as something held till end of transaction. But the alternative of new > locktags ain't perfect, as you say. That worked. > > I would prefer though to find a solution that only depends on making > > heap_inplace_update protect itself, without high-level cooperation > > from the possibly-interfering updater. This is basically because > > I'm still afraid that we're defining the problem too narrowly. > > For one thing, I have nearly zero confidence that GRANT et al are > > the only problematic source of conflicting transactional updates. > > Likewise here, but I have fair confidence that an assertion would flush out > the rest. heap_inplace_update() would assert that the backend holds one of > the acceptable locks. It could even be an elog; heap_inplace_update() can > tolerate that cost. That check would fall in both heap_inplace_update() and heap_update(). After all, a heap_inplace_update() check won't detect an omission in GRANT. > > For another, I'm worried that some extension may be using > > heap_inplace_update against a catalog we're not considering here. > > A pgxn search finds "citus" using heap_inplace_update(). > > > I'd also like to find a solution that fixes the case of a conflicting > > manual UPDATE (although certainly that's a stretch goal we may not be > > able to reach). > > It would be nice. I expect most approaches could get there by having ExecModifyTable() arrange for the expected locking or other actions. That's analogous to how heap_update() takes care of sinval even for a manual UPDATE. > > I wonder if there's a way for heap_inplace_update to mark the tuple > > header as just-updated in a way that regular heap_update could > > recognize. (For standard catalog updates, we'd then end up erroring > > in simple_heap_update, which I think is fine.) We can't update xmin, > > because the VACUUM callers don't have an XID; but maybe there's some > > other way? I'm speculating about putting a funny value into xmax, > > or something like that, and having heap_update check that what it > > sees in xmax matches what was in the tuple the update started with. > > Hmmm. Achieving it without an XID would be the real trick. (With an XID, we > could use xl_heap_lock like heap_update() does.) Thinking out loud, what if > heap_inplace_update() sets HEAP_XMAX_INVALID and xmax = > TransactionIdAdvance(xmax)? Or change t_ctid in a similar way. Then regular > heap_update() could complain if the field changed vs. last seen value. This > feels like something to regret later in terms of limiting our ability to > harness those fields for more-valuable ends or compact them away in a future > page format. I can't pinpoint a specific loss, so the idea might have legs. > Nontransactional data in separate tables or in new metapages smells like the > right long-term state. A project wanting to reuse the tuple header bits could > introduce such storage to unblock its own bit reuse. heap_update() does not have the pre-modification xmax today, so I used t_ctid. heap_modify_tuple() preserves t_ctid, so heap_update() already has the pre-modification t_ctid in key cases. For details of how the prototype uses t_ctid, see comment at "#define InplaceCanaryOffsetNumber". The prototype doesn't prevent corruption in the following scenario, because the aborted ALTER TABLE RENAME overwrites the special t_ctid: == session 1 drop table t; create table t (c int); begin; -- in gdb, set breakpoint on heap_modify_tuple grant select on t to public; == session 2 alter table t add primary key (c); begin; alter table t rename to t2; rollback; == back in session 1 -- release breakpoint -- want error (would get it w/o the begin;alter;rollback) commit; I'm missing how to mark the tuple in a fashion accessible to a second heap_update() after a rolled-back heap_update(). The mark needs enough bits "N" so it's implausible for 2^N inplace updates to happen between GRANT fetching the old tuple and GRANT completing heap_update(). Looking for bits that could persist across a rolled-back heap_update(), we have 3 in t_ctid, 2 in t_infomask2, and 0 in xmax. I definitely don't want to paint us into a corner by spending the t_infomask2 bits on this. Even if I did, 2^(3+2)=32 wouldn't clearly be enough inplace updates. Is there a way to salvage the goal of fixing the bug without modifying code like ExecGrant_common()? If not, I'm inclined to pick from one of the following designs: - Acquire ShareUpdateExclusiveLock in GRANT ((B1) from previous list). It does make GRANT more intrusive; e.g. GRANT will cancel autovacuum. I'm leaning toward this one for two reasons. First, it doesn't slow heap_update() outside of assert builds. Second, it makes the GRANT experience more like the rest our DDL, in that concurrent DDL will make GRANT block, not fail. - GRANT passes to heapam the fixed-size portion of the pre-modification tuple. heap_update() compares those bytes to the oldtup in shared buffers to see if an inplace update happened. (HEAD could get the bytes from a new heap_update() parameter, while back branches would need a different passing approach.) - LockTuple(), as seen in its attached prototype. I like this least at the moment, because it changes pg_locks content without having a clear advantage over the previous option. Also, the prototype has enough FIXME markers that I expect this to get hairy before it's done. I might change my preference after further prototypes. Does anyone have a strong preference between those? Briefly, I did consider these additional alternatives: - Just accept the yet-rarer chance of corruption from this message's test procedure. - Hold a buffer lock long enough to solve things. - Remember the tuples where we overwrote a special t_ctid, and reverse the overwrite during abort processing. But I/O in the abort path sounds bad. Thanks, nm
Вложения
I'm attaching patches implementing the LockTuple() design. It turns out we don't just lose inplace updates. We also overwrite unrelated tuples, reproduced at inplace.spec. Good starting points are README.tuplock and the heap_inplace_update_scan() header comment. On Wed, Nov 01, 2023 at 08:09:15PM -0700, Noah Misch wrote: > On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote: > > On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote: > > > We could perhaps make this work by using the existing tuple-lock > > > infrastructure, rather than inventing new locktags (a choice that > > > spills to a lot of places including clients that examine pg_locks). > > > I'd also like to find a solution that fixes the case of a conflicting > > > manual UPDATE (although certainly that's a stretch goal we may not be > > > able to reach). I implemented that; search for ri_needLockTagTuple. > - GRANT passes to heapam the fixed-size portion of the pre-modification tuple. > heap_update() compares those bytes to the oldtup in shared buffers to see if > an inplace update happened. (HEAD could get the bytes from a new > heap_update() parameter, while back branches would need a different passing > approach.) This could have been fine, but ... > - LockTuple(), as seen in its attached prototype. I like this least at the > moment, because it changes pg_locks content without having a clear advantage > over the previous option. ... I settled on the LockTuple() design for these reasons: - Solves more conflicts by waiting, instead of by ERROR or by retry loops. - Extensions wanting inplace updates don't have a big disadvantage over core code inplace updates. - One could use this to stop "tuple concurrently updated" for pg_class rows, by using SearchSysCacheLocked1() for all pg_class DDL and making that function wait for any existing xmax like inplace_xmax_lock() does. I don't expect to write that, but it's a nice option to have. - pg_locks shows the new lock acquisitions. Separable, nontrivial things not fixed in the attached patch stack: - Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right. - AtEOXact_Inval(true) is outside the RecordTransactionCommit() critical section, but it is critical. We must not commit transactional DDL without other backends receiving an inval. (When the inplace inval becomes nontransactional, it will face the same threat.) - Trouble is possible, I bet, if the system crashes between the inplace-update memcpy() and XLogInsert(). See the new XXX comment below the memcpy(). Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and finally issuing memcpy() into the buffer. - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's xmin does not stop pruning, an MVCC scan in that backend can find zero tuples when one is live. This is like what all backends got in the days of SnapshotNow catalog scans. See the pgbench test suite addition. (Perhaps the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM, setting that flag later and unsetting it earlier.) If you find decisions in this thread's patches are tied to any of those such that I should not separate those, let's discuss that. Topics in the patches that I feel are most fruitful for debate: - This makes inplace update block if the tuple has an updater. It's like one GRANT blocking another, except an inplace updater won't get "ERROR: tuple concurrently updated" like one of the GRANTs would. I had implemented versions that avoided this blocking by mutating each tuple in the updated tuple chain. That worked, but it had corner cases bad for maintainability, listed in the inplace_xmax_lock() header comment. I'd rather accept the blocking, so hackers can rule out those corner cases. A long-running GRANT already hurts VACUUM progress more just by keeping an XID running. - Pre-checks could make heap_inplace_update_cancel() calls rarer. Avoiding one of those avoids an exclusive buffer lock, and it avoids waiting on concurrent heap_update() if any. We'd pre-check the syscache tuple. EventTriggerOnLogin() does it that way, because the code was already in that form. I expect only vac_update_datfrozenxid() concludes !dirty enough to matter. I didn't bother with the optimization, but it would be simple. - If any citus extension user feels like porting its heap_inplace_update() call to this, I'd value hearing about your experience. - I paid more than my usual attention to test coverage, considering the patch stack's intensity compared to most back-patch bug fixes. I've kept all the above topics brief; feel free to ask for more details. Thanks, nm
Вложения
- inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
- inplace010-tests-v1.patch
- inplace040-waitfuncs-v1.patch
- inplace050-tests-inj-v1.patch
- inplace060-nodeModifyTable-comments-v1.patch
- inplace070-rel-locks-missing-v1.patch
- inplace080-catcache-detoast-inplace-stale-v1.patch
- inplace090-LOCKTAG_TUPLE-eoxact-v1.patch
- inplace110-successors-v1.patch
- inplace120-locktag-v1.patch
On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote: > I'm attaching patches implementing the LockTuple() design. It turns out we > don't just lose inplace updates. We also overwrite unrelated tuples, > reproduced at inplace.spec. Good starting points are README.tuplock and the > heap_inplace_update_scan() header comment. About inplace050-tests-inj-v1.patch. + /* Check if blocked_pid is in injection_wait(). */ + proc = BackendPidGetProc(blocked_pid); + if (proc == NULL) + PG_RETURN_BOOL(false); /* session gone: definitely unblocked */ + wait_event = + pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info)); + if (wait_event && strncmp("INJECTION_POINT(", + wait_event, + strlen("INJECTION_POINT(")) == 0) + PG_RETURN_BOOL(true); Hmm. I am not sure that this is the right interface for the job because this is not only related to injection points but to the monitoring of a one or more wait events when running a permutation step. Perhaps this is something that should be linked to the spec files with some property area listing the wait events we're expected to wait on instead when running a step that we know will wait? -- Michael
Вложения
On Mon, May 13, 2024 at 04:59:59PM +0900, Michael Paquier wrote: > About inplace050-tests-inj-v1.patch. > > + /* Check if blocked_pid is in injection_wait(). */ > + proc = BackendPidGetProc(blocked_pid); > + if (proc == NULL) > + PG_RETURN_BOOL(false); /* session gone: definitely unblocked */ > + wait_event = > + pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info)); > + if (wait_event && strncmp("INJECTION_POINT(", > + wait_event, > + strlen("INJECTION_POINT(")) == 0) > + PG_RETURN_BOOL(true); > > Hmm. I am not sure that this is the right interface for the job > because this is not only related to injection points but to the > monitoring of a one or more wait events when running a permutation > step. Could you say more about that? Permutation steps don't monitor wait events today. This patch would be the first instance of that. > Perhaps this is something that should be linked to the spec > files with some property area listing the wait events we're expected > to wait on instead when running a step that we know will wait? The spec syntax doesn't distinguish contention types at all. The isolation tester's needs are limited to distinguishing: (a) process is waiting on another test session (b) process is waiting on automatic background activity (autovacuum, mainly) Automatic background activity doesn't make a process enter or leave injection_wait(), so all injection point wait events fall in (a). (The tester ignores (b), since those clear up without intervention. Failing to ignore them, as the tester did long ago, made output unstable.)
On Sun, May 12, 2024 at 7:29 PM Noah Misch <noah@leadboat.com> wrote: > - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's > xmin does not stop pruning, an MVCC scan in that backend can find zero > tuples when one is live. This is like what all backends got in the days of > SnapshotNow catalog scans. See the pgbench test suite addition. (Perhaps > the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM, > setting that flag later and unsetting it earlier.) Are you saying that this is a problem already, or that the patch causes it to start happening? If it's the former, that's horrible. If it's the latter, I'd say that is a fatal defect. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, May 13, 2024 at 03:53:08PM -0400, Robert Haas wrote: > On Sun, May 12, 2024 at 7:29 PM Noah Misch <noah@leadboat.com> wrote: > > - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's > > xmin does not stop pruning, an MVCC scan in that backend can find zero > > tuples when one is live. This is like what all backends got in the days of > > SnapshotNow catalog scans. See the pgbench test suite addition. (Perhaps > > the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM, > > setting that flag later and unsetting it earlier.) > > Are you saying that this is a problem already, or that the patch > causes it to start happening? If it's the former, that's horrible. The former.
On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote: > I'm attaching patches implementing the LockTuple() design. Starting 2024-06-10, I plan to push the first seven of the ten patches: inplace005-UNEXPECTEDPASS-tap-meson-v1.patch inplace010-tests-v1.patch inplace040-waitfuncs-v1.patch inplace050-tests-inj-v1.patch inplace060-nodeModifyTable-comments-v1.patch Those five just deal in tests, test infrastructure, and comments. inplace070-rel-locks-missing-v1.patch Main risk is new DDL deadlocks. inplace080-catcache-detoast-inplace-stale-v1.patch If it fails to fix the bug it targets, I expect it's a no-op rather than breaking things. I'll leave the last three of the ten needing review. Those three are beyond my skill to self-certify.
On Wed, Jun 5, 2024 at 2:17 PM Noah Misch <noah@leadboat.com> wrote: > Starting 2024-06-10, I plan to push the first seven of the ten patches: > > inplace005-UNEXPECTEDPASS-tap-meson-v1.patch > inplace010-tests-v1.patch > inplace040-waitfuncs-v1.patch > inplace050-tests-inj-v1.patch > inplace060-nodeModifyTable-comments-v1.patch > Those five just deal in tests, test infrastructure, and comments. > inplace070-rel-locks-missing-v1.patch > Main risk is new DDL deadlocks. > inplace080-catcache-detoast-inplace-stale-v1.patch > If it fails to fix the bug it targets, I expect it's a no-op rather than > breaking things. > > I'll leave the last three of the ten needing review. Those three are beyond > my skill to self-certify. It's not this patch set's fault, but I'm not very pleased to see that the injection point wait events have been shoehorned into the "Extension" category - which they are not - instead of being a new wait_event_type. That would have avoided the ugly wait-event naming pattern, inconsistent with everything else, introduced by inplace050-tests-inj-v1.patch. I think that the comments and commit messages in this patch set could, in some places, use improvement. For instance, inplace060-nodeModifyTable-comments-v1.patch reflows a bunch of comments, which makes it hard to see what actually changed, and the commit message doesn't tell you, either. A good bit of it seems to be changing "a view" to "a view INSTEAD OF trigger" or "a view having an INSTEAD OF trigger," but the reasoning behind that change is not spelled out anywhere. The reader is left to guess what the other case is and why the same principles don't apply to it. I don't doubt that the new comments are more correct than the old ones, but I expect future patch authors to have difficulty maintaining that state of affairs. Similarly, inplace070-rel-locks-missing-v1.patch adds no comments. IMHO, the commit message also isn't very informative. It disclaims knowledge of what bug it's fixing, while at the same time leaving the reader to figure out for themselves how the behavior has changed. Consequently, I expect writing the release notes for a release including this patch to be difficult: "We added some locks that block ... something ... in some circumstances ... to prevent ... something." It's not really the job of the release note author to fill in those blanks, but rather of the patch author or committer. I don't want to overburden the act of fixing bugs, but I just feel like more explanation is needed here. When I see for example that we're adding a lock acquisition to the end of heap_create(), I can't help but wonder if it's really true that we don't take a lock on a just-created relation today. I'm certainly under the impression that we lock newly-created, uncommitted relations, and a quick test seems to confirm that. I don't quite know whether that happens, but evidently this call is guarding against something more subtle than a categorical failure to lock a relation on creation so I think there should be a comment explaining what that thing is. It's also quite surprising that SetRelationHasSubclass() says "take X lock before calling" and 2 of 4 callers just don't. I guess that's how it is. But shouldn't we then have an assertion inside that function to guard against future mistakes? If the reason why we failed to add this initially is discernible from the commit messages that introduced the bug, it would be nice to mention what it seems to have been; if not, it would at least be nice to mention the offending commit(s). I'm also a bit worried that this is going to cause deadlocks, but I suppose if it does, that's still better than the status quo. IsInplaceUpdateOid's header comment says IsInplaceUpdateRelation instead of IsInplaceUpdateOid. inplace080-catcache-detoast-inplace-stale-v1.patch seems like another place where spelling out the rationale in more detail would be helpful to future readers; for instance, the commit message says that PgDatabaseToastTable is the only one affected, but it doesn't say why the others are not, or why this one is. The lengthy comment in CatalogCacheCreateEntry is also difficult to correlate with the code which follows. I can't guess whether the two cases called out in the comment always needed to be handled and were handled save only for in-place updates, and thus the comment changes were simply taking the opportunity to elaborate on the existing comments; or whether one of those cases is preexisting and the other arises from the desire to handle inplace updates. It could be helpful to mention relevant identifiers from the code in the comment text e.g. "systable_recheck_tuple detects ordinary updates by noting changes to the tuple's visibility information, while the equalTuple() case detects inplace updates." IMHO, this patch set underscores the desirability of removing in-place update altogether. That sounds difficult and not back-patchable, but I can't classify what this patch set does as anything better than grotty hacks to work around serious design deficiencies. That is not a vote against these patches: I see no better way forward. Nonetheless, I dislike the lack of better options. I have done only cursory review of the last two patches and don't feel I'm in a place to certify them, at least not now. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote: > It's not this patch set's fault, but I'm not very pleased to see that > the injection point wait events have been shoehorned into the > "Extension" category - which they are not - instead of being a new > wait_event_type. That would have avoided the ugly wait-event naming > pattern, inconsistent with everything else, introduced by > inplace050-tests-inj-v1.patch. Not sure to agree with that. The set of core backend APIs supporting injection points have nothing to do with wait events. The library attached to one or more injection points *may* decide to use a wait event like what the wait/wakeup calls in modules/injection_points do, but that's entirely optional. These rely on custom wait events, plugged into the Extension category as the code run is itself in an extension. I am not arguing against the point that it may be interesting to plug in custom wait event categories, but the current design of wait events makes that much harder than what core is currently able to handle, and I am not sure that this brings much at the end as long as the wait event strings can be customized. I've voiced upthread concerns over the naming enforced by the patch and the way it plugs the namings into the isolation functions, by the way. -- Michael
Вложения
On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote: > > It's not this patch set's fault, but I'm not very pleased to see that > > the injection point wait events have been shoehorned into the > > "Extension" category - which they are not - instead of being a new > > wait_event_type. That would have avoided the ugly wait-event naming > > pattern, inconsistent with everything else, introduced by > > inplace050-tests-inj-v1.patch. > > Not sure to agree with that. The set of core backend APIs supporting > injection points have nothing to do with wait events. The library > attached to one or more injection points *may* decide to use a wait > event like what the wait/wakeup calls in modules/injection_points do, > but that's entirely optional. These rely on custom wait events, > plugged into the Extension category as the code run is itself in an > extension. I am not arguing against the point that it may be > interesting to plug in custom wait event categories, but the current > design of wait events makes that much harder than what core is > currently able to handle, and I am not sure that this brings much at > the end as long as the wait event strings can be customized. > > I've voiced upthread concerns over the naming enforced by the patch > and the way it plugs the namings into the isolation functions, by the > way. I think the core code should provide an "Injection Point" wait event type and let extensions add specific wait events there, just like you did for "Extension". Then this ugly naming would go away. As I see it, "Extension" is only supposed to be used as a catch-all when we have no other information, but here we do. If we refuse to use the wait_event_type field to categorize waits, then people are going to have to find some other way to get that data into the system, as Noah has done. -- Robert Haas EDB: http://www.enterprisedb.com