Обсуждение: interval_ops shall stop using btequalimage (deduplication)
The btequalimage() header comment says: * Generic "equalimage" support function. * * B-Tree operator classes whose equality function could safely be replaced by * datum_image_eq() in all cases can use this as their "equalimage" support * function. interval_ops, however, recognizes equal-but-distinguishable values: create temp table t (c interval); insert into t values ('1d'::interval), ('24h'); table t; select distinct c from t; The CREATE INDEX of the following test: begin; create table t (c interval); insert into t select x from generate_series(1,500), (values ('1 year 1 month'::interval), ('1 year 30 days')) t(x); select distinct c from t; create index ti on t (c); rollback; Fails with: 2498151 2023-10-10 05:06:46.177 GMT DEBUG: building index "ti" on table "t" serially 2498151 2023-10-10 05:06:46.178 GMT DEBUG: index "ti" can safely use deduplication TRAP: failed Assert("!itup_key->allequalimage || keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright)"), File: "nbtutils.c",Line: 2443, PID: 2498151 I've also caught btree posting lists where one TID refers to a '1d' heap tuple, while another TID refers to a '24h' heap tuple. amcheck complains. Index-only scans can return the '1d' bits where the actual tuple had the '24h' bits. Are there other consequences to highlight in the release notes? The back-branch patch is larger, to fix things without initdb. Hence, I'm attaching patches for HEAD and for v16 (trivial to merge back from there). I glanced at the other opfamilies permitting deduplication, and they look okay: [local] test=*# select amproc, amproclefttype = amprocrighttype as l_eq_r, array_agg(array[opfname, amproclefttype::regtype::text])from pg_amproc join pg_opfamily f on amprocfamily = f.oid where amprocnum = 4 and opfmethod= 403 group by 1,2; ─[ RECORD 1 ]─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── amproc │ btequalimage l_eq_r │ t array_agg │ {{bit_ops,bit},{bool_ops,boolean},{bytea_ops,bytea},{char_ops,"\"char\""},{datetime_ops,date},{datetime_ops,"timestamp withouttime zone"},{datetime_ops,"timestamp with time zone"},{network_ops,inet},{integer_ops,smallint},{integer_ops,integer},{integer_ops,bigint},{interval_ops,interval},{macaddr_ops,macaddr},{oid_ops,oid},{oidvector_ops,oidvector},{time_ops,"time withouttime zone"},{timetz_ops,"time with time zone"},{varbit_ops,"bit varying"},{text_pattern_ops,text},{bpchar_pattern_ops,character},{money_ops,money},{tid_ops,tid},{uuid_ops,uuid},{pg_lsn_ops,pg_lsn},{macaddr8_ops,macaddr8},{enum_ops,anyenum},{xid8_ops,xid8}} ─[ RECORD 2 ]─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── amproc │ btvarstrequalimage l_eq_r │ t array_agg │ {{bpchar_ops,character},{text_ops,text},{text_ops,name}} Thanks, nm
Вложения
On Tue, Oct 10, 2023 at 6:33 PM Noah Misch <noah@leadboat.com> wrote: > interval_ops, however, recognizes equal-but-distinguishable values: > Fails with: > > 2498151 2023-10-10 05:06:46.177 GMT DEBUG: building index "ti" on table "t" serially > 2498151 2023-10-10 05:06:46.178 GMT DEBUG: index "ti" can safely use deduplication > TRAP: failed Assert("!itup_key->allequalimage || keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright)"), File:"nbtutils.c", Line: 2443, PID: 2498151 Nice catch. Out of curiosity, how did you figure this out? Did it just occur to you that interval_ops had a behavior that made deduplication unsafe? Or did the problem come to your attention after running amcheck on a customer database? Or was it something else? > I've also caught btree posting lists where one TID refers to a '1d' heap > tuple, while another TID refers to a '24h' heap tuple. amcheck complains. > Index-only scans can return the '1d' bits where the actual tuple had the '24h' > bits. Are there other consequences to highlight in the release notes? Nothing else comes to mind right now. I should think about posting list splits some more tomorrow, though -- those are tricky. > The back-branch patch is larger, to fix things without initdb. Hence, I'm > attaching patches for HEAD and for v16 (trivial to merge back from there). > I glanced at the other opfamilies permitting deduplication, and they look okay: Due to the way that nbtsplitloc.c deals with duplicates, I'd expect the same assertion failure with any index where a single leaf page is filled with opclass-wise duplicates with more than one distinct representation/output -- the details beyond that shouldn't matter. I was happy with how easy it was to make this assertion fail (with a known broken numeric_ops opclass) while testing/developing deduplication. I'm a little surprised that it took this long to notice the interval_ops issue. Do we really need to change the catalog contents when backpatching? -- Peter Geoghegan
On Tue, Oct 10, 2023 at 08:12:36PM -0700, Peter Geoghegan wrote: > On Tue, Oct 10, 2023 at 6:33 PM Noah Misch <noah@leadboat.com> wrote: > > interval_ops, however, recognizes equal-but-distinguishable values: > > > Fails with: > > > > 2498151 2023-10-10 05:06:46.177 GMT DEBUG: building index "ti" on table "t" serially > > 2498151 2023-10-10 05:06:46.178 GMT DEBUG: index "ti" can safely use deduplication > > TRAP: failed Assert("!itup_key->allequalimage || keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright)"), File:"nbtutils.c", Line: 2443, PID: 2498151 > > Nice catch. > > Out of curiosity, how did you figure this out? Did it just occur to > you that interval_ops had a behavior that made deduplication unsafe? > Or did the problem come to your attention after running amcheck on a > customer database? Or was it something else? My friend got an amcheck "lacks matching index tuple" failure, and they asked me about it. I ran into the assertion failure while reproducing things. > I'm a little surprised that it took this long to notice > the interval_ops issue. Agreed. I don't usually store interval values in tables, and I'm not sure I've ever indexed one. Who knows. > Do we really need to change the catalog contents when backpatching? Not really. I think we usually do. On the other hand, unlike some past cases, there's no functional need for the catalog changes. The catalog changes just get a bit of efficiency. No strong preference here.
On Tue, Oct 10, 2023 at 8:29 PM Noah Misch <noah@leadboat.com> wrote: > My friend got an amcheck "lacks matching index tuple" failure, and they asked > me about it. I ran into the assertion failure while reproducing things. Reminds me of the time that amcheck found a bug in the default btree opclass for PostGIS's geography type. The opclass wasn't really intended to be used for indexing. The issue with the opclass (which violated transitive consistency) would probably never have been detected were it not for the tendency of PostGIS users to accidentally create useless B-Tree indexes on geography columns. Users sometimes omitted "using gist", without necessarily noticing that the index was practically useless. > > Do we really need to change the catalog contents when backpatching? > > Not really. I think we usually do. On the other hand, unlike some past > cases, there's no functional need for the catalog changes. The catalog > changes just get a bit of efficiency. No strong preference here. I'll defer to you on this question, then. I don't see any reason to delay committing your fix. The issue that you've highlighted is exactly the kind of issue that I anticipated might happen at some point. This seems straightforward. -- Peter Geoghegan
On Tue, Oct 10, 2023 at 8:51 PM Peter Geoghegan <pg@bowt.ie> wrote: > I don't see any reason to delay committing your fix. The issue that > you've highlighted is exactly the kind of issue that I anticipated > might happen at some point. This seems straightforward. BTW, we don't need to recommend the heapallindexed option in the release notes. Calling bt_check_index() will reliably indicate that corruption is present when called against existing interval_ops indexes once your fix is in. Simply having an index whose metapage's allequalimage field is spuriously set to true will be recognized as corruption right away. Obviously, this will be no less true with an existing interval_ops index that happens to be completely empty. -- Peter Geoghegan
On Tue, Oct 10, 2023 at 09:35:45PM -0700, Peter Geoghegan wrote: > On Tue, Oct 10, 2023 at 8:51 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I don't see any reason to delay committing your fix. The issue that > > you've highlighted is exactly the kind of issue that I anticipated > > might happen at some point. This seems straightforward. > > BTW, we don't need to recommend the heapallindexed option in the > release notes. Calling bt_check_index() will reliably indicate that > corruption is present when called against existing interval_ops > indexes once your fix is in. Simply having an index whose metapage's > allequalimage field is spuriously set to true will be recognized as > corruption right away. Obviously, this will be no less true with an > existing interval_ops index that happens to be completely empty. Interesting. So, >99% of interval-type indexes, even ones WITH (deduplicate_items=off), will get amcheck failures. The <1% of exceptions might include indexes having allequalimage=off due to an additional column, e.g. a two-column (interval, numeric) index. If interval indexes are common enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are relatively rare, that could argue for giving amcheck a special case. Specifically, downgrade its "metapage incorrectly indicates that deduplication is safe" from ERROR to WARNING for interval_ops only. Without that special case (i.e. with the v1 patch), the release notes should probably resemble, "After updating, run REINDEX on all indexes having an interval-type column." There's little point in recommending pg_amcheck if >99% will fail. I'm inclined to bet that interval-type indexes are rare, so I lean against adding the amcheck special case. It's not a strong preference. Other opinions? If users want to reduce their exposure now, they could do "ALTER INDEX ... SET (deduplicate_items = off)" and then REINDEX any indexes already failing "pg_amcheck --heapallindexed". allequalimage will remain wrong but have no ill effects beyond making amcheck fail. Another REINDEX after the update would let amcheck pass.
On Wed, Oct 11, 2023 at 11:38 AM Noah Misch <noah@leadboat.com> wrote: > Interesting. So, >99% of interval-type indexes, even ones WITH > (deduplicate_items=off), will get amcheck failures. The <1% of exceptions > might include indexes having allequalimage=off due to an additional column, > e.g. a two-column (interval, numeric) index. If interval indexes are common > enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are relatively > rare, that could argue for giving amcheck a special case. Specifically, > downgrade its "metapage incorrectly indicates that deduplication is safe" from > ERROR to WARNING for interval_ops only. I am not aware of any user actually running "deduplicate_items = off" in production, for any index. It was added purely as a defensive thing -- not because I anticipated any real need to disable deduplication. Deduplication was optimized for being enabled by default. Anything is possible, of course, but it's hard to give too much weight to cases where two very unlikely things happen to intersect. (Plus "deduplicate_items = off" doesn't really do that much; more on that below.) > Without that special case (i.e. with > the v1 patch), the release notes should probably resemble, "After updating, > run REINDEX on all indexes having an interval-type column." +1 > There's little > point in recommending pg_amcheck if >99% will fail. I'm inclined to bet that > interval-type indexes are rare, so I lean against adding the amcheck special > case. It's not a strong preference. Other opinions? Well, there will only be one known reason why anybody will ever see this test fail (barring a near-miraculous coincidence where "generic corruption" somehow gets passed our most basic sanity tests, only to fail this metapage field check a bit later on). Even if you pessimistically assume that similar problems remain undiscovered in some other opfamily, this particular check isn't going to be the check that detects the other problems -- you really would need heapallindexed verification for that. In short, this metapage check is only effective retrospectively, once we recognize and fix problems in an opclass. Clearly there will be exactly one case like that post-fix (interval_ops is at least the only affected core code opfamily), so why not point that out directly with a HINT? A HINT could go a long way towards putting the problem in context, without really adding a special case, and without any real question of users being misled. > If users want to reduce their exposure now, they could do "ALTER INDEX ... SET > (deduplicate_items = off)" and then REINDEX any indexes already failing > "pg_amcheck --heapallindexed". allequalimage will remain wrong but have no > ill effects beyond making amcheck fail. Another REINDEX after the update > would let amcheck pass. Even when "deduplicate_items = off", that just means that the nbtree code won't apply further deduplication passes going forward (until such time as deduplication is reenabled). It doesn't really mean that this problem can't exist. OTOH it's easy to detect affected indexes using SQL. So this is one case where telling users to REINDEX really does seem like the best thing (as opposed to something we say because we're too lazy to come up with nuanced, practical guidance). -- Peter Geoghegan
On Wed, Oct 11, 2023 at 01:00:44PM -0700, Peter Geoghegan wrote: > On Wed, Oct 11, 2023 at 11:38 AM Noah Misch <noah@leadboat.com> wrote: > > Interesting. So, >99% of interval-type indexes, even ones WITH > > (deduplicate_items=off), will get amcheck failures. The <1% of exceptions > > might include indexes having allequalimage=off due to an additional column, > > e.g. a two-column (interval, numeric) index. If interval indexes are common > > enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are relatively > > rare, that could argue for giving amcheck a special case. Specifically, > > downgrade its "metapage incorrectly indicates that deduplication is safe" from > > ERROR to WARNING for interval_ops only. > > I am not aware of any user actually running "deduplicate_items = off" > in production, for any index. It was added purely as a defensive thing > -- not because I anticipated any real need to disable deduplication. > Deduplication was optimized for being enabled by default. Sure. Low-importance background information: deduplicate_items=off got on my radar while I was wondering if ALTER INDEX ... SET (deduplicate_items=off) would clear allequalimage. If it had, we could have advised people to use ALTER INDEX, then rebuild only those indexes still failing "pg_amcheck --heapallindexed". ALTER INDEX doesn't do that, ruling out that idea. > > Without that special case (i.e. with > > the v1 patch), the release notes should probably resemble, "After updating, > > run REINDEX on all indexes having an interval-type column." > > +1 > > > There's little > > point in recommending pg_amcheck if >99% will fail. I'm inclined to bet that > > interval-type indexes are rare, so I lean against adding the amcheck special > > case. It's not a strong preference. Other opinions? > exactly one case like that post-fix (interval_ops is at least the only > affected core code opfamily), so why not point that out directly with > a HINT? A HINT could go a long way towards putting the problem in > context, without really adding a special case, and without any real > question of users being misled. Works for me. Added.
Вложения
On Thu, Oct 12, 2023 at 4:10 PM Noah Misch <noah@leadboat.com> wrote: > > exactly one case like that post-fix (interval_ops is at least the only > > affected core code opfamily), so why not point that out directly with > > a HINT? A HINT could go a long way towards putting the problem in > > context, without really adding a special case, and without any real > > question of users being misled. > > Works for me. Added. Looks good. Thanks! -- Peter Geoghegan
> I've also caught btree posting lists where one TID refers to a '1d' heap
> tuple, while another TID refers to a '24h' heap tuple. amcheck complains.
Index-only scans can return the '1d' bits where the actual tuple had the '24h'
bits.
postgres=# explain (analyze, buffers) select c::text, count(c) from t group by c::text;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=49.27..49.29 rows=1 width=40) (actual time=3.329..3.333 rows=2 loops=1)
Group Key: (c)::text
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=6
-> Index Only Scan using ti on t (cost=0.28..44.27 rows=1000 width=48) (actual time=0.107..2.269 rows=1000 loops=1)
Heap Fetches: 0
Buffers: shared hit=6
Planning:
Buffers: shared hit=4
Planning Time: 0.319 ms
Execution Time: 3.432 ms
(11 rows)
postgres=# select c::text, count(c) from t group by c::text;
c | count
----------------+-------
1 year 1 mon | 500
1 year 30 days | 500
(2 rows)
> tuple, while another TID refers to a '24h' heap tuple. amcheck complains.
Index-only scans can return the '1d' bits where the actual tuple had the '24h'
bits.
Have a build without the patch. I can't reproduce amcheck complaints in release mode
where all these statements succeed.
create table t (c interval);
insert into t select x from generate_series(1,500), (values ('1 year 1 month'::interval), ('1 year 30 days')) t(x);
select distinct c from t;
create index ti on t (c);
insert into t select x from generate_series(1,500), (values ('1 year 1 month'::interval), ('1 year 30 days')) t(x);
select distinct c from t;
create index ti on t (c);
select bt_index_check('ti'::regclass);
On following query, the query seems to return the right result sets,
index-only scan doesn't seem to be mislead by the misuse of btequalimage
postgres=# vacuum (INDEX_CLEANUP on) t;
VACUUMpostgres=# explain (analyze, buffers) select c::text, count(c) from t group by c::text;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=49.27..49.29 rows=1 width=40) (actual time=3.329..3.333 rows=2 loops=1)
Group Key: (c)::text
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=6
-> Index Only Scan using ti on t (cost=0.28..44.27 rows=1000 width=48) (actual time=0.107..2.269 rows=1000 loops=1)
Heap Fetches: 0
Buffers: shared hit=6
Planning:
Buffers: shared hit=4
Planning Time: 0.319 ms
Execution Time: 3.432 ms
(11 rows)
postgres=# select c::text, count(c) from t group by c::text;
c | count
----------------+-------
1 year 1 mon | 500
1 year 30 days | 500
(2 rows)
> * Generic "equalimage" support function.
> *
> * B-Tree operator classes whose equality function could safely be replaced by
> * datum_image_eq() in all cases can use this as their "equalimage" support
> * function.
> *
> * B-Tree operator classes whose equality function could safely be replaced by
> * datum_image_eq() in all cases can use this as their "equalimage" support
> * function.
It seems to me that as long as a data type has a deterministic sort but not necessarily be equalimage,
it should be able to support deduplication. e.g for interval type, we add a byte wise tie breaker
after '24h' and '1day' are compared equally. In the btree, '24h' and '1day' are still adjacent,
'1day' is always sorted before '24h' in a btree page, can we do dedup for each value
without problem?
The assertion will still be triggered as it's testing btequalimage, but I'll defer it for now.
Wanted to know if the above idea sounds sane first...
Donghang Lin
(ServiceNow)
On Thu, Oct 12, 2023 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Oct 12, 2023 at 4:10 PM Noah Misch <noah@leadboat.com> wrote:
> > exactly one case like that post-fix (interval_ops is at least the only
> > affected core code opfamily), so why not point that out directly with
> > a HINT? A HINT could go a long way towards putting the problem in
> > context, without really adding a special case, and without any real
> > question of users being misled.
>
> Works for me. Added.
Looks good. Thanks!
--
Peter Geoghegan
On Mon, Oct 16, 2023 at 11:21:20PM -0700, Donghang Lin wrote: > > I've also caught btree posting lists where one TID refers to a '1d' heap > > tuple, while another TID refers to a '24h' heap tuple. amcheck complains. > Index-only scans can return the '1d' bits where the actual tuple had the > '24h' > bits. > > Have a build without the patch. I can't reproduce amcheck complaints in > release mode > where all these statements succeed. The queries I shared don't create the problematic structure, just an assertion failure. > > * Generic "equalimage" support function. > > * > > * B-Tree operator classes whose equality function could safely be > replaced by > > * datum_image_eq() in all cases can use this as their "equalimage" support > > * function. > It seems to me that as long as a data type has a deterministic sort but not > necessarily be equalimage, > it should be able to support deduplication. e.g for interval type, we add > a byte wise tie breaker > after '24h' and '1day' are compared equally. In the btree, '24h' and '1day' > are still adjacent, > '1day' is always sorted before '24h' in a btree page, can we do dedup for > each value > without problem? Yes. I'm not aware of correctness obstacles arising if one did that.
On Mon, Oct 16, 2023 at 11:21 PM Donghang Lin <donghanglin@gmail.com> wrote: > It seems to me that as long as a data type has a deterministic sort but not necessarily be equalimage, > it should be able to support deduplication. e.g for interval type, we add a byte wise tie breaker > after '24h' and '1day' are compared equally. In the btree, '24h' and '1day' are still adjacent, > '1day' is always sorted before '24h' in a btree page, can we do dedup for each value > without problem? > The assertion will still be triggered as it's testing btequalimage, but I'll defer it for now. > Wanted to know if the above idea sounds sane first... It's hard to give any one reason why this won't work. I'm sure that with enough effort some scheme like this could work. It's just that there are significant practical problems that at least make it seem unappealing as a project. This has been discussed before: https://www.postgresql.org/message-id/flat/CAH2-WzkZkSC7G%2Bv1WwXGo0emh8E-rByw%3DxSpBUoavk7PTjwF2Q%40mail.gmail.com#4e98cba0d76c8c8c0bf67ae0d4652903 -- Peter Geoghegan