Обсуждение: interval_ops shall stop using btequalimage (deduplication)

Поиск
Список
Период
Сортировка

interval_ops shall stop using btequalimage (deduplication)

От
Noah Misch
Дата:
The btequalimage() header comment says:

 * Generic "equalimage" support function.
 *
 * B-Tree operator classes whose equality function could safely be replaced by
 * datum_image_eq() in all cases can use this as their "equalimage" support
 * function.

interval_ops, however, recognizes equal-but-distinguishable values:

  create temp table t (c interval);
  insert into t values ('1d'::interval), ('24h');
  table t;
  select distinct c from t;

The CREATE INDEX of the following test:

  begin;
  create table t (c interval);
  insert into t select x from generate_series(1,500), (values ('1 year 1 month'::interval), ('1 year 30 days')) t(x);
  select distinct c from t;
  create index ti on t (c);
  rollback;

Fails with:

  2498151 2023-10-10 05:06:46.177 GMT DEBUG:  building index "ti" on table "t" serially
  2498151 2023-10-10 05:06:46.178 GMT DEBUG:  index "ti" can safely use deduplication
  TRAP: failed Assert("!itup_key->allequalimage || keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright)"), File:
"nbtutils.c",Line: 2443, PID: 2498151 

I've also caught btree posting lists where one TID refers to a '1d' heap
tuple, while another TID refers to a '24h' heap tuple.  amcheck complains.
Index-only scans can return the '1d' bits where the actual tuple had the '24h'
bits.  Are there other consequences to highlight in the release notes?  The
back-branch patch is larger, to fix things without initdb.  Hence, I'm
attaching patches for HEAD and for v16 (trivial to merge back from there).  I
glanced at the other opfamilies permitting deduplication, and they look okay:

[local] test=*# select amproc, amproclefttype = amprocrighttype as l_eq_r, array_agg(array[opfname,
amproclefttype::regtype::text])from pg_amproc join pg_opfamily f on amprocfamily = f.oid where amprocnum = 4 and
opfmethod= 403 group by 1,2; 
─[ RECORD 1
]───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
amproc    │ btequalimage
l_eq_r    │ t
array_agg │
{{bit_ops,bit},{bool_ops,boolean},{bytea_ops,bytea},{char_ops,"\"char\""},{datetime_ops,date},{datetime_ops,"timestamp
withouttime zone"},{datetime_ops,"timestamp with time
zone"},{network_ops,inet},{integer_ops,smallint},{integer_ops,integer},{integer_ops,bigint},{interval_ops,interval},{macaddr_ops,macaddr},{oid_ops,oid},{oidvector_ops,oidvector},{time_ops,"time
withouttime zone"},{timetz_ops,"time with time zone"},{varbit_ops,"bit
varying"},{text_pattern_ops,text},{bpchar_pattern_ops,character},{money_ops,money},{tid_ops,tid},{uuid_ops,uuid},{pg_lsn_ops,pg_lsn},{macaddr8_ops,macaddr8},{enum_ops,anyenum},{xid8_ops,xid8}}
─[ RECORD 2
]───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
amproc    │ btvarstrequalimage
l_eq_r    │ t
array_agg │ {{bpchar_ops,character},{text_ops,text},{text_ops,name}}

Thanks,
nm

Вложения

Re: interval_ops shall stop using btequalimage (deduplication)

От
Peter Geoghegan
Дата:
On Tue, Oct 10, 2023 at 6:33 PM Noah Misch <noah@leadboat.com> wrote:
> interval_ops, however, recognizes equal-but-distinguishable values:

> Fails with:
>
>   2498151 2023-10-10 05:06:46.177 GMT DEBUG:  building index "ti" on table "t" serially
>   2498151 2023-10-10 05:06:46.178 GMT DEBUG:  index "ti" can safely use deduplication
>   TRAP: failed Assert("!itup_key->allequalimage || keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright)"),
File:"nbtutils.c", Line: 2443, PID: 2498151 

Nice catch.

Out of curiosity, how did you figure this out? Did it just occur to
you that interval_ops had a behavior that made deduplication unsafe?
Or did the problem come to your attention after running amcheck on a
customer database? Or was it something else?

> I've also caught btree posting lists where one TID refers to a '1d' heap
> tuple, while another TID refers to a '24h' heap tuple.  amcheck complains.
> Index-only scans can return the '1d' bits where the actual tuple had the '24h'
> bits.  Are there other consequences to highlight in the release notes?

Nothing else comes to mind right now. I should think about posting
list splits some more tomorrow, though -- those are tricky.

> The back-branch patch is larger, to fix things without initdb.  Hence, I'm
> attaching patches for HEAD and for v16 (trivial to merge back from there).

> I glanced at the other opfamilies permitting deduplication, and they look okay:

Due to the way that nbtsplitloc.c deals with duplicates, I'd expect
the same assertion failure with any index where a single leaf page is
filled with opclass-wise duplicates with more than one distinct
representation/output -- the details beyond that shouldn't matter. I
was happy with how easy it was to make this assertion fail (with a
known broken numeric_ops opclass) while testing/developing
deduplication. I'm a little surprised that it took this long to notice
the interval_ops issue.

Do we really need to change the catalog contents when backpatching?

--
Peter Geoghegan



Re: interval_ops shall stop using btequalimage (deduplication)

От
Noah Misch
Дата:
On Tue, Oct 10, 2023 at 08:12:36PM -0700, Peter Geoghegan wrote:
> On Tue, Oct 10, 2023 at 6:33 PM Noah Misch <noah@leadboat.com> wrote:
> > interval_ops, however, recognizes equal-but-distinguishable values:
> 
> > Fails with:
> >
> >   2498151 2023-10-10 05:06:46.177 GMT DEBUG:  building index "ti" on table "t" serially
> >   2498151 2023-10-10 05:06:46.178 GMT DEBUG:  index "ti" can safely use deduplication
> >   TRAP: failed Assert("!itup_key->allequalimage || keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright)"),
File:"nbtutils.c", Line: 2443, PID: 2498151
 
> 
> Nice catch.
> 
> Out of curiosity, how did you figure this out? Did it just occur to
> you that interval_ops had a behavior that made deduplication unsafe?
> Or did the problem come to your attention after running amcheck on a
> customer database? Or was it something else?

My friend got an amcheck "lacks matching index tuple" failure, and they asked
me about it.  I ran into the assertion failure while reproducing things.

> I'm a little surprised that it took this long to notice
> the interval_ops issue.

Agreed.  I don't usually store interval values in tables, and I'm not sure
I've ever indexed one.  Who knows.

> Do we really need to change the catalog contents when backpatching?

Not really.  I think we usually do.  On the other hand, unlike some past
cases, there's no functional need for the catalog changes.  The catalog
changes just get a bit of efficiency.  No strong preference here.



Re: interval_ops shall stop using btequalimage (deduplication)

От
Peter Geoghegan
Дата:
On Tue, Oct 10, 2023 at 8:29 PM Noah Misch <noah@leadboat.com> wrote:
> My friend got an amcheck "lacks matching index tuple" failure, and they asked
> me about it.  I ran into the assertion failure while reproducing things.

Reminds me of the time that amcheck found a bug in the default btree
opclass for PostGIS's geography type.

The opclass wasn't really intended to be used for indexing. The issue
with the opclass (which violated transitive consistency) would
probably never have been detected were it not for the tendency of
PostGIS users to accidentally create useless B-Tree indexes on
geography columns. Users sometimes omitted "using gist", without
necessarily noticing that the index was practically useless.

> > Do we really need to change the catalog contents when backpatching?
>
> Not really.  I think we usually do.  On the other hand, unlike some past
> cases, there's no functional need for the catalog changes.  The catalog
> changes just get a bit of efficiency.  No strong preference here.

I'll defer to you on this question, then.

I don't see any reason to delay committing your fix. The issue that
you've highlighted is exactly the kind of issue that I anticipated
might happen at some point. This seems straightforward.

--
Peter Geoghegan



Re: interval_ops shall stop using btequalimage (deduplication)

От
Peter Geoghegan
Дата:
On Tue, Oct 10, 2023 at 8:51 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I don't see any reason to delay committing your fix. The issue that
> you've highlighted is exactly the kind of issue that I anticipated
> might happen at some point. This seems straightforward.

BTW, we don't need to recommend the heapallindexed option in the
release notes. Calling bt_check_index() will reliably indicate that
corruption is present when called against existing interval_ops
indexes once your fix is in. Simply having an index whose metapage's
allequalimage field is spuriously set to true will be recognized as
corruption right away. Obviously, this will be no less true with an
existing interval_ops index that happens to be completely empty.

--
Peter Geoghegan



Re: interval_ops shall stop using btequalimage (deduplication)

От
Noah Misch
Дата:
On Tue, Oct 10, 2023 at 09:35:45PM -0700, Peter Geoghegan wrote:
> On Tue, Oct 10, 2023 at 8:51 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I don't see any reason to delay committing your fix. The issue that
> > you've highlighted is exactly the kind of issue that I anticipated
> > might happen at some point. This seems straightforward.
> 
> BTW, we don't need to recommend the heapallindexed option in the
> release notes. Calling bt_check_index() will reliably indicate that
> corruption is present when called against existing interval_ops
> indexes once your fix is in. Simply having an index whose metapage's
> allequalimage field is spuriously set to true will be recognized as
> corruption right away. Obviously, this will be no less true with an
> existing interval_ops index that happens to be completely empty.

Interesting.  So, >99% of interval-type indexes, even ones WITH
(deduplicate_items=off), will get amcheck failures.  The <1% of exceptions
might include indexes having allequalimage=off due to an additional column,
e.g. a two-column (interval, numeric) index.  If interval indexes are common
enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are relatively
rare, that could argue for giving amcheck a special case.  Specifically,
downgrade its "metapage incorrectly indicates that deduplication is safe" from
ERROR to WARNING for interval_ops only.  Without that special case (i.e. with
the v1 patch), the release notes should probably resemble, "After updating,
run REINDEX on all indexes having an interval-type column."  There's little
point in recommending pg_amcheck if >99% will fail.  I'm inclined to bet that
interval-type indexes are rare, so I lean against adding the amcheck special
case.  It's not a strong preference.  Other opinions?

If users want to reduce their exposure now, they could do "ALTER INDEX ... SET
(deduplicate_items = off)" and then REINDEX any indexes already failing
"pg_amcheck --heapallindexed".  allequalimage will remain wrong but have no
ill effects beyond making amcheck fail.  Another REINDEX after the update
would let amcheck pass.



Re: interval_ops shall stop using btequalimage (deduplication)

От
Peter Geoghegan
Дата:
On Wed, Oct 11, 2023 at 11:38 AM Noah Misch <noah@leadboat.com> wrote:
> Interesting.  So, >99% of interval-type indexes, even ones WITH
> (deduplicate_items=off), will get amcheck failures.  The <1% of exceptions
> might include indexes having allequalimage=off due to an additional column,
> e.g. a two-column (interval, numeric) index.  If interval indexes are common
> enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are relatively
> rare, that could argue for giving amcheck a special case.  Specifically,
> downgrade its "metapage incorrectly indicates that deduplication is safe" from
> ERROR to WARNING for interval_ops only.

I am not aware of any user actually running "deduplicate_items = off"
in production, for any index. It was added purely as a defensive thing
-- not because I anticipated any real need to disable deduplication.
Deduplication was optimized for being enabled by default.

Anything is possible, of course, but it's hard to give too much weight
to cases where two very unlikely things happen to intersect. (Plus
"deduplicate_items = off" doesn't really do that much; more on that
below.)

> Without that special case (i.e. with
> the v1 patch), the release notes should probably resemble, "After updating,
> run REINDEX on all indexes having an interval-type column."

+1

> There's little
> point in recommending pg_amcheck if >99% will fail.  I'm inclined to bet that
> interval-type indexes are rare, so I lean against adding the amcheck special
> case.  It's not a strong preference.  Other opinions?

Well, there will only be one known reason why anybody will ever see
this test fail (barring a near-miraculous coincidence where "generic
corruption" somehow gets passed our most basic sanity tests, only to
fail this metapage field check a bit later on). Even if you
pessimistically assume that similar problems remain undiscovered in
some other opfamily, this particular check isn't going to be the check
that detects the other problems -- you really would need
heapallindexed verification for that.

In short, this metapage check is only effective retrospectively, once
we recognize and fix problems in an opclass. Clearly there will be
exactly one case like that post-fix (interval_ops is at least the only
affected core code opfamily), so why not point that out directly with
a HINT? A HINT could go a long way towards putting the problem in
context, without really adding a special case, and without any real
question of users being misled.

> If users want to reduce their exposure now, they could do "ALTER INDEX ... SET
> (deduplicate_items = off)" and then REINDEX any indexes already failing
> "pg_amcheck --heapallindexed".  allequalimage will remain wrong but have no
> ill effects beyond making amcheck fail.  Another REINDEX after the update
> would let amcheck pass.

Even when "deduplicate_items = off", that just means that the nbtree
code won't apply further deduplication passes going forward (until
such time as deduplication is reenabled). It doesn't really mean that
this problem can't exist. OTOH it's easy to detect affected indexes
using SQL. So this is one case where telling users to REINDEX really
does seem like the best thing (as opposed to something we say because
we're too lazy to come up with nuanced, practical guidance).

--
Peter Geoghegan



Re: interval_ops shall stop using btequalimage (deduplication)

От
Noah Misch
Дата:
On Wed, Oct 11, 2023 at 01:00:44PM -0700, Peter Geoghegan wrote:
> On Wed, Oct 11, 2023 at 11:38 AM Noah Misch <noah@leadboat.com> wrote:
> > Interesting.  So, >99% of interval-type indexes, even ones WITH
> > (deduplicate_items=off), will get amcheck failures.  The <1% of exceptions
> > might include indexes having allequalimage=off due to an additional column,
> > e.g. a two-column (interval, numeric) index.  If interval indexes are common
> > enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are relatively
> > rare, that could argue for giving amcheck a special case.  Specifically,
> > downgrade its "metapage incorrectly indicates that deduplication is safe" from
> > ERROR to WARNING for interval_ops only.
> 
> I am not aware of any user actually running "deduplicate_items = off"
> in production, for any index. It was added purely as a defensive thing
> -- not because I anticipated any real need to disable deduplication.
> Deduplication was optimized for being enabled by default.

Sure.  Low-importance background information: deduplicate_items=off got on my
radar while I was wondering if ALTER INDEX ... SET (deduplicate_items=off)
would clear allequalimage.  If it had, we could have advised people to use
ALTER INDEX, then rebuild only those indexes still failing "pg_amcheck
--heapallindexed".  ALTER INDEX doesn't do that, ruling out that idea.

> > Without that special case (i.e. with
> > the v1 patch), the release notes should probably resemble, "After updating,
> > run REINDEX on all indexes having an interval-type column."
> 
> +1
> 
> > There's little
> > point in recommending pg_amcheck if >99% will fail.  I'm inclined to bet that
> > interval-type indexes are rare, so I lean against adding the amcheck special
> > case.  It's not a strong preference.  Other opinions?

> exactly one case like that post-fix (interval_ops is at least the only
> affected core code opfamily), so why not point that out directly with
> a HINT? A HINT could go a long way towards putting the problem in
> context, without really adding a special case, and without any real
> question of users being misled.

Works for me.  Added.

Вложения

Re: interval_ops shall stop using btequalimage (deduplication)

От
Peter Geoghegan
Дата:
On Thu, Oct 12, 2023 at 4:10 PM Noah Misch <noah@leadboat.com> wrote:
> > exactly one case like that post-fix (interval_ops is at least the only
> > affected core code opfamily), so why not point that out directly with
> > a HINT? A HINT could go a long way towards putting the problem in
> > context, without really adding a special case, and without any real
> > question of users being misled.
>
> Works for me.  Added.

Looks good. Thanks!

--
Peter Geoghegan



Re: interval_ops shall stop using btequalimage (deduplication)

От
Donghang Lin
Дата:
> I've also caught btree posting lists where one TID refers to a '1d' heap
> tuple, while another TID refers to a '24h' heap tuple.  amcheck complains.
Index-only scans can return the '1d' bits where the actual tuple had the '24h'
bits.

Have a build without the patch. I can't reproduce amcheck complaints in release mode
where all these statements succeed. 
  create table t (c interval);
  insert into t select x from generate_series(1,500), (values ('1 year 1 month'::interval), ('1 year 30 days')) t(x);
  select distinct c from t;
  create index ti on t (c);
  select bt_index_check('ti'::regclass);
On following query, the query seems to return the right result sets, 
index-only scan doesn't seem to be mislead by the misuse of btequalimage 

  postgres=# vacuum (INDEX_CLEANUP on) t;
VACUUM
postgres=# explain (analyze, buffers) select c::text, count(c) from t group by c::text;
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=49.27..49.29 rows=1 width=40) (actual time=3.329..3.333 rows=2 loops=1)
   Group Key: (c)::text
   Batches: 1  Memory Usage: 24kB
   Buffers: shared hit=6
   ->  Index Only Scan using ti on t  (cost=0.28..44.27 rows=1000 width=48) (actual time=0.107..2.269 rows=1000 loops=1)
         Heap Fetches: 0
         Buffers: shared hit=6
 Planning:
   Buffers: shared hit=4
 Planning Time: 0.319 ms
 Execution Time: 3.432 ms
(11 rows)

postgres=# select c::text, count(c) from t group by c::text;
       c        | count
----------------+-------
 1 year 1 mon   |   500
 1 year 30 days |   500
(2 rows)

>  * Generic "equalimage" support function.
> *
> * B-Tree operator classes whose equality function could safely be replaced by
> * datum_image_eq() in all cases can use this as their "equalimage" support
> * function.
It seems to me that as long as a data type has a deterministic sort but not necessarily be equalimage,
it should be able to support deduplication.  e.g for interval type, we add a byte wise tie breaker 
after '24h' and '1day' are compared equally. In the btree, '24h' and '1day' are still adjacent, 
'1day' is always sorted before '24h' in a btree page, can we do dedup for each value 
without problem? 
The assertion will still be triggered as it's testing btequalimage, but I'll defer it for now. 
Wanted to know if the above idea sounds sane first...


Donghang Lin
(ServiceNow)


On Thu, Oct 12, 2023 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Oct 12, 2023 at 4:10 PM Noah Misch <noah@leadboat.com> wrote:
> > exactly one case like that post-fix (interval_ops is at least the only
> > affected core code opfamily), so why not point that out directly with
> > a HINT? A HINT could go a long way towards putting the problem in
> > context, without really adding a special case, and without any real
> > question of users being misled.
>
> Works for me.  Added.

Looks good. Thanks!

--
Peter Geoghegan


Re: interval_ops shall stop using btequalimage (deduplication)

От
Noah Misch
Дата:
On Mon, Oct 16, 2023 at 11:21:20PM -0700, Donghang Lin wrote:
> > I've also caught btree posting lists where one TID refers to a '1d' heap
> > tuple, while another TID refers to a '24h' heap tuple.  amcheck complains.
> Index-only scans can return the '1d' bits where the actual tuple had the
> '24h'
> bits.
> 
> Have a build without the patch. I can't reproduce amcheck complaints in
> release mode
> where all these statements succeed.

The queries I shared don't create the problematic structure, just an assertion
failure.

> >  * Generic "equalimage" support function.
> > *
> > * B-Tree operator classes whose equality function could safely be
> replaced by
> > * datum_image_eq() in all cases can use this as their "equalimage" support
> > * function.
> It seems to me that as long as a data type has a deterministic sort but not
> necessarily be equalimage,
> it should be able to support deduplication.  e.g for interval type, we add
> a byte wise tie breaker
> after '24h' and '1day' are compared equally. In the btree, '24h' and '1day'
> are still adjacent,
> '1day' is always sorted before '24h' in a btree page, can we do dedup for
> each value
> without problem?

Yes.  I'm not aware of correctness obstacles arising if one did that.



Re: interval_ops shall stop using btequalimage (deduplication)

От
Peter Geoghegan
Дата:
On Mon, Oct 16, 2023 at 11:21 PM Donghang Lin <donghanglin@gmail.com> wrote:
> It seems to me that as long as a data type has a deterministic sort but not necessarily be equalimage,
> it should be able to support deduplication.  e.g for interval type, we add a byte wise tie breaker
> after '24h' and '1day' are compared equally. In the btree, '24h' and '1day' are still adjacent,
> '1day' is always sorted before '24h' in a btree page, can we do dedup for each value
> without problem?
> The assertion will still be triggered as it's testing btequalimage, but I'll defer it for now.
> Wanted to know if the above idea sounds sane first...

 It's hard to give any one reason why this won't work. I'm sure that
with enough effort some scheme like this could work. It's just that
there are significant practical problems that at least make it seem
unappealing as a project. This has been discussed before:


https://www.postgresql.org/message-id/flat/CAH2-WzkZkSC7G%2Bv1WwXGo0emh8E-rByw%3DxSpBUoavk7PTjwF2Q%40mail.gmail.com#4e98cba0d76c8c8c0bf67ae0d4652903

--
Peter Geoghegan