Обсуждение: BUG #17855: Uninitialised memory used when the name type value processed in binary mode of Memoize

Поиск
Список
Период
Сортировка

BUG #17855: Uninitialised memory used when the name type value processed in binary mode of Memoize

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17855
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 15.2
Operating system:   Ubuntu 22.04
Description:

When the following query executed under valgrind:
SET enable_seqscan TO off;

CREATE TABLE t (n name);
INSERT INTO t VALUES('name'), ('name');
CREATE INDEX t_idx ON t(n);
ANALYZE t;

SELECT COUNT(*) FROM t t1 INNER JOIN t t2 ON t1.n >= t2.n;

An incorrect memory access detected:
==00:00:00:06.975 2318810== Use of uninitialised value of size 8
==00:00:00:06.975 2318810==    at 0x4348B9: memoize_insert_hash_internal
(simplehash.h:633)
==00:00:00:06.975 2318810==    by 0x43515F: memoize_insert
(simplehash.h:763)
==00:00:00:06.975 2318810==    by 0x43515F: cache_lookup
(nodeMemoize.c:519)
==00:00:00:06.975 2318810==    by 0x435832: ExecMemoize
(nodeMemoize.c:705)
==00:00:00:06.975 2318810==    by 0x40EA14: ExecProcNodeFirst
(execProcnode.c:464)
==00:00:00:06.975 2318810==    by 0x43F32B: ExecProcNode (executor.h:259)
==00:00:00:06.975 2318810==    by 0x43F32B: ExecNestLoop
(nodeNestloop.c:160)
==00:00:00:06.975 2318810==    by 0x40EA14: ExecProcNodeFirst
(execProcnode.c:464)
==00:00:00:06.975 2318810==    by 0x41A9BF: ExecProcNode (executor.h:259)
==00:00:00:06.975 2318810==    by 0x41A9BF: fetch_input_tuple
(nodeAgg.c:563)
==00:00:00:06.975 2318810==    by 0x41E2D4: agg_retrieve_direct
(nodeAgg.c:2346)
==00:00:00:06.975 2318810==    by 0x41E5AA: ExecAgg (nodeAgg.c:2161)
==00:00:00:06.975 2318810==    by 0x40EA14: ExecProcNodeFirst
(execProcnode.c:464)
==00:00:00:06.975 2318810==    by 0x407008: ExecProcNode (executor.h:259)
==00:00:00:06.975 2318810==    by 0x407008: ExecutePlan (execMain.c:1636)
==00:00:00:06.975 2318810==    by 0x4071E8: standard_ExecutorRun
(execMain.c:363)
==00:00:00:06.975 2318810==  Uninitialised value was created by a heap
allocation
==00:00:00:06.975 2318810==    at 0x74394C: palloc (mcxt.c:1093)
==00:00:00:06.975 2318810==    by 0x262F07: btrescan (nbtree.c:427)
==00:00:00:06.975 2318810==    by 0x2511F6: index_rescan (indexam.c:314)
==00:00:00:06.975 2318810==    by 0x42F4E7: IndexOnlyNext
(nodeIndexonlyscan.c:111)
==00:00:00:06.975 2318810==    by 0x411CCD: ExecScanFetch (execScan.c:133)
==00:00:00:06.975 2318810==    by 0x411D68: ExecScan (execScan.c:182)
==00:00:00:06.975 2318810==    by 0x42F25F: ExecIndexOnlyScan
(nodeIndexonlyscan.c:315)
==00:00:00:06.975 2318810==    by 0x40EA14: ExecProcNodeFirst
(execProcnode.c:464)
==00:00:00:06.975 2318810==    by 0x43EEB8: ExecProcNode (executor.h:259)
==00:00:00:06.975 2318810==    by 0x43EEB8: ExecNestLoop
(nodeNestloop.c:109)
==00:00:00:06.975 2318810==    by 0x40EA14: ExecProcNodeFirst
(execProcnode.c:464)
==00:00:00:06.975 2318810==    by 0x41A9BF: ExecProcNode (executor.h:259)
==00:00:00:06.975 2318810==    by 0x41A9BF: fetch_input_tuple
(nodeAgg.c:563)
==00:00:00:06.975 2318810==    by 0x41E2D4: agg_retrieve_direct
(nodeAgg.c:2346)
==00:00:00:06.975 2318810==

EXPLAIN shows:
 Aggregate  (cost=21.42..21.43 rows=1 width=8)
   ->  Nested Loop  (cost=0.27..21.42 rows=1 width=0)
         ->  Index Only Scan using t_idx on t t1  (cost=0.13..12.16 rows=2
width=64)
         ->  Memoize  (cost=0.14..6.16 rows=1 width=64)
               Cache Key: t1.n
               Cache Mode: binary
               ->  Index Only Scan using t_idx on t t2  (cost=0.13..6.15
rows=1 width=64)
                     Index Cond: (n <= t1.n)

If I understand the issue and nodeMemoize.c correctly, here
prepare_probe_slot() fills pslot->tts_values[] with a result of
ExecEvalExpr(),
which has length of 'name', but later MemoizeHash_hash() gets
attrs[]->attlen
(64) and passes it to datum_image_hash() for calculating a hash over this
number of bytes.

Reproduced starting from e502150f7.


On Tue, 21 Mar 2023 at 10:27, PG Bug reporting form
<noreply@postgresql.org> wrote:
> SET enable_seqscan TO off;
>
> CREATE TABLE t (n name);
> INSERT INTO t VALUES('name'), ('name');
> CREATE INDEX t_idx ON t(n);
> ANALYZE t;
>
> SELECT COUNT(*) FROM t t1 INNER JOIN t t2 ON t1.n >= t2.n;
>
> An incorrect memory access detected:
> ==00:00:00:06.975 2318810== Use of uninitialised value of size 8

This is very strange.  I tracked this down to the fact that the
TupleDesc for the heap and the index don't match.

postgres=# select attrelid::regclass,attname,attlen from pg_attribute
where attnum=1 and attrelid in ('t'::regclass,'t_idx'::regclass);
 attrelid | attname | attlen
----------+---------+--------
 t        | n       |     64
 t_idx    | n       |     -2

This seems to be due to pg_opclass defining that cstring should be the
opckeytype for btree index for the name type:

postgres=# select *,opckeytype::regtype from pg_opclass where
opcintype='name'::regtype and opcdefault and opcmethod=403;
  oid  | opcmethod | opcname  | opcnamespace | opcowner | opcfamily |
opcintype | opcdefault | opckeytype | opckeytype
-------+-----------+----------+--------------+----------+-----------+-----------+------------+------------+------------
 10028 |       403 | name_ops |           11 |       10 |      1994 |
      19 | t          |       2275 | cstring
(1 row)

The problem valgrind finds is due to index_deform_tuple_internal(), if
I put a breakpoint in indextuple.c:536 after the values[attnum] =
fetchatt(thisatt, tp + off); then output the stored Datum in my
debugger, I see:

((char *) values[attnum])[0]
110 'n'
((char *) values[attnum])[1]
97 'a'
((char *) values[attnum])[2]
109 'm'
((char *) values[attnum])[3]
101 'e'
((char *) values[attnum])[4]
0 '\0'
...
((char *) values[attnum])[7]
0 '\0'
((char *) values[attnum])[8]
-51 'Í'

Note the value at element 8 (I'd expect) should be 0.  In
nodeMemoize.c when we call datum_image_hash(), we'll hash the full 64
bytes of the name type and when we access element 8, we read
uninitialised memory.

The reason this only seems to happen when Memoize is in binary mode is
because we use datum_image_hash() which will hash the full 64 bytes of
the name Datum.  When in logical mode and in hash joins, we'll use
hashname(), which calls strlen() on the input name Datum and only
hashes the bytes as if the name were a cstring:

return hash_any((unsigned char *) key, strlen(key));

I really don't know if I made a wrong assumption about the safety of
using datum_image_hash(). It really surprises me that it's unsafe to
not access the full 64 bytes of a name Datum.

David



 On Wed, 22 Mar 2023 at 19:18, David Rowley <dgrowleyml@gmail.com> wrote:
> The reason this only seems to happen when Memoize is in binary mode is
> because we use datum_image_hash() which will hash the full 64 bytes of
> the name Datum.  When in logical mode and in hash joins, we'll use
> hashname(), which calls strlen() on the input name Datum and only
> hashes the bytes as if the name were a string:

A relevant comment is in StoreIndexTuple():

/*
* Note: we must use the tupdesc supplied by the AM in index_deform_tuple,
* not the slot's tupdesc, in case the latter has different datatypes
* (this happens for btree name_ops in particular).  They'd better have
* the same number of columns though, as well as being datatype-compatible
* which is something we can't so easily check.
*/

I'm just not really certain if we can say name is
"datatype-compatible" with cstring or not.  It seems that namehash,
namecmp, nameout etc are all coded so that they can accept cstrings as
inputs. It's just not going to be safe for anything that wants to
access all of the NAMEDATALEN bytes.

Another bug I've found relating to this is:

CREATE TABLE t (n name);
INSERT INTO t VALUES('name'), ('name');
CREATE INDEX t_idx ON t(n);
ANALYZE t;

set enable_seqscan=0;
set enable_indexscan=0;

postgres=# select record_image_eq(row('name'::name), row(t.n)) from t
order by n;
 record_image_eq
-----------------
 f <- incorrect result
 f <- incorrect result
(2 rows)

set enable_indexscan=1;
set enable_indexonlyscan=0;

postgres=# select record_image_eq(row('name'::name), row(t.n)) from t
order by n;
 record_image_eq
-----------------
 t <- correct result
 t <- correct result
(2 rows)

It just seems to me that this whole reading one type from an index and
assuming it's some other different time is going to be a source of
bugs, even if we were to fix the Memoize and record_image_eq() bugs.

I tried and failed to find problems with the usage of datum_image_eq()
used in ri_KeysEqual(). This is because to get an UPDATE to perform an
Index Only Scan would require that the index had the ctid column. It
does not seem impossible that we might want to special case that,
since the index does contain the heap ctid, after all.  If we did
that, we'd subtly break foreign key checks when performing an
UPDATE/DELETE on the referenced table when the foreign key column is
of type "name".

Right now, I don't have any bright ideas on how we might fix this.
The only thing I can think of right now is to adjust StoreIndexTuple()
to check for mismatching data types and if it finds one, convert with
the index type's output function and import with the heap type's input
function.  Sounds like painful overhead not to mention how to manage
the memory allocations so it does not leak any copied Datum values.

David



Re: BUG #17855: Uninitialised memory used when the name type value processed in binary mode of Memoize

От
Peter Geoghegan
Дата:
On Wed, Mar 22, 2023 at 8:11 PM David Rowley <dgrowleyml@gmail.com> wrote:
> A relevant comment is in StoreIndexTuple():
>
> /*
> * Note: we must use the tupdesc supplied by the AM in index_deform_tuple,
> * not the slot's tupdesc, in case the latter has different datatypes
> * (this happens for btree name_ops in particular).  They'd better have
> * the same number of columns though, as well as being datatype-compatible
> * which is something we can't so easily check.
> */
>
> I'm just not really certain if we can say name is
> "datatype-compatible" with cstring or not.  It seems that namehash,
> namecmp, nameout etc are all coded so that they can accept cstrings as
> inputs. It's just not going to be safe for anything that wants to
> access all of the NAMEDATALEN bytes.

I doubt that there is a clear answer to that question.

Have you seen the comments about the cstring/name_ops hack mentioning
a SIGSEGV in btrescan()? Those were added around the time index-only
scans first went in.

--
Peter Geoghegan



On Thu, 23 Mar 2023 at 16:25, Peter Geoghegan <pg@bowt.ie> wrote:
> Have you seen the comments about the cstring/name_ops hack mentioning
> a SIGSEGV in btrescan()? Those were added around the time index-only
> scans first went in.

I'd not seen it. That's a bit disappointing. Is all this just to work
around not having to store the full 64 bytes for a name in indexes?

Seems it there are a few hacks that try to make this work.  I wonder
if we should just invent new hacks in the form of a new version of
datum_image_eq that accepts a pointer to a FormData_pg_attribute
instead of typByVal and typLen then just special case NAMEOID types to
always compare as cstrings. Same for datum_image_hash().

David



Re: BUG #17855: Uninitialised memory used when the name type value processed in binary mode of Memoize

От
Peter Geoghegan
Дата:
On Wed, Mar 22, 2023 at 9:01 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Thu, 23 Mar 2023 at 16:25, Peter Geoghegan <pg@bowt.ie> wrote:
> > Have you seen the comments about the cstring/name_ops hack mentioning
> > a SIGSEGV in btrescan()? Those were added around the time index-only
> > scans first went in.
>
> I'd not seen it. That's a bit disappointing. Is all this just to work
> around not having to store the full 64 bytes for a name in indexes?

Yes, that's the whole point of the hack.

> Seems it there are a few hacks that try to make this work.  I wonder
> if we should just invent new hacks in the form of a new version of
> datum_image_eq that accepts a pointer to a FormData_pg_attribute
> instead of typByVal and typLen then just special case NAMEOID types to
> always compare as cstrings. Same for datum_image_hash().

The only reason why datum_image_eq() has support for cstring is B-Tree
deduplication; cstring support was added to allow nbtdedup.c to deal
with name/cstring indexes sensibly. So the fact that datum_image_eq()
can compare cstrings in the first place is in fact related to the same
general issue with name_ops.

--
Peter Geoghegan



On Thu, 23 Mar 2023 at 17:01, David Rowley <dgrowleyml@gmail.com> wrote:
> Seems it there are a few hacks that try to make this work.  I wonder
> if we should just invent new hacks in the form of a new version of
> datum_image_eq that accepts a pointer to a FormData_pg_attribute
> instead of typByVal and typLen then just special case NAMEOID types to
> always compare as cstrings. Same for datum_image_hash().

Here are patches to implement versions of datum_image_eq() and
datum_image_hash() which accept FormData_pg_attribute pointers instead
of the byval and len parameters.

The 0002 patch just removes the now unused functions.

David

Вложения
On Thu, 23 Mar 2023 at 16:10, David Rowley <dgrowleyml@gmail.com> wrote:
> Right now, I don't have any bright ideas on how we might fix this.
> The only thing I can think of right now is to adjust StoreIndexTuple()
> to check for mismatching data types and if it finds one, convert with
> the index type's output function and import with the heap type's input
> function.  Sounds like painful overhead not to mention how to manage
> the memory allocations so it does not leak any copied Datum values.

I had a bit of an offline chat about this bug to Andres.  We talked
about the method described in the above paragraph as a means to fix
this.

We talked about the possibilities of bugs being a bit more widespread.
For example, anywhere that does datumCopy() (i.e. lots of places) on a
name type could read too many bytes.

I've drafted up a patch which adds some code to nodeIndexonlyscan.c.
During ExecInitIndexOnlyScan() to looks for any name types in the
scan's tupleDesc and if it finds some, it marks the position of each
of these in a new array in IndexOnlyScanState.  When we call
StoreIndexTuple(), if that array has any elements, we spin through it
and convert the cstring names to Names correctly padded out to
64-bytes.

The attached is just a draft so far. It'll need more comments to
document what the code is all about. I don't want to spend too much
time on it if this isn't going to be the final solution.

I'd be happy to hear from anyone who has any thoughts on this as a fix
for the issue.

David

Вложения

Re: BUG #17855: Uninitialised memory used when the name type value processed in binary mode of Memoize

От
Alexander Lakhin
Дата:
Hi David,

25.03.2023 08:32, David Rowley wrote:
> The attached is just a draft so far. It'll need more comments to
> document what the code is all about. I don't want to spend too much
> time on it if this isn't going to be the final solution.

I've stumbled upon this issue one more time. With a query like this:
CREATE TABLE t(id integer, node name);
CREATE INDEX t_id_node_idx ON t(id, node);
INSERT INTO t VALUES (1, 'node1');

EXPLAIN SELECT array_agg(node ORDER BY node) AS node_list FROM t GROUP BY id;
SELECT array_agg(node ORDER BY node) AS node_list FROM t GROUP BY id;

I get (on REL_16_STABLE):
                                       QUERY PLAN
--------------------------------------------------------------------------------------
  GroupAggregate  (cost=0.15..67.65 rows=200 width=36)
    Group Key: id
    ->  Index Only Scan using t_id_node_idx on t (cost=0.15..60.90 rows=850 width=68)

(Note that this time the error is triggered without the Memoize node.)

==00:00:00:04.385 3941088== Uninitialised byte(s) found during client check request
==00:00:00:04.385 3941088==    at 0x1F0CF7: printtup (printtup.c:349)
==00:00:00:04.385 3941088==    by 0x410521: ExecutePlan (execMain.c:1701)
==00:00:00:04.385 3941088==    by 0x4106B3: standard_ExecutorRun (execMain.c:365)
==00:00:00:04.385 3941088==    by 0x41078D: ExecutorRun (execMain.c:309)
==00:00:00:04.385 3941088==    by 0x5FAC9A: PortalRunSelect (pquery.c:924)
==00:00:00:04.385 3941088==    by 0x5FC63B: PortalRun (pquery.c:768)
==00:00:00:04.385 3941088==    by 0x5F85F3: exec_simple_query (postgres.c:1274)
==00:00:00:04.385 3941088==    by 0x5FA593: PostgresMain (postgres.c:4637)
==00:00:00:04.385 3941088==    by 0x5514A2: BackendRun (postmaster.c:4464)
==00:00:00:04.385 3941088==    by 0x554658: BackendStartup (postmaster.c:4192)
==00:00:00:04.385 3941088==    by 0x5547F6: ServerLoop (postmaster.c:1782)
==00:00:00:04.385 3941088==    by 0x555D26: PostmasterMain (postmaster.c:1466)
==00:00:00:04.385 3941088==  Address 0x1121ecdc is 36 bytes inside a block of size 88 client-defined
==00:00:00:04.385 3941088==    at 0x77EDA5: palloc0 (mcxt.c:1282)
==00:00:00:04.385 3941088==    by 0x62B6BC: construct_md_array (arrayfuncs.c:3528)
==00:00:00:04.385 3941088==    by 0x631013: makeMdArrayResult (arrayfuncs.c:5427)
==00:00:00:04.385 3941088==    by 0x6242AF: array_agg_finalfn (array_userfuncs.c:858)
==00:00:00:04.385 3941088==    by 0x4247F5: finalize_aggregate (nodeAgg.c:1120)
==00:00:00:04.385 3941088==    by 0x42620A: finalize_aggregates (nodeAgg.c:1361)
==00:00:00:04.385 3941088==    by 0x426E71: agg_retrieve_direct (nodeAgg.c:2520)
==00:00:00:04.385 3941088==    by 0x4272B9: ExecAgg (nodeAgg.c:2180)
==00:00:00:04.385 3941088==    by 0x417EE1: ExecProcNodeFirst (execProcnode.c:464)
==00:00:00:04.385 3941088==    by 0x4104F0: ExecProcNode (executor.h:273)
==00:00:00:04.385 3941088==    by 0x4104F0: ExecutePlan (execMain.c:1670)
==00:00:00:04.385 3941088==    by 0x4106B3: standard_ExecutorRun (execMain.c:365)
==00:00:00:04.385 3941088==    by 0x41078D: ExecutorRun (execMain.c:309)
==00:00:00:04.385 3941088==

With your last patch applied I see no this valgrind complaint.

Maybe it makes sense to register the proposed patch on the commitfest at
least to keep it in sight?

Best regards,
Alexander Lakhin



On Sat, 9 Sept 2023 at 20:00, Alexander Lakhin <exclusion@gmail.com> wrote:
> I've stumbled upon this issue one more time. With a query like this:
> CREATE TABLE t(id integer, node name);
> CREATE INDEX t_id_node_idx ON t(id, node);
> INSERT INTO t VALUES (1, 'node1');
>
> (Note that this time the error is triggered without the Memoize node.)

Yeah, it's really not a Memoize bug. It's an Index Only Scan bug.
I've added Robert to get his views.

> Maybe it makes sense to register the proposed patch on the commitfest at
> least to keep it in sight?

I've attached another patch which uses another method to fix this, per
an idea from Andres Freund.  I'd class it as a hack, but I don't have
any better ideas aside from the mammoth task of making name variable
length.  Indexes on name typed columns simply don't store all 64 bytes
of the name, so it's not safe to have code that assumes a name Datum
points to 64 bytes. The patch makes it so such a Datum *will* point to
64 bytes.  I've tried to do this as cheaply as possible by saving the
indexes to name columns in a new array in IndexOnlyScanState.  That
should make the overhead very small when indexes don't contain any
name-typed columns.

David

Вложения
Hello David.

24.04.2024 05:25, David Rowley wrote:
> On Sat, 9 Sept 2023 at 20:00, Alexander Lakhin <exclusion@gmail.com> wrote:
>> I've stumbled upon this issue one more time. With a query like this:
>> CREATE TABLE t(id integer, node name);
>> CREATE INDEX t_id_node_idx ON t(id, node);
>> INSERT INTO t VALUES (1, 'node1');
>>
>> (Note that this time the error is triggered without the Memoize node.)
> Yeah, it's really not a Memoize bug. It's an Index Only Scan bug.
> I've added Robert to get his views.
>
>> Maybe it makes sense to register the proposed patch on the commitfest at
>> least to keep it in sight?
> I've attached another patch which uses another method to fix this, per
> an idea from Andres Freund.  I'd class it as a hack, but I don't have
> any better ideas aside from the mammoth task of making name variable
> length.  Indexes on name typed columns simply don't store all 64 bytes
> of the name, so it's not safe to have code that assumes a name Datum
> points to 64 bytes. The patch makes it so such a Datum *will* point to
> 64 bytes.  I've tried to do this as cheaply as possible by saving the
> indexes to name columns in a new array in IndexOnlyScanState.  That
> should make the overhead very small when indexes don't contain any
> name-typed columns.

I've tested this patch with an additional #include in nodeIndexonlyscan.c
(for namestrcpy), it works, but I'm unsure about the check for NAMEOID —
it seems like the fix could break for a type based on name. Say, with:
CREATE DOMAIN named AS name;
CREATE TABLE t(id integer, node named);
CREATE INDEX t_id_node_idx ON t(id, node);
INSERT INTO t VALUES (1, 'node1');

SELECT array_agg(node ORDER BY node) AS node_list FROM t GROUP BY id;

I still get the Valgrind complaint.

Best regards,
Alexander



David Rowley <dgrowleyml@gmail.com> writes:
> I've attached another patch which uses another method to fix this, per
> an idea from Andres Freund.  I'd class it as a hack, but I don't have
> any better ideas aside from the mammoth task of making name variable
> length.  Indexes on name typed columns simply don't store all 64 bytes
> of the name, so it's not safe to have code that assumes a name Datum
> points to 64 bytes. The patch makes it so such a Datum *will* point to
> 64 bytes.  I've tried to do this as cheaply as possible by saving the
> indexes to name columns in a new array in IndexOnlyScanState.  That
> should make the overhead very small when indexes don't contain any
> name-typed columns.

I realize that this is just a draft patch, but the near-total lack
of comments is still disappointing.  Even at the draft stage,
good comments are important to make it reviewable.

I do not like testing "atttypid == NAMEOID" one bit.  As noted,
that will fail on domains over name.  It will also result in
unnecessary work when reading non-btree indexes that contain
name, since I don't think anything else has the same hack
that btree name_ops does (grep for cstring in pg_opclass.dat).
I think the correct thing is to see whether the index opclass
for the column is btree name_ops.  We don't seem to have an
oid_symbol macro for that, but it shouldn't be hard to add,
at least in HEAD.

            regards, tom lane



On Fri, 26 Apr 2024 at 03:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I do not like testing "atttypid == NAMEOID" one bit.  As noted,
> that will fail on domains over name.  It will also result in
> unnecessary work when reading non-btree indexes that contain
> name, since I don't think anything else has the same hack
> that btree name_ops does (grep for cstring in pg_opclass.dat).
> I think the correct thing is to see whether the index opclass
> for the column is btree name_ops.  We don't seem to have an
> oid_symbol macro for that, but it shouldn't be hard to add,
> at least in HEAD.

On looking at the relcache code, I don't see anywhere that we store
the opclass Oid in RelationData. IndexSupportInitialize() only records
the opcfamily and opcintype. Assuming we can't add new fields to
RelationData in the backbranches, is there a reason why we can't check
for rd_opfamily of TEXT_BTREE_FAM_OID and rd_opcintype of NAMEOID?

I've attached an updated patch that does it that way and also did a
round of commenting the code.

David

Вложения
Hello David,

29.04.2024 03:54, David Rowley wrote:
> On Fri, 26 Apr 2024 at 03:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I do not like testing "atttypid == NAMEOID" one bit.  As noted,
>> that will fail on domains over name.  It will also result in
>> unnecessary work when reading non-btree indexes that contain
>> name, since I don't think anything else has the same hack
>> that btree name_ops does (grep for cstring in pg_opclass.dat).
>> I think the correct thing is to see whether the index opclass
>> for the column is btree name_ops.  We don't seem to have an
>> oid_symbol macro for that, but it shouldn't be hard to add,
>> at least in HEAD.
> On looking at the relcache code, I don't see anywhere that we store
> the opclass Oid in RelationData. IndexSupportInitialize() only records
> the opcfamily and opcintype. Assuming we can't add new fields to
> RelationData in the backbranches, is there a reason why we can't check
> for rd_opfamily of TEXT_BTREE_FAM_OID and rd_opcintype of NAMEOID?
>
> I've attached an updated patch that does it that way and also did a
> round of commenting the code.

With the v3 patch applied, `make check` fails for me under Valgrind:
ok 66        + create_index                            33585 ms
not ok 67    + create_index_spgist                     14446 ms
# (test process exited with exit code 2)

==00:00:04:57.033 4152513== Invalid read of size 4
==00:00:04:57.033 4152513==    at 0x456499: ExecInitIndexOnlyScan (nodeIndexonlyscan.c:657)
==00:00:04:57.033 4152513==    by 0x435EE6: ExecInitNode (execProcnode.c:225)
==00:00:04:57.033 4152513==    by 0x459780: ExecInitLimit (nodeLimit.c:477)
...
==00:00:04:57.033 4152513==  Address 0x73263b4 is 612 bytes inside a recently re-allocated block of size 1,024 alloc'd
==00:00:04:57.033 4152513==    at 0x4848899: malloc (vg_replace_malloc.c:381)
==00:00:04:57.033 4152513==    by 0x7A7DFC: AllocSetContextCreateInternal (aset.c:444)
==00:00:04:57.033 4152513==    by 0x775644: RelationInitIndexAccessInfo (relcache.c:1481)
==00:00:04:57.033 4152513==    by 0x775E4B: RelationBuildDesc (relcache.c:1207)
...
==00:00:04:57.033 4152513==
==00:00:04:57.033 4152513== Exit program on first error (--exit-on-first-error=yes)
2024-04-29 12:32:14.965 UTC postmaster[4109162] LOG:  server process (PID 4152513) exited with exit code 1
2024-04-29 12:32:14.965 UTC postmaster[4109162] DETAIL:  Failed process was running: SELECT p, dist FROM 
quad_point_tbl_ord_seq1 ORDER BY p <-> '0,0' LIMIT 10;

The quad_point_tbl_ord_seq1 index is defined as
CREATE INDEX ON quad_point_tbl_ord_seq1 USING spgist(p) INCLUDE(dist);

Best regards,
Alexander



Alexander Lakhin <exclusion@gmail.com> writes:
> 29.04.2024 03:54, David Rowley wrote:
>> On looking at the relcache code, I don't see anywhere that we store
>> the opclass Oid in RelationData. IndexSupportInitialize() only records
>> the opcfamily and opcintype. Assuming we can't add new fields to
>> RelationData in the backbranches, is there a reason why we can't check
>> for rd_opfamily of TEXT_BTREE_FAM_OID and rd_opcintype of NAMEOID?

That should be all right, but ...

> With the v3 patch applied, `make check` fails for me under Valgrind:

... yeah, because those arrays are only of length indnkeyatts,
while the loop is trying to iterate over indnatts columns.

One bit of research that needs to be done is whether btree will
truncate an "include"'d column of type name.  I think it will not,
because that behavior is driven by the opclass and there isn't one
for an included column, but it wouldn't hurt to check.  If so,
just restricting these setup loops to consider only indnkeyatts
columns should fix this.

Something that's bothering me more is expressed in this comment
near the head of ExecInitIndexOnlyScan:

     * Build the scan tuple type using the indextlist generated by the
     * planner.  We use this, rather than the index's physical tuple
     * descriptor, because the latter contains storage column types not the
     * types of the original datums.  (It's the AM's responsibility to return
     * suitable data anyway.)

Per this, the system structure ought to be that btree makes this mess
and btree should clean it up.  Now maybe there's an argument that it's
better to do it in nodeIndexonlyscan.c in case any other index AM
wants to make a similar optimization ... but that seems unlikely to me
since "name" isn't really a general-purpose data type, and we don't
use anything but btree for the system catalogs.  Besides, this code
wouldn't work anyway because it's specific to a btree opfamily.
So I think we ought to look to see if there's someplace in nbtree
where this responsibility could be shoved.

On the other hand, looking at that comment again, I wonder if we could
drive this off the physical tuple descriptor containing atttypid
CSTRINGOID.  I don't say that's better than the v3 logic, but it's an
alternative to consider, especially if it'd let us avoid writing this
in a btree-specific way.  Maybe "storage type is cstring and
rd_opcintype is name" would be a sufficient test that would preserve
some independence from btree?

            regards, tom lane



On Tue, 30 Apr 2024 at 02:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alexander Lakhin <exclusion@gmail.com> writes:
> > With the v3 patch applied, `make check` fails for me under Valgrind:
>
> ... yeah, because those arrays are only of length indnkeyatts,
> while the loop is trying to iterate over indnatts columns.

I forgot about INCLUDE.  Thanks.

> One bit of research that needs to be done is whether btree will
> truncate an "include"'d column of type name.  I think it will not,
> because that behavior is driven by the opclass and there isn't one
> for an included column, but it wouldn't hurt to check.  If so,
> just restricting these setup loops to consider only indnkeyatts
> columns should fix this.

I did that research in the form of setting a breakpoint in
index_deform_tuple() and verified the tup->t_info & 0xFFF grows by
64-bytes for each extra name column I INCLUDE.  Also verified those
extra bytes are all zero'd

> On the other hand, looking at that comment again, I wonder if we could
> drive this off the physical tuple descriptor containing atttypid
> CSTRINGOID.  I don't say that's better than the v3 logic, but it's an
> alternative to consider, especially if it'd let us avoid writing this
> in a btree-specific way.  Maybe "storage type is cstring and
> rd_opcintype is name" would be a sufficient test that would preserve
> some independence from btree?

I think this is a good idea.  It seems reasonable that some other AM
might want to do the same thing. Also, it allows me to choose a better
name for these new IndexOnlyScanState fields. ioss_NameCStringAttNums
and ioss_NameCStringCount seems more on point. However, I couldn't
really decide which way around to have Name and CString.

I added a regression test to index_including.sql to give this some coverage.

David

Вложения
David Rowley <dgrowleyml@gmail.com> writes:
> I think this is a good idea.  It seems reasonable that some other AM
> might want to do the same thing. Also, it allows me to choose a better
> name for these new IndexOnlyScanState fields. ioss_NameCStringAttNums
> and ioss_NameCStringCount seems more on point. However, I couldn't
> really decide which way around to have Name and CString.

This version LGTM.  The field names are fine, or at least I don't see
a reason to prefer the other way.  One trivial nit: should we make
the array be AttrNumber* instead of int*?  The space saving would be
negligible (it being very unlikely that there's ever more than one
entry), but AttrNumber is a more specific description of what it is.

            regards, tom lane



Hello David,

30.04.2024 03:48, David Rowley wrote:
> On Tue, 30 Apr 2024 at 02:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> One bit of research that needs to be done is whether btree will
>> truncate an "include"'d column of type name.  I think it will not,
>> because that behavior is driven by the opclass and there isn't one
>> for an included column, but it wouldn't hurt to check.  If so,
>> just restricting these setup loops to consider only indnkeyatts
>> columns should fix this.
> I did that research in the form of setting a breakpoint in
> index_deform_tuple() and verified the tup->t_info & 0xFFF grows by
> 64-bytes for each extra name column I INCLUDE.  Also verified those
> extra bytes are all zero'd

Yes, I saw the same with pageinspect.

I could not find cases (I've tested also ranges and arrays based on the
name type) where the v4 behaves incorrectly, so I think it solves the
problem.

As a side note, perhaps this addition:
+#include "catalog/pg_opfamily_d.h"
is not needed for the current implementation.

Thank you for the fix!

Best regards,
Alexander



On Wed, 1 May 2024 at 02:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This version LGTM.  The field names are fine, or at least I don't see
> a reason to prefer the other way.  One trivial nit: should we make
> the array be AttrNumber* instead of int*?  The space saving would be
> negligible (it being very unlikely that there's ever more than one
> entry), but AttrNumber is a more specific description of what it is.

I've pushed this after adjusting the int array to become an AttrNumber array.

I toyed with the idea of having the loop iterator in
ExecInitIndexOnlyScan() an AttrNumber type, but in the end went with a
cast. I felt this way was more robust against any possible future
widening of the AttrNumber typedef.  I added the cast to prevent any
compilers from warning about truncation.  This can't happen since
indnkeyatts is int16, but I don't know if all compilers would be able
to figure that out, so added a cast.

Thank you to both of you for reviewing this.

David