Обсуждение: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Поиск
Список
Период
Сортировка

BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16865
Logged by:          Dimitri Nüscheler
Email address:      dimitri.nuescheler@gmail.com
PostgreSQL version: 13.1
Operating system:   Debian
Description:

Hello

I'm writing a small search engine application that also supports negation
(exclude results that contain terms starting with string).

After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery
when matched to a tsvector using the @@ operator no longer works properly,
so I started bisecting the commit history and tracked it down to this commit
(see the query and expected and actual result further below):

> commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Fri Jul 24 15:26:51 2020 -0400
>
>    Fix assorted bugs by changing TS_execute's callback API to ternary
logic.
> ...

I'm still working on creating a reproducible test-case without having to
share company data. I'm also trying to understand the code as a fun
exercise.

I can at least share some of the queries and result data:

DDL:
CREATE TABLE IF NOT EXISTS sherlock_catalog (
uri varchar,
description varchar NOT NULL,
metadata varchar NOT NULL,
textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED,
last_seen timestamptz,
PRIMARY KEY (uri)
);
CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar
AS $$
BEGIN
    RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
'[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
\2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
END;
$$
LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;

Query:
SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinued%' LIMIT 2;

Result: 2 rows

Expected result: 0 rows, because the textsearch vector contains:
'discontinu':86

Plan:
 Limit  (cost=130.61..12789.24 rows=1 width=136)
   ->  Bitmap Heap Scan on sherlock_catalog  (cost=130.61..12789.24 rows=1
width=136)
         Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)
         Filter: ((metadata)::text ~~ '%Discontinued%'::text)
         ->  Bitmap Index Scan on sherlock_catalog_textsearch
(cost=0.00..130.61 rows=3548 width=0)
               Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)

The generated plan is structurally the same for PostgreSQL 12.5 respectively
versions before that commit, but if I alter the plan using (SET enable_*scan
 = off) so that the planner comes up with a sequential scan, I will get the
expected results.

Some other results (count):

Count of the same query:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
 count 
-------
  4962
(1 Zeile)


Negation, but not a prefix search:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
'%Discontinu%';
 count 
-------
     0
(1 Zeile)


Without negation:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
 count 
-------
 13127
(1 Zeile)

Without the "discontinu" term at all

user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
 count 
-------
 13127
(1 Zeile)


So it seems like the negated query manages to filter out some data, but not
all - as if it failed to recheck and definitely determine the TS_YES
respectively TS_NO answer from an in-precise TS_MAYBE answer from an
unprecise index-based answer (without position information?)? if I even
understand this remotely correctly, I'm new to this.
I'll try to find out more and to prepare shareable data that reproduces the
problem, but I also wonder if I manage to dive into the code and understand
something about it :-)

Kind Regards
Dimitri Nüscheler


=?UTF-8?Q?Dimitri_N=C3=BCscheler?= <dimitri.nuescheler@gmail.com> writes:
> Meanwhile I managed to anonymize the data. I put it in this archive
> https://www.violetsky.ch/postgres-issue-anonymized.tar.gz

Thanks.  I've reproduced the issue and it boils down to doing the
wrong thing when there are GIN_MAYBE values in the input data for
checkcondition_gin, specifically when the bitmap holding the original
GIN index results has become lossy.  We correctly get a TS_MAYBE
result out of the calculation in TS_execute_recurse, but then
TS_execute figures it can throw that detail away and just return
TS_YES.  I think that this problem existed before 2f2007fbb, but
we were masking it by always forcing rechecks.

Anyway, this seems to put the final nail in the coffin of the idea
that it's sufficient for TS_execute to return a boolean.  At least
the tsginidx.c callers really need to see the 3-way result.  Hence
I propose the attached patch.  It fixes the given test case, but
I wonder if you can try it and see if you see any remaining problems.

I wasted quite a bit of time trying to devise a test case that is
short enough to be reasonable to include in our regression tests,
without success ... you need quite a bit of data to make the GIN
bitmap become lossy.  So no test case here, but I'm not super
comfortable with that.

            regards, tom lane

diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index 6c913baaba..8eed376708 100644
--- a/src/backend/utils/adt/tsginidx.c
+++ b/src/backend/utils/adt/tsginidx.c
@@ -246,10 +246,22 @@ gin_tsquery_consistent(PG_FUNCTION_ARGS)
         gcv.map_item_operand = (int *) (extra_data[0]);
         gcv.need_recheck = recheck;

-        res = TS_execute(GETQUERY(query),
-                         &gcv,
-                         TS_EXEC_PHRASE_NO_POS,
-                         checkcondition_gin);
+        switch (TS_execute_ternary(GETQUERY(query),
+                                   &gcv,
+                                   TS_EXEC_PHRASE_NO_POS,
+                                   checkcondition_gin))
+        {
+            case TS_NO:
+                res = false;
+                break;
+            case TS_YES:
+                res = true;
+                break;
+            case TS_MAYBE:
+                res = true;
+                *recheck = true;
+                break;
+        }
     }

     PG_RETURN_BOOL(res);
@@ -284,11 +296,12 @@ gin_tsquery_triconsistent(PG_FUNCTION_ARGS)
         gcv.map_item_operand = (int *) (extra_data[0]);
         gcv.need_recheck = &recheck;

-        if (TS_execute(GETQUERY(query),
-                       &gcv,
-                       TS_EXEC_PHRASE_NO_POS,
-                       checkcondition_gin))
-            res = recheck ? GIN_MAYBE : GIN_TRUE;
+        res = TS_execute_ternary(GETQUERY(query),
+                                 &gcv,
+                                 TS_EXEC_PHRASE_NO_POS,
+                                 checkcondition_gin);
+        if (res == GIN_TRUE && recheck)
+            res = GIN_MAYBE;
     }

     PG_RETURN_GIN_TERNARY_VALUE(res);
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 2939fb5c21..9236ebcc8f 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -1854,6 +1854,18 @@ TS_execute(QueryItem *curitem, void *arg, uint32 flags,
     return TS_execute_recurse(curitem, arg, flags, chkcond) != TS_NO;
 }

+/*
+ * Evaluate tsquery boolean expression.
+ *
+ * This is the same as TS_execute except that TS_MAYBE is returned as-is.
+ */
+TSTernaryValue
+TS_execute_ternary(QueryItem *curitem, void *arg, uint32 flags,
+                   TSExecuteCallback chkcond)
+{
+    return TS_execute_recurse(curitem, arg, flags, chkcond);
+}
+
 /*
  * TS_execute recursion for operators above any phrase operator.  Here we do
  * not need to worry about lexeme positions.  As soon as we hit an OP_PHRASE
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index 69a9ba8524..4266560151 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -199,6 +199,9 @@ typedef TSTernaryValue (*TSExecuteCallback) (void *arg, QueryOperand *val,

 extern bool TS_execute(QueryItem *curitem, void *arg, uint32 flags,
                        TSExecuteCallback chkcond);
+extern TSTernaryValue TS_execute_ternary(QueryItem *curitem, void *arg,
+                                         uint32 flags,
+                                         TSExecuteCallback chkcond);
 extern bool tsquery_requires_match(QueryItem *curitem);

 /*

I wrote:
> Anyway, this seems to put the final nail in the coffin of the idea
> that it's sufficient for TS_execute to return a boolean.  At least
> the tsginidx.c callers really need to see the 3-way result.  Hence
> I propose the attached patch.  It fixes the given test case, but
> I wonder if you can try it and see if you see any remaining problems.

After further reflection, it seems like now that we've done that,
we should go all the way and replace the out-of-band recheck result
from checkcondition_gin with a MAYBE result, as attached.

AFAICT the previous patch didn't leave any user-visible bug,
but there is an inefficiency in using the out-of-band signaling.
Consider a query like

    (a & b:B) | c

and suppose that some index entry has "b" and "c" but not "a".
With the code as it stands, when checkcondition_gin checks for a
match to "b:B" it will find one, and then because of the weight
restriction it will return TS_MAYBE, and also set need_recheck.
Then TS_execute_recurse will combine the TS_MAYBE with the TS_NO
for "a" to produce TS_NO, and finally combine that with TS_YES
for "c" to get TS_YES.  This is a correct result: the row
unquestionably matches the query.  But since we set the
need_recheck flag, we'll force a heap visit and recheck anyway.
That's bogus, and we don't need to accept it anymore now that
the data return path is fully ternary-clean.  Returning TS_MAYBE
is sufficient to do all the right things, so I propose the
attached v2.

            regards, tom lane

diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index 6c913baaba..78de531eb8 100644
--- a/src/backend/utils/adt/tsginidx.c
+++ b/src/backend/utils/adt/tsginidx.c
@@ -175,7 +175,6 @@ typedef struct
     QueryItem  *first_item;
     GinTernaryValue *check;
     int           *map_item_operand;
-    bool       *need_recheck;
 } GinChkVal;

 /*
@@ -184,27 +183,24 @@ typedef struct
 static TSTernaryValue
 checkcondition_gin(void *checkval, QueryOperand *val, ExecPhraseData *data)
 {
+    GinTernaryValue result;
     GinChkVal  *gcv = (GinChkVal *) checkval;
     int            j;

-    /*
-     * if any val requiring a weight is used or caller needs position
-     * information then set recheck flag
-     */
-    if (val->weight != 0 || data != NULL)
-        *(gcv->need_recheck) = true;
-
     /* convert item's number to corresponding entry's (operand's) number */
     j = gcv->map_item_operand[((QueryItem *) val) - gcv->first_item];

+    /* determine presence of current entry in indexed value */
+    result = gcv->check[j];
+
     /*
-     * return presence of current entry in indexed value; but TRUE becomes
-     * MAYBE in the presence of a query requiring recheck
+     * if any val requiring a weight is used or caller needs position
+     * information then we must recheck, so replace TRUE with MAYBE.
      */
-    if (gcv->check[j] == GIN_TRUE)
+    if (result == GIN_TRUE)
     {
         if (val->weight != 0 || data != NULL)
-            return TS_MAYBE;
+            result = GIN_MAYBE;
     }

     /*
@@ -212,7 +208,7 @@ checkcondition_gin(void *checkval, QueryOperand *val, ExecPhraseData *data)
      * assignments.  We could use a switch statement to map the values if that
      * ever stops being true, but it seems unlikely to happen.
      */
-    return (TSTernaryValue) gcv->check[j];
+    return (TSTernaryValue) result;
 }

 Datum
@@ -244,12 +240,23 @@ gin_tsquery_consistent(PG_FUNCTION_ARGS)
                          "sizes of GinTernaryValue and bool are not equal");
         gcv.check = (GinTernaryValue *) check;
         gcv.map_item_operand = (int *) (extra_data[0]);
-        gcv.need_recheck = recheck;

-        res = TS_execute(GETQUERY(query),
-                         &gcv,
-                         TS_EXEC_PHRASE_NO_POS,
-                         checkcondition_gin);
+        switch (TS_execute_ternary(GETQUERY(query),
+                                   &gcv,
+                                   TS_EXEC_PHRASE_NO_POS,
+                                   checkcondition_gin))
+        {
+            case TS_NO:
+                res = false;
+                break;
+            case TS_YES:
+                res = true;
+                break;
+            case TS_MAYBE:
+                res = true;
+                *recheck = true;
+                break;
+        }
     }

     PG_RETURN_BOOL(res);
@@ -266,10 +273,6 @@ gin_tsquery_triconsistent(PG_FUNCTION_ARGS)
     /* int32    nkeys = PG_GETARG_INT32(3); */
     Pointer    *extra_data = (Pointer *) PG_GETARG_POINTER(4);
     GinTernaryValue res = GIN_FALSE;
-    bool        recheck;
-
-    /* Initially assume query doesn't require recheck */
-    recheck = false;

     if (query->size > 0)
     {
@@ -282,13 +285,11 @@ gin_tsquery_triconsistent(PG_FUNCTION_ARGS)
         gcv.first_item = GETQUERY(query);
         gcv.check = check;
         gcv.map_item_operand = (int *) (extra_data[0]);
-        gcv.need_recheck = &recheck;

-        if (TS_execute(GETQUERY(query),
-                       &gcv,
-                       TS_EXEC_PHRASE_NO_POS,
-                       checkcondition_gin))
-            res = recheck ? GIN_MAYBE : GIN_TRUE;
+        res = TS_execute_ternary(GETQUERY(query),
+                                 &gcv,
+                                 TS_EXEC_PHRASE_NO_POS,
+                                 checkcondition_gin);
     }

     PG_RETURN_GIN_TERNARY_VALUE(res);
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 2939fb5c21..9236ebcc8f 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -1854,6 +1854,18 @@ TS_execute(QueryItem *curitem, void *arg, uint32 flags,
     return TS_execute_recurse(curitem, arg, flags, chkcond) != TS_NO;
 }

+/*
+ * Evaluate tsquery boolean expression.
+ *
+ * This is the same as TS_execute except that TS_MAYBE is returned as-is.
+ */
+TSTernaryValue
+TS_execute_ternary(QueryItem *curitem, void *arg, uint32 flags,
+                   TSExecuteCallback chkcond)
+{
+    return TS_execute_recurse(curitem, arg, flags, chkcond);
+}
+
 /*
  * TS_execute recursion for operators above any phrase operator.  Here we do
  * not need to worry about lexeme positions.  As soon as we hit an OP_PHRASE
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index 69a9ba8524..4266560151 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -199,6 +199,9 @@ typedef TSTernaryValue (*TSExecuteCallback) (void *arg, QueryOperand *val,

 extern bool TS_execute(QueryItem *curitem, void *arg, uint32 flags,
                        TSExecuteCallback chkcond);
+extern TSTernaryValue TS_execute_ternary(QueryItem *curitem, void *arg,
+                                         uint32 flags,
+                                         TSExecuteCallback chkcond);
 extern bool tsquery_requires_match(QueryItem *curitem);

 /*

Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

От
Dimitri Nüscheler
Дата:
Hi Tom

Wow, this is going fast. I applied the second patch and it fixes my original test case on my private data.
Thank you also for the explanations. I'm happy it's possible to get insight into PostgreSQL interna so quickly :-)

Let me know if there's more I can do (testing particular cases or patch variants etc).

Kind regards
Dimitri

Am Di., 16. Feb. 2021 um 01:27 Uhr schrieb Tom Lane <tgl@sss.pgh.pa.us>:
I wrote:
> Anyway, this seems to put the final nail in the coffin of the idea
> that it's sufficient for TS_execute to return a boolean.  At least
> the tsginidx.c callers really need to see the 3-way result.  Hence
> I propose the attached patch.  It fixes the given test case, but
> I wonder if you can try it and see if you see any remaining problems.

After further reflection, it seems like now that we've done that,
we should go all the way and replace the out-of-band recheck result
from checkcondition_gin with a MAYBE result, as attached.

AFAICT the previous patch didn't leave any user-visible bug,
but there is an inefficiency in using the out-of-band signaling.
Consider a query like

        (a & b:B) | c

and suppose that some index entry has "b" and "c" but not "a".
With the code as it stands, when checkcondition_gin checks for a
match to "b:B" it will find one, and then because of the weight
restriction it will return TS_MAYBE, and also set need_recheck.
Then TS_execute_recurse will combine the TS_MAYBE with the TS_NO
for "a" to produce TS_NO, and finally combine that with TS_YES
for "c" to get TS_YES.  This is a correct result: the row
unquestionably matches the query.  But since we set the
need_recheck flag, we'll force a heap visit and recheck anyway.
That's bogus, and we don't need to accept it anymore now that
the data return path is fully ternary-clean.  Returning TS_MAYBE
is sufficient to do all the right things, so I propose the
attached v2.

                        regards, tom lane

=?UTF-8?Q?Dimitri_N=C3=BCscheler?= <dimitri.nuescheler@gmail.com> writes:
> Wow, this is going fast. I applied the second patch and it fixes my
> original test case on my private data.

Pushed, thanks for testing!

            regards, tom lane