Обсуждение: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term
BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 16865 Logged by: Dimitri Nüscheler Email address: dimitri.nuescheler@gmail.com PostgreSQL version: 13.1 Operating system: Debian Description: Hello I'm writing a small search engine application that also supports negation (exclude results that contain terms starting with string). After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery when matched to a tsvector using the @@ operator no longer works properly, so I started bisecting the commit history and tracked it down to this commit (see the query and expected and actual result further below): > commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad) > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Fri Jul 24 15:26:51 2020 -0400 > > Fix assorted bugs by changing TS_execute's callback API to ternary logic. > ... I'm still working on creating a reproducible test-case without having to share company data. I'm also trying to understand the code as a fun exercise. I can at least share some of the queries and result data: DDL: CREATE TABLE IF NOT EXISTS sherlock_catalog ( uri varchar, description varchar NOT NULL, metadata varchar NOT NULL, textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english', sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED, last_seen timestamptz, PRIMARY KEY (uri) ); CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar AS $$ BEGIN RETURN lower(regexp_replace(regexp_replace(regexp_replace(str, '[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1 \2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g')); END; $$ LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT; Query: SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE '%Discontinued%' LIMIT 2; Result: 2 rows Expected result: 0 rows, because the textsearch vector contains: 'discontinu':86 Plan: Limit (cost=130.61..12789.24 rows=1 width=136) -> Bitmap Heap Scan on sherlock_catalog (cost=130.61..12789.24 rows=1 width=136) Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* & !''discontinu'':*'::tsquery) Filter: ((metadata)::text ~~ '%Discontinued%'::text) -> Bitmap Index Scan on sherlock_catalog_textsearch (cost=0.00..130.61 rows=3548 width=0) Index Cond: (textsearch @@ '''full'':* & ''suppli'':* & !''discontinu'':*'::tsquery) The generated plan is structurally the same for PostgreSQL 12.5 respectively versions before that commit, but if I alter the plan using (SET enable_*scan = off) so that the planner comes up with a sequential scan, I will get the expected results. Some other results (count): Count of the same query: user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE '%Discontinu%'; count ------- 4962 (1 Zeile) Negation, but not a prefix search: user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE '%Discontinu%'; count ------- 0 (1 Zeile) Without negation: user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE '%Discontinu%'; count ------- 13127 (1 Zeile) Without the "discontinu" term at all user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%'; count ------- 13127 (1 Zeile) So it seems like the negated query manages to filter out some data, but not all - as if it failed to recheck and definitely determine the TS_YES respectively TS_NO answer from an in-precise TS_MAYBE answer from an unprecise index-based answer (without position information?)? if I even understand this remotely correctly, I'm new to this. I'll try to find out more and to prepare shareable data that reproduces the problem, but I also wonder if I manage to dive into the code and understand something about it :-) Kind Regards Dimitri Nüscheler
=?UTF-8?Q?Dimitri_N=C3=BCscheler?= <dimitri.nuescheler@gmail.com> writes: > Meanwhile I managed to anonymize the data. I put it in this archive > https://www.violetsky.ch/postgres-issue-anonymized.tar.gz Thanks. I've reproduced the issue and it boils down to doing the wrong thing when there are GIN_MAYBE values in the input data for checkcondition_gin, specifically when the bitmap holding the original GIN index results has become lossy. We correctly get a TS_MAYBE result out of the calculation in TS_execute_recurse, but then TS_execute figures it can throw that detail away and just return TS_YES. I think that this problem existed before 2f2007fbb, but we were masking it by always forcing rechecks. Anyway, this seems to put the final nail in the coffin of the idea that it's sufficient for TS_execute to return a boolean. At least the tsginidx.c callers really need to see the 3-way result. Hence I propose the attached patch. It fixes the given test case, but I wonder if you can try it and see if you see any remaining problems. I wasted quite a bit of time trying to devise a test case that is short enough to be reasonable to include in our regression tests, without success ... you need quite a bit of data to make the GIN bitmap become lossy. So no test case here, but I'm not super comfortable with that. regards, tom lane diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c index 6c913baaba..8eed376708 100644 --- a/src/backend/utils/adt/tsginidx.c +++ b/src/backend/utils/adt/tsginidx.c @@ -246,10 +246,22 @@ gin_tsquery_consistent(PG_FUNCTION_ARGS) gcv.map_item_operand = (int *) (extra_data[0]); gcv.need_recheck = recheck; - res = TS_execute(GETQUERY(query), - &gcv, - TS_EXEC_PHRASE_NO_POS, - checkcondition_gin); + switch (TS_execute_ternary(GETQUERY(query), + &gcv, + TS_EXEC_PHRASE_NO_POS, + checkcondition_gin)) + { + case TS_NO: + res = false; + break; + case TS_YES: + res = true; + break; + case TS_MAYBE: + res = true; + *recheck = true; + break; + } } PG_RETURN_BOOL(res); @@ -284,11 +296,12 @@ gin_tsquery_triconsistent(PG_FUNCTION_ARGS) gcv.map_item_operand = (int *) (extra_data[0]); gcv.need_recheck = &recheck; - if (TS_execute(GETQUERY(query), - &gcv, - TS_EXEC_PHRASE_NO_POS, - checkcondition_gin)) - res = recheck ? GIN_MAYBE : GIN_TRUE; + res = TS_execute_ternary(GETQUERY(query), + &gcv, + TS_EXEC_PHRASE_NO_POS, + checkcondition_gin); + if (res == GIN_TRUE && recheck) + res = GIN_MAYBE; } PG_RETURN_GIN_TERNARY_VALUE(res); diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c index 2939fb5c21..9236ebcc8f 100644 --- a/src/backend/utils/adt/tsvector_op.c +++ b/src/backend/utils/adt/tsvector_op.c @@ -1854,6 +1854,18 @@ TS_execute(QueryItem *curitem, void *arg, uint32 flags, return TS_execute_recurse(curitem, arg, flags, chkcond) != TS_NO; } +/* + * Evaluate tsquery boolean expression. + * + * This is the same as TS_execute except that TS_MAYBE is returned as-is. + */ +TSTernaryValue +TS_execute_ternary(QueryItem *curitem, void *arg, uint32 flags, + TSExecuteCallback chkcond) +{ + return TS_execute_recurse(curitem, arg, flags, chkcond); +} + /* * TS_execute recursion for operators above any phrase operator. Here we do * not need to worry about lexeme positions. As soon as we hit an OP_PHRASE diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h index 69a9ba8524..4266560151 100644 --- a/src/include/tsearch/ts_utils.h +++ b/src/include/tsearch/ts_utils.h @@ -199,6 +199,9 @@ typedef TSTernaryValue (*TSExecuteCallback) (void *arg, QueryOperand *val, extern bool TS_execute(QueryItem *curitem, void *arg, uint32 flags, TSExecuteCallback chkcond); +extern TSTernaryValue TS_execute_ternary(QueryItem *curitem, void *arg, + uint32 flags, + TSExecuteCallback chkcond); extern bool tsquery_requires_match(QueryItem *curitem); /*
I wrote: > Anyway, this seems to put the final nail in the coffin of the idea > that it's sufficient for TS_execute to return a boolean. At least > the tsginidx.c callers really need to see the 3-way result. Hence > I propose the attached patch. It fixes the given test case, but > I wonder if you can try it and see if you see any remaining problems. After further reflection, it seems like now that we've done that, we should go all the way and replace the out-of-band recheck result from checkcondition_gin with a MAYBE result, as attached. AFAICT the previous patch didn't leave any user-visible bug, but there is an inefficiency in using the out-of-band signaling. Consider a query like (a & b:B) | c and suppose that some index entry has "b" and "c" but not "a". With the code as it stands, when checkcondition_gin checks for a match to "b:B" it will find one, and then because of the weight restriction it will return TS_MAYBE, and also set need_recheck. Then TS_execute_recurse will combine the TS_MAYBE with the TS_NO for "a" to produce TS_NO, and finally combine that with TS_YES for "c" to get TS_YES. This is a correct result: the row unquestionably matches the query. But since we set the need_recheck flag, we'll force a heap visit and recheck anyway. That's bogus, and we don't need to accept it anymore now that the data return path is fully ternary-clean. Returning TS_MAYBE is sufficient to do all the right things, so I propose the attached v2. regards, tom lane diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c index 6c913baaba..78de531eb8 100644 --- a/src/backend/utils/adt/tsginidx.c +++ b/src/backend/utils/adt/tsginidx.c @@ -175,7 +175,6 @@ typedef struct QueryItem *first_item; GinTernaryValue *check; int *map_item_operand; - bool *need_recheck; } GinChkVal; /* @@ -184,27 +183,24 @@ typedef struct static TSTernaryValue checkcondition_gin(void *checkval, QueryOperand *val, ExecPhraseData *data) { + GinTernaryValue result; GinChkVal *gcv = (GinChkVal *) checkval; int j; - /* - * if any val requiring a weight is used or caller needs position - * information then set recheck flag - */ - if (val->weight != 0 || data != NULL) - *(gcv->need_recheck) = true; - /* convert item's number to corresponding entry's (operand's) number */ j = gcv->map_item_operand[((QueryItem *) val) - gcv->first_item]; + /* determine presence of current entry in indexed value */ + result = gcv->check[j]; + /* - * return presence of current entry in indexed value; but TRUE becomes - * MAYBE in the presence of a query requiring recheck + * if any val requiring a weight is used or caller needs position + * information then we must recheck, so replace TRUE with MAYBE. */ - if (gcv->check[j] == GIN_TRUE) + if (result == GIN_TRUE) { if (val->weight != 0 || data != NULL) - return TS_MAYBE; + result = GIN_MAYBE; } /* @@ -212,7 +208,7 @@ checkcondition_gin(void *checkval, QueryOperand *val, ExecPhraseData *data) * assignments. We could use a switch statement to map the values if that * ever stops being true, but it seems unlikely to happen. */ - return (TSTernaryValue) gcv->check[j]; + return (TSTernaryValue) result; } Datum @@ -244,12 +240,23 @@ gin_tsquery_consistent(PG_FUNCTION_ARGS) "sizes of GinTernaryValue and bool are not equal"); gcv.check = (GinTernaryValue *) check; gcv.map_item_operand = (int *) (extra_data[0]); - gcv.need_recheck = recheck; - res = TS_execute(GETQUERY(query), - &gcv, - TS_EXEC_PHRASE_NO_POS, - checkcondition_gin); + switch (TS_execute_ternary(GETQUERY(query), + &gcv, + TS_EXEC_PHRASE_NO_POS, + checkcondition_gin)) + { + case TS_NO: + res = false; + break; + case TS_YES: + res = true; + break; + case TS_MAYBE: + res = true; + *recheck = true; + break; + } } PG_RETURN_BOOL(res); @@ -266,10 +273,6 @@ gin_tsquery_triconsistent(PG_FUNCTION_ARGS) /* int32 nkeys = PG_GETARG_INT32(3); */ Pointer *extra_data = (Pointer *) PG_GETARG_POINTER(4); GinTernaryValue res = GIN_FALSE; - bool recheck; - - /* Initially assume query doesn't require recheck */ - recheck = false; if (query->size > 0) { @@ -282,13 +285,11 @@ gin_tsquery_triconsistent(PG_FUNCTION_ARGS) gcv.first_item = GETQUERY(query); gcv.check = check; gcv.map_item_operand = (int *) (extra_data[0]); - gcv.need_recheck = &recheck; - if (TS_execute(GETQUERY(query), - &gcv, - TS_EXEC_PHRASE_NO_POS, - checkcondition_gin)) - res = recheck ? GIN_MAYBE : GIN_TRUE; + res = TS_execute_ternary(GETQUERY(query), + &gcv, + TS_EXEC_PHRASE_NO_POS, + checkcondition_gin); } PG_RETURN_GIN_TERNARY_VALUE(res); diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c index 2939fb5c21..9236ebcc8f 100644 --- a/src/backend/utils/adt/tsvector_op.c +++ b/src/backend/utils/adt/tsvector_op.c @@ -1854,6 +1854,18 @@ TS_execute(QueryItem *curitem, void *arg, uint32 flags, return TS_execute_recurse(curitem, arg, flags, chkcond) != TS_NO; } +/* + * Evaluate tsquery boolean expression. + * + * This is the same as TS_execute except that TS_MAYBE is returned as-is. + */ +TSTernaryValue +TS_execute_ternary(QueryItem *curitem, void *arg, uint32 flags, + TSExecuteCallback chkcond) +{ + return TS_execute_recurse(curitem, arg, flags, chkcond); +} + /* * TS_execute recursion for operators above any phrase operator. Here we do * not need to worry about lexeme positions. As soon as we hit an OP_PHRASE diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h index 69a9ba8524..4266560151 100644 --- a/src/include/tsearch/ts_utils.h +++ b/src/include/tsearch/ts_utils.h @@ -199,6 +199,9 @@ typedef TSTernaryValue (*TSExecuteCallback) (void *arg, QueryOperand *val, extern bool TS_execute(QueryItem *curitem, void *arg, uint32 flags, TSExecuteCallback chkcond); +extern TSTernaryValue TS_execute_ternary(QueryItem *curitem, void *arg, + uint32 flags, + TSExecuteCallback chkcond); extern bool tsquery_requires_match(QueryItem *curitem); /*
Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term
От
Dimitri Nüscheler
Дата:
Hi Tom
Wow, this is going fast. I applied the second patch and it fixes my original test case on my private data.
Thank you also for the explanations. I'm happy it's possible to get insight into PostgreSQL interna so quickly :-)
Let me know if there's more I can do (testing particular cases or patch variants etc).
Kind regards
Dimitri
Am Di., 16. Feb. 2021 um 01:27 Uhr schrieb Tom Lane <tgl@sss.pgh.pa.us>:
I wrote:
> Anyway, this seems to put the final nail in the coffin of the idea
> that it's sufficient for TS_execute to return a boolean. At least
> the tsginidx.c callers really need to see the 3-way result. Hence
> I propose the attached patch. It fixes the given test case, but
> I wonder if you can try it and see if you see any remaining problems.
After further reflection, it seems like now that we've done that,
we should go all the way and replace the out-of-band recheck result
from checkcondition_gin with a MAYBE result, as attached.
AFAICT the previous patch didn't leave any user-visible bug,
but there is an inefficiency in using the out-of-band signaling.
Consider a query like
(a & b:B) | c
and suppose that some index entry has "b" and "c" but not "a".
With the code as it stands, when checkcondition_gin checks for a
match to "b:B" it will find one, and then because of the weight
restriction it will return TS_MAYBE, and also set need_recheck.
Then TS_execute_recurse will combine the TS_MAYBE with the TS_NO
for "a" to produce TS_NO, and finally combine that with TS_YES
for "c" to get TS_YES. This is a correct result: the row
unquestionably matches the query. But since we set the
need_recheck flag, we'll force a heap visit and recheck anyway.
That's bogus, and we don't need to accept it anymore now that
the data return path is fully ternary-clean. Returning TS_MAYBE
is sufficient to do all the right things, so I propose the
attached v2.
regards, tom lane
=?UTF-8?Q?Dimitri_N=C3=BCscheler?= <dimitri.nuescheler@gmail.com> writes: > Wow, this is going fast. I applied the second patch and it fixes my > original test case on my private data. Pushed, thanks for testing! regards, tom lane