Обсуждение: Text search lexer's handling of hyphens and negatives

Поиск
Список
Период
Сортировка

Text search lexer's handling of hyphens and negatives

От
raylu
Дата:
(I sent a similar message before subscribing to the list but it hasn't
gone through yet, so sorry if you see a duplicate of this...)

We've been happily using pgsql to store user-generated documents for a
while now. We also wanted to be able to search the documents so we
tossed the document contents into a tsvector and did a pretty
straightforward contents @@ phraseto_tsquery('simple', 'the query').

Our users have a lot of things named like ABC-DEF-GHI so that sort of
hyphenated name appears in their documents fairly often.
to_tsvector('simple', 'ABC-DEF-GHI') @@ phraseto_tsquery('simple',
'ABC-DEF-GHI') works without issue.

Sometimes, these hyphenated names have numbers in them like
UVW-789-XYZ. Still no problem with to_tsvector/phraseto_tsquery.

Sometimes, users can only remember the last bit of the name. So they'd
like to find the document with ABC-DEF-GHI in it by searching for
'DEF-GHI'. Since to_tsvector('simple', 'ABC-DEF-GHI') is
'abc-def-ghi':1 'abc':2 'def':3 'ghi':4
we search for to_tsquery('simple', 'def <-> ghi') instead of using
phraseto_tsquery. This works, but you can probably see where this is
going.

to_tsvector('simple', 'UVW-789-XYZ') is
'uvw':1 '-789':2 'xyz':3
because -789 is a negative integer. If we turn the query '789-XYZ'
into the tsquery as before, we get to_tsquery('simple', '789 <-> xyz')
which doesn't match it.

Are we missing something here? Is there either a way to
1. generate tsvectors without this special (negative) integer behavior or
2. generate tsqueries in a more intelligent way?



Re: Text search lexer's handling of hyphens and negatives

От
Alan Hodgson
Дата:
On Tue, 2019-10-15 at 14:51 -0700, raylu wrote:
Are we missing something here? Is there either a way to
1. generate tsvectors without this special (negative) integer behavior or
2. generate tsqueries in a more intelligent way?


My company has found the pg_trm extension to be more useful for partial text searches than the full text functions. I don't know specifically how it might help with your hyphens but it would be worth testing. The docs actually suggest using them in conjunction in some cases.

Re: Text search lexer's handling of hyphens and negatives

От
raylu
Дата:
On Tue, Oct 15, 2019 at 3:35 PM Alan Hodgson <ahodgson@lists.simkin.ca> wrote:
> My company has found the pg_trm extension to be more useful for partial text searches than the full text functions. I
don'tknow specifically how it might help with your hyphens but it would be worth testing. The docs actually suggest
usingthem in conjunction in some cases. 

We actually do use pg_trgm already for the names/titles of things.
Indexing the content with a trigram index and then doing
LOWER(content) LIKE '%789-xyz%' would certainly work, but
1. we'd have to do a little bit of finagling if we wanted to match on
word boundaries (don't match '6789-xyza' in the above example)
2. trigram indexes are pretty huge for long documents, which is why we
currently only use them for names/titles

We may give up and just use pg_trgm for contents if nothing else works
out but it feels like the text search lexer is _so_ close to what we
want.



Re: Text search lexer's handling of hyphens and negatives

От
"Daniel Verite"
Дата:
    raylu wrote:

> to_tsvector('simple', 'UVW-789-XYZ') is
> 'uvw':1 '-789':2 'xyz':3
> because -789 is a negative integer. If we turn the query '789-XYZ'
> into the tsquery as before, we get to_tsquery('simple', '789 <-> xyz')
> which doesn't match it.
>
> Are we missing something here? Is there either a way to
> 1. generate tsvectors without this special (negative) integer behavior or

As an ad-hoc solution, you could add a dictionary that turns a negative
integer into its positive counterpart. There's a dictionary in contrib that
can be used as a starting point:
https://www.postgresql.org/docs/current/dict-int.html

It's a matter of ~10 lines of C code to add an "abs" parameter to
that dictionary that would, when true, produce "789" as a lexem
when fed "-789" as input.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: Text search lexer's handling of hyphens and negatives

От
Alan Hodgson
Дата:
On Tue, 2019-10-15 at 20:34 -0700, raylu wrote:
On Tue, Oct 15, 2019 at 3:35 PM Alan Hodgson <
ahodgson@lists.simkin.ca
> wrote:
My company has found the pg_trm extension to be more useful for partial text searches than the full text functions. I don't know specifically how it might help with your hyphens but it would be worth testing. The docs actually suggest using them in conjunction in some cases.

We actually do use pg_trgm already for the names/titles of things.
Indexing the content with a trigram index and then doing
LOWER(content) LIKE '%789-xyz%' would certainly work, but
1. we'd have to do a little bit of finagling if we wanted to match on
word boundaries (don't match '6789-xyza' in the above example)
2. trigram indexes are pretty huge for long documents, which is why we
currently only use them for names/titles

We may give up and just use pg_trgm for contents if nothing else works
out but it feels like the text search lexer is _so_ close to what we
want.


Maybe you could have a trigger pull out those specific hypenated references into a separate column when the document is added or updated, and store/index those separately?