Re: full text search and hyphens in uuid

Поиск
Список
Период
Сортировка
От Martin Norbäck Olivers
Тема Re: full text search and hyphens in uuid
Дата
Msg-id CALoTC6sk1_8-r6e-9-t2oAe79yX94iG=AOkSH2nHRnxgDrHGFg@mail.gmail.com
обсуждение исходный текст
Ответ на full text search and hyphens in uuid  (Martin Norbäck Olivers <martin@norpan.org>)
Список pgsql-sql
Hi! Thanks for answering.

My use case for doing this is that I have uuids embedded within the text data (it's JSON data actually) and I just index to_tsvector('simple', json_column).

And I want to search for the uuids sometimes, and it's not predetermined which json keys contain them. But it does seem like it's not possible to change the to_tsvector lexer, so I guess I will have to extract the uuids when inserting the data and index them separately.

Regards,
Martin

On Sat, Oct 28, 2023 at 5:48 PM Steve Midgley <science@misuse.org> wrote:
On Fri, Oct 27, 2023 at 4:49 AM Martin Norbäck Olivers <martin@norpan.org> wrote:
Hi!
I have a problem with full text search and uuids in the text which I index using to_tsvector . I have uuids in my text and most of the time, it works well because they are lexed as words so I can just search for the parts of the uuid.

The problem is an uuid like this:
select to_tsvector('simple','0232710f-8545-59eb-abcd-47aa57184361')

Which gives this result
'-59':3 '-8545':2 '0232710f':1 '47aa57184361':7 'abcd':6 'eb':5 'eb-abcd-47aa57184361':4

So, I found dict_int and asked it to remove the minus signs

create extension dict_int;
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 12, absval = true);
alter text search configuration simple alter mapping for int, uint with intdict

 and now I get this result instead:
'0232710f':1 '47aa57184361':7 '59':3 '8545':2 'abcd':6 'eb':5 'eb-abcd-47aa57184361':4

which is slightly better, but still not good enough because there is no token 59eb. It's being split into 59 and eb.

Is there any way to change this behaviour of the tsvector lexer? Do I have to write my own tsvector or is there a way to "turn off" integer handling in the lexer?

Regards,
Martin

I don't understand your use case for doing this, but it seems like you could use something other than ts_vector to break apart your uuids, and then index them? It seems like ts_vector is primarily used to find things that are near to other things via their vector signatures (at least that's my understanding). But doing vector component math on segments of a UUID seems meaningless since the UUID is mostly random?

So couldn't you break your UUID into separate fields, or barring that into a jsonb or array field that contains the components, and then just index that computed field? Maybe that could even be achieved in a view, if you don't want to alter your core table? 

Obviously all this could be insensible, if I'm not following the purpose of your use of ts_vector..
Best,
Steve 


--
Martin Norbäck Olivers
IT-konsult, Masara AB
Telefon: +46 703 22 70 12
Kärrhöksvägen 4
656 72 Skattkärr

В списке pgsql-sql по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: full text search and hyphens in uuid
Следующее
От: Sanjay Minni
Дата:
Сообщение: DEFAULT in update & prepared statements