Re: full text search and hyphens in uuid

Поиск

Список

Период

Сортировка

От	Martin Norbäck Olivers
Тема	Re: full text search and hyphens in uuid
Дата	29 октября 2023 г. 19:31:07
Msg-id	CALoTC6sk1_8-r6e-9-t2oAe79yX94iG=AOkSH2nHRnxgDrHGFg@mail.gmail.com обсуждение исходный текст
Ответ на	full text search and hyphens in uuid (Martin Norbäck Olivers <martin@norpan.org>)
Список	pgsql-sql

Дерево обсуждения

Hi! Thanks for answering.

My use case for doing this is that I have uuids embedded within the text data (it's JSON data actually) and I just index to_tsvector('simple', json_column).

And I want to search for the uuids sometimes, and it's not predetermined which json keys contain them. But it does seem like it's not possible to change the to_tsvector lexer, so I guess I will have to extract the uuids when inserting the data and index them separately.

Regards,
Martin

On Sat, Oct 28, 2023 at 5:48 PM Steve Midgley <science@misuse.org> wrote:

On Fri, Oct 27, 2023 at 4:49 AM Martin Norbäck Olivers <martin@norpan.org> wrote:
Hi!
I have a problem with full text search and uuids in the text which I index using to_tsvector . I have uuids in my text and most of the time, it works well because they are lexed as words so I can just search for the parts of the uuid.

The problem is an uuid like this:
select to_tsvector('simple','0232710f-8545-59eb-abcd-47aa57184361')

Which gives this result
'-59':3 '-8545':2 '0232710f':1 '47aa57184361':7 'abcd':6 'eb':5 'eb-abcd-47aa57184361':4

So, I found dict_int and asked it to remove the minus signs

create extension dict_int;
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 12, absval = true);
alter text search configuration simple alter mapping for int, uint with intdict

and now I get this result instead:
'0232710f':1 '47aa57184361':7 '59':3 '8545':2 'abcd':6 'eb':5 'eb-abcd-47aa57184361':4

which is slightly better, but still not good enough because there is no token 59eb. It's being split into 59 and eb.

Is there any way to change this behaviour of the tsvector lexer? Do I have to write my own tsvector or is there a way to "turn off" integer handling in the lexer?

Regards,
Martin

I don't understand your use case for doing this, but it seems like you could use something other than ts_vector to break apart your uuids, and then index them? It seems like ts_vector is primarily used to find things that are near to other things via their vector signatures (at least that's my understanding). But doing vector component math on segments of a UUID seems meaningless since the UUID is mostly random?

So couldn't you break your UUID into separate fields, or barring that into a jsonb or array field that contains the components, and then just index that computed field? Maybe that could even be achieved in a view, if you don't want to alter your core table?

Obviously all this could be insensible, if I'm not following the purpose of your use of ts_vector..
Best,
Steve

Martin Norbäck Olivers

IT-konsult, Masara AB

Telefon: +46 703 22 70 12

E-post: martin@norpan.org

Kärrhöksvägen 4

656 72 Skattkärr

В списке pgsql-sql по дате отправления:

Предыдущее

От: Tom Lane
Дата: 28 октября 2023 г., 05:05:14
Сообщение: Re: full text search and hyphens in uuid

Следующее

От: Sanjay Minni
Дата: 08 ноября 2023 г., 19:02:16
Сообщение: DEFAULT in update & prepared statements

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: full text search and hyphens in uuid

Предыдущее

Следующее