Re: BUG #18149: Incorrect lexeme for english token "proxy"

Поиск
Список
Период
Сортировка
От Patrick Peralta
Тема Re: BUG #18149: Incorrect lexeme for english token "proxy"
Дата
Msg-id CADV9oGz+JRYg2n1X2Jzknr-2YSAjcGT9FJPY0wiMNBnV+ZJUgw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #18149: Incorrect lexeme for english token "proxy"  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
So the difficulty is that (a) the default TS parser doesn't break down
this multiply-hyphenated word quite the way you'd hoped

I see, thanks for pointing this out. Now I understand that this isn't a lexeme issue; rather it's a side effect of how the parser handles hyphens.

I tried this out with underscores which produces a better result:

# select * from ts_debug('english', 'CLOUD_PROXY_SEP19_T1_254__1695167380256');
   alias   |       description        |     token     |  dictionaries  |  dictionary  |     lexemes
-----------+--------------------------+---------------+----------------+--------------+-----------------
 asciiword | Word, all ASCII          | CLOUD         | {english_stem} | english_stem | {cloud}
 blank     | Space symbols            | _             | {}             |              |
 asciiword | Word, all ASCII          | PROXY         | {english_stem} | english_stem | {proxi}
 blank     | Space symbols            | _             | {}             |              |
 numword   | Word, letters and digits | SEP19         | {simple}       | simple       | {sep19}
 blank     | Space symbols            | _             | {}             |              |
 numword   | Word, letters and digits | T1            | {simple}       | simple       | {t1}
 blank     | Space symbols            | _             | {}             |              |
 uint      | Unsigned integer         | 254           | {simple}       | simple       | {254}
 blank     | Space symbols            | __            | {}             |              |
 uint      | Unsigned integer         | 1695167380256 | {simple}       | simple       | {1695167380256}
(11 rows)

# SELECT to_tsvector('english', 'CLOUD_PROXY__SEP19_T1-254__1695167380256') @@ to_tsquery('english','cloud_proxy:*');
 ?column?
----------
 t

If that data format is being imposed on you

These are freeform fields so we are not imposing a data format. However I wonder if we can do some data "massaging" to deal with hyphens. Writing our own parser is definitely overkill for this edge condition. 

Really appreciate this insight, it was very helpful! We can close this ticket.

Thanks,
Patrick


On Sat, Oct 7, 2023 at 12:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Patrick Peralta <pperalta@gmail.com> writes:
> However I ran into an anomaly with this query:

> # SELECT to_tsvector('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256')
> @@ to_tsquery('english','cloud-proxy:*');
>  ?column?
> ----------
>  f
> (1 row)

Hmm.  Investigating that a bit:

regression=# select * from ts_debug('english', 'cloud-proxy');
      alias      |           description           |    token    |  dictionaries  |  dictionary  |    lexemes   
-----------------+---------------------------------+-------------+----------------+--------------+---------------
 asciihword      | Hyphenated word, all ASCII      | cloud-proxy | {english_stem} | english_stem | {cloud-proxi}
 hword_asciipart | Hyphenated word part, all ASCII | cloud       | {english_stem} | english_stem | {cloud}
 blank           | Space symbols                   | -           | {}             |              |
 hword_asciipart | Hyphenated word part, all ASCII | proxy       | {english_stem} | english_stem | {proxi}
(4 rows)

regression=# select * from ts_debug('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256');
      alias      |               description                |        token         |  dictionaries  |  dictionary  |        lexemes         
-----------------+------------------------------------------+----------------------+----------------+--------------+------------------------
 numhword        | Hyphenated word, letters and digits      | CLOUD-PROXY-SEP19-T1 | {simple}       | simple       | {cloud-proxy-sep19-t1}
 hword_asciipart | Hyphenated word part, all ASCII          | CLOUD                | {english_stem} | english_stem | {cloud}
 blank           | Space symbols                            | -                    | {}             |              |
 hword_asciipart | Hyphenated word part, all ASCII          | PROXY                | {english_stem} | english_stem | {proxi}
 blank           | Space symbols                            | -                    | {}             |              |
 hword_numpart   | Hyphenated word part, letters and digits | SEP19                | {simple}       | simple       | {sep19}
 blank           | Space symbols                            | -                    | {}             |              |
 hword_numpart   | Hyphenated word part, letters and digits | T1                   | {simple}       | simple       | {t1}
 blank           | Space symbols                            | -                    | {}             |              |
 uint            | Unsigned integer                         | 254                  | {simple}       | simple       | {254}
 blank           | Space symbols                            | -                    | {}             |              |
 int             | Signed integer                           | -1695167380256       | {simple}       | simple       | {-1695167380256}
(12 rows)

So the difficulty is that (a) the default TS parser doesn't break down
this multiply-hyphenated word quite the way you'd hoped, and (b) fragments
classified as numhword aren't passed through the english_stem dictionary
at all.  Also, (c) I'm doubtful that the snowball stemmer would have
converted cloud-proxy-sep19-t1 to cloud-proxi-sep19-t1; but it didn't get
the chance anyway.

While (b) would be easy to address with a custom TS configuration,
(a) and (c) can't be fixed without getting your hands dirty in
C code.  Is there any chance of adjusting the notation you're dealing
with here?  I get sane-looking results from, for example,

regression=# select to_tsvector('english', 'CLOUD-PROXY--SEP19-T1-254--1695167380256');
                                         to_tsvector                                         
----------------------------------------------------------------------------------------------
 '-1695167380256':8 '254':7 'cloud':2 'cloud-proxi':1 'proxi':3 'sep19':5 'sep19-t1':4 't1':6
(1 row)

If that data format is being imposed on you then I'm not seeing a good
solution without custom C code.  I'd be inclined to try to make the
parser generate all of "cloud-proxy-sep19-t1", "cloud-proxy-sep19",
"cloud-proxy" from this input, but a custom TS parser is kind of a
high bar to clear.

                        regards, tom lane

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #18149: Incorrect lexeme for english token "proxy"
Следующее
От: Richard Guo
Дата:
Сообщение: Re: BUG #17540: Prepared statement: PG switches to a generic query plan which is consistently much slower