Re: Stack overflow issue

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: Stack overflow issue
Дата	31 августа 2022 г. 01:57:06
Msg-id	3802215.1661900226@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Re: Stack overflow issue (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: Stack overflow issue (Richard Guo <guofenglinux@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

I wrote:
> The upstream recommendation, which seems pretty sane to me, is to
> simply reject any string exceeding some threshold length as not
> possibly being a word.  Apparently it's common to use thresholds
> as small as 64 bytes, but in the attached I used 1000 bytes.

On further thought: that coding treats anything longer than 1000
bytes as a stopword, but maybe we should just accept it unmodified.
The manual says "A Snowball dictionary recognizes everything, whether
or not it is able to simplify the word".  While "recognizes" formally
includes the case of "recognizes as a stopword", people might find
this behavior surprising.  We could alternatively do it as attached,
which accepts overlength words but does nothing to them except
case-fold.  This is closer to the pre-patch behavior, but gives up
the opportunity to avoid useless downstream processing of long words.

            regards, tom lane

diff --git a/src/backend/snowball/dict_snowball.c b/src/backend/snowball/dict_snowball.c
index 68c9213f69..1d5dfff5a0 100644
--- a/src/backend/snowball/dict_snowball.c
+++ b/src/backend/snowball/dict_snowball.c
@@ -275,8 +275,24 @@ dsnowball_lexize(PG_FUNCTION_ARGS)
     char       *txt = lowerstr_with_len(in, len);
     TSLexeme   *res = palloc0(sizeof(TSLexeme) * 2);

-    if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
+    /*
+     * Do not pass strings exceeding 1000 bytes to the stemmer, as they're
+     * surely not words in any human language.  This restriction avoids
+     * wasting cycles on stuff like base64-encoded data, and it protects us
+     * against possible inefficiency or misbehavior in the stemmer.  (For
+     * example, the Turkish stemmer has an indefinite recursion, so it can
+     * crash on long-enough strings.)  However, Snowball dictionaries are
+     * defined to recognize all strings, so we can't reject the string as an
+     * unknown word.
+     */
+    if (len > 1000)
+    {
+        /* return the lexeme lowercased, but otherwise unmodified */
+        res->lexeme = txt;
+    }
+    else if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
     {
+        /* empty or stopword, so report as stopword */
         pfree(txt);
     }
     else

В списке pgsql-hackers по дате отправления:

Предыдущее

От: David Rowley
Дата: 31 августа 2022 г., 01:40:43
Сообщение: Re: Reducing the chunk header sizes on all memory context types

Следующее

От: Peter Smith
Дата: 31 августа 2022 г., 02:35:54
Сообщение: Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Stack overflow issue

Предыдущее

Следующее