Re: Unicode normalization SQL functions

Поиск

Список

Период

Сортировка

От	Daniel Verite
Тема	Re: Unicode normalization SQL functions
Дата	17 февраля 2020 г. 22:08:00
Msg-id	2c5e8df9-43b8-41fa-88e6-286e8634f00a@manitou-mail.org обсуждение исходный текст
Ответ на	Re: Unicode normalization SQL functions (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Ответы	Re: Unicode normalization SQL functions (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Список	pgsql-hackers

Дерево обсуждения

  Hi,

I've checked the v3 patch against the results of the normalization
done by ICU [1] on my test data again, and they're identical
(as they were with v1; v2 had the bug discussed upthread, now fixed).

Concerning execution speed, there's an excessive CPU usage when
normalizing into NFC or NFKC. Looking at pre-existing code, it looks
like recompose_code() in unicode_norm.c looping over the
UnicodeDecompMain array might be very costly.

Another point is that the ICU-based implementation appears
to be significantly faster in all cases, which makes me wonder
why ICU builds should not just use ICU instead of the PG-core
implementation.
To illustrate this, here are the execution times reported by psql for
the queries below exercising the normalization code, both with the
functions provided by the patch and with the equivalent functions
implemented with ICU.
The dataset is ~10 million unique short strings
extracted from real data, and the number is a median execution time in
millisecs, for 10 successive runs with query parallelism off
(stddev in parentheses).

 operation  |     core       |    icu
------------+--------------+-----------
 nfc check  | 4398 (20)    | 3088 (27)
 nfc conv   | 771502 (414) | 5503 (19)
 nfd check  | 4510 (10)    | 2898 (8)
 nfd conv   | 9102 (1)       | 5569 (6)
 nfkc check | 4825 (51)    | 3273 (4)
 nfkc conv  | 772240 (340) | 5763 (8)
 nfkd check | 4794 (4)       | 3170 (39)
 nfkd conv  | 9229 (4)       | 5824 (9)

The queries:

check w/core:
  select count(*) from words where w is $NORM normalized;

conversion w/core:
  select sum(length(normalize(w, $NORM))) from words;

check w/icu:
  select count(*) from words where icu_is_normalized(w, '$NORM');

conversion w/icu:
  select sum(length(icu_normalize(w, '$NORM'))) from words;


[1] https://github.com/dverite/icu_ext/blob/master/icu_normalize.c

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

В списке pgsql-hackers по дате отправления:

Предыдущее

От: "Haumacher, Bernhard"
Дата: 17 февраля 2020 г., 21:01:38
Сообщение: Re: Error on failed COMMIT

Следующее

От: "Daniel Verite"
Дата: 17 февраля 2020 г., 22:14:03
Сообщение: Re: Unicode normalization SQL functions

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Unicode normalization SQL functions

Предыдущее

Следующее