Re: Unicode normalization SQL functions

Поиск
Список
Период
Сортировка
От Daniel Verite
Тема Re: Unicode normalization SQL functions
Дата
Msg-id 2c5e8df9-43b8-41fa-88e6-286e8634f00a@manitou-mail.org
обсуждение исходный текст
Ответ на Re: Unicode normalization SQL functions  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Ответы Re: Unicode normalization SQL functions  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Список pgsql-hackers
  Hi,

I've checked the v3 patch against the results of the normalization
done by ICU [1] on my test data again, and they're identical
(as they were with v1; v2 had the bug discussed upthread, now fixed).

Concerning execution speed, there's an excessive CPU usage when
normalizing into NFC or NFKC. Looking at pre-existing code, it looks
like recompose_code() in unicode_norm.c looping over the
UnicodeDecompMain array might be very costly.

Another point is that the ICU-based implementation appears
to be significantly faster in all cases, which makes me wonder
why ICU builds should not just use ICU instead of the PG-core
implementation.
To illustrate this, here are the execution times reported by psql for
the queries below exercising the normalization code, both with the
functions provided by the patch and with the equivalent functions
implemented with ICU.
The dataset is ~10 million unique short strings
extracted from real data, and the number is a median execution time in
millisecs, for 10 successive runs with query parallelism off
(stddev in parentheses).

 operation  |     core       |    icu
------------+--------------+-----------
 nfc check  | 4398 (20)    | 3088 (27)
 nfc conv   | 771502 (414) | 5503 (19)
 nfd check  | 4510 (10)    | 2898 (8)
 nfd conv   | 9102 (1)       | 5569 (6)
 nfkc check | 4825 (51)    | 3273 (4)
 nfkc conv  | 772240 (340) | 5763 (8)
 nfkd check | 4794 (4)       | 3170 (39)
 nfkd conv  | 9229 (4)       | 5824 (9)

The queries:

check w/core:
  select count(*) from words where w is $NORM normalized;

conversion w/core:
  select sum(length(normalize(w, $NORM))) from words;

check w/icu:
  select count(*) from words where icu_is_normalized(w, '$NORM');

conversion w/icu:
  select sum(length(icu_normalize(w, '$NORM'))) from words;


[1] https://github.com/dverite/icu_ext/blob/master/icu_normalize.c

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Haumacher, Bernhard"
Дата:
Сообщение: Re: Error on failed COMMIT
Следующее
От: "Daniel Verite"
Дата:
Сообщение: Re: Unicode normalization SQL functions