Re: BUG #15548: Unaccent does not remove combining diacritical characters

Поиск

Список

Период

Сортировка

От	Hugh Ranalli
Тема	Re: BUG #15548: Unaccent does not remove combining diacritical characters
Дата	15 декабря 2018 г. 21:08:00
Msg-id	CAAhbUMN1n=ZVns-OeCbaVRYPS0oj7tTnmJrzw7Az-op4DHC+JA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: BUG #15548: Unaccent does not remove combining diacritical characters (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: BUG #15548: Unaccent does not remove combining diacritical characters (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-bugs

Дерево обсуждения

On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hugh Ranalli <hugh@whtc.ca> writes:
Cool. Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/

Done.

Me too -- seems like that bears looking into. Perhaps the script's
results are platform dependent -- what were you testing on?

I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think that's it. The program's decisions come from the two data files, the Unicode data set and the Latin-ASCII transliteration file. The script uses categories (ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category) to identify letters (and now combining marks) and if they are in range, performs a substitution. It then uses the transliteration file to find rules for particular character substitutions (for example, that file seems to handle the copyright symbol substitution). I don't see anything platform dependent in there.

In looking more closely, I also see that script isn't generating ligatures, even though it should, because although the program can generate them, none of the ligatures are in the ranges defined in PLAIN_LETTER_RANGES, and so they are skipped.

This could probably be handled by adding the ligature ranges to the defined ranges. Symbol types could be added to the types it looks at, and perhaps the codepoint ranges collapsed into one list, as the IDs are unique across all categories. I don't think we'd want to just rely on ranges, as that could include control characters, punctuation, etc.

There are a number of other characters that appear in unaccent.rules that aren't generated by the script. I've attached a diff of the output of generate_unaccent_rules (using the version before my changes, to simplify matters) and unaccent.rules. Unfortunately, I don't know how to interpret most of these characters.

I suppose it's valid to ask if changing © to (C) is even something an "unaccent" function should do. Given that it's in the existing rules file, should it be supported as an existing behaviour?

Sorry for more questions than answers. ;-)

Вложения

unaccent.diff

В списке pgsql-bugs по дате отправления:

Предыдущее

От: Tom Lane
Дата: 15 декабря 2018 г., 01:50:03
Сообщение: Re: BUG #15548: Unaccent does not remove combining diacritical characters

Следующее

От: Tom Lane
Дата: 15 декабря 2018 г., 21:44:48
Сообщение: Re: BUG #15548: Unaccent does not remove combining diacritical characters

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Вложения

Предыдущее

Следующее