Re: BUG #15548: Unaccent does not remove combining diacritical characters

Поиск
Список
Период
Сортировка
От Hugh Ranalli
Тема Re: BUG #15548: Unaccent does not remove combining diacritical characters
Дата
Msg-id CAAhbUMN1n=ZVns-OeCbaVRYPS0oj7tTnmJrzw7Az-op4DHC+JA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #15548: Unaccent does not remove combining diacritical characters  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: BUG #15548: Unaccent does not remove combining diacritical characters  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs


On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hugh Ranalli <hugh@whtc.ca> writes:
Cool.  Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/
Done.
 
Me too -- seems like that bears looking into.  Perhaps the script's
results are platform dependent -- what were you testing on?
I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think that's it. The program's decisions come from the two data files, the Unicode data set and the Latin-ASCII transliteration file. The script uses categories (ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category) to identify letters (and now combining marks) and if they are in range, performs a substitution. It then uses the transliteration file to find rules for particular character substitutions (for example, that file seems to handle the copyright symbol substitution). I don't see anything platform dependent in there. 

In looking more closely, I also see that script isn't generating ligatures, even though it should, because although the program can generate them, none of the ligatures are in the ranges defined in PLAIN_LETTER_RANGES, and so they are skipped.

This could probably be handled by adding the ligature ranges to the defined ranges. Symbol types could be added to the types it looks at, and perhaps the codepoint ranges collapsed into one list, as the IDs are unique across all categories. I don't think we'd want to just rely on ranges, as that could include control characters, punctuation, etc. 

There are a number of other characters that appear in unaccent.rules that aren't generated by the script. I've attached a diff of the output of generate_unaccent_rules (using the version before my changes, to simplify matters) and unaccent.rules. Unfortunately, I don't know how to interpret most of these characters.

I suppose it's valid to ask if changing © to (C) is even something an "unaccent" function should do. Given that it's in the existing rules file, should it be supported as an existing behaviour?

Sorry for more questions than answers. ;-)

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #15548: Unaccent does not remove combining diacritical characters