Re: [HACKERS] Extra Vietnamese unaccent rules

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: [HACKERS] Extra Vietnamese unaccent rules
Дата
Msg-id CAEepm=2o7gmoZaG+t0NJ=xUkLUBqMzg_s1aojm2CX7fkrnXoHA@mail.gmail.com
обсуждение исходный текст
Ответ на [HACKERS] Extra Vietnamese unaccent rules  (Nguyen Le Hoang Kha <nlhkha@gmail.com>)
Ответы Re: [HACKERS] Extra Vietnamese unaccent rules  (Michael Paquier <michael.paquier@gmail.com>)
Re: [HACKERS] Extra Vietnamese unaccent rules  (Kha Nguyen <nlhkha@gmail.com>)
Список pgsql-hackers
On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen <nlhkha@gmail.com> wrote:
> Could you explain to me what this line means:
> “
> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
> 0301;;;;N;;;1EA4;;1EA4
> “
>
> If you could give me an example of adding a rule for “recursive” case, I can do the rest. I am not familiar with this
unaccentformat generation yet. 

So contrib/unaccent/generate_unaccent_rules.py is a Python script that
takes UnicodeData.txt, a list of information about all Unicode
codepoints available at a URL that is shown in a comment, and
generates unaccent.rules.  The idea was to avoid having to change it
manually every time someone finds characters that should be in there
(as you have just done!) by doing it systematically.

Unicode has two ways to represent characters with accents: either with
composed codepoints like "é" or decomposed codepoints where you say
"e" and then "´".  The field "00E2 0301" is the decomposed form of
that character above.  Our job here is to identify the basic letter
that each composed character contains, by analysing the decomposed
field that you see in that line.  I failed to realise that characters
with TWO accents are described as a composed character with ONE accent
plus another accent.

You don't have to worry about decoding that line, it's all done in
that Python script.  The problem is just in the function
is_letter_with_marks().  Instead of just checking if combining_ids[0]
is a plain letter, it looks like it should also check if
combining_ids[0] itself is a letter with marks.  Also get_plain_letter
would need to be able to recurse to extract the "a".

I hope that helps!

--
Thomas Munro
http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jeff Janes
Дата:
Сообщение: Re: [HACKERS] logical replication - still unstable after all these months
Следующее
От: Kha Nguyen
Дата:
Сообщение: Re: [HACKERS] Extra Vietnamese unaccent rules