Re: [HACKERS] Extra Vietnamese unaccent rules

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: [HACKERS] Extra Vietnamese unaccent rules
Дата
Msg-id CAB7nPqTVYiaCqWgviJU1mM4LazGbuZ2sxp7qBAPNCHzSZAF4Dw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Extra Vietnamese unaccent rules  (Thomas Munro <thomas.munro@enterprisedb.com>)
Ответы Re: [HACKERS] Extra Vietnamese unaccent rules  (Dang Minh Huong <kakalot49@gmail.com>)
Список pgsql-hackers
On Fri, May 26, 2017 at 5:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Unicode has two ways to represent characters with accents: either with
> composed codepoints like "é" or decomposed codepoints where you say
> "e" and then "´".  The field "00E2 0301" is the decomposed form of
> that character above.  Our job here is to identify the basic letter
> that each composed character contains, by analysing the decomposed
> field that you see in that line.  I failed to realise that characters
> with TWO accents are described as a composed character with ONE accent
> plus another accent.

Doesn't that depend on the NF operation you are working on? With a
canonical decomposition it seems to me that a character with two
accents can as well be decomposed with one character and two composing
character accents (NFKC does a canonical decomposition in one of its
steps).

> You don't have to worry about decoding that line, it's all done in
> that Python script.  The problem is just in the function
> is_letter_with_marks().  Instead of just checking if combining_ids[0]
> is a plain letter, it looks like it should also check if
> combining_ids[0] itself is a letter with marks.  Also get_plain_letter
> would need to be able to recurse to extract the "a".

Actually, with the recent work that has been done with
unicode_norm_table.h which has been to transpose UnicodeData.txt into
user-friendly tables, shouldn't the python script of unaccent/ be
replaced by something that works on this table? This does a canonical
decomposition but just keeps the first characters with a class
ordering of 0. So we have basic APIs able to look at UnicodeData.txt
and let caller do decision making with the result returned.
--
Michael



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Craig Ringer
Дата:
Сообщение: Re: [HACKERS] Logical replication & corrupted pages recovery
Следующее
От: Euler Taveira
Дата:
Сообщение: Re: [HACKERS] pg_dump ignoring information_schema tables which usedin Create Publication.