PostgreSQL fails to convert decomposed utf-8 to other encodings

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема PostgreSQL fails to convert decomposed utf-8 to other encodings
Дата
Msg-id 53E179E1.3060404@2ndquadrant.com
обсуждение исходный текст
Ответы Re: PostgreSQL fails to convert decomposed utf-8 to other encodings  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
There's a bug in encoding conversions from utf-8 to other encodings that
results in corrupt output if decomposed utf-8 is used.

PostgreSQL doesn't process utf-8 to pre-composed form first, so
decomposed UTF-8 is not handled correctly.

Take á:

regress=> -- Decomposed - 'a' then 'acute'
regress=> SELECT E'\u0061\u0301';
' ?column?
----------
 á
(1 row)

regress=> -- Precomposed - 'a-acute'
regress=> SELECT E'\u00E1';
 ?column?
----------
 á
(1 row)


regress=> SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
ERROR:  character with byte sequence 0xcc 0x81 in encoding "UTF8" has no
equivalent in encoding "LATIN1"

regress=> SELECT convert_to(E'\u00E1', 'iso-8859-1');
 convert_to
------------
 \xe1
(1 row)


This affects input from the client too:

regress=> SELECT convert_to('á', 'iso-8859-1');
ERROR:  character with byte sequence 0xcc 0x81 in encoding "UTF8" has no
equivalent in encoding "LATIN1"

regress=> SELECT convert_to('á', 'iso-8859-1');
 convert_to
------------
 \xe1
(1 row)


... yes, that looks like the same function producing different results
on identical input. You might not be able to reproduce with copy and
paste from this mail if your client normalizes UTF-8, but you'll be able
to by printing the decomposed character to your terminal as an escape
string, then copying and pasting from there.


We should've probably been normalizing decomposed sequences to
precomposed as part of utf-8 validation wherever 'text' input occurs,
but it's too late for that now as DBs in the wild will contain
decomposed chars. Instead, conversion functions need to normalize
decomposed chars to precomposed before converting from utf-8 to another
encoding.

Comments?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-bugs по дате отправления:

Предыдущее
От: David G Johnston
Дата:
Сообщение: Re: BUG #11128: Error in pg_restore with materialized view
Следующее
От: Tom Lane
Дата:
Сообщение: Re: PostgreSQL fails to convert decomposed utf-8 to other encodings