Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
От | Sergey Burladyan |
---|---|
Тема | Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding |
Дата | |
Msg-id | 200803200633.03865.eshkinkot@gmail.com обсуждение исходный текст |
Ответ на | Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding ("Heikki Linnakangas" <heikki@enterprisedb.com>) |
Ответы |
Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
("Heikki Linnakangas" <heikki@enterprisedb.com>)
|
Список | pgsql-bugs |
Thursday 20 March 2008 01:16:34 Heikki Linnakangas: Thanks for answer, Heikki ! > You'd need to modify the mic->ISO-8859-5 translation table as well, for > converting in the other direction. oops, i have not thought about it %) > Here's a patch that does the conversion in the other direction as well. > As I'm not too familiar with cyrillic, can you double-check that this > works? I tested it using the convert() function between different > encodings, and it seems ok to me. yes, i test it with function like this and it work now :) create or replace function test_convert() returns setof record as $$ declare --- russian alphabet, 33 upper and 33 lower letters in utf-8 encoding r bytea default=20 E'\320\260\320\261\320\262\320\263\320\264\320\265\321\221\320\266\320\267\= 320\270\320\271\320\272\320\273\320\274\320\275\320\276\320\277\321\200\321= \201\321\202\321\203\321\204\321\205\321\206\321\207\321\210\321\211\321\21= 2\321\213\321\214\321\215\321\216\321\217\320\220\320\221\320\222\320\223\3= 20\224\320\225\320\201\320\226\320\227\320\230\320\231\320\232\320\233\320\= 234\320\235\320\236\320\237\320\240\320\241\320\242\320\243\320\244\320\245= \320\246\320\247\320\250\320\251\320\252\320\253\320\254\320\255\320\256\32= 0\257'; s bytea; --- converted to result t bytea; --- converted back result res record; begin raise notice 'russian ABC: "%"', encode(r, 'escape'); s :=3D convert(r, 'utf-8', 'iso-8859-5'); t :=3D convert(s, 'iso-8859-5', 'windows-1251'); t :=3D=20 convert(t, 'windows-1251', 'utf-8'); if t !=3D r then raise exception 'iso-8859-5, windows-1251 | t !=3D r'; end if; res :=3D row('iso-8859-5, windows-1251'::text, encode( =20=20=20=20=20=20 convert(convert(s, 'iso-8859-5', 'windows-1251'), 'windows-1251', 'utf-8') , 'escape')::text ); return next res; [...skip...] seb=3D# select * from test_convert() as (conv text, res text); NOTICE: russian ABC: "=D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=B6=D0= =B7..." conv | res ----------------------------+----------- iso-8859-5, windows-1251 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... iso-8859-5, windows-866 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... iso-8859-5, koi8-r | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... iso-8859-5, iso-8859-5 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-866, windows-1251 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-866, iso-8859-5 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-866, koi8-r | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-866, windows-866 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-1251, windows-866 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-1251, iso-8859-5 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-1251, koi8-r | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-1251, windows-1251 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... koi8-r, windows-866 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... koi8-r, iso-8859-5 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... koi8-r, windows-1251 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... koi8-r, koi8-r | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... (16 rows) > Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an > intermediate encoding, because there's no direct conversion table > between ISO-8859-5 and the other cyrillic encodings. Ideally there would > be. Another possibility would be to use UTF-8 as the intermediate > encoding; that'd probably be much slower, but UTF-8 should have all the > characters needed. I think that UTF-8 is too complex for translate 8-bit charset to another 8-= bit=20 charset, but other solution is many many translate tables... hard question = %) > Is there any other characters like "YO" that are missing, that exist in > all the encodings?=20 if we say about alphabet letters, the answer is - No, only "YO" was missing. if we say about any character, there is 'NO-BREAK SPACE' (U+00A0) it exist = in=20 1251, 866, koi8-r and iso but i do not think that it widely used... > Looking at the character set table for KOI8-R, it=20 > looks like the "YO" is in an odd place in the table, compared to all > other cyrillic characters. Perhaps that's why it was missed. Yes, i understand. russian character sets always been a challenge for all= =20 programmers :) it are at least five, and it are all different Thanks for patch, Heikki ! ---
В списке pgsql-bugs по дате отправления:
Предыдущее
От: "Heikki Linnakangas"Дата:
Сообщение: Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
Следующее
От: NikhilSДата:
Сообщение: Re: Problem identifying constraints which should not be inherited