Обсуждение: UTF-8
Hello! I'm using PgSQL for a 3 years for web applications, but not only. But the main problem is in encoding. My web applications are used by international (mostly 3 languages: latvian (LATIN7), english and russian). The best (mostly) solution is to use UTF-8, but there are a lot of problems. The main problem is in searching (tsearch2, lower, upper, e.c.) and sorting. I would be a glad to hear your solutions, experience in web application with multi languages (searching with indexing, sorting and others problems with multi byte encoding). For developers: what are your future plans about UTF-8 in Postgres? thanx Martins
On Fri, Oct 06, 2006 at 12:44:43PM +0300, Martins Mihailovs wrote: > I would be a glad to hear your solutions, experience in web application > with multi languages (searching with indexing, sorting and others > problems with multi byte encoding). > > For developers: what are your future plans about UTF-8 in Postgres? Currnetly postgresql using the OS to do sorting, upper/lower conversions, etc. Which means you stuck with whatever your OS does, which may or may not be any good. The fix is easy, make postgresql use some kind of system independant locale database to do these transformations. We just need someone to write it.... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Martijn van Oosterhout wrote: > On Fri, Oct 06, 2006 at 12:44:43PM +0300, Martins Mihailovs wrote: >> I would be a glad to hear your solutions, experience in web application >> with multi languages (searching with indexing, sorting and others >> problems with multi byte encoding). >> >> For developers: what are your future plans about UTF-8 in Postgres? > > Currnetly postgresql using the OS to do sorting, upper/lower > conversions, etc. Which means you stuck with whatever your OS does, > which may or may not be any good. > > The fix is easy, make postgresql use some kind of system independant > locale database to do these transformations. We just need someone to > write it.... > There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4, (there are one of locale: lv_LV.utf8, for Latvian language). But if I want do "lower", then with standard latin symbols all is ok, but with others special symbols (like umlaut in Germany) there is problems, and sorting is going not like alphabet but like latin alphabet and specials symbols after. :( For sorting there are solution function nls_to_string(), witch is using C library and system locales for converting strings to binary. But it dont work for search (like, tsearch2, e.c.) I dont know what to do, but function lower() is not working for special symbols (LATIN 7). And I tried in windows - the same... (as I know windows knows how to converts latvian symbols lower, upper... in UTF-8)...
On Tue, Oct 10, 2006 at 11:49:06AM +0300, Martins Mihailovs wrote: > There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4, > (there are one of locale: lv_LV.utf8, for Latvian language). But if I > want do "lower", then with standard latin symbols all is ok, but with > others special symbols (like umlaut in Germany) there is problems, and > sorting is going not like alphabet but like latin alphabet and specials > symbols after. :( You don't say what your encoding is. If it not UTF-8, that's your problem... Check the lc_collate, lc_ctype and encoding settings and make sure they're compatable with eachother. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
2006/10/12, Martijn van Oosterhout <kleptog@svana.org>: > On Tue, Oct 10, 2006 at 11:49:06AM +0300, Martins Mihailovs wrote: > > There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4, > > (there are one of locale: lv_LV.utf8, for Latvian language). But if I > > want do "lower", then with standard latin symbols all is ok, but with > > others special symbols (like umlaut in Germany) there is problems, and > > sorting is going not like alphabet but like latin alphabet and specials > > symbols after. :( > > You don't say what your encoding is. If it not UTF-8, that's your > problem... Doesn't lv_LV.utf8 mean he *did* say what his encoding is? t.n.a.
On Thu, Oct 12, 2006 at 11:09:53PM +0200, Tomi NA wrote: > 2006/10/12, Martijn van Oosterhout <kleptog@svana.org>: > >On Tue, Oct 10, 2006 at 11:49:06AM +0300, Martins Mihailovs wrote: > >> There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4, > >> (there are one of locale: lv_LV.utf8, for Latvian language). But if I > >> want do "lower", then with standard latin symbols all is ok, but with > >> others special symbols (like umlaut in Germany) there is problems, and > >> sorting is going not like alphabet but like latin alphabet and specials > >> symbols after. :( > > > >You don't say what your encoding is. If it not UTF-8, that's your > >problem... > > Doesn't lv_LV.utf8 mean he *did* say what his encoding is? Not really. It says the encoding the system *expects*. However, if he actually created his database with LATIN1 encoding, it would explain the problems he's having. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
2006/10/13, Martijn van Oosterhout <kleptog@svana.org>: > On Thu, Oct 12, 2006 at 11:09:53PM +0200, Tomi NA wrote: > > 2006/10/12, Martijn van Oosterhout <kleptog@svana.org>: > > >On Tue, Oct 10, 2006 at 11:49:06AM +0300, Martins Mihailovs wrote: > > >> There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4, > > >> (there are one of locale: lv_LV.utf8, for Latvian language). But if I > > >> want do "lower", then with standard latin symbols all is ok, but with > > >> others special symbols (like umlaut in Germany) there is problems, and > > >> sorting is going not like alphabet but like latin alphabet and specials > > >> symbols after. :( > > > > > >You don't say what your encoding is. If it not UTF-8, that's your > > >problem... > > > > Doesn't lv_LV.utf8 mean he *did* say what his encoding is? > > Not really. It says the encoding the system *expects*. However, if he > actually created his database with LATIN1 encoding, it would explain > the problems he's having. This is a reoccurring topic on the list: sure, it's possible to misconfigure pg so that uppercase/lowercase/ilike/tsearch2/order don't work with a single letter outside of the English alphabet, but the problem Martins seems to be facing is one we've seen here before (myself being one of those affected). There's no way Martins can set up pg - UTF or no UTF - so that collation and case insensitivity-based functions work in both Latvian an Russian. Because I have the same problem with Croatian, German and Italian, I've limited my use of pg to projects targeted at LAN or intranet environments: others are probably switching to mysql or firebird altogether as it's easier to work with just one RDBMS than two. t.n.a.
On Fri, Oct 13, 2006 at 03:40:17PM +0200, Tomi NA wrote: > This is a reoccurring topic on the list: sure, it's possible to > misconfigure pg so that uppercase/lowercase/ilike/tsearch2/order don't > work with a single letter outside of the English alphabet, but the > problem Martins seems to be facing is one we've seen here before > (myself being one of those affected). There's no way Martins can set > up pg - UTF or no UTF - so that collation and case insensitivity-based > functions work in both Latvian an Russian. While sorting for multiple languages simultaneously is an issue, that's not the problem here. Linux/GLibc *does* support correct sorting for all language/charset combinations, and that's what he's using. Just for the hell of it I setup lv_LV.utf8 on my laptop and verifed that it sorts just fine: $ LC_ALL=lv_LV.utf8 sort < /tmp/test2 | recode utf8..latin1 a á B d F ü Z $ (hope it looks ok on your screen, the source file is utf8 but the recode is for display). Similarly, upper/lower are also supported, although postgresql doesn't take advantage of the system support in that case. So yes, PostgreSQL does have some issues in this area, but for the platform he is using sorting *should* work. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
2006/10/13, Martijn van Oosterhout <kleptog@svana.org>: > While sorting for multiple languages simultaneously is an issue, that's > not the problem here. Linux/GLibc *does* support correct sorting for > all language/charset combinations, and that's what he's using. Just for > the hell of it I setup lv_LV.utf8 on my laptop and verifed that it > sorts just fine: ... > Similarly, upper/lower are also supported, although postgresql doesn't > take advantage of the system support in that case. I think this is the crux of the problem. Not supporting uppercase and lowercase makes an e.g. generic people search dialog not malfunction: searching for "Müller" will not find him if he is stored as "OTTO MÜLLER" in the database. Certainly not if I have to make sure the search finds one "Zvonimir Šimić" stored as "ZVONIMIR ŠIMIĆ". Whats more, if the user gives up on the integrated search and tries to list all the people in such a database ordered by their last names, he probably won't find Šimić because the user expects him to be between S and T, not after 'Z' (where he ends up beacuse the letter code of 'Š' is greater than that of 'Z'). As for Martins' problem, he needs to support 2 non-english languages which means he's stuck with the same problem of one language being semi functional. As an aside, why doesn't pg take advantage ot the underlying system's support of upper/lower case? t.n.a.
"Tomi NA" <hefest@gmail.com> writes: > 2006/10/13, Martijn van Oosterhout <kleptog@svana.org>: >> Similarly, upper/lower are also supported, although postgresql doesn't >> take advantage of the system support in that case. > I think this is the crux of the problem. If it were true, then it might be ... regards, tom lane
On Fri, Oct 13, 2006 at 12:04:02PM -0400, Tom Lane wrote: > "Tomi NA" <hefest@gmail.com> writes: > > 2006/10/13, Martijn van Oosterhout <kleptog@svana.org>: > >> Similarly, upper/lower are also supported, although postgresql doesn't > >> take advantage of the system support in that case. > > > I think this is the crux of the problem. > > If it were true, then it might be ... Eh? Here's the declaration of pg_toupper: unsigned char pg_toupper(unsigned char ch); Characters havn't fitted in an unsigned char in a very long time. It's obviously bogus for any multibyte encoding (the code even says so). For such encodings you could use the system's towupper() (ANSI C/Unix98) which will work on any unicode char. To make this work, pg_strupper() will have to convert each character to Unicode, run towupper() and convert back to the encoding. I imagine that'll get rejected for being inefficient, but really don't see any other way. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Martijn van Oosterhout <kleptog@svana.org> writes: > Characters havn't fitted in an unsigned char in a very long time. It's > obviously bogus for any multibyte encoding (the code even says so). For > such encodings you could use the system's towupper() (ANSI C/Unix98) > which will work on any unicode char. http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/oracle_compat.c?rev=1.67 * If the system provides the needed functions for wide-character manipulation * (which are all standardized by C99), then we implement upper/lower/initcap * using wide-character functions. Otherwise we use the traditional <ctype.h> * functions, which of course will not work as desired in multibyte character * sets. Note that in either case we are effectively assuming that the * database character encoding matches the encoding implied by LC_CTYPE. regards, tom lane
Martijn van Oosterhout wrote: > On Thu, Oct 12, 2006 at 11:09:53PM +0200, Tomi NA wrote: >> 2006/10/12, Martijn van Oosterhout <kleptog@svana.org>: >>> On Tue, Oct 10, 2006 at 11:49:06AM +0300, Martins Mihailovs wrote: >>>> There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4, >>>> (there are one of locale: lv_LV.utf8, for Latvian language). But if I >>>> want do "lower", then with standard latin symbols all is ok, but with >>>> others special symbols (like umlaut in Germany) there is problems, and >>>> sorting is going not like alphabet but like latin alphabet and specials >>>> symbols after. :( >>> You don't say what your encoding is. If it not UTF-8, that's your >>> problem... >> Doesn't lv_LV.utf8 mean he *did* say what his encoding is? > > Not really. It says the encoding the system *expects*. However, if he > actually created his database with LATIN1 encoding, it would explain > the problems he's having. > > Have a nice day, of course DB is width UNICODE....