Обсуждение: UTF-8

Поиск
Список
Период
Сортировка

UTF-8

От
Martins Mihailovs
Дата:
Hello!

I'm using PgSQL for a 3 years for web applications, but not only. But
the main problem is in encoding. My web applications are used by
international (mostly 3 languages: latvian (LATIN7), english and
russian). The best (mostly) solution is to use UTF-8, but there are a
lot of problems. The main problem is in searching (tsearch2, lower,
upper, e.c.) and sorting.

I would be a glad to hear your solutions, experience in web application
with multi languages (searching with indexing, sorting and others
problems with multi byte encoding).

For developers: what are your future plans about UTF-8 in Postgres?

thanx
Martins

Re: UTF-8

От
Martijn van Oosterhout
Дата:
On Fri, Oct 06, 2006 at 12:44:43PM +0300, Martins Mihailovs wrote:
> I would be a glad to hear your solutions, experience in web application
> with multi languages (searching with indexing, sorting and others
> problems with multi byte encoding).
>
> For developers: what are your future plans about UTF-8 in Postgres?

Currnetly postgresql using the OS to do sorting, upper/lower
conversions, etc. Which means you stuck with whatever your OS does,
which may or may not be any good.

The fix is easy, make postgresql use some kind of system independant
locale database to do these transformations. We just need someone to
write it....

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: UTF-8

От
Martins Mihailovs
Дата:
Martijn van Oosterhout wrote:
> On Fri, Oct 06, 2006 at 12:44:43PM +0300, Martins Mihailovs wrote:
>> I would be a glad to hear your solutions, experience in web application
>> with multi languages (searching with indexing, sorting and others
>> problems with multi byte encoding).
>>
>> For developers: what are your future plans about UTF-8 in Postgres?
>
> Currnetly postgresql using the OS to do sorting, upper/lower
> conversions, etc. Which means you stuck with whatever your OS does,
> which may or may not be any good.
>
> The fix is easy, make postgresql use some kind of system independant
> locale database to do these transformations. We just need someone to
> write it....
>


There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4,
(there are one of locale:   lv_LV.utf8, for Latvian language). But if I
want do "lower", then with standard latin symbols all is ok, but with
others special symbols (like umlaut in Germany) there is problems, and
sorting is going not like alphabet but like latin alphabet and specials
symbols after. :(

For sorting there are solution function nls_to_string(), witch is using
C library and system locales for converting strings to binary. But it
dont work for search (like, tsearch2, e.c.)

I dont know what to do, but function lower() is not working for special
symbols (LATIN 7).

And I tried in windows - the same... (as I know windows knows how to
converts latvian symbols lower, upper... in UTF-8)...





Re: UTF-8

От
Martijn van Oosterhout
Дата:
On Tue, Oct 10, 2006 at 11:49:06AM +0300, Martins Mihailovs wrote:
> There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4,
> (there are one of locale:   lv_LV.utf8, for Latvian language). But if I
> want do "lower", then with standard latin symbols all is ok, but with
> others special symbols (like umlaut in Germany) there is problems, and
> sorting is going not like alphabet but like latin alphabet and specials
> symbols after. :(

You don't say what your encoding is. If it not UTF-8, that's your
problem...

Check the lc_collate, lc_ctype and encoding settings and make sure
they're compatable with eachother.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: UTF-8

От
"Tomi NA"
Дата:
2006/10/12, Martijn van Oosterhout <kleptog@svana.org>:
> On Tue, Oct 10, 2006 at 11:49:06AM +0300, Martins Mihailovs wrote:
> > There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4,
> > (there are one of locale:   lv_LV.utf8, for Latvian language). But if I
> > want do "lower", then with standard latin symbols all is ok, but with
> > others special symbols (like umlaut in Germany) there is problems, and
> > sorting is going not like alphabet but like latin alphabet and specials
> > symbols after. :(
>
> You don't say what your encoding is. If it not UTF-8, that's your
> problem...

Doesn't lv_LV.utf8 mean he *did* say what his encoding is?

t.n.a.

Re: UTF-8

От
Martijn van Oosterhout
Дата:
On Thu, Oct 12, 2006 at 11:09:53PM +0200, Tomi NA wrote:
> 2006/10/12, Martijn van Oosterhout <kleptog@svana.org>:
> >On Tue, Oct 10, 2006 at 11:49:06AM +0300, Martins Mihailovs wrote:
> >> There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4,
> >> (there are one of locale:   lv_LV.utf8, for Latvian language). But if I
> >> want do "lower", then with standard latin symbols all is ok, but with
> >> others special symbols (like umlaut in Germany) there is problems, and
> >> sorting is going not like alphabet but like latin alphabet and specials
> >> symbols after. :(
> >
> >You don't say what your encoding is. If it not UTF-8, that's your
> >problem...
>
> Doesn't lv_LV.utf8 mean he *did* say what his encoding is?

Not really. It says the encoding the system *expects*. However, if he
actually created his database with LATIN1 encoding, it would explain
the problems he's having.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: UTF-8

От
"Tomi NA"
Дата:
2006/10/13, Martijn van Oosterhout <kleptog@svana.org>:
> On Thu, Oct 12, 2006 at 11:09:53PM +0200, Tomi NA wrote:
> > 2006/10/12, Martijn van Oosterhout <kleptog@svana.org>:
> > >On Tue, Oct 10, 2006 at 11:49:06AM +0300, Martins Mihailovs wrote:
> > >> There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4,
> > >> (there are one of locale:   lv_LV.utf8, for Latvian language). But if I
> > >> want do "lower", then with standard latin symbols all is ok, but with
> > >> others special symbols (like umlaut in Germany) there is problems, and
> > >> sorting is going not like alphabet but like latin alphabet and specials
> > >> symbols after. :(
> > >
> > >You don't say what your encoding is. If it not UTF-8, that's your
> > >problem...
> >
> > Doesn't lv_LV.utf8 mean he *did* say what his encoding is?
>
> Not really. It says the encoding the system *expects*. However, if he
> actually created his database with LATIN1 encoding, it would explain
> the problems he's having.

This is a reoccurring topic on the list: sure, it's possible to
misconfigure pg so that uppercase/lowercase/ilike/tsearch2/order don't
work with a single letter outside of the English alphabet, but the
problem Martins seems to be facing is one we've seen here before
(myself being one of those affected). There's no way Martins can set
up pg - UTF or no UTF - so that collation and case insensitivity-based
functions work in both Latvian an Russian.
Because I have the same problem with Croatian, German and Italian,
I've limited my use of pg to projects targeted at LAN or intranet
environments: others are probably switching to mysql or firebird
altogether as it's easier to work with just one RDBMS than two.

t.n.a.

Re: UTF-8

От
Martijn van Oosterhout
Дата:
On Fri, Oct 13, 2006 at 03:40:17PM +0200, Tomi NA wrote:
> This is a reoccurring topic on the list: sure, it's possible to
> misconfigure pg so that uppercase/lowercase/ilike/tsearch2/order don't
> work with a single letter outside of the English alphabet, but the
> problem Martins seems to be facing is one we've seen here before
> (myself being one of those affected). There's no way Martins can set
> up pg - UTF or no UTF - so that collation and case insensitivity-based
> functions work in both Latvian an Russian.

While sorting for multiple languages simultaneously is an issue, that's
not the problem here. Linux/GLibc *does* support correct sorting for
all language/charset combinations, and that's what he's using. Just for
the hell of it I setup lv_LV.utf8 on my laptop and verifed that it
sorts just fine:

$ LC_ALL=lv_LV.utf8 sort < /tmp/test2 | recode utf8..latin1
a
á
B
d
F
ü
Z
$
 (hope it looks ok on your screen, the source file is utf8 but the
recode is for display).

Similarly, upper/lower are also supported, although postgresql doesn't
take advantage of the system support in that case.

So yes, PostgreSQL does have some issues in this area, but for the
platform he is using sorting *should* work.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: UTF-8

От
"Tomi NA"
Дата:
2006/10/13, Martijn van Oosterhout <kleptog@svana.org>:

> While sorting for multiple languages simultaneously is an issue, that's
> not the problem here. Linux/GLibc *does* support correct sorting for
> all language/charset combinations, and that's what he's using. Just for
> the hell of it I setup lv_LV.utf8 on my laptop and verifed that it
> sorts just fine:
...
> Similarly, upper/lower are also supported, although postgresql doesn't
> take advantage of the system support in that case.

I think this is the crux of the problem. Not supporting uppercase and
lowercase makes an e.g. generic people search dialog not malfunction:
searching for "Müller" will not find him if he is stored as "OTTO
MÜLLER" in the database. Certainly not if I have to make sure the
search finds one "Zvonimir Šimić" stored as "ZVONIMIR ŠIMIĆ". Whats
more, if the user gives up on the integrated search and tries to list
all the people in such a database ordered by their last names, he
probably won't find Šimić because the user expects him to be between S
and T, not after 'Z' (where he ends up beacuse the letter code of 'Š'
is greater than that of 'Z').
As for Martins' problem, he needs to support 2 non-english languages
which means he's stuck with the same problem of one language being
semi functional.

As an aside, why doesn't pg take advantage ot the underlying system's
support of upper/lower case?

t.n.a.

Re: UTF-8

От
Tom Lane
Дата:
"Tomi NA" <hefest@gmail.com> writes:
> 2006/10/13, Martijn van Oosterhout <kleptog@svana.org>:
>> Similarly, upper/lower are also supported, although postgresql doesn't
>> take advantage of the system support in that case.

> I think this is the crux of the problem.

If it were true, then it might be ...

            regards, tom lane

Re: UTF-8

От
Martijn van Oosterhout
Дата:
On Fri, Oct 13, 2006 at 12:04:02PM -0400, Tom Lane wrote:
> "Tomi NA" <hefest@gmail.com> writes:
> > 2006/10/13, Martijn van Oosterhout <kleptog@svana.org>:
> >> Similarly, upper/lower are also supported, although postgresql doesn't
> >> take advantage of the system support in that case.
>
> > I think this is the crux of the problem.
>
> If it were true, then it might be ...

Eh? Here's the declaration of pg_toupper:

unsigned char pg_toupper(unsigned char ch);

Characters havn't fitted in an unsigned char in a very long time. It's
obviously bogus for any multibyte encoding (the code even says so). For
such encodings you could use the system's towupper() (ANSI C/Unix98)
which will work on any unicode char.

To make this work, pg_strupper() will have to convert each character to
Unicode, run towupper() and convert back to the encoding. I imagine
that'll get rejected for being inefficient, but really don't see any
other way.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: UTF-8

От
Tom Lane
Дата:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Characters havn't fitted in an unsigned char in a very long time. It's
> obviously bogus for any multibyte encoding (the code even says so). For
> such encodings you could use the system's towupper() (ANSI C/Unix98)
> which will work on any unicode char.

http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/oracle_compat.c?rev=1.67

 * If the system provides the needed functions for wide-character manipulation
 * (which are all standardized by C99), then we implement upper/lower/initcap
 * using wide-character functions.  Otherwise we use the traditional <ctype.h>
 * functions, which of course will not work as desired in multibyte character
 * sets.  Note that in either case we are effectively assuming that the
 * database character encoding matches the encoding implied by LC_CTYPE.

            regards, tom lane

Re: UTF-8

От
Martins Mihailovs
Дата:
Martijn van Oosterhout wrote:
> On Thu, Oct 12, 2006 at 11:09:53PM +0200, Tomi NA wrote:
>> 2006/10/12, Martijn van Oosterhout <kleptog@svana.org>:
>>> On Tue, Oct 10, 2006 at 11:49:06AM +0300, Martins Mihailovs wrote:
>>>> There are some misunderstood. Im using Linux 2.6.16.4, postgresql 8.1.4,
>>>> (there are one of locale:   lv_LV.utf8, for Latvian language). But if I
>>>> want do "lower", then with standard latin symbols all is ok, but with
>>>> others special symbols (like umlaut in Germany) there is problems, and
>>>> sorting is going not like alphabet but like latin alphabet and specials
>>>> symbols after. :(
>>> You don't say what your encoding is. If it not UTF-8, that's your
>>> problem...
>> Doesn't lv_LV.utf8 mean he *did* say what his encoding is?
>
> Not really. It says the encoding the system *expects*. However, if he
> actually created his database with LATIN1 encoding, it would explain
> the problems he's having.
>
> Have a nice day,


of course DB is width UNICODE....