Обсуждение: russian case-insensitive regexp search not working

Поиск
Список
Период
Сортировка

russian case-insensitive regexp search not working

От
alexander lunyov
Дата:
Hello, friends.

OS FreeBSD 6.2, Postgresql 8.2.4

Postgresql does not search case-insensitive russian regexp unicode
patterns. Postgres is working under user pgsql with login class (in
/etc/login.conf):

postgres:\
     :lang=ru_RU.UTF-8:\
     :setenv=LC_COLLATE=C:\
     :tc=default:

In .profile of postgres user:

LANG=ru_RU.UTF-8
export LANG
CHARSET=UTF-8
export CHARSET

Then, database:

db=> \encoding
UTF8

Case insensitive search for low-cased pattern show nothing:

db=> select street from people where street ~* 'зелен';
  street
--------
(0 rows)

While there are records, but they are with first capital character:

db=> select street from people where street ~* 'Зелен';
      street
----------------
  Зеленая
  Зеленоградская
(2 rows)

Search for english values work fine, russian values not. Why could it be?

--
alexander lunyov
lan@startatom.ru


Re: russian case-insensitive regexp search not working

От
alexander lunyov
Дата:
No, ILIKE also does case-sensitive search.

I found this bug report:
http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php

Is it about this issue? And will it be fixed someday?

Sergey Levchenko wrote:
> Just use: select street from people where street ILIKE 'зелен%';
>
> select with case-insensitive regexp does no work right now!
>
> On 09/07/07, alexander lunyov <lan@startatom.ru> wrote:
>> Hello, friends.
>>
>> OS FreeBSD 6.2, Postgresql 8.2.4
>>
>> Postgresql does not search case-insensitive russian regexp unicode
>> patterns. Postgres is working under user pgsql with login class (in
>> /etc/login.conf):
>>
>> postgres:\
>>      :lang=ru_RU.UTF-8:\
>>      :setenv=LC_COLLATE=C:\
>>      :tc=default:
>>
>> In .profile of postgres user:
>>
>> LANG=ru_RU.UTF-8
>> export LANG
>> CHARSET=UTF-8
>> export CHARSET
>>
>> Then, database:
>>
>> db=> \encoding
>> UTF8
>>
>> Case insensitive search for low-cased pattern show nothing:
>>
>> db=> select street from people where street ~* 'зелен';
>>   street
>> --------
>> (0 rows)
>>
>> While there are records, but they are with first capital character:
>>
>> db=> select street from people where street ~* 'Зелен';
>>       street
>> ----------------
>>   Зеленая
>>   Зеленоградская
>> (2 rows)
>>
>> Search for english values work fine, russian values not. Why could it be?
>>
>> --
>> alexander lunyov
>> lan@startatom.ru
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 2: Don't 'kill -9' the postmaster
>>

--
alexander lunyov
lan@startatom.ru


Re: russian case-insensitive regexp search not working

От
Karsten Hilbert
Дата:
On Mon, Jul 09, 2007 at 04:00:01PM +0400, alexander lunyov wrote:

> I found this bug report:
> http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php
>
> Is it about this issue?
Yes.

> And will it be fixed someday?
Likely. In the meantime lower() can come to the rescue.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: russian case-insensitive regexp search not working

От
Tom Lane
Дата:
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:
> On Mon, Jul 09, 2007 at 04:00:01PM +0400, alexander lunyov wrote:
>> I found this bug report:
>> http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php
>>
>> Is it about this issue?
> Yes.

>> And will it be fixed someday?
> Likely. In the meantime lower() can come to the rescue.

It's not nearly as easy to fix as the reporter of that bug imagines,
because we don't know the relationship between pg_wchar representation
and the wchar_t representation assumed by the <wctype.h> functions.

In the short term it should work if you can use a single-byte encoding
(koi8r I suppose) and a locale setting compatible with that.

            regards, tom lane

Re: russian case-insensitive regexp search not working

От
Karsten Hilbert
Дата:
On Mon, Jul 09, 2007 at 09:50:42AM -0400, Tom Lane wrote:

> > On Mon, Jul 09, 2007 at 04:00:01PM +0400, alexander lunyov wrote:
> >> I found this bug report:
> >> http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php
> >>
> >> Is it about this issue?
> > Yes.
>
> >> And will it be fixed someday?
> > Likely. In the meantime lower() can come to the rescue.
>
> It's not nearly as easy to fix as the reporter of that bug imagines,
> because we don't know the relationship between pg_wchar representation
> and the wchar_t representation assumed by the <wctype.h> functions.

Just to clarify: lower() on both sides of a comparison
should still work as expected on multibyte encodings ? It's
been suggested here before.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: russian case-insensitive regexp search not working

От
alexander lunyov
Дата:
Karsten Hilbert wrote:
> Just to clarify: lower() on both sides of a comparison
> should still work as expected on multibyte encodings ? It's
> been suggested here before.

lower() on both sides also does not working in my case, it still search
for case-sensitive data. String in this example have first char
capitalized, and result is the same. Seems that lower() can't lower
multibyte character.


db=> select lower('Зелен');
  lower
-------
  Зелен
(1 запись)


--
alexander lunyov
lan@startatom.ru


Re: russian case-insensitive regexp search not working

От
Karsten Hilbert
Дата:
On Tue, Jul 10, 2007 at 08:40:24AM +0400, alexander lunyov wrote:

>> Just to clarify: lower() on both sides of a comparison
>> should still work as expected on multibyte encodings ? It's
>> been suggested here before.
>
> lower() on both sides also does not working in my case, it still search for
> case-sensitive data. String in this example have first char capitalized,
> and result is the same. Seems that lower() can't lower multibyte character.
>
> db=> select lower('Зелен');
Well, no,

    select my_string where lower(my_string) ~ lower(search_fragment);

Does that help ?

(~ does work for eg. German in my experience)

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: russian case-insensitive regexp search not working

От
alexander lunyov
Дата:
Karsten Hilbert wrote:
>>> Just to clarify: lower() on both sides of a comparison
>>> should still work as expected on multibyte encodings ? It's
>>> been suggested here before.
>> lower() on both sides also does not working in my case, it still search for
>> case-sensitive data. String in this example have first char capitalized,
>> and result is the same. Seems that lower() can't lower multibyte character.
>>
>> db=> select lower('Зелен');
> Well, no,

    With this i just wanted to say that lower() doesn't work at all on
russian unicode characters, even in "select lower('String')" 'String'
don't become lowercase, and further it does not work in more complex
select statement.

>
>     select my_string where lower(my_string) ~ lower(search_fragment);
>
> Does that help ?
>
> (~ does work for eg. German in my experience)

     No, for russian unicode strings it is not working.
     I searched pgsql-patches@ list and found there this thread:
     http://archives.postgresql.org/pgsql-patches/2007-06/msg00021.php
     I wrote Andrew (he didn't answer yet) about whether this patch can
help with my problem.

P.S.: if this issue is a known bug (as we talked earlier), then how long
will it take to fix it? I know little about postgresql development
process, maybe you know it little better?

--
alexander lunyov
lan@startatom.ru




Re: russian case-insensitive regexp search not working

От
Tom Lane
Дата:
alexander lunyov <lan@startatom.ru> writes:
>     With this i just wanted to say that lower() doesn't work at all on
> russian unicode characters,

In that case you're using the wrong locale (ie, not russian unicode).
Check "show lc_ctype".

Or [ checks back in thread... ] maybe you're using the wrong operating
system.  Not so long ago FreeBSD didn't have Unicode locale support at
all; I'm not sure if 6.2 has that problem but it is worth checking.
Does it work for you to do case-insensitive russian comparisons in
"grep", for instance?

            regards, tom lane

Re: russian case-insensitive regexp search not working

От
alexander lunyov
Дата:
Tom Lane wrote:
> alexander lunyov <lan@startatom.ru> writes:
>>     With this i just wanted to say that lower() doesn't work at all on
>> russian unicode characters,
>
> In that case you're using the wrong locale (ie, not russian unicode).
> Check "show lc_ctype".

db=>  SHOW LC_CTYPE;
  lc_ctype
----------
  C
(1 запись)

db=>  SHOW LC_COLLATE;
  lc_collate
------------
  C
(1 запись)

Where can i change this? Trying to SET this parameters gives error
"parameter "lc_collate" cannot be changed"

> Or [ checks back in thread... ] maybe you're using the wrong operating
> system.  Not so long ago FreeBSD didn't have Unicode locale support at
> all; I'm not sure if 6.2 has that problem but it is worth checking.
> Does it work for you to do case-insensitive russian comparisons in
> "grep", for instance?

I put to textfile 3 russian strings with different case of first char
and grep'ed them all:

# cat > textfile
Зеленая
Зеленодольская
зеленая
# grep -i зелен *
textfile:Зеленая
textfile:Зеленодольская
textfile:зеленая

So i think system is fine about unicode.

--
alexander lunyov
lan@startatom.ru


Re: russian case-insensitive regexp search not working

От
Oleg Bartunov
Дата:
alexander,

lc_ctype and lc_collate can be changed only at initdb !
You need to read localization chapter
http://www.postgresql.org/docs/current/static/charset.html


Oleg
On Thu, 12 Jul 2007, alexander lunyov wrote:

> Tom Lane wrote:
>> alexander lunyov <lan@startatom.ru> writes:
>>>     With this i just wanted to say that lower() doesn't work at all on
>>> russian unicode characters,
>>
>> In that case you're using the wrong locale (ie, not russian unicode).
>> Check "show lc_ctype".
>
> db=>  SHOW LC_CTYPE;
> lc_ctype
> ----------
> C
> (1 запись)
>
> db=>  SHOW LC_COLLATE;
> lc_collate
> ------------
> C
> (1 запись)
>
> Where can i change this? Trying to SET this parameters gives error "parameter
> "lc_collate" cannot be changed"
>
>> Or [ checks back in thread... ] maybe you're using the wrong operating
>> system.  Not so long ago FreeBSD didn't have Unicode locale support at
>> all; I'm not sure if 6.2 has that problem but it is worth checking.
>> Does it work for you to do case-insensitive russian comparisons in
>> "grep", for instance?
>
> I put to textfile 3 russian strings with different case of first char and
> grep'ed them all:
>
> # cat > textfile
> Зеленая
> Зеленодольская
> зеленая
> # grep -i зелен *
> textfile:Зеленая
> textfile:Зеленодольская
> textfile:зеленая
>
> So i think system is fine about unicode.
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: russian case-insensitive regexp search not working

От
alexander lunyov
Дата:
Oleg Bartunov wrote:
> alexander,
>
> lc_ctype and lc_collate can be changed only at initdb !
> You need to read localization chapter
> http://www.postgresql.org/docs/current/static/charset.html


Yes, i knew about this, but i thought maybe somehow it can be changed
onthefly.

... (10 minutes later)

Yes, now when initdb done with --locale=ru_RU.UTF-8,
lower('RussianString') gives me 'russianstring', though, case-insensiive
regexp still not working. I guess i'll stick with lower() ~ lower()
construction.

And thanks everybody who replied!

>
>
> Oleg
> On Thu, 12 Jul 2007, alexander lunyov wrote:
>
>> Tom Lane wrote:
>>> alexander lunyov <lan@startatom.ru> writes:
>>>>     With this i just wanted to say that lower() doesn't work at all on
>>>> russian unicode characters,
>>>
>>> In that case you're using the wrong locale (ie, not russian unicode).
>>> Check "show lc_ctype".
>>
>> db=>  SHOW LC_CTYPE;
>> lc_ctype
>> ----------
>> C
>> (1 запись)
>>
>> db=>  SHOW LC_COLLATE;
>> lc_collate
>> ------------
>> C
>> (1 запись)
>>
>> Where can i change this? Trying to SET this parameters gives error
>> "parameter "lc_collate" cannot be changed"
>>
>>> Or [ checks back in thread... ] maybe you're using the wrong operating
>>> system.  Not so long ago FreeBSD didn't have Unicode locale support at
>>> all; I'm not sure if 6.2 has that problem but it is worth checking.
>>> Does it work for you to do case-insensitive russian comparisons in
>>> "grep", for instance?
>>
>> I put to textfile 3 russian strings with different case of first char
>> and grep'ed them all:
>>
>> # cat > textfile
>> Зеленая
>> Зеленодольская
>> зеленая
>> # grep -i зелен *
>> textfile:Зеленая
>> textfile:Зеленодольская
>> textfile:зеленая
>>
>> So i think system is fine about unicode.
>>
>>
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

--
alexander lunyov
lan@startatom.ru