Обсуждение: russian case-insensitive regexp search not working
Hello, friends. OS FreeBSD 6.2, Postgresql 8.2.4 Postgresql does not search case-insensitive russian regexp unicode patterns. Postgres is working under user pgsql with login class (in /etc/login.conf): postgres:\ :lang=ru_RU.UTF-8:\ :setenv=LC_COLLATE=C:\ :tc=default: In .profile of postgres user: LANG=ru_RU.UTF-8 export LANG CHARSET=UTF-8 export CHARSET Then, database: db=> \encoding UTF8 Case insensitive search for low-cased pattern show nothing: db=> select street from people where street ~* 'зелен'; street -------- (0 rows) While there are records, but they are with first capital character: db=> select street from people where street ~* 'Зелен'; street ---------------- Зеленая Зеленоградская (2 rows) Search for english values work fine, russian values not. Why could it be? -- alexander lunyov lan@startatom.ru
No, ILIKE also does case-sensitive search. I found this bug report: http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php Is it about this issue? And will it be fixed someday? Sergey Levchenko wrote: > Just use: select street from people where street ILIKE 'зелен%'; > > select with case-insensitive regexp does no work right now! > > On 09/07/07, alexander lunyov <lan@startatom.ru> wrote: >> Hello, friends. >> >> OS FreeBSD 6.2, Postgresql 8.2.4 >> >> Postgresql does not search case-insensitive russian regexp unicode >> patterns. Postgres is working under user pgsql with login class (in >> /etc/login.conf): >> >> postgres:\ >> :lang=ru_RU.UTF-8:\ >> :setenv=LC_COLLATE=C:\ >> :tc=default: >> >> In .profile of postgres user: >> >> LANG=ru_RU.UTF-8 >> export LANG >> CHARSET=UTF-8 >> export CHARSET >> >> Then, database: >> >> db=> \encoding >> UTF8 >> >> Case insensitive search for low-cased pattern show nothing: >> >> db=> select street from people where street ~* 'зелен'; >> street >> -------- >> (0 rows) >> >> While there are records, but they are with first capital character: >> >> db=> select street from people where street ~* 'Зелен'; >> street >> ---------------- >> Зеленая >> Зеленоградская >> (2 rows) >> >> Search for english values work fine, russian values not. Why could it be? >> >> -- >> alexander lunyov >> lan@startatom.ru >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 2: Don't 'kill -9' the postmaster >> -- alexander lunyov lan@startatom.ru
On Mon, Jul 09, 2007 at 04:00:01PM +0400, alexander lunyov wrote: > I found this bug report: > http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php > > Is it about this issue? Yes. > And will it be fixed someday? Likely. In the meantime lower() can come to the rescue. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes: > On Mon, Jul 09, 2007 at 04:00:01PM +0400, alexander lunyov wrote: >> I found this bug report: >> http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php >> >> Is it about this issue? > Yes. >> And will it be fixed someday? > Likely. In the meantime lower() can come to the rescue. It's not nearly as easy to fix as the reporter of that bug imagines, because we don't know the relationship between pg_wchar representation and the wchar_t representation assumed by the <wctype.h> functions. In the short term it should work if you can use a single-byte encoding (koi8r I suppose) and a locale setting compatible with that. regards, tom lane
On Mon, Jul 09, 2007 at 09:50:42AM -0400, Tom Lane wrote: > > On Mon, Jul 09, 2007 at 04:00:01PM +0400, alexander lunyov wrote: > >> I found this bug report: > >> http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php > >> > >> Is it about this issue? > > Yes. > > >> And will it be fixed someday? > > Likely. In the meantime lower() can come to the rescue. > > It's not nearly as easy to fix as the reporter of that bug imagines, > because we don't know the relationship between pg_wchar representation > and the wchar_t representation assumed by the <wctype.h> functions. Just to clarify: lower() on both sides of a comparison should still work as expected on multibyte encodings ? It's been suggested here before. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Karsten Hilbert wrote: > Just to clarify: lower() on both sides of a comparison > should still work as expected on multibyte encodings ? It's > been suggested here before. lower() on both sides also does not working in my case, it still search for case-sensitive data. String in this example have first char capitalized, and result is the same. Seems that lower() can't lower multibyte character. db=> select lower('Зелен'); lower ------- Зелен (1 запись) -- alexander lunyov lan@startatom.ru
On Tue, Jul 10, 2007 at 08:40:24AM +0400, alexander lunyov wrote: >> Just to clarify: lower() on both sides of a comparison >> should still work as expected on multibyte encodings ? It's >> been suggested here before. > > lower() on both sides also does not working in my case, it still search for > case-sensitive data. String in this example have first char capitalized, > and result is the same. Seems that lower() can't lower multibyte character. > > db=> select lower('Зелен'); Well, no, select my_string where lower(my_string) ~ lower(search_fragment); Does that help ? (~ does work for eg. German in my experience) Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Karsten Hilbert wrote: >>> Just to clarify: lower() on both sides of a comparison >>> should still work as expected on multibyte encodings ? It's >>> been suggested here before. >> lower() on both sides also does not working in my case, it still search for >> case-sensitive data. String in this example have first char capitalized, >> and result is the same. Seems that lower() can't lower multibyte character. >> >> db=> select lower('Зелен'); > Well, no, With this i just wanted to say that lower() doesn't work at all on russian unicode characters, even in "select lower('String')" 'String' don't become lowercase, and further it does not work in more complex select statement. > > select my_string where lower(my_string) ~ lower(search_fragment); > > Does that help ? > > (~ does work for eg. German in my experience) No, for russian unicode strings it is not working. I searched pgsql-patches@ list and found there this thread: http://archives.postgresql.org/pgsql-patches/2007-06/msg00021.php I wrote Andrew (he didn't answer yet) about whether this patch can help with my problem. P.S.: if this issue is a known bug (as we talked earlier), then how long will it take to fix it? I know little about postgresql development process, maybe you know it little better? -- alexander lunyov lan@startatom.ru
alexander lunyov <lan@startatom.ru> writes: > With this i just wanted to say that lower() doesn't work at all on > russian unicode characters, In that case you're using the wrong locale (ie, not russian unicode). Check "show lc_ctype". Or [ checks back in thread... ] maybe you're using the wrong operating system. Not so long ago FreeBSD didn't have Unicode locale support at all; I'm not sure if 6.2 has that problem but it is worth checking. Does it work for you to do case-insensitive russian comparisons in "grep", for instance? regards, tom lane
Tom Lane wrote: > alexander lunyov <lan@startatom.ru> writes: >> With this i just wanted to say that lower() doesn't work at all on >> russian unicode characters, > > In that case you're using the wrong locale (ie, not russian unicode). > Check "show lc_ctype". db=> SHOW LC_CTYPE; lc_ctype ---------- C (1 запись) db=> SHOW LC_COLLATE; lc_collate ------------ C (1 запись) Where can i change this? Trying to SET this parameters gives error "parameter "lc_collate" cannot be changed" > Or [ checks back in thread... ] maybe you're using the wrong operating > system. Not so long ago FreeBSD didn't have Unicode locale support at > all; I'm not sure if 6.2 has that problem but it is worth checking. > Does it work for you to do case-insensitive russian comparisons in > "grep", for instance? I put to textfile 3 russian strings with different case of first char and grep'ed them all: # cat > textfile Зеленая Зеленодольская зеленая # grep -i зелен * textfile:Зеленая textfile:Зеленодольская textfile:зеленая So i think system is fine about unicode. -- alexander lunyov lan@startatom.ru
alexander, lc_ctype and lc_collate can be changed only at initdb ! You need to read localization chapter http://www.postgresql.org/docs/current/static/charset.html Oleg On Thu, 12 Jul 2007, alexander lunyov wrote: > Tom Lane wrote: >> alexander lunyov <lan@startatom.ru> writes: >>> With this i just wanted to say that lower() doesn't work at all on >>> russian unicode characters, >> >> In that case you're using the wrong locale (ie, not russian unicode). >> Check "show lc_ctype". > > db=> SHOW LC_CTYPE; > lc_ctype > ---------- > C > (1 запись) > > db=> SHOW LC_COLLATE; > lc_collate > ------------ > C > (1 запись) > > Where can i change this? Trying to SET this parameters gives error "parameter > "lc_collate" cannot be changed" > >> Or [ checks back in thread... ] maybe you're using the wrong operating >> system. Not so long ago FreeBSD didn't have Unicode locale support at >> all; I'm not sure if 6.2 has that problem but it is worth checking. >> Does it work for you to do case-insensitive russian comparisons in >> "grep", for instance? > > I put to textfile 3 russian strings with different case of first char and > grep'ed them all: > > # cat > textfile > Зеленая > Зеленодольская > зеленая > # grep -i зелен * > textfile:Зеленая > textfile:Зеленодольская > textfile:зеленая > > So i think system is fine about unicode. > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > alexander, > > lc_ctype and lc_collate can be changed only at initdb ! > You need to read localization chapter > http://www.postgresql.org/docs/current/static/charset.html Yes, i knew about this, but i thought maybe somehow it can be changed onthefly. ... (10 minutes later) Yes, now when initdb done with --locale=ru_RU.UTF-8, lower('RussianString') gives me 'russianstring', though, case-insensiive regexp still not working. I guess i'll stick with lower() ~ lower() construction. And thanks everybody who replied! > > > Oleg > On Thu, 12 Jul 2007, alexander lunyov wrote: > >> Tom Lane wrote: >>> alexander lunyov <lan@startatom.ru> writes: >>>> With this i just wanted to say that lower() doesn't work at all on >>>> russian unicode characters, >>> >>> In that case you're using the wrong locale (ie, not russian unicode). >>> Check "show lc_ctype". >> >> db=> SHOW LC_CTYPE; >> lc_ctype >> ---------- >> C >> (1 запись) >> >> db=> SHOW LC_COLLATE; >> lc_collate >> ------------ >> C >> (1 запись) >> >> Where can i change this? Trying to SET this parameters gives error >> "parameter "lc_collate" cannot be changed" >> >>> Or [ checks back in thread... ] maybe you're using the wrong operating >>> system. Not so long ago FreeBSD didn't have Unicode locale support at >>> all; I'm not sure if 6.2 has that problem but it is worth checking. >>> Does it work for you to do case-insensitive russian comparisons in >>> "grep", for instance? >> >> I put to textfile 3 russian strings with different case of first char >> and grep'ed them all: >> >> # cat > textfile >> Зеленая >> Зеленодольская >> зеленая >> # grep -i зелен * >> textfile:Зеленая >> textfile:Зеленодольская >> textfile:зеленая >> >> So i think system is fine about unicode. >> >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 -- alexander lunyov lan@startatom.ru