Обсуждение: another locale problem
While playing with --enable-locale and the like expressions I came across this phenomenon ('somestring' is in cyrillic - last 64 codes from 0xa0 to 0xff) SELECT key FROM t WHERE key ~* 'somestring' returns the correct answers, properly matching upper/lower case characters - this with --enable-locale, --with-mb=WIN and commenting out the USE_LOCALE restrictions in gram.y. Explain shows Sequential scan, as expected... However SELECT key FROM t WHERE key ~* '^somestring' returns no tuples and explain says it will use the index on key. Why is this? If 'somestring' is ASCII characters, explain always gives sequential scan and in both cases returns the proper results. I am willing to do some testing if anyone has ideas how to patch the regex code (I hate it :-). Regards, Daniel
> While playing with --enable-locale and the like expressions I came across this > phenomenon ('somestring' is in cyrillic - last 64 codes from 0xa0 to 0xff) > > SELECT key FROM t WHERE key ~* 'somestring' > > returns the correct answers, properly matching upper/lower case characters - > this with --enable-locale, --with-mb=WIN and commenting out the USE_LOCALE > restrictions in gram.y. Explain shows Sequential scan, as expected... > > However > > SELECT key FROM t WHERE key ~* '^somestring' > > returns no tuples and explain says it will use the index on key. Why is this? That's strange. It should be seq scan in this case? > If 'somestring' is ASCII characters, explain always gives sequential scan and > in both cases returns the proper results. > > I am willing to do some testing if anyone has ideas how to patch the regex > code (I hate it :-). Can you test following case: SELECT key FROM t WHERE key ~* '^Xsomestring' where X is one of an ASCII character. --- Tatsuo Ishii
>>>Tatsuo Ishii said:> > SELECT key FROM t WHERE key ~* '^somestring'> > > > returns no tuples and explain says it will usethe index on key. Why is th is?> > That's strange. It should be seq scan in this case? I forgot to mention, that if 'somestring' is all uppercase, everything works (the key field in the table is all uppercase). It still says index scan will be used. To summarize the problem. If key contains (equivalent cyrillic letters) 'ABC', 'ABCD', 'DAB' and 'ABX' and the query is: SELECT key FROM t WHERE key ~* '^AB'; index scan will be used and the correct tuples ('ABC', 'ABCD' and 'ABX') will be returned. If the query is SELECT key FROM t WHERE key ~* '^ab'; index scan will be used and no tuples will be returned. With the query SELECT key FROM t WHERE key ~* 'ab'; sequential scan will be used and the correct tuples will be returned (all of the above). > Can you test following case:> > SELECT key FROM t WHERE key ~* '^Xsomestring'> > where X is one of an ASCII character. Explain says it will use sequential scan and if I insert proper key in the table it will be returned. Daniel
> > > SELECT key FROM t WHERE key ~* '^somestring' > > > > > > returns no tuples and explain says it will use the index on key. Why is th > is? > > > > That's strange. It should be seq scan in this case? > >I forgot to mention, that if 'somestring' is all uppercase, everything works >(the key field in the table is all uppercase). It still says index scan will >be used. Hmm... I am totally confused by this. Sould be a bug. >To summarize the problem. If key contains (equivalent cyrillic letters) 'ABC', >'ABCD', 'DAB' and 'ABX' and the query is: > >SELECT key FROM t WHERE key ~* '^AB'; > >index scan will be used and the correct tuples ('ABC', 'ABCD' and 'ABX') will >be returned. If the query is > >SELECT key FROM t WHERE key ~* '^ab'; > >index scan will be used and no tuples will be returned. With the query Should be bug. >SELECT key FROM t WHERE key ~* 'ab'; > >sequential scan will be used and the correct tuples will be returned (all of >the above). Seems correct result. > > Can you test following case: > > > > SELECT key FROM t WHERE key ~* '^Xsomestring' > > > > where X is one of an ASCII character. > >Explain says it will use sequential scan and if I insert proper key in the >table it will be returned. Expected result. >From line 5388 of parser/gram.y: (strcmp(opname,"~*") == 0 && isalpha(n->val.val.str[pos]))) I suspect isalpha() does not return true if n->val.val.str[pos] is a non ascii char. Probably that's why SELECT key FROM t WHERE key ~* '^somestring' does not work. If its argument is a cyrillic char and cyrillic locale enabled then isalpha() should return true. Can you check this? Or: (strcmp(opname,"~*") == 0 && isalpha((unsigned char)n->val.val.str[pos]))) works for you? -- Tatsuo Ishii
>>>Tatsuo Ishii said: [...]> >From line 5388 of parser/gram.y:> (strcmp(opname,"~*") == 0 && is alpha(n->val.val.str[pos])))>> I suspect isalpha() does not return true if n->val.val.str[pos] is a> non ascii char. Probablythat's why [...] I have tried the following program under BSD/OS 4.0.1: --- cut here --- #include <string.h> #include <locale.h> #include <rune.h> #include <errno.h> main() { int i; printf ("%s\n", setlocale(LC_CTYPE, "CP1251")); printf ("--- strcoll ---\n"); printf ("%d\n", strcoll("àáâ", "àáã")); printf ("%d\n", strcoll("àáâ", "àáâ")); printf ("%d\n", strcoll("àáâ", "àáà")); printf ("--- isalpha ---\n"); i = 'ë'; printf ("%c, %d\n", i, i); printf ("%d\n", isalpha('ë')); printf ("%d\n", isalpha('a')); } --- cut here --- Where the cyrillic letter is the 'L' lowercase in cyrillic, the other are first characters from the alphabet.... The strcoll work always. This is because BSD/OS does NOT support the LC_COLLATE and strcoll is effectively strcmp (confirmed by the libc sources). There are two cases: 1. cc -o test test.c Returns 0 for isalpha() on cyrillic characers. 2. cc -funsigned-char -o test test.c Returns 1 for isalpha() on cyrillic characters! If I substitute the 'ë' character constant with it's code (235), isalpha() returns 1. BSD/OS 4.0.1 uses gcc 2.7.2.1. Apparently somewhere in the code 'char' is used (I did use both --enable-locale and --with-mb=WIN). I am still searching for it, but the regex code is not very readable... :-( Daniel
Ok. Now I almost certain that: (strcmp(opname,"~*") == 0 && isalpha((unsigned char)n->val.val.str[pos]))) should work for you. Can you confirm this? -- Tatsuo Ishii
On Fri, 11 Jun 1999, Tatsuo Ishii wrote: > > Ok. Now I almost certain that: > > (strcmp(opname,"~*") > == 0 && isalpha((unsigned char)n->val.val.str[pos]))) > > should work for you. Can you confirm this? > -- > Tatsuo Ishii > It did for me , BSD/OS is veeeery finicky with the way it treats chars. It upgrades them to ints and from then on isalpha isupper do not work. I have tried with --unsigned-char gcc switch the result was the same, only with a cast to (unsigned char) do all the char related functions work as advertised. I am sorry to say I have not had any input from other internationals using BSDI with a non-english character set. -- Incredible Networks LTD Angelos Karageorgiou 20 Karea st, +30.1.92.12.312 (voice) 116 36 Athens, Greece. +30.1.92.12.314 (fax) http://www.incredible.com angelos@incredible.com (e-mail)
This worked, however it made all selects of ~* '^sometext' sequential. Why can't we use index searches in this case? I believe it is :-) But this SELECT key FROM t WHERE key ~* '^ sometext'; (note the space after the carret) still does not work! It says index scan will be used and finds nothng. If 'sometext' is all uppercase the proper result is returned. The problem seems to be in the regex handling anyway. I tried compiling regcomp.c with -funsigned-char, but that was not enough. Further hacking... Daniel >>>Tatsuo Ishii said:> Ok. Now I almost certain that:> > (strcmp(opname,"~*")> == 0 && isalpha((unsignedchar)n->val.val.str[pos])))> > should work for you. Can you confirm this?> --> Tatsuo Ishii
This is not because of BSD/OS, but because of gcc. I found some 'char' references in the regex routines and am working on that currently. I can confirm that other software on BSD/OS works correctly with locales, such as vi, less etc. Apparently we need unsigned char in order to do anything sensible with char values over 127. Daniel >>>Angelos Karageorgiou said: > > (strcmp(opname,"~*")> > == 0 && isalpha((unsigned char)n->val.val.str[pos])))> > > > should workfor you. Can you confirm this?> > --> > Tatsuo Ishii> > It did for me , BSD/OS is veeeery finicky with the way it treatschars. It > upgrades them to ints and from then on isalpha isupper do not work.> I have tried with --unsigned-chargcc switch the result was the same, only> with a cast to (unsigned char) do all the char related functionswork> as advertised.> > I am sorry to say I have not had any input from other internationals using> BSDI with anon-english character set.> > > -- > Incredible Networks LTD Angelos Karageorgiou> 20 Karea st, +30.1.92.12.312(voice)> 116 36 Athens, Greece. +30.1.92.12.314 (fax)> http://www.incredible.com angelos@incredible.com(e-mail)>
>This worked, however it made all selects of ~* '^sometext' sequential. That's correct behavior. >Why can't we use index searches in this case? I believe it is :-) No. It's due to the nature of the Btree index. >But this > >SELECT key FROM t WHERE key ~* '^ sometext'; > >(note the space after the carret) still does not work! It says index scan will >be used and finds nothng. If 'sometext' is all uppercase the proper result is >returned. The problem seems to be in the regex handling anyway. I tried >compiling regcomp.c with -funsigned-char, but that was not enough. Further >hacking... Again, that should not be index scan. BTW, if you want to play with regex, you might find retest.c be useful. You can build the test tool by: make retest enjoy:-) -- Tatsuo Ishii
>>>Tatsuo Ishii said:> >This worked, however it made all selects of ~* '^sometext' sequential.> > That's correct behavior.>> >Why can't we use index searches in this case? I believe it is :-)> > No. It's due to the nature of the Btreeindex. But why it did use index scan when the case of the field and the expression matched and did find the correct results? > >But this> >> >SELECT key FROM t WHERE key ~* '^ sometext';> > Again, that should not be index scan. SELECT key FROM t WHERE key ~* '^ '; (space only) uses index and works correctly.... weird! > BTW, if you want to play with regex, you might find retest.c be> useful. You can build the test tool by:> > make retest>> enjoy:-) In a wild guess I changed all 'char' to 'unsigned char' in regcomp.c, with no positive or negative results. Maybe there are other places in regex where this should be done, such as regexec.c? Is there an regex guru over here ... :-) My long-time wondering about regex in Postgres has always been - isn't there something better than this old regex code that we can use? Daniel
Daniel Kalchev <daniel@digsys.bg> writes: > To summarize the problem. If key contains (equivalent cyrillic > letters) 'ABC', 'ABCD', 'DAB' and 'ABX' and the query is: > SELECT key FROM t WHERE key ~* '^AB'; > index scan will be used and the correct tuples ('ABC', 'ABCD' and > 'ABX') will be returned. If the query is > SELECT key FROM t WHERE key ~* '^ab'; > index scan will be used and no tuples will be returned. Hm. Is it possible that isalpha() is doing the wrong thing on your machine? makeIndexable() currently assumes that isalpha() returns true for any character that is subject to case conversion, but I wonder whether that's a good enough test. The other possibility is that regexp's internal handling of case-insensitive matching is not right. regards, tom lane
> >>>Tatsuo Ishii said: > > > SELECT key FROM t WHERE key ~* '^somestring' > > > > > > returns no tuples and explain says it will use the index on key. Why is th > is? > > > > That's strange. It should be seq scan in this case? > > I forgot to mention, that if 'somestring' is all uppercase, everything works > (the key field in the table is all uppercase). It still says index scan will > be used. > > To summarize the problem. If key contains (equivalent cyrillic letters) 'ABC', > 'ABCD', 'DAB' and 'ABX' and the query is: > > SELECT key FROM t WHERE key ~* '^AB'; Oops, forgot ~* is case-insensitive. The conditions we adde for this require the query to be between > A and < a, which is not very restrictive. No real better way to do this. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>>>Tom Lane said:> Daniel Kalchev <daniel@digsys.bg> writes:> > To summarize the problem. If key contains (equivalent cyrillic>> letters) 'ABC', 'ABCD', 'DAB' and 'ABX' and the query is:> > > SELECT key FROM t WHERE key ~* '^AB';> > > indexscan will be used and the correct tuples ('ABC', 'ABCD' and> > 'ABX') will be returned. If the query is> > > SELECTkey FROM t WHERE key ~* '^ab';> > > index scan will be used and no tuples will be returned.> > Hm. Is it possiblethat isalpha() is doing the wrong thing on your> machine? makeIndexable() currently assumes that isalpha() returnstrue> for any character that is subject to case conversion, but I wonder> whether that's a good enough test. In fact, after giving it some though... the expression in gram.y (strcmp(opname,"~*") == 0 && isalpha(n->val.val.str[pos]))) is wrong. The statement in my view decides that a regular expression is not indexable if it contains special characters or if it contains non-alpha characters. Therefore, the statement should be written as: (strcmp(opname,"~*") == 0 && !isalpha((unsigned char)n->val.val.str[pos]))) (two fixes :) This makes indexes work for '^abc' (lowercase ASCII). But does not find anything, which means regex does not work. It does not work for both ASCII and non-ASCII text/patterns. :-( > The other possibility is that regexp's internal handling of> case-insensitive matching is not right. I believe it to be terribly wrong, and some releases ago it worked with 8-bit characters by just compiling it with -funsigned-char. Now this breaks things... Daniel
Daniel Kalchev <daniel@digsys.bg> writes: > In fact, after giving it some though... the expression in gram.y > (strcmp(opname,"~*") > == 0 && isalpha(n->val.val.str[pos]))) > is wrong. The statement in my view decides that a regular expression is not > indexable if it contains special characters or if it contains non-alpha > characters. Therefore, the statement should be written as: > (strcmp(opname,"~*") > == 0 && !isalpha((unsigned char)n->val.val.str[pos]))) No, it's not wrong, at least not in that way! You've missed the point entirely. ~* is the *case insensitive* regexp match operator. Therefore if I have a pattern like '^abc' it can match anything beginning with either 'a' or 'A'. If the index restriction were to include the letter 'a' then it would exclude valid matches starting with 'A'. The simplest solution, which is what's in makeIndexable(), is to exclude case-foldable characters from the index restriction pattern. In this particular case you end up getting no index restriction at all, but that is indeed what's supposed to happen. I am not sure that isalpha() is an adequate test for case-foldable characters in non-ASCII locales, but inverting it is definitely wrong ;-) regards, tom lane
Tom, So you say that this check prevents the use of indexes, when we use the ~* operator and have alpha characters in the pattern, because apparently the index cannot do case insensitive matching. I was under the (apparently wrong impression) that it was possible to use indexes for case insensitive matching. Dreaming... :-) For your information, isalpha() is the correct match for case foldable characters, at least in the cp1251 (windows-1251) locale. I believe a more correct test could be to access the locale's MAPLOWER and MAPUPPER tables. It is not the case in Bulgarian, but there might be languages where an letter does not exist in both upper and lower cases and therefore requires more complex handling. Perhaps such situation exists in the multibyte locales. Please excuse my confusion. :-) Daniel >>>Tom Lane said:> Daniel Kalchev <daniel@digsys.bg> writes:> > In fact, after giving it some though... the expression ingram.y> > > (strcmp(opname,"~*")> > == 0 && isalpha(n->val.val.str[pos])))> > > is wrong. The statementin my view decides that a regular expression is no t > > indexable if it contains special characters or if itcontains non-alpha > > characters. Therefore, the statement should be written as:> > > (strcmp(opname,"~*")>> == 0 && !isalpha((unsigned char)n->val.val.str[pos])))> > No, it's not wrong, at least not in thatway! You've missed the point> entirely. ~* is the *case insensitive* regexp match operator.> Therefore if I have apattern like '^abc' it can match anything> beginning with either 'a' or 'A'. If the index restriction were to> includethe letter 'a' then it would exclude valid matches starting with> 'A'. The simplest solution, which is what's inmakeIndexable(), is> to exclude case-foldable characters from the index restriction pattern.> In this particular case youend up getting no index restriction at all,> but that is indeed what's supposed to happen.> > I am not sure that isalpha()is an adequate test for case-foldable> characters in non-ASCII locales, but inverting it is definitely wrong ;-)>> regards, tom lane