Обсуждение: another locale problem

Поиск
Список
Период
Сортировка

another locale problem

От
Daniel Kalchev
Дата:
While playing with --enable-locale and the like expressions I came across this 
phenomenon ('somestring' is in cyrillic - last 64 codes from 0xa0 to 0xff)

SELECT key FROM t WHERE key ~* 'somestring'

returns the correct answers, properly matching upper/lower case characters - 
this with --enable-locale, --with-mb=WIN and commenting out the USE_LOCALE 
restrictions in gram.y. Explain shows Sequential scan, as expected...

However

SELECT key FROM t WHERE key ~* '^somestring'

returns no tuples and explain says it will use the index on key. Why is this?

If 'somestring' is ASCII characters, explain always gives sequential scan and 
in both cases returns the proper results.

I am willing to do some testing if anyone has ideas how to patch the regex 
code (I hate it :-).

Regards,
Daniel



Re: [HACKERS] another locale problem

От
Tatsuo Ishii
Дата:
> While playing with --enable-locale and the like expressions I came across this 
> phenomenon ('somestring' is in cyrillic - last 64 codes from 0xa0 to 0xff)
> 
> SELECT key FROM t WHERE key ~* 'somestring'
> 
> returns the correct answers, properly matching upper/lower case characters - 
> this with --enable-locale, --with-mb=WIN and commenting out the USE_LOCALE 
> restrictions in gram.y. Explain shows Sequential scan, as expected...
> 
> However
> 
> SELECT key FROM t WHERE key ~* '^somestring'
> 
> returns no tuples and explain says it will use the index on key. Why is this?

That's strange. It should be seq scan in this case?

> If 'somestring' is ASCII characters, explain always gives sequential scan and 
> in both cases returns the proper results.
> 
> I am willing to do some testing if anyone has ideas how to patch the regex 
> code (I hate it :-).

Can you test following case:

SELECT key FROM t WHERE key ~* '^Xsomestring'

where X is one of an ASCII character.
---
Tatsuo Ishii


Re: [HACKERS] another locale problem

От
Daniel Kalchev
Дата:
>>>Tatsuo Ishii said:> > SELECT key FROM t WHERE key ~* '^somestring'> > > > returns no tuples and explain says it will
usethe index on key. Why is th    is?> > That's strange. It should be seq scan in this case?
 

I forgot to mention, that if 'somestring' is all uppercase, everything works 
(the key field in the table is all uppercase). It still says index scan will 
be used.

To summarize the problem. If key contains (equivalent cyrillic letters) 'ABC', 
'ABCD', 'DAB' and 'ABX' and the query is:

SELECT key FROM t WHERE key ~* '^AB';

index scan will be used and the correct tuples ('ABC', 'ABCD' and 'ABX') will 
be returned. If the query is

SELECT key FROM t WHERE key ~* '^ab';

index scan will be used and no tuples will be returned. With the query

SELECT key FROM t WHERE key ~* 'ab';

sequential scan will be used and the correct tuples will be returned (all of 
the above).
> Can you test following case:> > SELECT key FROM t WHERE key ~* '^Xsomestring'> > where X is one of an ASCII
character.

Explain says it will use sequential scan and if I insert proper key in the 
table it will be returned.

Daniel



Re: [HACKERS] another locale problem

От
Tatsuo Ishii
Дата:
> > > SELECT key FROM t WHERE key ~* '^somestring'
> > > 
> > > returns no tuples and explain says it will use the index on key. Why is th
>     is?
> > 
> > That's strange. It should be seq scan in this case?
>
>I forgot to mention, that if 'somestring' is all uppercase, everything works 
>(the key field in the table is all uppercase). It still says index scan will 
>be used.

Hmm... I am totally confused by this. Sould be a bug.

>To summarize the problem. If key contains (equivalent cyrillic letters) 'ABC', 
>'ABCD', 'DAB' and 'ABX' and the query is:
>
>SELECT key FROM t WHERE key ~* '^AB';
>
>index scan will be used and the correct tuples ('ABC', 'ABCD' and 'ABX') will 
>be returned. If the query is
>
>SELECT key FROM t WHERE key ~* '^ab';
>
>index scan will be used and no tuples will be returned. With the query

Should be bug.

>SELECT key FROM t WHERE key ~* 'ab';
>
>sequential scan will be used and the correct tuples will be returned (all of 
>the above).

Seems correct result.

> > Can you test following case:
> > 
> > SELECT key FROM t WHERE key ~* '^Xsomestring'
> > 
> > where X is one of an ASCII character.
>
>Explain says it will use sequential scan and if I insert proper key in the 
>table it will be returned.

Expected result.

>From line 5388 of parser/gram.y:                    (strcmp(opname,"~*") == 0 && isalpha(n->val.val.str[pos])))

I suspect isalpha() does not return true if n->val.val.str[pos] is a
non ascii char. Probably that's why 
SELECT key FROM t WHERE key ~* '^somestring'

does not work.

If its argument is a cyrillic char and cyrillic locale enabled then
isalpha() should return true. Can you check this?

Or:                    (strcmp(opname,"~*")
== 0 && isalpha((unsigned char)n->val.val.str[pos])))

works for you?
--
Tatsuo Ishii


Re: [HACKERS] another locale problem

От
Daniel Kalchev
Дата:
>>>Tatsuo Ishii said:
[...]> >From line 5388 of parser/gram.y:>                         (strcmp(opname,"~*") == 0 && is
alpha(n->val.val.str[pos])))>> I suspect isalpha() does not return true if n->val.val.str[pos] is a> non ascii char.
Probablythat's why 
 
[...]

I have tried the following program under BSD/OS 4.0.1:

--- cut here ---
#include <string.h>
#include <locale.h>
#include <rune.h>
#include <errno.h>


main()
{    int i;
    printf ("%s\n", setlocale(LC_CTYPE, "CP1251"));    printf ("--- strcoll ---\n");    printf ("%d\n", strcoll("àáâ",
"àáã"));   printf ("%d\n", strcoll("àáâ", "àáâ"));    printf ("%d\n", strcoll("àáâ", "àáà"));    printf ("--- isalpha
---\n");   i = 'ë';    printf ("%c, %d\n", i, i);    printf ("%d\n", isalpha('ë'));    printf ("%d\n", isalpha('a'));
 
}
--- cut here ---

Where the cyrillic letter is the 'L' lowercase in cyrillic, the other are 
first characters from the alphabet....

The strcoll work always. This is because BSD/OS does NOT support the 
LC_COLLATE and strcoll is effectively strcmp (confirmed by the libc sources).

There are two cases:

1. cc -o test test.c

Returns 0 for isalpha() on cyrillic characers.

2. cc -funsigned-char -o test test.c

Returns 1 for isalpha() on cyrillic characters!

If I substitute the 'ë' character constant with it's code (235), isalpha() 
returns 1.

BSD/OS 4.0.1 uses gcc 2.7.2.1. Apparently somewhere in the code 'char' is used 
(I did use both --enable-locale and --with-mb=WIN). I am still searching for 
it, but the regex code is not very readable... :-(

Daniel



Re: [HACKERS] another locale problem

От
Tatsuo Ishii
Дата:
Ok. Now I almost certain that:
                    (strcmp(opname,"~*")
== 0 && isalpha((unsigned char)n->val.val.str[pos])))

should work for you. Can you confirm this?
--
Tatsuo Ishii


Re: [HACKERS] another locale problem

От
Angelos Karageorgiou
Дата:
On Fri, 11 Jun 1999, Tatsuo Ishii wrote:

> 
> Ok. Now I almost certain that:
> 
>                         (strcmp(opname,"~*")
> == 0 && isalpha((unsigned char)n->val.val.str[pos])))
> 
> should work for you. Can you confirm this?
> --
> Tatsuo Ishii
> 


It did for me , BSD/OS is veeeery finicky with the way it treats chars. It 
upgrades them to ints and from then on isalpha isupper do not work.
I have tried with --unsigned-char gcc switch the result was the same, only
with a cast to (unsigned char) do all the char related functions work
as advertised.

I am sorry to say I have not had any input from other internationals using
BSDI with a non-english character set.


-- 
Incredible Networks LTD       Angelos Karageorgiou
20 Karea st,                  +30.1.92.12.312 (voice)
116 36 Athens, Greece.        +30.1.92.12.314 (fax)
http://www.incredible.com     angelos@incredible.com (e-mail)



Re: [HACKERS] another locale problem

От
Daniel Kalchev
Дата:
This worked, however it made all selects of ~* '^sometext' sequential.

Why can't we use index searches in this case? I believe it is :-)

But this

SELECT key FROM t WHERE key ~* '^ sometext';

(note the space after the carret) still does not work! It says index scan will 
be used and finds nothng. If 'sometext' is all uppercase the proper result is 
returned. The problem seems to be in the regex handling anyway. I tried 
compiling regcomp.c with -funsigned-char, but that was not enough. Further 
hacking...

Daniel

>>>Tatsuo Ishii said:> Ok. Now I almost certain that:> >                         (strcmp(opname,"~*")> == 0 &&
isalpha((unsignedchar)n->val.val.str[pos])))> > should work for you. Can you confirm this?> --> Tatsuo Ishii
 




Re: [HACKERS] another locale problem

От
Daniel Kalchev
Дата:
This is not because of BSD/OS, but because of gcc. I found some 'char' 
references in the regex routines and am working on that currently.

I can confirm that other software on BSD/OS works correctly with locales, such 
as vi, less etc. Apparently we need unsigned char in order to do anything 
sensible with char values over 127.

Daniel

>>>Angelos Karageorgiou said:
> >                         (strcmp(opname,"~*")> > == 0 && isalpha((unsigned char)n->val.val.str[pos])))> > > > should
workfor you. Can you confirm this?> > --> > Tatsuo Ishii> > It did for me , BSD/OS is veeeery finicky with the way it
treatschars. It > upgrades them to ints and from then on isalpha isupper do not work.> I have tried with
--unsigned-chargcc switch the result was the same, only> with a cast to (unsigned char) do all the char related
functionswork> as advertised.> > I am sorry to say I have not had any input from other internationals using> BSDI with
anon-english character set.> > > -- > Incredible Networks LTD       Angelos Karageorgiou> 20 Karea st,
+30.1.92.12.312(voice)> 116 36 Athens, Greece.        +30.1.92.12.314 (fax)> http://www.incredible.com
angelos@incredible.com(e-mail)> 
 




Re: [HACKERS] another locale problem

От
Tatsuo Ishii
Дата:
>This worked, however it made all selects of ~* '^sometext' sequential.

That's correct behavior.

>Why can't we use index searches in this case? I believe it is :-)

No. It's due to the nature of the  Btree index.

>But this
>
>SELECT key FROM t WHERE key ~* '^ sometext';
>
>(note the space after the carret) still does not work! It says index scan will 
>be used and finds nothng. If 'sometext' is all uppercase the proper result is 
>returned. The problem seems to be in the regex handling anyway. I tried 
>compiling regcomp.c with -funsigned-char, but that was not enough. Further 
>hacking...

Again, that should not be index scan.

BTW, if you want to play with regex, you might find retest.c be
useful. You can build the test tool by:

make retest

enjoy:-)
--
Tatsuo Ishii


Re: [HACKERS] another locale problem

От
Daniel Kalchev
Дата:
>>>Tatsuo Ishii said:> >This worked, however it made all selects of ~* '^sometext' sequential.> > That's correct
behavior.>> >Why can't we use index searches in this case? I believe it is :-)> > No. It's due to the nature of the
Btreeindex.
 

But why it did use index scan when the case of the field and the expression 
matched and did find the correct results?
> >But this> >> >SELECT key FROM t WHERE key ~* '^ sometext';> > Again, that should not be index scan.

SELECT key FROM t WHERE key ~* '^ ';

(space only) uses index and works correctly.... weird!
> BTW, if you want to play with regex, you might find retest.c be> useful. You can build the test tool by:> > make
retest>> enjoy:-)
 


In a wild guess I changed all 'char' to 'unsigned char' in regcomp.c, with no 
positive or negative results. Maybe there are other places in regex where this 
should be done, such as regexec.c? Is there an regex guru over here ... :-)

My long-time wondering about regex in Postgres has always been - isn't there 
something better than this old regex code that we can use?

Daniel



Re: [HACKERS] another locale problem

От
Tom Lane
Дата:
Daniel Kalchev <daniel@digsys.bg> writes:
> To summarize the problem. If key contains (equivalent cyrillic
> letters) 'ABC', 'ABCD', 'DAB' and 'ABX' and the query is:

> SELECT key FROM t WHERE key ~* '^AB';

> index scan will be used and the correct tuples ('ABC', 'ABCD' and
> 'ABX') will be returned. If the query is

> SELECT key FROM t WHERE key ~* '^ab';

> index scan will be used and no tuples will be returned.

Hm.  Is it possible that isalpha() is doing the wrong thing on your
machine?  makeIndexable() currently assumes that isalpha() returns true
for any character that is subject to case conversion, but I wonder
whether that's a good enough test.

The other possibility is that regexp's internal handling of
case-insensitive matching is not right.
        regards, tom lane


Re: [HACKERS] another locale problem

От
Bruce Momjian
Дата:
> >>>Tatsuo Ishii said:
>  > > SELECT key FROM t WHERE key ~* '^somestring'
>  > > 
>  > > returns no tuples and explain says it will use the index on key. Why is th
>      is?
>  > 
>  > That's strange. It should be seq scan in this case?
> 
> I forgot to mention, that if 'somestring' is all uppercase, everything works 
> (the key field in the table is all uppercase). It still says index scan will 
> be used.
> 
> To summarize the problem. If key contains (equivalent cyrillic letters) 'ABC', 
> 'ABCD', 'DAB' and 'ABX' and the query is:
> 
> SELECT key FROM t WHERE key ~* '^AB';

Oops, forgot ~* is case-insensitive.  The conditions we adde for this
require the query to be between > A and < a, which is not very
restrictive.

No real better way to do this.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] another locale problem

От
Daniel Kalchev
Дата:
>>>Tom Lane said:> Daniel Kalchev <daniel@digsys.bg> writes:> > To summarize the problem. If key contains (equivalent
cyrillic>> letters) 'ABC', 'ABCD', 'DAB' and 'ABX' and the query is:> > > SELECT key FROM t WHERE key ~* '^AB';> > >
indexscan will be used and the correct tuples ('ABC', 'ABCD' and> > 'ABX') will be returned. If the query is> > >
SELECTkey FROM t WHERE key ~* '^ab';> > > index scan will be used and no tuples will be returned.> > Hm.  Is it
possiblethat isalpha() is doing the wrong thing on your> machine?  makeIndexable() currently assumes that isalpha()
returnstrue> for any character that is subject to case conversion, but I wonder> whether that's a good enough test.
 

In fact, after giving it some though... the expression in gram.y

                    (strcmp(opname,"~*")
== 0 && isalpha(n->val.val.str[pos])))

is wrong. The statement in my view decides that a regular expression is not 
indexable if it contains special characters or if it contains non-alpha 
characters. Therefore, the statement should be written as:
                    (strcmp(opname,"~*")
== 0 && !isalpha((unsigned char)n->val.val.str[pos])))

(two fixes :) This makes indexes work for '^abc' (lowercase ASCII). But does 
not find anything, which means regex does not work. It does not work for both 
ASCII and non-ASCII text/patterns. :-(
> The other possibility is that regexp's internal handling of> case-insensitive matching is not right.

I believe it to be terribly wrong, and some releases ago it worked with 8-bit 
characters by just compiling it with -funsigned-char. Now this breaks things...

Daniel



Re: [HACKERS] another locale problem

От
Tom Lane
Дата:
Daniel Kalchev <daniel@digsys.bg> writes:
> In fact, after giving it some though... the expression in gram.y

>                         (strcmp(opname,"~*")
> == 0 && isalpha(n->val.val.str[pos])))

> is wrong. The statement in my view decides that a regular expression is not 
> indexable if it contains special characters or if it contains non-alpha 
> characters. Therefore, the statement should be written as:

>                         (strcmp(opname,"~*")
> == 0 && !isalpha((unsigned char)n->val.val.str[pos])))

No, it's not wrong, at least not in that way!  You've missed the point
entirely.  ~* is the *case insensitive* regexp match operator.
Therefore if I have a pattern like '^abc' it can match anything
beginning with either 'a' or 'A'.  If the index restriction were to
include the letter 'a' then it would exclude valid matches starting with
'A'.  The simplest solution, which is what's in makeIndexable(), is
to exclude case-foldable characters from the index restriction pattern.
In this particular case you end up getting no index restriction at all,
but that is indeed what's supposed to happen.

I am not sure that isalpha() is an adequate test for case-foldable
characters in non-ASCII locales, but inverting it is definitely wrong ;-)
        regards, tom lane


Re: [HACKERS] another locale problem

От
Daniel Kalchev
Дата:
Tom,

So you say that this check prevents the use of indexes, when we use the ~* 
operator and have alpha characters in the pattern, because apparently the 
index cannot do case insensitive matching.

I was under the (apparently wrong impression) that it was possible to use 
indexes for case insensitive matching. Dreaming... :-)

For your information, isalpha() is the correct match for case foldable 
characters, at least in the cp1251 (windows-1251) locale. I believe a more 
correct test could be to access the locale's MAPLOWER and MAPUPPER tables.

It is not the case in Bulgarian, but there might be languages where an letter 
does not exist in both upper and lower cases and therefore requires more 
complex handling. Perhaps such situation exists in the multibyte locales.

Please excuse my confusion. :-)

Daniel

>>>Tom Lane said:> Daniel Kalchev <daniel@digsys.bg> writes:> > In fact, after giving it some though... the expression
ingram.y> > >                         (strcmp(opname,"~*")> > == 0 && isalpha(n->val.val.str[pos])))> > > is wrong. The
statementin my view decides that a regular expression is no    t > > indexable if it contains special characters or if
itcontains non-alpha > > characters. Therefore, the statement should be written as:> > >
(strcmp(opname,"~*")>> == 0 && !isalpha((unsigned char)n->val.val.str[pos])))> > No, it's not wrong, at least not in
thatway!  You've missed the point> entirely.  ~* is the *case insensitive* regexp match operator.> Therefore if I have
apattern like '^abc' it can match anything> beginning with either 'a' or 'A'.  If the index restriction were to>
includethe letter 'a' then it would exclude valid matches starting with> 'A'.  The simplest solution, which is what's
inmakeIndexable(), is> to exclude case-foldable characters from the index restriction pattern.> In this particular case
youend up getting no index restriction at all,> but that is indeed what's supposed to happen.> > I am not sure that
isalpha()is an adequate test for case-foldable> characters in non-ASCII locales, but inverting it is definitely wrong
;-)>>             regards, tom lane