Обсуждение: Like vs '=' bug with indexing
I am reposting this because I'm not sure it actually made it to the list. I have a function to transform text into a pseudo-metaphone variable, take this example: cddbsql=# select song, metatext(song) from cdsongs where metatext(song) like metatext('born to run') limit 3 ; song | metatext -------------+----------Born To Run | brntornBorn To Run | brntornBorn To Run | brntorn (3 rows) Here is the problem: Depending on whether there is an index or not, 'like' behaves differently. Here is a transcript: cddbsql=# select song, metatext(song) from cdsongs where metatext(song) like metatext('born to run') limit 1 ; song | metatext -------------+----------Born To Run | brntorn (1 row) cddbsql=# create index cdsongs_meta_song on cdsongs (metatext(song)) ; CREATE cddbsql=# select song, metatext(song) from cdsongs where metatext(song) like metatext('born to run') limit 1 ;song | metatext ------+---------- (0 rows) This happens in both 7.0 and 7.1. __________________________________________________ Get personalized email addresses from Yahoo! Mail - only $35 a year! http://personal.mail.yahoo.com/
m w <mttf2000@yahoo.com> writes: > Here is the problem: Depending on whether there is an > index or not, 'like' behaves differently. Please provide a complete, self-contained example with which we can reproduce the problem. regards, tom lane
--- Tom Lane <tgl@sss.pgh.pa.us> wrote: > m w <mttf2000@yahoo.com> writes: > > Here is the problem: Depending on whether there is > an > > index or not, 'like' behaves differently. > > Please provide a complete, self-contained example > with which > we can reproduce the problem. I am trying to create a test function and some data that reproduces the problem easily, but I think I know what it is. It is a two bug issue. I had a bug in my code in that I added the "\0" to the end of my strings when I converted from a C string to a postgres "text" object. I think this exposes a bug in postgres where either index or table scans (I'm not sure which just yet) treat a zero differently than a non-zero in a varchar. It looks as if some section of code is using the zero to terminate a string and another section of code is not. So at some point data which should be the same differes either on length of data, or the trailing zero compared to an uninitialized byte. Removing the terminating zero from the postgres string fixes the problem, but, if I understand postgres well enough, this should not make a difference, and should be reported as a bug anyway. __________________________________________________ Get personalized email addresses from Yahoo! Mail - only $35 a year! http://personal.mail.yahoo.com/
m w <mttf2000@yahoo.com> writes: > I think this exposes a bug in postgres where either > index or table scans (I'm not sure which just yet) > treat a zero differently than a non-zero in a varchar. Embedded zeroes aren't supported in char/varchar/text fields, and cannot be supported in a portable fashion, since these datatypes rely on functions like strcoll() that don't allow embedded nulls in strings. It wouldn't surprise me too much if there are inconsistent behaviors between indexscans and seqscans for such invalid data. It doesn't seem real practical for us to examine the output of every C-coded function to make sure it produces a valid value of the datatype. Illegal returned values are a fault of the function, and ensuing misbehaviors are still its fault ... regards, tom lane
Embedded zeroes aren't supported in > char/varchar/text fields, > and cannot be supported in a portable fashion, since > these datatypes > rely on functions like strcoll() that don't allow > embedded nulls in > strings. It wouldn't surprise me too much if there > are inconsistent > behaviors between indexscans and seqscans for such > invalid data. > > It doesn't seem real practical for us to examine the > output of every > C-coded function to make sure it produces a valid > value of the datatype. > Illegal returned values are a fault of the function, > and ensuing > misbehaviors are still its fault ... Fair enough, but I think it should raise a caution flag when two different behaviors can be seen with the same query. This may have a common cause with other index vs non-index behavior. __________________________________________________ Get personalized email addresses from Yahoo! Mail - only $35 a year! http://personal.mail.yahoo.com/
Tom Lane wrote: > > m w <mttf2000@yahoo.com> writes: > > I think this exposes a bug in postgres where either > > index or table scans (I'm not sure which just yet) > > treat a zero differently than a non-zero in a varchar. > > Embedded zeroes aren't supported in char/varchar/text fields, > and cannot be supported in a portable fashion, since these datatypes > rely on functions like strcoll() that don't allow embedded nulls in > strings. Is there no simple (i.e. cheap) way to disallow \0 alltogether for these types then ? perhaps just strip them out in textin() (or is it text_in()) ? > It wouldn't surprise me too much if there are inconsistent > behaviors between indexscans and seqscans for such invalid data. should'nt they both use the _same_ strcoll() and friends ? > It doesn't seem real practical for us to examine the output of every > C-coded function to make sure it produces a valid value of the datatype. > Illegal returned values are a fault of the function, and ensuing > misbehaviors are still its fault ... Should we not examine "the _possible_ outputs of every C-coded function to make sure it produces a valid value of the datatype" ;) For me producing an invalid data for a datatype seems very much like a bug and it _should_ be reported. ------------- Hannu
--- Hannu Krosing <hannu@tm.ee> wrote: > Should we not examine "the _possible_ outputs of > every C-coded function > to make sure it produces a valid value of the > datatype" ;) > > For me producing an invalid data for a datatype > seems very much like > a bug and it _should_ be reported. No, I think Tom is right, there should be no validation on C functions incorporated into Postgres by users. Who wants that overhead in a production system? However, I think when the same SQL query produces different results when you add an index, speaks of an inconsistency in the system, which could be the source of other problems. I have seen a couple posts where results from an index scan are not the same as the results from a table scan, granted they were language issues, but still, my gut tells me if I set the length of a variable to x, and a trailing zero is included, the system should either fail consistently or work consistently. I don't care which, it should just be consistent. Inconsistent behavior indicates that a different matching algorithm is used if one uses an index instead of a table scan. That scares me. __________________________________________________ Get personalized email addresses from Yahoo! Mail - only $35 a year! http://personal.mail.yahoo.com/
Hannu Krosing <hannu@tm.ee> writes: > Is there no simple (i.e. cheap) way to disallow \0 alltogether for > these types then ? > perhaps just strip them out in textin() (or is it text_in()) ? They *are* stripped out in textin(), by virtue of the fact that textin expects to see a null-terminated input string. >> It wouldn't surprise me too much if there are inconsistent >> behaviors between indexscans and seqscans for such invalid data. > should'nt they both use the _same_ strcoll() and friends ? Irrelevant; the issue is that the various comparison operators may produce inconsistent results given invalid input. For instance texteq() short-circuits to a FALSE result if the lengths of the inputs are different, which means that 'ab\0' = 'ab' will produce false, even though a strcoll-based comparison will claim they are equal. I don't think that means that texteq() is wrong to check the lengths first. > Should we not examine "the _possible_ outputs of every C-coded function > to make sure it produces a valid value of the datatype" ;) Go for it. Possibly chr() should reject chr(0) ... regards, tom lane
m w <mttf2000@yahoo.com> writes: > Inconsistent behavior indicates that a different > matching algorithm is used if one uses an index > instead of a table scan. That scares me. A seq scan and an index scan are inherently different algorithms, so I don't see exactly how you think we can avoid this risk. In particular, if you are dealing with a btree index and a "WHERE column = constant" query, then a seq scan is only going to be concerned with the behavior of the '=' operator --- does it return TRUE or not for any particular row? But an index search is inherently going to make ordered comparisons (<, =, >). So there is always a potential for inconsistent behavior if the ordering operators produce results that are inconsistent with simple '='. We cannot design that away --- all we can do is fix such bugs when one is discovered in a particular datatype. regards, tom lane
Tom Lane wrote: > > Hannu Krosing <hannu@tm.ee> writes: > > Is there no simple (i.e. cheap) way to disallow \0 alltogether for > > these types then ? > > perhaps just strip them out in textin() (or is it text_in()) ? > > They *are* stripped out in textin(), by virtue of the fact that > textin expects to see a null-terminated input string. Ok, I was mistaken to think that pg_trigger.tgargs contained real \0's and not fakes and I was able to get similar output from other char types by using \\000 - You never can tell how many \\\\ are required to input a single \ to next level. Also I remember being told that bytea _can_ hold embedded \0, no ? > > Should we not examine "the _possible_ outputs of every C-coded function > > to make sure it produces a valid value of the datatype" ;) > > Go for it. > > Possibly chr() should reject chr(0) ... there is no function chr() at least in 7.0.2. and char is not usable from psql (gives out strange errors); hannu=# select char(0); ERROR: length for type 'bpchar' must be at least 1 hannu=# select char(32); ERROR: parser: parse error at or near ";" hannu=# select char('a'); ERROR: parser: parse error at or near "'" could it be possible somehow distinguish between user callable (safe) functions and "internal" ones ? ------------ Hannu
Hannu Krosing <hannu@tm.ee> writes: > Also I remember being told that bytea _can_ hold embedded \0, no ? Yes. What has that got to do with text et al? >> Possibly chr() should reject chr(0) ... > > there is no function chr() at least in 7.0.2. I think it used to be called ichar(), but then someone pointed out that Oracle calls it chr(). > and char is not usable from psql (gives out strange errors); char(n) is a type name, not a function call. regards, tom lane
Tom Lane wrote: > > Hannu Krosing <hannu@tm.ee> writes: > > Also I remember being told that bytea _can_ hold embedded \0, no ? > > Yes. What has that got to do with text et al? > > >> Possibly chr() should reject chr(0) ... > > > > there is no function chr() at least in 7.0.2. > > I think it used to be called ichar(), but then someone pointed out that > Oracle calls it chr(). > > > and char is not usable from psql (gives out strange errors); > > char(n) is a type name, not a function call. \df list it as a function call taking either bpchar or text argument. hannu=# \df List of functions Result | Function | Arguments ---------------------+----------------------+------------------------------------------ ...char | char | bpchar char | char | text ... Maybe the cast functions (named similar to types) should be removed from \df output or at least marked as such, also in pg_proc? ------------ Hannu