Обсуждение: PATCH: CITEXT 2.0 v4
Howdy, I've attached a new patch with the latest revisions of for the citext contrib module patch. The changes include: * Using strlen() to pass string lengths to the comparison function, since lowercasing the value can change the length. Per Tom Lane. * Made citextcmp consistently return int32, per Tom Lane. * Made the hash index function return the proper value, per Tom Lane. * Removed the COMMENTs and GRANTs from citext.sql.in. * Added a cast function from bpchar to citext, as suggested by Tom Lane. * Set the storage type for CITEXT to "extended", to ensure that it will be toastable. Per Tom Lane. * Fixed the COMMUTATOR of <=. * Changed the cast from citext to bpchar from implicit to assignment. This eliminates ambiguous function resolutions. * Eliminated superflous functions, per Tom Lane. * Removed unnecessary `OPERATOR()` calls in NEGATORs and the like. * Added binary in/out functions. Per Tom Lane * Added an explicit shell type to make the output a bit quieter. * Converted tests to pure SQL and omitted multibyte tests (though a few remain commented-out). * Reorganized and expanded the documentation a bit. This version is far better than I started with, and I'm very grateful for the feedback. Now, I have a few remaining questions to ask, mostly just to get your opinions: * The README for citext 1.0 on pgFoundry says: > I had to make a decision on casting between types for regular > expressions and > decided that if any parameter is of citext type then case > insensitive applies. > For example applying regular expressions with a varchar and a citext > will > produce a case-insensitive result. > > Having thought about this afterwards I realised that since we have > the option > to use case-insensitive results with regular expressions I should > have left the > behaviour exactly as text and then you have the best of both > worlds... oh well > not hard to change for any of you perfectionists! I followed the original and made all the regex and LIKE comparisons case-insensitive. But maybe I should not have? Especially since the regular expression functions (e.g., regexp_replace()) and a few non- regex functions (e.g., replace()) still don't behave case-insensitively? * If the answer is "no", how can I make those functions behave case- insensitively? (See the "TODO" tests.) * Should there be any other casts? To and from name, perhaps? Thanks! David
Вложения
On Jul 15, 2008, at 22:23, David E. Wheeler wrote: > * The README for citext 1.0 on pgFoundry says: > >> I had to make a decision on casting between types for regular >> expressions and >> decided that if any parameter is of citext type then case >> insensitive applies. >> For example applying regular expressions with a varchar and a >> citext will >> produce a case-insensitive result. >> >> Having thought about this afterwards I realised that since we have >> the option >> to use case-insensitive results with regular expressions I should >> have left the >> behaviour exactly as text and then you have the best of both >> worlds... oh well >> not hard to change for any of you perfectionists! > > I followed the original and made all the regex and LIKE comparisons > case-insensitive. But maybe I should not have? Especially since the > regular expression functions (e.g., regexp_replace()) and a few non- > regex functions (e.g., replace()) still don't behave case- > insensitively? I was thinking about this a bit last night and wanted to fill things out a bit. As a programmer, I find Donald Fraser's hindsight to be more appealing, because at least that way I have the option to do matching against CITEXT strings case-sensitively when I want to. OTOH, if what we want is to have CITEXT work more like a case- insensitive collation, then the expectation from the matching operators and functions might be different. Does anyone have any idea whether regex and LIKE matching against a string in a case-insensitive collation would match case-insensitively or not? If so, then maybe the regex and LIKE ops and funcs *should* match case-insensitively? If not, or if only for some collations, then I would think not. Either way, I know of no way, currently, to allow functions like replace(), split_part(), strpos(), and translate() to match case- insensitiely, even if we wanted to. Anyone have any ideas? > * If the answer is "no", how can I make those functions behave case- > insensitively? (See the "TODO" tests.) > > * Should there be any other casts? To and from name, perhaps? Thanks, David
On Wednesday 16 July 2008 13:54:25 David E. Wheeler wrote: > On Jul 15, 2008, at 22:23, David E. Wheeler wrote: > > * The README for citext 1.0 on pgFoundry says: > >> I had to make a decision on casting between types for regular > >> expressions and > >> decided that if any parameter is of citext type then case > >> insensitive applies. > >> For example applying regular expressions with a varchar and a > >> citext will > >> produce a case-insensitive result. > >> > >> Having thought about this afterwards I realised that since we have > >> the option > >> to use case-insensitive results with regular expressions I should > >> have left the > >> behaviour exactly as text and then you have the best of both > >> worlds... oh well > >> not hard to change for any of you perfectionists! > > > > I followed the original and made all the regex and LIKE comparisons > > case-insensitive. But maybe I should not have? Especially since the > > regular expression functions (e.g., regexp_replace()) and a few non- > > regex functions (e.g., replace()) still don't behave case- > > insensitively? > > I was thinking about this a bit last night and wanted to fill things > out a bit. > > As a programmer, I find Donald Fraser's hindsight to be more > appealing, because at least that way I have the option to do matching > against CITEXT strings case-sensitively when I want to. > > OTOH, if what we want is to have CITEXT work more like a case- > insensitive collation, then the expectation from the matching > operators and functions might be different. Does anyone have any idea > whether regex and LIKE matching against a string in a case-insensitive > collation would match case-insensitively or not? If so, then maybe the > regex and LIKE ops and funcs *should* match case-insensitively? If > not, or if only for some collations, then I would think not. > > Either way, I know of no way, currently, to allow functions like > replace(), split_part(), strpos(), and translate() to match case- > insensitiely, even if we wanted to. Anyone have any ideas? > > > * If the answer is "no", how can I make those functions behave case- > > insensitively? (See the "TODO" tests.) > > > > * Should there be any other casts? To and from name, perhaps? > AIUI, your propsing the following: select 'x'::citext = 'X'::citext;?column? ----------t (1 row) select 'x'::citext ~ 'X'::citext;?column? ----------f (1 row) I understand the desire for flexibility, but the above seems wierd to me. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Jul 16, 2008, at 11:20, Robert Treat wrote: >> I was thinking about this a bit last night and wanted to fill things >> out a bit. >> >> As a programmer, I find Donald Fraser's hindsight to be more >> appealing, because at least that way I have the option to do matching >> against CITEXT strings case-sensitively when I want to. >> >> OTOH, if what we want is to have CITEXT work more like a case- >> insensitive collation, then the expectation from the matching >> operators and functions might be different. Does anyone have any idea >> whether regex and LIKE matching against a string in a case- >> insensitive >> collation would match case-insensitively or not? If so, then maybe >> the >> regex and LIKE ops and funcs *should* match case-insensitively? If >> not, or if only for some collations, then I would think not. >> >> Either way, I know of no way, currently, to allow functions like >> replace(), split_part(), strpos(), and translate() to match case- >> insensitiely, even if we wanted to. Anyone have any ideas? >> >>> * If the answer is "no", how can I make those functions behave case- >>> insensitively? (See the "TODO" tests.) >>> >>> * Should there be any other casts? To and from name, perhaps? >> > > AIUI, your propsing the following: > > select 'x'::citext = 'X'::citext; > ?column? > ---------- > t > (1 row) > > select 'x'::citext ~ 'X'::citext; > ?column? > ---------- > f > (1 row) > > I understand the desire for flexibility, but the above seems wierd > to me. That's what Donald Fraser suggested, and I see some value in that, but wanted to get some other opinions. And you're right, that does seem a bit weird. The trouble is that, right now: template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o'); regexp_replace ---------------- fxx (1 row) So there's an inconsistency there. I don't know how to make that work case-insensitively. Best, David
Am 16.07.2008 um 20:38 schrieb David E. Wheeler: > > The trouble is that, right now: > > template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o'); > regexp_replace > ---------------- > fxx > (1 row) > > So there's an inconsistency there. I don't know how to make that > work case-insensitively. Wouldn't it be possible to create a variant of regexp_replace, i.e. regexp_replace(citext,citext,text), which would again lower-case the first two arguments before passing the input to regexp_replace(text,text,text)? Best Regards Michael Paesold
On Jul 17, 2008, at 03:45, Michael Paesold wrote: > Wouldn't it be possible to create a variant of regexp_replace, i.e. > regexp_replace(citext,citext,text), which would again lower-case the > first two arguments before passing the input to > regexp_replace(text,text,text)? Sure, but then you end up with this: template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o'); regexp_replace ---------------- foo (1 row) Which is just wrong. I'm going to look at the regex C functions today and see if there's an easy way to just always pass them the 'i' flag, which would do the trick. That still won't help replace(), split_part(), or translate(), however. Best, David
David E. Wheeler writes: > On Jul 17, 2008, at 03:45, Michael Paesold wrote: > >> Wouldn't it be possible to create a variant of regexp_replace, i.e. >> regexp_replace(citext,citext,text), which would again lower-case >> the first two arguments before passing the input to >> regexp_replace(text,text,text)? > > Sure, but then you end up with this: > > template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o'); > regexp_replace > ---------------- > foo > (1 row) Yeah, you are right, I see. :-) > Which is just wrong. I'm going to look at the regex C functions > today and see if there's an easy way to just always pass them the > 'i' flag, which would do the trick. That still won't help replace(), > split_part(), or translate(), however. Calling regex functions with the case-insensitivity option would be great. It should also be possible to rewrite replace() into regexp_replace() by first escaping the regex meta characters. Actually re-implementing those functions in a case insensitive way would still be an option, but of course some amount of work. The question is, how much use case there is. Best Regards Michael Paesold
On Jul 18, 2008, at 01:39, Michael Paesold wrote: > Calling regex functions with the case-insensitivity option would be > great. It should also be possible to rewrite replace() into > regexp_replace() by first escaping the regex meta characters. > > Actually re-implementing those functions in a case insensitive way > would still be an option, but of course some amount of work. The > question is, how much use case there is. Not much for me. I might use the regex functions, but would be happy to manually pass the "i" flag. However, if someone with a lot more C and Pg core knowledge wanted to sit down with me for a couple hours next week and help me bang out these functions, that would be great. I'd love to have the implementation be that much more complete. I do believe that, as it stands now in the v4 patch, citext is pretty close to ready, and certainly commit-able. Thanks, David
On Jul 18, 2008, at 09:53, David E. Wheeler wrote: > However, if someone with a lot more C and Pg core knowledge wanted > to sit down with me for a couple hours next week and help me bang > out these functions, that would be great. I'd love to have the > implementation be that much more complete. I've implemented fixes for the regexp_* functions and strpos() in pure SQL, like so: CREATE OR REPLACE FUNCTION regexp_matches( citext, citext ) RETURNS TEXT[] AS ' SELECT regexp_matches( $1::text, $2::text, ''i'' ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_matches( citext, citext, text ) RETURNS TEXT[] AS ' SELECT regexp_matches( $1::text, $2::text, CASE WHEN strpos($3, ''c'') = 0 THEN $3 || ''i'' ELSE $3 END ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text ) returns TEXT AS ' SELECT regexp_replace( $1::text, $2::text, $3, ''i''); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text, text ) returns TEXT AS ' SELECT regexp_replace( $1::text, $2::text, $3, CASE WHEN strpos($4, ''c'') = 0 THEN $4 || ''i'' ELSE $4 END); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext ) RETURNS TEXT[] AS ' SELECT regexp_split_to_array( $1::text, $2::text, ''i'' ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext, text ) RETURNS TEXT[] AS ' SELECT regexp_split_to_array( $1::text, $2::text, CASE WHEN strpos($3, ''c'') = 0 THEN $3 || ''i'' ELSE $3 END ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext ) RETURNS SETOF TEXT AS ' SELECT regexp_split_to_table( $1::text, $2::text, ''i'' ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext, text ) RETURNS SETOF TEXT AS ' SELECT regexp_split_to_table( $1::text, $2::text, CASE WHEN strpos($3, ''c'') = 0 THEN $3 || ''i'' ELSE $3 END ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION strpos( citext, citext ) RETURNS INT AS ' SELECT strpos( LOWER( $1::text ), LOWER( $2::text) ); ' LANGUAGE SQL IMMUTABLE STRICT; Not so bad, though it'd be nice to have C functions that just did these things. Still not case-insensitive are: -- replace() -- split_part() -- translate() So, anyone at OSCON this week want to help me with these? Or to convert the above functions to C? Greg? Bruce? Thanks, David
On Jul 18, 2008, at 01:39, Michael Paesold wrote: > Calling regex functions with the case-insensitivity option would be > great. It should also be possible to rewrite replace() into > regexp_replace() by first escaping the regex meta characters. > > Actually re-implementing those functions in a case insensitive way > would still be an option, but of course some amount of work. The > question is, how much use case there is. I've figured out how to make all the functions work using SQL function workarounds, converting things and re-dispatching to the text versions as appropriate. They work quite well, and can be converted to C later if that becomes a requirement. Meanwhile, on the question of whether or not regular expression and LIKE comparisons *should* match case-insensitively, I have a couple more observations: * Thinking about how a true case-insensitive collation would work, I'm quite certain that it would match case-insensitively. Anything else would just be unexpected, because in a case-insensitive collation, lowercase characters are, in practice, identical to uppercase characters. As far as matching is concerned, there is no difference between them. So the matching operators and functions against CITEXT should follow that assumption. * I tried a few matches on MySQL, where the collation is case- insensitive by default, and it confirms my impression: mysql> select 'Foo' regexp 'o$'; +-------------------+ | 'Foo' regexp 'o$' | +-------------------+ | 1 | +-------------------+ 1 row in set (0.00 sec) mysql> select 'Foo' regexp 'O$'; +-------------------+ | 'Foo' regexp 'O$' | +-------------------+ | 1 | +-------------------+ 1 row in set (0.00 sec) mysql> select 'Foo' like '%o'; +-----------------+ | 'Foo' like '%o' | +-----------------+ | 1 | +-----------------+ 1 row in set (0.00 sec) mysql> select 'Foo' like '%O'; +-----------------+ | 'Foo' like '%O' | +-----------------+ | 1 | +-----------------+ 1 row in set (0.00 sec) I'll grant that MySQL may not be the best model for how things should work, but it's something, at least. Anyone else got access to another database with case-insensitive collations to see how LIKE and regular expressions work? Thanks, David