Обсуждение: insensitive collations
With various patches and discussions around collations going on, I figured I'd send in my in-progress patch for insensitive collations. This adds a flag "insensitive" to collations. Such a collation disables various optimizations that assume that strings are equal only if they are byte-wise equal. That then allows use cases such as case-insensitive or accent-insensitive comparisons or handling of strings with different Unicode normal forms. So this doesn't actually make the collation case-insensitive or anything, it just allows a library-provided collation that is, say, case-insensitive to actually work that way. So maybe "insensitive" isn't the right name for this flag, but we can think about that. The jobs of this patch, aside from some DDL extensions, are to track collation assignment in plan types whether they have so far been ignored, and then make the various collation-aware functions take the insensitive flag into account. In comparison functions this just means skipping past the memcmp() optimizations. In hashing functions, this means converting the string to a sort key (think strxfrm()) before hashing. Various pieces are incomplete, but the idea should be clear from this. I have only implemented the ICU implementation in hashtext(); the libc provider branch needs to be added (or maybe we won't want to). All the changes around the "name" type haven't been taken into account. Foreign key support (see ri_GenerateQualCollation()) needs to be addressed. More tests for all the different plans need to be added. But in principle it works quite well, as you can see in the tests added so far. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Peter Eisentraut wrote: > So this doesn't actually make the collation case-insensitive or > anything, it just allows a library-provided collation that is, say, > case-insensitive to actually work that way. That's great news! > So maybe "insensitive" isn't the right name for this flag, but we > can think about that. To me it seems more natural to find a name for the other behavior, the one that consists of overwriting the locale-sensitive equality with the result of the byte-wise comparison. AFAIU the flag is meant to say: "don't do that". Some ideas that come to mind: as an enum CREATE COLLATION ... ( [EQUALITY = 'natural' | 'bytewise' ] ) as a boolean CREATE COLLATION ... ( [NATURAL_EQUALITY = true | false ] ) defaulting to false or CREATE COLLATION ... ( [BYTEWISE_EQUALITY = true | false ] ) defaulting to true Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > To me it seems more natural to find a name for the other behavior, the > one that consists of overwriting the locale-sensitive equality with > the result of the byte-wise comparison. AFAIU the flag is meant > to say: "don't do that". > Some ideas that come to mind: > as an enum > CREATE COLLATION ... ( [EQUALITY = 'natural' | 'bytewise' ] ) > as a boolean > CREATE COLLATION ... ( [NATURAL_EQUALITY = true | false ] ) > defaulting to false > or > CREATE COLLATION ... ( [BYTEWISE_EQUALITY = true | false ] ) > defaulting to true I don't really find it "natural" for equality to consider obviously distinct values to be equal. As a counterexample, the fact that IEEE arithmetic treats 0 and -0 as equal seems to cause about as many problems as it solves, and newcomers to float math certainly don't find it "natural". So I object to that particular naming. Perhaps we could do something around "uniqueness"? That is, collations could have a boolean property UNIQUE, default true, or something like that. regards, tom lane
On Wed, Dec 19, 2018 at 6:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I don't really find it "natural" for equality to consider obviously > distinct values to be equal. The Unicode consortium calls our current behavior within comparisons "deterministic comparisons" -- it's something they're not so keen on: https://unicode.org/reports/tr10/#Deterministic_Comparison I suggest using their terminology for our current behavior. I can see where Peter was coming from with "natural", but it doesn't quite work. One problem with that terminology is that already refers to a sort that sorts numbers as numbers. How about "linguistic", or "lexical"? There is a significant cultural dimension to this. I suspect that you don't find it "natural" that obviously distinct values compare as equal because you're anglophone. That's the exact behavior you'll get when using an unadorned en_US collation/locale, I think. > As a counterexample, the fact that IEEE > arithmetic treats 0 and -0 as equal seems to cause about as many problems > as it solves, and newcomers to float math certainly don't find it > "natural". So I object to that particular naming. FWIW, I don't think that your IEEE analogy quite works, because you're talking about a property of a datatype. A collation is not intrinsic to any collatable datatype. Fortunately, we're not required to agree on what feels natural. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > The Unicode consortium calls our current behavior within comparisons > "deterministic comparisons" -- it's something they're not so keen on: > https://unicode.org/reports/tr10/#Deterministic_Comparison > I suggest using their terminology for our current behavior. Hm, it's not the greatest terminology perhaps, but if they're using it then that makes it at least semi-standard. I can live with that. > FWIW, I don't think that your IEEE analogy quite works, because you're > talking about a property of a datatype. A collation is not intrinsic > to any collatable datatype. Fortunately, we're not required to agree > on what feels natural. Right, which is exactly why it'd be a bad idea to use "natural" as the name for this property ... regards, tom lane
Tom Lane wrote: > I don't really find it "natural" for equality to consider obviously > distinct values to be equal. According to https://www.merriam-webster.com/dictionary/natural "natural" has no less than 15 meanings. The first in the list is "based on an inherent sense of right and wrong" which I admit is not what we want to imply in this context. The meaning that I was thinking about was close to definitions 4: "following from the nature of the one in question " or 7: "having a specified character by nature " or 13: "closely resembling an original : true to nature" When postgres uses the comparison from a collation with no modification whatsoever, it's true to that collation. When it changes the result from equal to non-equal, it's not. If a collation says that "ABC" = "abc" and postgres says, mmh, OK thanks but I'll go with "ABC" != "abc", then that denatures the collation, in the sense of: "to deprive of natural qualities : change the nature of" (https://www.merriam-webster.com/dictionary/denature) Aside from that, I'd be +1 for "linguistic" as the opposite of "bytewise", I think it tends to be easily understood when expressing that a strcoll()-like function is used as opposed to a strcmp()-like function. I'm -1 for "deterministic" as a replacement for "bytewise". Even if Unicode has choosen that term for exactly the behavior we're talking about, it's heavily used in the more general sense of: "given a particular input, will always produce the same output" (quoted from https://en.wikipedia.org/wiki/Deterministic_algorithm) which we very much expect from all our string comparisons no matter the flags we may put on the collations. "bytewise" might be less academic but it has less potential for wrong interpretations. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Here is an updated patch. I have updated the naming to "deterministic", as discussed. I have fixed up support for the "name" type, added foreign key support, psql, pg_dump support, more tests. There are a couple of TODOs in bpchar support that I need to look into a bit more. But other than that it's pretty complete. Perhaps it worth pointing out to new reviewers that the ICU tests can be run like so: make check EXTRA_TESTS=collate.icu.utf8 -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Peter Eisentraut wrote: > Here is an updated patch. When using GROUP BY and ORDER BY on a field with a non-deterministic collation, this pops out: CREATE COLLATION myfr (locale='fr-u-ks-level1', provider='icu', deterministic=false); =# select n from (values ('été' collate "myfr"), ('ete')) x(n) group by 1 order by 1 ; n ----- ete (1 row) =# select n from (values ('été' collate "myfr"), ('ete')) x(n) group by 1 order by 1 desc; n ----- été (1 row) The single-row output is different whether it's sorted in the ASC or DESC direction, even though in theory, ORDER BY is done after GROUP BY, where it shouldn't make that difference. EXPLAIN shows that the sort is done before grouping, which might explain why it happens, but isn't that plan incorrect given the context? postgres=# explain select n from (values ('été' collate "myfr"), ('ete')) x(n) group by 1 order by 1 desc; QUERY PLAN -------------------------------------------------------------------------- Group (cost=0.04..0.04 rows=2 width=32) Group Key: "*VALUES*".column1 -> Sort (cost=0.04..0.04 rows=2 width=32) Sort Key: "*VALUES*".column1 COLLATE myfr DESC -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=32) (5 rows) Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 04/01/2019 17:05, Daniel Verite wrote: > When using GROUP BY and ORDER BY on a field with a non-deterministic > collation, this pops out: > > CREATE COLLATION myfr (locale='fr-u-ks-level1', > provider='icu', deterministic=false); > > =# select n from (values ('été' collate "myfr"), ('ete')) x(n) > group by 1 order by 1 ; > n > ----- > ete > (1 row) > > =# select n from (values ('été' collate "myfr"), ('ete')) x(n) > group by 1 order by 1 desc; > n > ----- > été > (1 row) I don't see anything wrong here. The collation says that both values are equal, so which one is returned is implementation-dependent. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/28/18 9:55 AM, Peter Eisentraut wrote: > Here is an updated patch. > > I have updated the naming to "deterministic", as discussed. Maybe this is orthogonal and best handled elsewhere but have you when working with string equality given unicode normalization forms[1] any thought? I feel there are three sane ways to do unicode string equality: 1) Binary equality 2) Binary equality after normalizing the unicode 3) Collation equality Would there be any point in adding unicode normalization support into the collation system or is this best handle for example with a function run on INSERT or with something else entirely? Right now PosgreSQL does not have any support for normalization forms as far as I know. 1. http://unicode.org/reports/tr15/ Andreas
Peter Eisentraut wrote: > > =# select n from (values ('été' collate "myfr"), ('ete')) x(n) > > group by 1 order by 1 ; > > n > > ----- > > ete > > (1 row) > > > > =# select n from (values ('été' collate "myfr"), ('ete')) x(n) > > group by 1 order by 1 desc; > > n > > ----- > > été > > (1 row) > > I don't see anything wrong here. The collation says that both values > are equal, so which one is returned is implementation-dependent. Is it, but it's impractical if the product of seemingly the same GROUP BY flip-flops between its different valid results. If it can't be avoided, then okay. If it can be avoided at little cost, then it would be better to do it. As a different example, the regression tests are somewhat counting on this already. Consider this part: +CREATE TABLE test3ci (x text COLLATE case_insensitive); +INSERT INTO test1ci VALUES ('abc'), ('def'), ('ghi'); +INSERT INTO test2ci VALUES ('ABC'), ('ghi'); +INSERT INTO test3ci VALUES ('abc'), ('ABC'), ('def'), ('ghi'); ... +SELECT x, count(*) FROM test3ci GROUP BY x ORDER BY x; + x | count +-----+------- + abc | 2 + def | 1 + ghi | 1 +(3 rows) If ABC was returned here instead of abc for whatever reason, that would be correct strictly speaking, yet "make check" would fail. That's impractical. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 09/01/2019 19:49, Andreas Karlsson wrote: > On 12/28/18 9:55 AM, Peter Eisentraut wrote: >> Here is an updated patch. >> >> I have updated the naming to "deterministic", as discussed. > > Maybe this is orthogonal and best handled elsewhere but have you when > working with string equality given unicode normalization forms[1] any > thought? Nondeterministic collations do address this by allowing canonically equivalent code point sequences to compare as equal. You still need a collation implementation that actually does compare them as equal; ICU does this, glibc does not AFAICT. > Would there be any point in adding unicode normalization support into > the collation system or is this best handle for example with a function > run on INSERT or with something else entirely? I think there might be value in a feature that normalizes strings as they enter the database, as a component of the encoding conversion infrastructure. But that would be a separate feature. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/01/2019 22:01, Daniel Verite wrote: >> I don't see anything wrong here. The collation says that both values >> are equal, so which one is returned is implementation-dependent. > Is it, but it's impractical if the product of seemingly the same GROUP BY > flip-flops between its different valid results. If it can't be avoided, then > okay. If it can be avoided at little cost, then it would be better to do it. But there is no concept of which one of these is the preferred variant, so I don't see how the system is supposed to pick one and then stick to it across separate query invocations. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut wrote: > Here is an updated patch. On a table with pre-existing contents, the creation of a unique index does not seem to detect the duplicates that are equal per the collation and different binary-wise. postgres=# \d test3ci Table "public.test3ci" Column | Type | Collation | Nullable | Default --------+------+------------------+----------+--------- x | text | case_insensitive | | postgres=# select * from test3ci; x ----- abc ABC def ghi (4 rows) postgres=# create unique index idx on test3ci(x); -- EXPECTED TO FAIL CREATE INDEX postgres=# \d test3ci Table "public.test3ci" Column | Type | Collation | Nullable | Default --------+------+------------------+----------+--------- x | text | case_insensitive | | Indexes: "idx" UNIQUE, btree (x) postgres=# select count(*) from test3ci where x='abc'; count ------- 2 (1 row) OTOH with an already existing unique index, attempts to insert such duplicates are rejected as expected. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 1/10/19 8:44 AM, Peter Eisentraut wrote: > On 09/01/2019 19:49, Andreas Karlsson wrote: >> Maybe this is orthogonal and best handled elsewhere but have you when >> working with string equality given unicode normalization forms[1] any >> thought? > > Nondeterministic collations do address this by allowing canonically > equivalent code point sequences to compare as equal. You still need a > collation implementation that actually does compare them as equal; ICU > does this, glibc does not AFAICT. Ah, right! You could use -ks-identic[1] for this. >> Would there be any point in adding unicode normalization support into >> the collation system or is this best handle for example with a function >> run on INSERT or with something else entirely? > > I think there might be value in a feature that normalizes strings as > they enter the database, as a component of the encoding conversion > infrastructure. But that would be a separate feature. Agreed. And if we ever implement this we could theoretically optimize the equality of -ks-identic to do a strcmp() rather than having to collate anything. I think it could also be useful to just add functions which can normalize strings, which was in a proposal to the SQL standard which was not accepted.[2] Notes 1. http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options 2. https://dev.mysql.com/worklog/task/?id=2048 Andreas
Andreas Karlsson wrote: > > Nondeterministic collations do address this by allowing canonically > > equivalent code point sequences to compare as equal. You still need a > > collation implementation that actually does compare them as equal; ICU > > does this, glibc does not AFAICT. > > Ah, right! You could use -ks-identic[1] for this. Strings that differ like that are considered equal even at this level: postgres=# create collation identic (locale='und-u-ks-identic', provider='icu', deterministic=false); CREATE COLLATION postgres=# select 'é' = E'e\u0301' collate "identic"; ?column? ---------- t (1 row) There's a separate setting "colNormalization", or "kk" in BCP 47 From http://www.unicode.org/reports/tr35/tr35-collation.html#Normalization_Setting "The UCA always normalizes input strings into NFD form before the rest of the algorithm. However, this results in poor performance. With normalization=off, strings that are in [FCD] and do not contain Tibetan precomposed vowels (U+0F73, U+0F75, U+0F81) should sort correctly. With normalization=on, an implementation that does not normalize to NFD must at least perform an incremental FCD check and normalize substrings as necessary" But even setting this to false does not mean that NFD and NFC forms of the same text compare as different: postgres=# create collation identickk (locale='und-u-ks-identic-kk-false', provider='icu', deterministic=false); CREATE COLLATION postgres=# select 'é' = E'e\u0301' collate "identickk"; ?column? ---------- t (1 row) AFAIU such strings may only compare as different when they're not in FCD form (http://unicode.org/notes/tn5/#FCD) There are also ICU-specific explanations about FCD here: http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm#Normalization It looks like setting colNormalization to false might provide a performance benefit when you know your contents are in FCD form, which is mostly the case according to ICU: "Note that all NFD strings are in FCD, and in practice most NFC strings will also be in FCD; for that matter most strings (of whatever ilk) will be in FCD. We guarantee that if any input strings are in FCD, that we will get the right results in collation without having to normalize". Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 14/01/2019 13:23, Daniel Verite wrote: > On a table with pre-existing contents, the creation of a unique index > does not seem to detect the duplicates that are equal per the > collation and different binary-wise. Fixed in the attached updated patch. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Peter Eisentraut wrote: > > On a table with pre-existing contents, the creation of a unique index > > does not seem to detect the duplicates that are equal per the > > collation and different binary-wise. > > Fixed in the attached updated patch. Check. I've found another issue with aggregates over distinct: the deduplication seems to ignore the collation. postgres=# select distinct x from test3ci; -- OK x ----- def abc ghi (3 rows) postgres=# select count(distinct x) from test3ci; -- not OK count ------- 4 (1 row) postgres=# select array_agg(distinct x) from test3ci; -- not OK array_agg ------------------- {ABC,abc,def,ghi} (1 row) Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 14/01/2019 15:37, Andreas Karlsson wrote: >> Nondeterministic collations do address this by allowing canonically >> equivalent code point sequences to compare as equal. You still need a >> collation implementation that actually does compare them as equal; ICU >> does this, glibc does not AFAICT. > Ah, right! You could use -ks-identic[1] for this. That's the default. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 16/01/2019 14:20, Daniel Verite wrote: > I've found another issue with aggregates over distinct: > the deduplication seems to ignore the collation. I have a fix for that. I'll send it with the next update. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 16/01/2019 21:50, Peter Eisentraut wrote: > On 16/01/2019 14:20, Daniel Verite wrote: >> I've found another issue with aggregates over distinct: >> the deduplication seems to ignore the collation. > > I have a fix for that. I'll send it with the next update. Another patch. This fixes your issue, and it incorporates the findings from the thread "ExecBuildGroupingEqual versus collations", as well as a few other fixes and more tests. As far as I can tell, this covers everything now, meaning all the relevant plan types propagate the collation correctly and all the relevant operators and functions do the right things with them. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Peter Eisentraut wrote: > Another patch. + <literal>ks</literal> key), in order for such such collations to act in a s/such such/such/ + <para> + The pattern matching operators of all three kinds do not support + nondeterministic collations. If required, apply a different collation to + the expression to work around this limitation. + </para> It's an important point of comparison between CI collations and contrib/citext, since the latter diverts a bunch of functions/operators to make them do case-insensitive pattern matching. The doc for citext explains the rationale for using it versus text, maybe it would need now to be expanded a bit with pros/cons of choosing citext versus non-deterministic collations. The current patch doesn't alter a few string functions that could potentially implement collation-aware string search, such as replace(), strpos(), starts_with(). ISTM that we should not let these functions ignore the collation: they ought to error out until we get their implementation to use the ICU collation-aware string search. FWIW I've been experimenting with usearch_openFromCollator() and other usearch_* functions, and it looks doable to implement at least the 3 above functions based on that, even though the UTF16-ness of the API does not favor us. ICU also provides regexp matching, but not collation-aware, since character-based patterns don't play well with the concept of collation. About a potential collation-aware LIKE, it looks hard to implement, since the algorithm currently used in like_match.c seems purely character-based. AFAICS there's no way to plug calls to usearch_* functions into it, it would need a separate redesign from scratch. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 30/01/2019 16:30, Daniel Verite wrote: > Peter Eisentraut wrote: > >> Another patch. Another patch, with expanded documentation, more tests, and some code cleanup. > > + <literal>ks</literal> key), in order for such such collations to act in > a > > s/such such/such/ fixed > > + <para> > + The pattern matching operators of all three kinds do not support > + nondeterministic collations. If required, apply a different collation > to > + the expression to work around this limitation. > + </para> > > It's an important point of comparison between CI collations and > contrib/citext, since the latter diverts a bunch of functions/operators > to make them do case-insensitive pattern matching. > The doc for citext explains the rationale for using it versus text, > maybe it would need now to be expanded a bit with pros/cons of > choosing citext versus non-deterministic collations. possibly addressed by the expanded documentation > The current patch doesn't alter a few string functions that could > potentially implement collation-aware string search, such as > replace(), strpos(), starts_with(). > ISTM that we should not let these functions ignore the collation: they > ought to error out until we get their implementation to use the ICU > collation-aware string search. I have addressed that. Many of these end up at the same one or two low-level C function, so the checking just happens there. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Another patch to address merge conflicts. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On Tue, Feb 19, 2019 at 6:47 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > Another patch to address merge conflicts. Some remarks on this: * Your draft commit message says: > This patch makes changes in three areas: > > - CREATE COLLATION DDL changes and system catalog changes to support > this new flag. > > - Many executor nodes and auxiliary code are extended to track > collations. Previously, this code would just throw away collation > information, because the eventually-called user-defined functions > didn't use it since they only cared about equality, which didn't > need collation information. > > - String data type functions that do equality comparisons and hashing > are changed to take the (non-)deterministic flag into account. For > comparison, this just means skipping various shortcuts and tie > breakers that use byte-wise comparison. For hashing, we first need > to convert the input string to a canonical "sort key" using the ICU > analogue of strxfrm(). I wonder if it would be better to break this into distinct commits? * Why is support for non-deterministic comparisons limited to the ICU provider? If that's the only case that could possibly be affected, then why did we ever add the varstrcmp() tie-breaker call to strcmp() in the first place? The behavior described in the commit message of bugfix commit 656beff5 describes a case where Hungarian text caused index corruption by being strcoll()-wise equal but not bitwise equal. Besides, I think that you can vendor your own case insensitive collation with glibc, since it's based on UCA. This restriction feels like it's actually a kludge to work around the fact that database-wide collations have this foreign key related restriction in your patch: > @@ -2901,11 +2921,20 @@ ri_AttributesEqual(Oid eq_opr, Oid typeid, > } > > /* > - * Apply the comparison operator. We assume it doesn't care about > - * collations. > - */ > - return DatumGetBool(FunctionCall2(&entry->eq_opr_finfo, > - oldvalue, newvalue)); > + * Apply the comparison operator. > + * > + * Note: This function is part of a call stack that determines whether an > + * update to a row is significant enough that it needs checking or action > + * on the other side of a foreign-key constraint. Therefore, the > + * comparison here would need to be done with the collation of the *other* > + * table. For simplicity (e.g., we might not even have the other table > + * open), we'll just use the default collation here, which could lead to > + * some false negatives. All this would break if we ever allow > + * database-wide collations to be nondeterministic. > + */ > + return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo, > + DEFAULT_COLLATION_OID, > + oldvalue, newvalue)); > } The existing restriction on ICU collations (that they cannot be database-wide) side-steps the issue. * Can said restriction somehow be lifted? That seems like it would be a lot cleaner. * Why have you disable this optimization?: > /* Fast pre-check for equality, as discussed in varstr_cmp() */ > - if (len1 == len2 && memcmp(a1p, a2p, len1) == 0) > + if ((!sss->locale || sss->locale->deterministic) && > + len1 == len2 && memcmp(a1p, a2p, len1) == 0) I don't see why this is necessary. A non-deterministic collation cannot indicate that bitwise identical strings are non-equal. * Perhaps you should add a "Tip" referencing the feature to the contrib/citext documentation. -- Peter Geoghegan
On 2019-02-21 03:17, Peter Geoghegan wrote: > I wonder if it would be better to break this into distinct commits? I thought about that. Especially the planner/executor changes could be done separately, sort of as a way to address the thread "ExecBuildGroupingEqual versus collations". But I'm not sure if they would have good test coverage on their own. I can work on this if people think this would be useful. > * Why is support for non-deterministic comparisons limited to the ICU > provider? If that's the only case that could possibly be affected, > then why did we ever add the varstrcmp() tie-breaker call to strcmp() > in the first place? The behavior described in the commit message of > bugfix commit 656beff5 describes a case where Hungarian text caused > index corruption by being strcoll()-wise equal but not bitwise equal. > Besides, I think that you can vendor your own case insensitive > collation with glibc, since it's based on UCA. The original test case (described here: https://www.postgresql.org/message-id/27064.1134753128%40sss.pgh.pa.us) no longer works, so it was probably fixed on the glibc side. The git log of the hu_HU locale definition shows that it has been "fixed" in major ways several times over the years, so that's plausible. I tried reproducing some more practical scenarios involving combining diacritics, but glibc apparently doesn't believe strings in different normal forms are equal. At that point I gave up because this doesn't seem worthwhile to support. Moreover, I think allowing this would require a "trusted" strxfrm(), which is currently disabled. >> + return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo, >> + DEFAULT_COLLATION_OID, >> + oldvalue, newvalue)); >> } > > The existing restriction on ICU collations (that they cannot be > database-wide) side-steps the issue. > > * Can said restriction somehow be lifted? That seems like it would be > a lot cleaner. Lift what restriction? That ICU collations cannot be database-wide? Sure that would be nice, but it's a separate project. Even then, I'm not sure that we would allow a database-wide collation to be nondeterministic. That would for example disallow the use of LIKE, which would be weird. In any case, the above issue can be addressed then. I think it's not worth complicating this right now. > * Why have you disable this optimization?: > >> /* Fast pre-check for equality, as discussed in varstr_cmp() */ >> - if (len1 == len2 && memcmp(a1p, a2p, len1) == 0) >> + if ((!sss->locale || sss->locale->deterministic) && >> + len1 == len2 && memcmp(a1p, a2p, len1) == 0) > > I don't see why this is necessary. A non-deterministic collation > cannot indicate that bitwise identical strings are non-equal. Right, I went too far there. > * Perhaps you should add a "Tip" referencing the feature to the > contrib/citext documentation. Good idea. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-02-21 09:36, Peter Eisentraut wrote: >> * Why have you disable this optimization?: >> >>> /* Fast pre-check for equality, as discussed in varstr_cmp() */ >>> - if (len1 == len2 && memcmp(a1p, a2p, len1) == 0) >>> + if ((!sss->locale || sss->locale->deterministic) && >>> + len1 == len2 && memcmp(a1p, a2p, len1) == 0) >> I don't see why this is necessary. A non-deterministic collation >> cannot indicate that bitwise identical strings are non-equal. > Right, I went too far there. > >> * Perhaps you should add a "Tip" referencing the feature to the >> contrib/citext documentation. > Good idea. Here is another patch that fixes these two points. I have also worked on the tests hoping to appease the cfbot. Older ICU versions (<54) don't support all the locale customization options, so many of my new tests in collate.icu.utf8.sql will fail on older systems. What should we do about that? Have another extra test file? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Peter Eisentraut wrote: [v7-0001-Collations-with-nondeterministic-comparison.patch] +GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation) { + if (collation && !lc_ctype_is_c(collation) && collation != DEFAULT_COLLATION_OID) + { + pg_locale_t locale = pg_newlocale_from_collation(collation); + + if (locale && !locale->deterministic) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("nondeterministic collations are not supported for LIKE"))); + } This test gets side-stepped when pattern_fixed_prefix() in selfuncs.c returns Pattern_Prefix_Exact, and the code optimizes the operation by converting it to a bytewise equality test, or a bytewise range check in the index with Pattern_Type_Prefix. Here's a reproducer: === create collation ciai (locale='und-u-ks-level1', deterministic=false, provider='icu'); create table w(t text collate "C"); insert into w select md5(i::text) from generate_series(1,10000) as i; insert into w values('abc'); create index indexname on w(t ); select t from w where t like 'ABC' collate ciai; t --- (0 rows) select t from w where t like 'ABC%' collate ciai; t --- (0 rows) === For the LIKE operator, I think the fix should be that like_fixed_prefix() should always return Pattern_Prefix_None for non-deterministic collations. For regular expressions, pg_set_regex_collation() is called at some point when checking for a potential prefix, and since it errors out with non-deterministic collations, this issue is taken care of already. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Peter Eisentraut wrote: > Older ICU versions (<54) don't support all the locale customization > options, so many of my new tests in collate.icu.utf8.sql will fail on > older systems. What should we do about that? Have another extra test file? Maybe stick to the old-style syntax for the regression tests? The declarations that won't work as expected with older ICU versions would be: CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false); 'und-u-ks-level2' is equivalent to 'und@colStrength=secondary' CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false); 'und-u-ks-level1-kc-true' => 'und@colStrength=primary;colCaseLevel=yes' Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 2019-03-05 18:48, Daniel Verite wrote: >> Older ICU versions (<54) don't support all the locale customization >> options, so many of my new tests in collate.icu.utf8.sql will fail on >> older systems. What should we do about that? Have another extra test file? > Maybe stick to the old-style syntax for the regression tests? > The declarations that won't work as expected with older ICU versions > would be: > > CREATE COLLATION case_insensitive (provider = icu, locale = > 'und-u-ks-level2', deterministic = false); > > 'und-u-ks-level2' is equivalent to 'und@colStrength=secondary' The problem is not the syntax but that the older ICU versions don't support the *functionality* of ks-level2 or colStrength=secondary. If you try it, you will simply get a normal case-sensitive behavior. It would probably be possible to write all the tests for nondeterministic collations without making use of this functionality, using only canonically equivalent sequences as test data. But that would make the tests extremely weird and unintuitive, so I'd like to avoid that. After thinking about it a bit more, I think making a new test file is reasonable. The file is already fairly long anyway, compared to the typical test file size. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut wrote: > The problem is not the syntax but that the older ICU versions don't > support the *functionality* of ks-level2 or colStrength=secondary. If > you try it, you will simply get a normal case-sensitive behavior. My bad, I see now that the "old locale extension syntax" was actually introduced at the same time than the "language tag syntax" in ICU 54: http://bugs.icu-project.org/trac/ticket/8260 With previous versions, we'd need to call ucol_setAttribute(), with the attributes and values defined here: http://icu-project.org/apiref/icu4c/ucol_8h.html for instance to get colStrength=secondary: ucol_setAttribute(coll, UCOL_STRENGTH , UCOL_SECONDARY, &status); which I've just checked gives the expected result with ICU-4.2. These attributes are flagged as "Stable: ICU 2.0" up to "Stable: ICU 2.8" (for UCOL_NUMERIC_COLLATION ). So if we really wanted to have these functionalities with pre-54 ICU, we could but that would mean implementing an interface to pass to CREATE COLLATION the attributes/values we want to support. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 2019-03-04 15:58, Daniel Verite wrote: > For the LIKE operator, I think the fix should be that like_fixed_prefix() > should always return Pattern_Prefix_None for non-deterministic collations. Good catch. I added the fix to match_pattern_prefix() instead. That would be the place to change if someone wanted to fix this properly in the future. While testing this, I also found a somewhat related problem with _pattern_ops operator classes, which also need to be prohibited with nondeterministic collations. Updated patch attached, also with some merge conflicts resolved. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On 2019-03-07 20:04, Daniel Verite wrote: > With previous versions, we'd need to call ucol_setAttribute(), > with the attributes and values defined here: > http://icu-project.org/apiref/icu4c/ucol_8h.html > for instance to get colStrength=secondary: > ucol_setAttribute(coll, UCOL_STRENGTH , UCOL_SECONDARY, &status); > which I've just checked gives the expected result with ICU-4.2. I see. I'm thinking about adding some ad hoc code to pg_newlocale_from_collation() to parse these keywords ourselves, so we can provide the same interface for old ICU versions. I'll send a separate patch for that. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-03-08 11:09, Peter Eisentraut wrote: > On 2019-03-07 20:04, Daniel Verite wrote: >> With previous versions, we'd need to call ucol_setAttribute(), >> with the attributes and values defined here: >> http://icu-project.org/apiref/icu4c/ucol_8h.html >> for instance to get colStrength=secondary: >> ucol_setAttribute(coll, UCOL_STRENGTH , UCOL_SECONDARY, &status); >> which I've just checked gives the expected result with ICU-4.2. > > I see. I'm thinking about adding some ad hoc code to > pg_newlocale_from_collation() to parse these keywords ourselves, so we > can provide the same interface for old ICU versions. I'll send a > separate patch for that. Patches here. This will allow all the existing collation customization options as well as the ones being proposed in this thread to work in older ICU versions. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On 2019-03-11 21:36, Peter Eisentraut wrote: > Patches here. This will allow all the existing collation customization > options as well as the ones being proposed in this thread to work in > older ICU versions. This has been committed, and here is an updated main patch, which now has a chance to pass the cfbot builds. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On 2019-03-18 00:19, Peter Eisentraut wrote: > On 2019-03-11 21:36, Peter Eisentraut wrote: >> Patches here. This will allow all the existing collation customization >> options as well as the ones being proposed in this thread to work in >> older ICU versions. > > This has been committed, and here is an updated main patch, which now > has a chance to pass the cfbot builds. The main patch has also been committed. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Has any progress been made on supporting LIKE for nondeterministic collations? The pattern as well as the expresion needs to use collation-aware character comparisons, so for a suitable collation where ß compares equally to ss: SELECT * from table WHERE name LIKE '%ß%' yields Brian Bruß Steven Sossmix and even if the pattern contains only single-byte UTF-8 characters, a non-accented character in the first 127 might compare equally to a two-byte accented character in the first argument, so the comparisons as well as the character-advancing logic must be collation-aware. This seems to imply that for the general nondeterministic case we need to rewrite the algorithm to use ICU functions for advancing to the next character and for comparing characters at the current position in the pattern and string. Is that accurate? for a database with UTF8 encoding and a collation that is case-insenstitive but accent-sensitive, and where the pattern contains only single-byte characters or wildcard characters, would LIKE and ILIKE be correct with the current per-byte implementation - albeit without any index exploitation? ----- Jim Finnerty, AWS, Amazon Aurora PostgreSQL -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
For a UTF8 encoded, case-insensitive (nondeterministic), accent-sensitive ICU collation, a LIKE predicate can be used with a small transformation of the predicate, and the pattern can contain multi-byte characters: from: SELECT * FROM locations WHERE location LIKE 'midi-Pyrené%'; -- ERROR: nondeterministic collations are not supported for LIKE to: SELECT * FROM locations WHERE lower(location) COLLATE "C" LIKE lower('midi-Pyrené%'); and if there is an index defined as follows: CREATE INDEX location_LOWER_C_index ON locations (LOWER(location) COLLATE "C"); then the LIKE predicate above performs the desired CI_AS evaluation and also exploits the index: EXPLAIN VERBOSE SELECT * FROM locations WHERE LOWER(location) COLLATE "C" LIKE LOWER('midi-Pyrené%'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using location_LOWER_C_index on public.locations (cost=0.13..8.16 rows=1 width=18) Output: location Index Cond: (((lower((locations.location)::text))::text >= 'midi-pyren?'::text) AND ((lower((locations.location)::text))::text < 'midi-pyren?'::text)) Filter: ((lower((locations.location)::text))::text ~~ 'midi-pyren?%'::text) (4 rows) It turns out that CI_AS represents the vast majority (more than 99.9%) of nondeterministic collations that we are seeing in babelfish, because SQL_Latin1_General_CP1_CI_AS is the default collation in SQL Server. Currently nondeterministic collations are disabled at the database level. The cited reason was because of the lack of LIKE support and because certain catalog views use LIKE. That may still need to be a limitation if those LIKE predicates currently have an index exploitation unless we can create a LOWER(col) COLLATE "C" index on the catalog. Which catalog views were these that had the problem? ----- Jim Finnerty, AWS, Amazon Aurora PostgreSQL -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Jim Finnerty wrote: > For a UTF8 encoded, case-insensitive (nondeterministic), accent-sensitive > ICU > collation, a LIKE predicate can be used with a small transformation of the > predicate, and the pattern can contain multi-byte characters: > > from: > > SELECT * FROM locations WHERE location LIKE 'midi-Pyrené%'; > -- ERROR: nondeterministic collations are not supported for LIKE > > to: > > SELECT * FROM locations WHERE lower(location) COLLATE "C" LIKE > lower('midi-Pyrené%'); For prefix matching, there's a simpler way with non-deterministic collations based on the advice in [1] The trick is that if an ICU collation is assigned to "location", whether it's deterministic or not, SELECT * FROM locations WHERE location LIKE 'midi-Pyrené%'; is equivalent to: SELECT * FROM locations WHERE location BETWEEN 'midi-Pyrené' AND 'midi-Pyrené' || E'\uFFFF'; and that will use a btree index if available. Also, it works with all features of ND-collations and all encodings, not just case-insensitiveness and UTF-8. Now that doesn't solve LIKE '%midi-Pyrené%', or LIKE '%midi_Pyrené%', but that trick could be a building block for an algorithm implementing LIKE with ND-collations in the future. [1] https://unicode-org.github.io/icu/userguide/collation/architecture.html#generating-bounds-for-a-sort-key-prefix-matching Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
Jim Finnerty wrote: > Currently nondeterministic collations are disabled at the database level. Deterministic ICU collations are also disabled. > The cited reason was because of the lack of LIKE support and because certain > catalog views use LIKE. But the catalogs shouldn't use the default collation of the database. commit 586b98fdf1aaef4a27744f8b988479aad4bd9a01 provides some details about this: Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed Dec 19 17:35:12 2018 -0500 Make type "name" collation-aware. The "name" comparison operators now all support collations, making them functionally equivalent to "text" comparisons, except for the different physical representation of the datatype. They do, in fact, mostly share the varstr_cmp and varstr_sortsupport infrastructure, which has been slightly enlarged to handle the case. To avoid changes in the default behavior of the datatype, set name's typcollation to C_COLLATION_OID not DEFAULT_COLLATION_OID, so that by default comparisons to a name value will continue to use strcmp semantics. (This would have been the case for system catalog columns anyway, because of commit 6b0faf723, but doing this makes it true for user-created name columns as well. In particular, this avoids locale-dependent changes in our regression test results.) In consequence, tweak a couple of places that made assumptions about collatable base types always having typcollation DEFAULT_COLLATION_OID. I have not, however, attempted to relax the restriction that user- defined collatable types must have that. Hence, "name" doesn't behave quite like a user-defined type; it acts more like a domain with COLLATE "C". (Conceivably, if we ever get rid of the need for catalog name columns to be fixed-length, "name" could actually become such a domain over text. But that'd be a pretty massive undertaking, and I'm not volunteering.) Discussion: https://postgr.es/m/15938.1544377821@sss.pgh.pa.us Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
The two ideas can be combined to create a workaround for accent-sensitive nondeterministic collations that enables an ordinary btree to be exploited if available, and also provides the full LIKE logic in either case-sensitive or case-insensitive collations: SELECT * FROM locations WHERE location LIKE 'midi-Pyrén%ées'; becomes: SELECT * FROM locations WHERE location COLLATE "C" ILIKE 'midi-Pyrén%ées' AND -- For CI collations only location BETWEEN 'midi-Pyrén' AND 'midi-Pyrén' || E'\uFFFF'; -- exploitable by ordinary btree LIKE would also be valid for a case sensitive collation, but then the collation would be CS_AS, so it would be deterministic and no transform would be needed. The expression above produces a good plan, but EXPLAIN complains if the concatenated expression is not a valid character for the current client_encoding, which I had set to WIN1252 to display the accented characters properly on the client: babel=# SELECT * FROM locations babel-# WHERE babel-# location COLLATE "C" ILIKE 'midi-Pyrén%ées' AND -- For CI collations only babel-# location BETWEEN 'midi-Pyrén' AND 'midi-Pyrén' || E'\uFFFF'; -- exploitable by ordinary btree location ----------------- Midi-Pyrénées midi-Pyrénées (2 rows) babel=# EXPLAIN VERBOSE SELECT * FROM locations babel-# WHERE babel-# location COLLATE "C" ILIKE 'midi-Pyrén%ées' AND -- For CI collations only babel-# location BETWEEN 'midi-Pyrén' AND 'midi-Pyrén' || E'\uFFFF'; -- exploitable by ordinary btree ERROR: character with byte sequence 0xef 0xbf 0xbf in encoding "UTF8" has no equivalent in encoding "WIN1252" reset client_encoding; babel=# EXPLAIN VERBOSE SELECT * FROM locations WHERE location COLLATE "C" ILIKE 'midi-Pyrén%ées' AND -- For CI collations only location BETWEEN 'midi-Pyrén' AND 'midi-Pyrén' || E'\uFFFF'; -- exploitable by ordinary btree QUERY PLAN ------------------------------------------------------------------------------------------------------------ Index Only Scan using *location_index* on public.locations (cost=0.13..8.15 rows=1 width=18) Output: location Index Cond: ((locations.location >= 'midi-Pyrén'::text) AND (locations.location <= 'midi-Pyrén�'::text)) Filter: ((locations.location)::text ~~* 'midi-Pyrén%ées'::text) (4 rows) ----- Jim Finnerty, AWS, Amazon Aurora PostgreSQL -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
My previous post had a flaw, but fixing that flaw led me to what may be a bug? Using column COLLATE "C" ILIKE pattern almost does what you'd like, but the single-character wildcard is treated as a single byte with "C", and that won't match a multi-byte character. The fix, I thought, would be to use a deterministic CS_AS ICU collation, since we can handle the per-character advance correctly in that case (we think). Well, maybe not. It looks like single-character wildcards using a deterministic ICU collation match a single byte instead of a single character: This creates a deterministic case-sensitive, accent-sensitive collation in a utf8-encoded database: SET client_encoding = WIN1252; CREATE COLLATION CS_AS ( provider = icu, locale = 'utf8@colStrength=secondary;colCaseLevel=yes', deterministic = true); CREATE TABLE locations (location VARCHAR(255) COLLATE CS_AS); CREATE INDEX location_index ON locations (location); INSERT INTO locations VALUES ('Franche-Comté') , ('Midi-Pyrénées') , ('midi-Pyrénées') , ('midi-Pyrenées') , ('Brian Bruß') , ('Brian Bruss') , ('Steven Sossmix') , ('Provence-Alpes-Côte d Azur'); postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comté'; location ---------------- Franche-Comté (1 row) postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt_'; -- is this a bug? location ---------- (0 rows) postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt__'; -- the wildcard is applied byte by byte instead of character by character, so the 2-byte accented character is matched only by 2 '_'s location ---------------- Franche-Comté (1 row) ----- Jim Finnerty, AWS, Amazon Aurora PostgreSQL -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Jim Finnerty wrote: > SET client_encoding = WIN1252; > [...] > postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt__'; -- > the wildcard is applied byte by byte instead of character by character, so > the 2-byte accented character is matched only by 2 '_'s > location > ---------------- > Franche-Comté > (1 row) The most plausible explanation is that the client-side text is encoded in UTF-8, rather than WIN1252 as declared. If you added length('Franche-Comté') to the above query, I suspect it would tell that the string is one character longer than expected, and octet_length('Franche-Comté') would be two-byte longer than expected. Also dumping the contents of the "location" column with convert_to() would show that the accents have been wrongly translated, if the explanation of the encoding snafu is correct. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
you were exactly right. With client_encoding set to UTF8 it now works correctly. the transformation to do the LIKE or ILIKE on a nondeterministic ICU collation CI_AS for which there is an otherwise identical deterministic collation CS_AS is as shown below: SELECT * FROM locations WHERE location LIKE 'midi-Pyr_n%ées'; -- what we want to do ERROR: nondeterministic collations are not supported for LIKE -- explicitly collate with the CS_AS collation and then use ILIKE to get the desired CI_AS behavior: -- note that the single-character wildcard '_' matches either e or é, as expected SELECT * FROM locations WHERE location COLLATE SQL_Latin1_General_CP1_CS_AS ILIKE 'midi-Pyr_n%ées'; location --------------- Midi-Pyrénées midi-Pyrénées midi-Pyrenées (3 rows) EXPLAIN SELECT * FROM locations WHERE location COLLATE SQL_Latin1_General_CP1_CS_AS ILIKE 'midi-Pyr_n%ées'; QUERY PLAN ------------------------------------------------------------------------------- Bitmap Heap Scan on locations (cost=8.90..20.77 rows=1 width=516) Filter: ((location)::text ~~* 'midi-Pyr_n%ées'::text) -> Bitmap Index Scan on location_index (cost=0.00..8.89 rows=150 width=0) (3 rows) ----- Jim Finnerty, AWS, Amazon Aurora PostgreSQL -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html