Обсуждение: insensitive collations

Поиск
Список
Период
Сортировка

insensitive collations

От
Peter Eisentraut
Дата:
With various patches and discussions around collations going on, I
figured I'd send in my in-progress patch for insensitive collations.

This adds a flag "insensitive" to collations.  Such a collation disables
various optimizations that assume that strings are equal only if they
are byte-wise equal.  That then allows use cases such as
case-insensitive or accent-insensitive comparisons or handling of
strings with different Unicode normal forms.

So this doesn't actually make the collation case-insensitive or
anything, it just allows a library-provided collation that is, say,
case-insensitive to actually work that way.  So maybe "insensitive"
isn't the right name for this flag, but we can think about that.

The jobs of this patch, aside from some DDL extensions, are to track
collation assignment in plan types whether they have so far been
ignored, and then make the various collation-aware functions take the
insensitive flag into account.  In comparison functions this just means
skipping past the memcmp() optimizations.  In hashing functions, this
means converting the string to a sort key (think strxfrm()) before hashing.

Various pieces are incomplete, but the idea should be clear from this.
I have only implemented the ICU implementation in hashtext(); the libc
provider branch needs to be added (or maybe we won't want to).  All the
changes around the "name" type haven't been taken into account.  Foreign
key support (see ri_GenerateQualCollation()) needs to be addressed.
More tests for all the different plans need to be added.  But in
principle it works quite well, as you can see in the tests added so far.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: insensitive collations

От
"Daniel Verite"
Дата:
    Peter Eisentraut wrote:

> So this doesn't actually make the collation case-insensitive or
> anything, it just allows a library-provided collation that is, say,
> case-insensitive to actually work that way.

That's great news!

> So maybe "insensitive" isn't the right name for this flag, but we
> can think about that.

To me it seems more natural to find a name for the other behavior, the
one that consists of overwriting the locale-sensitive equality with
the result of the byte-wise comparison. AFAIU the flag is meant
to say: "don't do that".

Some ideas that come to mind:

as an enum
 CREATE COLLATION ... ( [EQUALITY = 'natural' | 'bytewise' ] )
as a boolean
 CREATE COLLATION ... ( [NATURAL_EQUALITY = true | false ] )
  defaulting to false
or
 CREATE COLLATION ... ( [BYTEWISE_EQUALITY = true | false ] )
  defaulting to true


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: insensitive collations

От
Tom Lane
Дата:
"Daniel Verite" <daniel@manitou-mail.org> writes:
> To me it seems more natural to find a name for the other behavior, the
> one that consists of overwriting the locale-sensitive equality with
> the result of the byte-wise comparison. AFAIU the flag is meant
> to say: "don't do that".

> Some ideas that come to mind:

> as an enum
>  CREATE COLLATION ... ( [EQUALITY = 'natural' | 'bytewise' ] )
> as a boolean
>  CREATE COLLATION ... ( [NATURAL_EQUALITY = true | false ] )
>   defaulting to false
> or
>  CREATE COLLATION ... ( [BYTEWISE_EQUALITY = true | false ] )
>   defaulting to true

I don't really find it "natural" for equality to consider obviously
distinct values to be equal.  As a counterexample, the fact that IEEE
arithmetic treats 0 and -0 as equal seems to cause about as many problems
as it solves, and newcomers to float math certainly don't find it
"natural".  So I object to that particular naming.  Perhaps we could
do something around "uniqueness"?  That is, collations could have
a boolean property UNIQUE, default true, or something like that.

            regards, tom lane


Re: insensitive collations

От
Peter Geoghegan
Дата:
On Wed, Dec 19, 2018 at 6:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I don't really find it "natural" for equality to consider obviously
> distinct values to be equal.

The Unicode consortium calls our current behavior within comparisons
"deterministic comparisons" -- it's something they're not so keen on:

https://unicode.org/reports/tr10/#Deterministic_Comparison

I suggest using their terminology for our current behavior. I can see
where Peter was coming from with "natural", but it doesn't quite work.
One problem with that terminology is that already refers to a sort
that sorts numbers as numbers. How about "linguistic", or "lexical"?

There is a significant cultural dimension to this. I suspect that you
don't find it "natural" that obviously distinct values compare as
equal because you're anglophone. That's the exact behavior you'll get
when using an unadorned en_US collation/locale, I think.

> As a counterexample, the fact that IEEE
> arithmetic treats 0 and -0 as equal seems to cause about as many problems
> as it solves, and newcomers to float math certainly don't find it
> "natural".  So I object to that particular naming.

FWIW, I don't think that your IEEE analogy quite works, because you're
talking about a property of a datatype. A collation is not intrinsic
to any collatable datatype. Fortunately, we're not required to agree
on what feels natural.

-- 
Peter Geoghegan


Re: insensitive collations

От
Tom Lane
Дата:
Peter Geoghegan <pg@bowt.ie> writes:
> The Unicode consortium calls our current behavior within comparisons
> "deterministic comparisons" -- it's something they're not so keen on:
> https://unicode.org/reports/tr10/#Deterministic_Comparison
> I suggest using their terminology for our current behavior.

Hm, it's not the greatest terminology perhaps, but if they're using it
then that makes it at least semi-standard.  I can live with that.

> FWIW, I don't think that your IEEE analogy quite works, because you're
> talking about a property of a datatype. A collation is not intrinsic
> to any collatable datatype. Fortunately, we're not required to agree
> on what feels natural.

Right, which is exactly why it'd be a bad idea to use "natural" as the
name for this property ...

            regards, tom lane


Re: insensitive collations

От
"Daniel Verite"
Дата:
    Tom Lane wrote:

> I don't really find it "natural" for equality to consider obviously
> distinct values to be equal.

According to https://www.merriam-webster.com/dictionary/natural
"natural" has no less than 15 meanings. The first in the list is
  "based on an inherent sense of right and wrong"
which I admit is not what we want to imply in this context.

The meaning that I was thinking about was close to definitions
4:    "following from the nature of the one in question "
or 7:    "having a specified character by nature "
or 13:    "closely resembling an original : true to nature"

When postgres uses the comparison from a collation
with no modification whatsoever, it's true to that collation.
When it changes the result from equal to non-equal, it's not.
If a collation says that "ABC" = "abc" and postgres says, mmh, OK
thanks but I'll go with "ABC" != "abc", then that denatures the
collation, in the sense of:
  "to deprive of natural qualities : change the nature of"
(https://www.merriam-webster.com/dictionary/denature)

Aside from that, I'd be +1 for "linguistic" as the opposite of
"bytewise", I think it tends to be easily understood when expressing
that a strcoll()-like function is used as opposed to a strcmp()-like
function.

I'm -1 for "deterministic" as a replacement for "bytewise". Even
if Unicode has choosen that term for exactly the behavior we're talking
about, it's heavily used in the more general sense of:
   "given a particular input, will always produce the same output"
 (quoted from https://en.wikipedia.org/wiki/Deterministic_algorithm)
which we very much expect from all our string comparisons no matter the
flags we may put on the collations. "bytewise" might be less academic
but it has less potential for wrong interpretations.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: insensitive collations

От
Peter Eisentraut
Дата:
Here is an updated patch.

I have updated the naming to "deterministic", as discussed.

I have fixed up support for the "name" type, added foreign key support,
psql, pg_dump support, more tests.  There are a couple of TODOs in
bpchar support that I need to look into a bit more.  But other than that
it's pretty complete.

Perhaps it worth pointing out to new reviewers that the ICU tests can be
run like so:

    make check EXTRA_TESTS=collate.icu.utf8

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: insensitive collations

От
"Daniel Verite"
Дата:
    Peter Eisentraut wrote:

> Here is an updated patch.


When using GROUP BY and ORDER BY on a field with a non-deterministic
collation, this pops out:

CREATE COLLATION myfr (locale='fr-u-ks-level1',
  provider='icu', deterministic=false);

=# select n from (values ('été' collate "myfr"), ('ete')) x(n)
  group by 1 order by 1 ;
  n
-----
 ete
(1 row)

=#  select n from (values ('été' collate "myfr"), ('ete')) x(n)
  group by 1 order by 1 desc;
  n
-----
 été
(1 row)

The single-row output is different whether it's sorted in the ASC or
DESC direction, even though in theory, ORDER BY is done after GROUP
BY, where it shouldn't make that difference.

EXPLAIN shows that the sort is done before grouping, which might
explain why it happens, but isn't that plan incorrect given the context?

postgres=# explain select n from (values ('été' collate "myfr"), ('ete'))
x(n)
  group by 1 order by 1  desc;
                QUERY PLAN
--------------------------------------------------------------------------
 Group    (cost=0.04..0.04 rows=2 width=32)
   Group Key: "*VALUES*".column1
   ->  Sort  (cost=0.04..0.04 rows=2 width=32)
     Sort Key: "*VALUES*".column1 COLLATE myfr DESC
     ->  Values Scan on "*VALUES*"    (cost=0.00..0.03 rows=2 width=32)
(5 rows)


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 04/01/2019 17:05, Daniel Verite wrote:
> When using GROUP BY and ORDER BY on a field with a non-deterministic
> collation, this pops out:
> 
> CREATE COLLATION myfr (locale='fr-u-ks-level1',
>   provider='icu', deterministic=false);
> 
> =# select n from (values ('été' collate "myfr"), ('ete')) x(n)
>   group by 1 order by 1 ;
>   n  
> -----
>  ete
> (1 row)
> 
> =#  select n from (values ('été' collate "myfr"), ('ete')) x(n)
>   group by 1 order by 1 desc;
>   n  
> -----
>  été
> (1 row)

I don't see anything wrong here.  The collation says that both values
are equal, so which one is returned is implementation-dependent.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: insensitive collations

От
Andreas Karlsson
Дата:
On 12/28/18 9:55 AM, Peter Eisentraut wrote:
> Here is an updated patch.
> 
> I have updated the naming to "deterministic", as discussed.

Maybe this is orthogonal and best handled elsewhere but have you when 
working with string equality given unicode normalization forms[1] any 
thought? I feel there are three sane ways to do unicode string equality:

1) Binary equality
2) Binary equality after normalizing the unicode
3) Collation equality

Would there be any point in adding unicode normalization support into 
the collation system or is this best handle for example with a function 
run on INSERT or with something else entirely?

Right now PosgreSQL does not have any support for normalization forms as 
far as I know.

1. http://unicode.org/reports/tr15/

Andreas


Re: insensitive collations

От
"Daniel Verite"
Дата:
    Peter Eisentraut wrote:

> > =# select n from (values ('été' collate "myfr"), ('ete')) x(n)
> >   group by 1 order by 1 ;
> >   n
> > -----
> >  ete
> > (1 row)
> >
> > =#  select n from (values ('été' collate "myfr"), ('ete')) x(n)
> >   group by 1 order by 1 desc;
> >   n
> > -----
> >  été
> > (1 row)
>
> I don't see anything wrong here.  The collation says that both values
> are equal, so which one is returned is implementation-dependent.

Is it, but it's impractical if the product of seemingly the same GROUP BY
flip-flops between its different valid results. If it can't be avoided, then
okay. If it can be avoided at little cost, then it would be better to do it.

As a different example, the regression tests are somewhat counting on
this already. Consider this part:

+CREATE TABLE test3ci (x text COLLATE case_insensitive);
+INSERT INTO test1ci VALUES ('abc'), ('def'), ('ghi');
+INSERT INTO test2ci VALUES ('ABC'), ('ghi');
+INSERT INTO test3ci VALUES ('abc'), ('ABC'), ('def'), ('ghi');
...
+SELECT x, count(*) FROM test3ci GROUP BY x ORDER BY x;
+  x  | count
+-----+-------
+ abc |     2
+ def |     1
+ ghi |     1
+(3 rows)

If ABC was returned here instead of abc for whatever reason,
that would be correct strictly speaking, yet "make check" would fail.
That's impractical.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 09/01/2019 19:49, Andreas Karlsson wrote:
> On 12/28/18 9:55 AM, Peter Eisentraut wrote:
>> Here is an updated patch.
>>
>> I have updated the naming to "deterministic", as discussed.
> 
> Maybe this is orthogonal and best handled elsewhere but have you when 
> working with string equality given unicode normalization forms[1] any 
> thought?

Nondeterministic collations do address this by allowing canonically
equivalent code point sequences to compare as equal.  You still need a
collation implementation that actually does compare them as equal; ICU
does this, glibc does not AFAICT.

> Would there be any point in adding unicode normalization support into 
> the collation system or is this best handle for example with a function 
> run on INSERT or with something else entirely?

I think there might be value in a feature that normalizes strings as
they enter the database, as a component of the encoding conversion
infrastructure.  But that would be a separate feature.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 09/01/2019 22:01, Daniel Verite wrote:
>> I don't see anything wrong here.  The collation says that both values
>> are equal, so which one is returned is implementation-dependent.
> Is it, but it's impractical if the product of seemingly the same GROUP BY
> flip-flops between its different valid results. If it can't be avoided, then
> okay. If it can be avoided at little cost, then it would be better to do it.

But there is no concept of which one of these is the preferred variant,
so I don't see how the system is supposed to pick one and then stick to
it across separate query invocations.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: insensitive collations

От
"Daniel Verite"
Дата:
    Peter Eisentraut wrote:

> Here is an updated patch.

On a table with pre-existing contents, the creation of a unique index
does not seem to detect the duplicates that are equal per the
collation and different binary-wise.

postgres=# \d test3ci
        Table "public.test3ci"
 Column | Type |    Collation      | Nullable | Default
--------+------+------------------+----------+---------
 x    | text | case_insensitive |         |

postgres=# select * from test3ci;
  x
-----
 abc
 ABC
 def
 ghi
(4 rows)

postgres=# create unique index idx on test3ci(x);  -- EXPECTED TO FAIL
CREATE INDEX

postgres=# \d test3ci
        Table "public.test3ci"
 Column | Type |    Collation      | Nullable | Default
--------+------+------------------+----------+---------
 x    | text | case_insensitive |         |
Indexes:
    "idx" UNIQUE, btree (x)

postgres=# select count(*) from test3ci where x='abc';
 count
-------
     2
(1 row)

OTOH with an already existing unique index, attempts to insert
such duplicates are rejected as expected.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: insensitive collations

От
Andreas Karlsson
Дата:
On 1/10/19 8:44 AM, Peter Eisentraut wrote:
> On 09/01/2019 19:49, Andreas Karlsson wrote:
>> Maybe this is orthogonal and best handled elsewhere but have you when
>> working with string equality given unicode normalization forms[1] any
>> thought?
> 
> Nondeterministic collations do address this by allowing canonically
> equivalent code point sequences to compare as equal.  You still need a
> collation implementation that actually does compare them as equal; ICU
> does this, glibc does not AFAICT.

Ah, right! You could use -ks-identic[1] for this.

>> Would there be any point in adding unicode normalization support into
>> the collation system or is this best handle for example with a function
>> run on INSERT or with something else entirely?
> 
> I think there might be value in a feature that normalizes strings as
> they enter the database, as a component of the encoding conversion
> infrastructure.  But that would be a separate feature.

Agreed. And if we ever implement this we could theoretically optimize 
the equality of -ks-identic to do a strcmp() rather than having to 
collate anything.

I think it could also be useful to just add functions which can 
normalize strings, which was in a proposal to the SQL standard which was 
not accepted.[2]

Notes

1. http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
2. https://dev.mysql.com/worklog/task/?id=2048

Andreas



Re: insensitive collations

От
"Daniel Verite"
Дата:
    Andreas Karlsson wrote:

> > Nondeterministic collations do address this by allowing canonically
> > equivalent code point sequences to compare as equal.  You still need a
> > collation implementation that actually does compare them as equal; ICU
> > does this, glibc does not AFAICT.
>
> Ah, right! You could use -ks-identic[1] for this.

Strings that differ like that are considered equal even at this level:

postgres=# create collation identic  (locale='und-u-ks-identic',
    provider='icu', deterministic=false);
CREATE COLLATION

postgres=# select 'é' = E'e\u0301' collate "identic";
 ?column?
----------
 t
(1 row)


There's a separate setting "colNormalization", or "kk" in BCP 47

From
http://www.unicode.org/reports/tr35/tr35-collation.html#Normalization_Setting

  "The UCA always normalizes input strings into NFD form before the
  rest of the algorithm. However, this results in poor performance.
  With normalization=off, strings that are in [FCD] and do not contain
  Tibetan precomposed vowels (U+0F73, U+0F75, U+0F81) should sort
  correctly. With normalization=on, an implementation that does not
  normalize to NFD must at least perform an incremental FCD check and
  normalize substrings as necessary"

But even setting this to false does not mean that NFD and NFC forms
of the same text compare as different:

postgres=# create collation identickk  (locale='und-u-ks-identic-kk-false',
    provider='icu', deterministic=false);
CREATE COLLATION

postgres=# select 'é' = E'e\u0301' collate "identickk";
 ?column?
----------
 t
(1 row)

AFAIU such strings may only compare as different when they're not
in FCD form (http://unicode.org/notes/tn5/#FCD)

There are also ICU-specific explanations about FCD here:
http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm#Normalization

It looks like setting colNormalization to false might provide a
performance benefit when you know your contents are in FCD
form, which is mostly the case according to ICU:

  "Note that all NFD strings are in FCD, and in practice most NFC
  strings will also be in FCD; for that matter most strings (of whatever
  ilk) will be in FCD.
  We guarantee that if any input strings are in FCD, that we will get
  the right results in collation without having to normalize".


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 14/01/2019 13:23, Daniel Verite wrote:
> On a table with pre-existing contents, the creation of a unique index
> does not seem to detect the duplicates that are equal per the
> collation and different binary-wise.

Fixed in the attached updated patch.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: insensitive collations

От
"Daniel Verite"
Дата:
    Peter Eisentraut wrote:

> > On a table with pre-existing contents, the creation of a unique index
> > does not seem to detect the duplicates that are equal per the
> > collation and different binary-wise.
>
> Fixed in the attached updated patch.

Check. I've found another issue with aggregates over distinct:
the deduplication seems to ignore the collation.

postgres=# select distinct x from test3ci;   -- OK
  x
-----
 def
 abc
 ghi
(3 rows)

postgres=# select count(distinct x) from test3ci;  -- not OK
 count
-------
     4
(1 row)

postgres=# select array_agg(distinct x) from test3ci;  -- not OK
     array_agg
-------------------
 {ABC,abc,def,ghi}
(1 row)


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 14/01/2019 15:37, Andreas Karlsson wrote:
>> Nondeterministic collations do address this by allowing canonically
>> equivalent code point sequences to compare as equal.  You still need a
>> collation implementation that actually does compare them as equal; ICU
>> does this, glibc does not AFAICT.
> Ah, right! You could use -ks-identic[1] for this.

That's the default.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 16/01/2019 14:20, Daniel Verite wrote:
> I've found another issue with aggregates over distinct:
> the deduplication seems to ignore the collation.

I have a fix for that.  I'll send it with the next update.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 16/01/2019 21:50, Peter Eisentraut wrote:
> On 16/01/2019 14:20, Daniel Verite wrote:
>> I've found another issue with aggregates over distinct:
>> the deduplication seems to ignore the collation.
> 
> I have a fix for that.  I'll send it with the next update.

Another patch.  This fixes your issue, and it incorporates the findings
from the thread "ExecBuildGroupingEqual versus collations", as well as a
few other fixes and more tests.

As far as I can tell, this covers everything now, meaning all the
relevant plan types propagate the collation correctly and all the
relevant operators and functions do the right things with them.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: insensitive collations

От
"Daniel Verite"
Дата:
    Peter Eisentraut wrote:

> Another patch.

+    <literal>ks</literal> key), in order for such such collations to act in
a

s/such such/such/

+   <para>
+    The pattern matching operators of all three kinds do not support
+    nondeterministic collations.  If required, apply a different collation
to
+    the expression to work around this limitation.
+   </para>

It's an important point of comparison between CI collations and
contrib/citext, since the latter diverts a bunch of functions/operators
to make them do case-insensitive pattern matching.
The doc for citext explains the rationale for using it versus text,
maybe it would need now to be expanded a bit with pros/cons of
choosing citext versus non-deterministic collations.

The current patch doesn't alter a few string functions that could
potentially implement collation-aware string search, such as
replace(), strpos(), starts_with().
ISTM that we should not let these functions ignore the collation: they
ought to error out until we get their implementation to use the ICU
collation-aware string search.
FWIW I've been experimenting with usearch_openFromCollator() and
other usearch_* functions, and it looks doable to implement at least the
3 above functions based on that, even though the UTF16-ness of the API
does not favor us.

ICU also provides regexp matching, but not collation-aware, since
character-based patterns don't play well with the concept of collation.
About a potential collation-aware LIKE, it looks hard to implement,
since the algorithm currently used in like_match.c seems purely
character-based. AFAICS there's no way to plug calls to usearch_*
functions into it, it would need a separate redesign from scratch.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 30/01/2019 16:30, Daniel Verite wrote:
>     Peter Eisentraut wrote:
> 
>> Another patch. 

Another patch, with expanded documentation, more tests, and some code
cleanup.

> 
> +    <literal>ks</literal> key), in order for such such collations to act in
> a
> 
> s/such such/such/

fixed

> 
> +   <para>
> +    The pattern matching operators of all three kinds do not support
> +    nondeterministic collations.  If required, apply a different collation
> to
> +    the expression to work around this limitation.
> +   </para>
> 
> It's an important point of comparison between CI collations and
> contrib/citext, since the latter diverts a bunch of functions/operators
> to make them do case-insensitive pattern matching.
> The doc for citext explains the rationale for using it versus text,
> maybe it would need now to be expanded a bit with pros/cons of
> choosing citext versus non-deterministic collations.

possibly addressed by the expanded documentation

> The current patch doesn't alter a few string functions that could
> potentially implement collation-aware string search, such as
> replace(), strpos(), starts_with().
> ISTM that we should not let these functions ignore the collation: they
> ought to error out until we get their implementation to use the ICU
> collation-aware string search.

I have addressed that.  Many of these end up at the same one or two
low-level C function, so the checking just happens there.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: insensitive collations

От
Peter Eisentraut
Дата:
Another patch to address merge conflicts.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: insensitive collations

От
Peter Geoghegan
Дата:
On Tue, Feb 19, 2019 at 6:47 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Another patch to address merge conflicts.

Some remarks on this:

* Your draft commit message says:

> This patch makes changes in three areas:
>
> - CREATE COLLATION DDL changes and system catalog changes to support
>   this new flag.
>
> - Many executor nodes and auxiliary code are extended to track
>   collations.  Previously, this code would just throw away collation
>   information, because the eventually-called user-defined functions
>   didn't use it since they only cared about equality, which didn't
>   need collation information.
>
> - String data type functions that do equality comparisons and hashing
>   are changed to take the (non-)deterministic flag into account.  For
>   comparison, this just means skipping various shortcuts and tie
>   breakers that use byte-wise comparison.  For hashing, we first need
>   to convert the input string to a canonical "sort key" using the ICU
>   analogue of strxfrm().

I wonder if it would be better to break this into distinct commits?

* Why is support for non-deterministic comparisons limited to the ICU
provider? If that's the only case that could possibly be affected,
then why did we ever add the varstrcmp() tie-breaker call to strcmp()
in the first place? The behavior described in the commit message of
bugfix commit 656beff5 describes a case where Hungarian text caused
index corruption by being strcoll()-wise equal but not bitwise equal.
Besides, I think that you can vendor your own case insensitive
collation with glibc, since it's based on UCA.

This restriction feels like it's actually a kludge to work around the
fact that database-wide collations have this foreign key related
restriction in your patch:

> @@ -2901,11 +2921,20 @@ ri_AttributesEqual(Oid eq_opr, Oid typeid,
>     }
>
>     /*
> -    * Apply the comparison operator.  We assume it doesn't care about
> -    * collations.
> -    */
> -   return DatumGetBool(FunctionCall2(&entry->eq_opr_finfo,
> -                                     oldvalue, newvalue));
> +    * Apply the comparison operator.
> +    *
> +    * Note: This function is part of a call stack that determines whether an
> +    * update to a row is significant enough that it needs checking or action
> +    * on the other side of a foreign-key constraint.  Therefore, the
> +    * comparison here would need to be done with the collation of the *other*
> +    * table.  For simplicity (e.g., we might not even have the other table
> +    * open), we'll just use the default collation here, which could lead to
> +    * some false negatives.  All this would break if we ever allow
> +    * database-wide collations to be nondeterministic.
> +    */
> +   return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo,
> +                                         DEFAULT_COLLATION_OID,
> +                                         oldvalue, newvalue));
>  }

The existing restriction on ICU collations (that they cannot be
database-wide) side-steps the issue.

* Can said restriction somehow be lifted? That seems like it would be
a lot cleaner.

* Why have you disable this optimization?:

>     /* Fast pre-check for equality, as discussed in varstr_cmp() */
> -   if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
> +   if ((!sss->locale || sss->locale->deterministic) &&
> +       len1 == len2 && memcmp(a1p, a2p, len1) == 0)

I don't see why this is necessary. A non-deterministic collation
cannot indicate that bitwise identical strings are non-equal.

* Perhaps you should add a "Tip" referencing the feature to the
contrib/citext documentation.
--
Peter Geoghegan


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 2019-02-21 03:17, Peter Geoghegan wrote:
> I wonder if it would be better to break this into distinct commits?

I thought about that.  Especially the planner/executor changes could be
done separately, sort of as a way to address the thread
"ExecBuildGroupingEqual versus collations".  But I'm not sure if they
would have good test coverage on their own.  I can work on this if
people think this would be useful.

> * Why is support for non-deterministic comparisons limited to the ICU
> provider? If that's the only case that could possibly be affected,
> then why did we ever add the varstrcmp() tie-breaker call to strcmp()
> in the first place? The behavior described in the commit message of
> bugfix commit 656beff5 describes a case where Hungarian text caused
> index corruption by being strcoll()-wise equal but not bitwise equal.
> Besides, I think that you can vendor your own case insensitive
> collation with glibc, since it's based on UCA.

The original test case (described here:
https://www.postgresql.org/message-id/27064.1134753128%40sss.pgh.pa.us)
no longer works, so it was probably fixed on the glibc side.  The git
log of the hu_HU locale definition shows that it has been "fixed" in
major ways several times over the years, so that's plausible.

I tried reproducing some more practical scenarios involving combining
diacritics, but glibc apparently doesn't believe strings in different
normal forms are equal.  At that point I gave up because this doesn't
seem worthwhile to support.

Moreover, I think allowing this would require a "trusted" strxfrm(),
which is currently disabled.

>> +   return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo,
>> +                                         DEFAULT_COLLATION_OID,
>> +                                         oldvalue, newvalue));
>>  }
> 
> The existing restriction on ICU collations (that they cannot be
> database-wide) side-steps the issue.
> 
> * Can said restriction somehow be lifted? That seems like it would be
> a lot cleaner.

Lift what restriction?  That ICU collations cannot be database-wide?
Sure that would be nice, but it's a separate project.  Even then, I'm
not sure that we would allow a database-wide collation to be
nondeterministic.  That would for example disallow the use of LIKE,
which would be weird.  In any case, the above issue can be addressed
then.  I think it's not worth complicating this right now.

> * Why have you disable this optimization?:
> 
>>     /* Fast pre-check for equality, as discussed in varstr_cmp() */
>> -   if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
>> +   if ((!sss->locale || sss->locale->deterministic) &&
>> +       len1 == len2 && memcmp(a1p, a2p, len1) == 0)
> 
> I don't see why this is necessary. A non-deterministic collation
> cannot indicate that bitwise identical strings are non-equal.

Right, I went too far there.

> * Perhaps you should add a "Tip" referencing the feature to the
> contrib/citext documentation.

Good idea.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 2019-02-21 09:36, Peter Eisentraut wrote:
>> * Why have you disable this optimization?:
>>
>>>     /* Fast pre-check for equality, as discussed in varstr_cmp() */
>>> -   if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
>>> +   if ((!sss->locale || sss->locale->deterministic) &&
>>> +       len1 == len2 && memcmp(a1p, a2p, len1) == 0)
>> I don't see why this is necessary. A non-deterministic collation
>> cannot indicate that bitwise identical strings are non-equal.
> Right, I went too far there.
> 
>> * Perhaps you should add a "Tip" referencing the feature to the
>> contrib/citext documentation.
> Good idea.

Here is another patch that fixes these two points.

I have also worked on the tests hoping to appease the cfbot.

Older ICU versions (<54) don't support all the locale customization
options, so many of my new tests in collate.icu.utf8.sql will fail on
older systems.  What should we do about that?  Have another extra test file?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: insensitive collations

От
"Daniel Verite"
Дата:
    Peter Eisentraut wrote:

[v7-0001-Collations-with-nondeterministic-comparison.patch]

+GenericMatchText(const char *s, int slen, const char *p, int plen, Oid
collation)
 {
+  if (collation && !lc_ctype_is_c(collation) && collation !=
DEFAULT_COLLATION_OID)
+  {
+    pg_locale_t    locale = pg_newlocale_from_collation(collation);
+
+    if (locale && !locale->deterministic)
+      ereport(ERROR,
+       (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        errmsg("nondeterministic collations are not supported for
LIKE")));
+  }

This test gets side-stepped when pattern_fixed_prefix() in selfuncs.c
returns Pattern_Prefix_Exact, and the code optimizes the operation by
converting it to a bytewise equality test, or a bytewise range check
in the index with Pattern_Type_Prefix.

Here's a reproducer:

===
create collation ciai (locale='und-u-ks-level1', deterministic=false,
provider='icu');

create table w(t text collate "C");

insert into w select md5(i::text) from generate_series(1,10000) as i;
insert into w values('abc');

create index indexname on w(t );

select t from w where t like 'ABC' collate ciai;
 t
---
(0 rows)

select t from w where t like 'ABC%' collate ciai;
 t
---
(0 rows)

===

For the LIKE operator, I think the fix should be that like_fixed_prefix()
should always return Pattern_Prefix_None for non-deterministic collations.

For regular expressions, pg_set_regex_collation() is called at some
point when checking for a potential prefix, and since it errors out with
non-deterministic collations, this issue is taken care of already.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: insensitive collations

От
"Daniel Verite"
Дата:
    Peter Eisentraut wrote:

> Older ICU versions (<54) don't support all the locale customization
> options, so many of my new tests in collate.icu.utf8.sql will fail on
> older systems.  What should we do about that?  Have another extra test file?

Maybe stick to the old-style syntax for the regression tests?
The declarations that won't work as expected with older ICU versions
would be:

CREATE COLLATION case_insensitive (provider = icu, locale =
'und-u-ks-level2', deterministic = false);

'und-u-ks-level2' is equivalent to 'und@colStrength=secondary'

CREATE COLLATION ignore_accents (provider = icu, locale =
'und-u-ks-level1-kc-true', deterministic = false);

'und-u-ks-level1-kc-true' => 'und@colStrength=primary;colCaseLevel=yes'


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 2019-03-05 18:48, Daniel Verite wrote:
>> Older ICU versions (<54) don't support all the locale customization
>> options, so many of my new tests in collate.icu.utf8.sql will fail on
>> older systems.  What should we do about that?  Have another extra test file?
> Maybe stick to the old-style syntax for the regression tests?
> The declarations that won't work as expected with older ICU versions
> would be:
> 
> CREATE COLLATION case_insensitive (provider = icu, locale =
> 'und-u-ks-level2', deterministic = false);
> 
> 'und-u-ks-level2' is equivalent to 'und@colStrength=secondary'

The problem is not the syntax but that the older ICU versions don't
support the *functionality* of ks-level2 or colStrength=secondary.  If
you try it, you will simply get a normal case-sensitive behavior.

It would probably be possible to write all the tests for
nondeterministic collations without making use of this functionality,
using only canonically equivalent sequences as test data.  But that
would make the tests extremely weird and unintuitive, so I'd like to
avoid that.

After thinking about it a bit more, I think making a new test file is
reasonable.  The file is already fairly long anyway, compared to the
typical test file size.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: insensitive collations

От
"Daniel Verite"
Дата:
    Peter Eisentraut wrote:

> The problem is not the syntax but that the older ICU versions don't
> support the *functionality* of ks-level2 or colStrength=secondary.  If
> you try it, you will simply get a normal case-sensitive behavior.

My bad, I see now that the "old locale extension syntax" was actually
introduced at the same time than the "language tag syntax" in ICU 54:
http://bugs.icu-project.org/trac/ticket/8260

With previous versions, we'd need to call ucol_setAttribute(),
with the attributes and values defined here:
http://icu-project.org/apiref/icu4c/ucol_8h.html
for instance to get colStrength=secondary:
  ucol_setAttribute(coll, UCOL_STRENGTH , UCOL_SECONDARY, &status);
which I've just checked gives the expected result with ICU-4.2.
These attributes are flagged as "Stable: ICU 2.0" up to
"Stable: ICU 2.8" (for UCOL_NUMERIC_COLLATION ).

So if we really wanted to have these functionalities with pre-54 ICU,
we could but that would mean implementing an interface to pass
to CREATE COLLATION the attributes/values we want to support.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 2019-03-04 15:58, Daniel Verite wrote:
> For the LIKE operator, I think the fix should be that like_fixed_prefix() 
> should always return Pattern_Prefix_None for non-deterministic collations.

Good catch.  I added the fix to match_pattern_prefix() instead.  That
would be the place to change if someone wanted to fix this properly in
the future.  While testing this, I also found a somewhat related problem
with _pattern_ops operator classes, which also need to be prohibited
with nondeterministic collations.

Updated patch attached, also with some merge conflicts resolved.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: insensitive collations

От
Peter Eisentraut
Дата:
On 2019-03-07 20:04, Daniel Verite wrote:
> With previous versions, we'd need to call ucol_setAttribute(),
> with the attributes and values defined here:
> http://icu-project.org/apiref/icu4c/ucol_8h.html
> for instance to get colStrength=secondary:
>   ucol_setAttribute(coll, UCOL_STRENGTH , UCOL_SECONDARY, &status);
> which I've just checked gives the expected result with ICU-4.2.

I see.  I'm thinking about adding some ad hoc code to
pg_newlocale_from_collation() to parse these keywords ourselves, so we
can provide the same interface for old ICU versions.  I'll send a
separate patch for that.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: insensitive collations

От
Peter Eisentraut
Дата:
On 2019-03-08 11:09, Peter Eisentraut wrote:
> On 2019-03-07 20:04, Daniel Verite wrote:
>> With previous versions, we'd need to call ucol_setAttribute(),
>> with the attributes and values defined here:
>> http://icu-project.org/apiref/icu4c/ucol_8h.html
>> for instance to get colStrength=secondary:
>>   ucol_setAttribute(coll, UCOL_STRENGTH , UCOL_SECONDARY, &status);
>> which I've just checked gives the expected result with ICU-4.2.
> 
> I see.  I'm thinking about adding some ad hoc code to
> pg_newlocale_from_collation() to parse these keywords ourselves, so we
> can provide the same interface for old ICU versions.  I'll send a
> separate patch for that.

Patches here.  This will allow all the existing collation customization
options as well as the ones being proposed in this thread to work in
older ICU versions.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: insensitive collations

От
Peter Eisentraut
Дата:
On 2019-03-11 21:36, Peter Eisentraut wrote:
> Patches here.  This will allow all the existing collation customization
> options as well as the ones being proposed in this thread to work in
> older ICU versions.

This has been committed, and here is an updated main patch, which now
has a chance to pass the cfbot builds.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: insensitive collations

От
Peter Eisentraut
Дата:
On 2019-03-18 00:19, Peter Eisentraut wrote:
> On 2019-03-11 21:36, Peter Eisentraut wrote:
>> Patches here.  This will allow all the existing collation customization
>> options as well as the ones being proposed in this thread to work in
>> older ICU versions.
> 
> This has been committed, and here is an updated main patch, which now
> has a chance to pass the cfbot builds.

The main patch has also been committed.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: insensitive collations

От
Jim Finnerty
Дата:
Has any progress been made on supporting LIKE for nondeterministic
collations?

The pattern as well as the expresion needs to use collation-aware character
comparisons, so for a suitable collation where ß compares equally to ss:

SELECT * from table WHERE name LIKE '%ß%'
yields
Brian Bruß
Steven Sossmix

and even if the pattern contains only single-byte UTF-8 characters, a
non-accented character in the first 127 might compare equally to a two-byte
accented character in the first argument, so the comparisons as well as the
character-advancing logic must be collation-aware.  This seems to imply that
for the general nondeterministic case we need to rewrite the algorithm to
use ICU functions for advancing to the next character and for comparing
characters at the current position in the pattern and string.  Is that
accurate?

for a database with UTF8 encoding and a collation that is case-insenstitive
but accent-sensitive, and where the pattern contains only single-byte
characters or wildcard characters, would LIKE and ILIKE be correct with the
current per-byte implementation - albeit without any index exploitation?




-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: insensitive collations

От
Jim Finnerty
Дата:
For a UTF8 encoded, case-insensitive (nondeterministic), accent-sensitive ICU
collation, a LIKE predicate can be used with a small transformation of the
predicate, and the pattern can contain multi-byte characters:

from:

SELECT * FROM locations WHERE location LIKE 'midi-Pyrené%';
-- ERROR:  nondeterministic collations are not supported for LIKE

to:

SELECT * FROM locations WHERE lower(location) COLLATE "C" LIKE
lower('midi-Pyrené%');

and if there is an index defined as follows:

CREATE INDEX location_LOWER_C_index ON
    locations (LOWER(location) COLLATE "C");

then the LIKE predicate above performs the desired CI_AS evaluation and also
exploits the index:

EXPLAIN VERBOSE SELECT * FROM locations WHERE LOWER(location) COLLATE "C"
LIKE LOWER('midi-Pyrené%');

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using location_LOWER_C_index on public.locations
(cost=0.13..8.16 rows=1 width=18)
   Output: location
   Index Cond: (((lower((locations.location)::text))::text >=
'midi-pyren?'::text) AND ((lower((locations.location)::text))::text <
'midi-pyren?'::text))
   Filter: ((lower((locations.location)::text))::text ~~
'midi-pyren?%'::text)
(4 rows)


It turns out that CI_AS represents the vast majority (more than 99.9%) of
nondeterministic collations that we are seeing in babelfish, because
SQL_Latin1_General_CP1_CI_AS is the default collation in SQL Server.

Currently nondeterministic collations are disabled at the database level.
The cited reason was because of the lack of LIKE support and because certain
catalog views use LIKE.  That may still need to be a limitation if those
LIKE predicates currently have an index exploitation unless we can create a
LOWER(col) COLLATE "C" index on the catalog.  Which catalog views were these
that had the problem?




-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: insensitive collations

От
"Daniel Verite"
Дата:
Jim Finnerty wrote:

> For a UTF8 encoded, case-insensitive (nondeterministic), accent-sensitive
> ICU
> collation, a LIKE predicate can be used with a small transformation of the
> predicate, and the pattern can contain multi-byte characters:
>
> from:
>
> SELECT * FROM locations WHERE location LIKE 'midi-Pyrené%';
> -- ERROR:  nondeterministic collations are not supported for LIKE
>
> to:
>
> SELECT * FROM locations WHERE lower(location) COLLATE "C" LIKE
> lower('midi-Pyrené%');

For prefix matching, there's a simpler way with non-deterministic
collations based on the advice in [1]

The trick is that if an ICU collation is assigned to "location",
whether it's deterministic or not,

SELECT * FROM locations WHERE location LIKE 'midi-Pyrené%';

is equivalent to:

SELECT * FROM locations WHERE location BETWEEN
   'midi-Pyrené' AND 'midi-Pyrené' || E'\uFFFF';

and that will use a btree index if available.

Also, it works with all features of ND-collations and all encodings, not
just case-insensitiveness and UTF-8.

Now that doesn't solve LIKE '%midi-Pyrené%', or LIKE '%midi_Pyrené%',
but that trick could be a building block for an algorithm implementing
LIKE with ND-collations in the future.

[1]

https://unicode-org.github.io/icu/userguide/collation/architecture.html#generating-bounds-for-a-sort-key-prefix-matching


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: insensitive collations

От
"Daniel Verite"
Дата:
    Jim Finnerty wrote:

> Currently nondeterministic collations are disabled at the database level.

Deterministic ICU collations are also disabled.

> The cited reason was because of the lack of LIKE support and because certain
> catalog views use LIKE.

But the catalogs shouldn't use the default collation of the database.

commit 586b98fdf1aaef4a27744f8b988479aad4bd9a01 provides
some details about this:

Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:    Wed Dec 19 17:35:12 2018 -0500

    Make type "name" collation-aware.

    The "name" comparison operators now all support collations, making them
    functionally equivalent to "text" comparisons, except for the different
    physical representation of the datatype.  They do, in fact, mostly share
    the varstr_cmp and varstr_sortsupport infrastructure, which has been
    slightly enlarged to handle the case.

    To avoid changes in the default behavior of the datatype, set name's
    typcollation to C_COLLATION_OID not DEFAULT_COLLATION_OID, so that
    by default comparisons to a name value will continue to use strcmp
    semantics.    (This would have been the case for system catalog columns
    anyway, because of commit 6b0faf723, but doing this makes it true for
    user-created name columns as well.    In particular, this avoids
    locale-dependent changes in our regression test results.)

    In consequence, tweak a couple of places that made assumptions about
    collatable base types always having typcollation DEFAULT_COLLATION_OID.
    I have not, however, attempted to relax the restriction that user-
    defined collatable types must have that.  Hence, "name" doesn't
    behave quite like a user-defined type; it acts more like a domain
    with COLLATE "C".  (Conceivably, if we ever get rid of the need for
    catalog name columns to be fixed-length, "name" could actually become
    such a domain over text.  But that'd be a pretty massive undertaking,
    and I'm not volunteering.)

    Discussion: https://postgr.es/m/15938.1544377821@sss.pgh.pa.us


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: insensitive collations

От
Jim Finnerty
Дата:
The two ideas can be combined to create a workaround for accent-sensitive
nondeterministic collations that enables an ordinary btree to be exploited
if available, and also provides the full LIKE logic in either case-sensitive
or case-insensitive collations:

SELECT * FROM locations WHERE location LIKE 'midi-Pyrén%ées';

becomes:

SELECT * FROM locations
WHERE
    location COLLATE "C" ILIKE 'midi-Pyrén%ées'  AND
-- For CI collations only
    location BETWEEN 'midi-Pyrén' AND 'midi-Pyrén' || E'\uFFFF';         --
exploitable by ordinary btree

LIKE would also be valid for a case sensitive collation, but then the
collation would be CS_AS, so it would be deterministic and no transform
would be needed.

The expression above produces a good plan, but EXPLAIN complains if the
concatenated expression is not a valid character for the current
client_encoding, which I had set to WIN1252 to display the accented
characters properly on the client:

babel=# SELECT * FROM locations
babel-# WHERE
babel-#     location COLLATE "C" ILIKE 'midi-Pyrén%ées'  AND
-- For CI collations only
babel-#     location BETWEEN 'midi-Pyrén' AND 'midi-Pyrén' || E'\uFFFF';
-- exploitable by ordinary btree
    location
-----------------
 Midi-Pyrénées
 midi-Pyrénées
(2 rows)

babel=# EXPLAIN VERBOSE SELECT * FROM locations
babel-# WHERE
babel-#     location COLLATE "C" ILIKE 'midi-Pyrén%ées'  AND
-- For CI collations only
babel-#     location BETWEEN 'midi-Pyrén' AND 'midi-Pyrén' || E'\uFFFF';
-- exploitable by ordinary btree
ERROR:  character with byte sequence 0xef 0xbf 0xbf in encoding "UTF8" has
no equivalent in encoding "WIN1252"

reset client_encoding;

babel=# EXPLAIN VERBOSE SELECT * FROM locations
WHERE
    location COLLATE "C" ILIKE 'midi-Pyrén%ées'  AND
-- For CI collations only
    location BETWEEN 'midi-Pyrén' AND 'midi-Pyrén' || E'\uFFFF';         --
exploitable by ordinary btree
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Index Only Scan using *location_index* on public.locations
(cost=0.13..8.15 rows=1 width=18)
   Output: location
   Index Cond: ((locations.location >= 'midi-Pyrén'::text) AND
(locations.location <= 'midi-Pyrén�'::text))
   Filter: ((locations.location)::text ~~* 'midi-Pyrén%ées'::text)
(4 rows)




-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: insensitive collations

От
Jim Finnerty
Дата:
My previous post had a flaw, but fixing that flaw led me to what may be a
bug?

Using column COLLATE "C" ILIKE pattern almost does what you'd like, but the
single-character wildcard is treated as a single byte with "C", and that
won't match a multi-byte character.  The fix, I thought, would be to use a
deterministic CS_AS ICU collation, since we can handle the per-character
advance correctly in that case (we think).  Well, maybe not.  It looks like
single-character wildcards using a deterministic ICU collation match a
single byte instead of a single character:

This creates a deterministic case-sensitive, accent-sensitive collation in a
utf8-encoded database:

SET client_encoding = WIN1252;
CREATE COLLATION CS_AS (
    provider = icu,
    locale = 'utf8@colStrength=secondary;colCaseLevel=yes',
    deterministic = true);

CREATE TABLE locations (location VARCHAR(255) COLLATE CS_AS);
CREATE INDEX location_index ON locations (location);

INSERT INTO locations VALUES ('Franche-Comté')
                        , ('Midi-Pyrénées')
                        , ('midi-Pyrénées')
                        , ('midi-Pyrenées')
                        , ('Brian Bruß')
                        , ('Brian Bruss')
                        , ('Steven Sossmix')
                        , ('Provence-Alpes-Côte d Azur');

postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comté';
    location
----------------
 Franche-Comté
(1 row)

postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt_';   --
is this a bug?
 location
----------
(0 rows)

postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt__';  --
the wildcard is applied byte by byte instead of character by character, so
the 2-byte accented character is matched only by 2 '_'s
    location
----------------
 Franche-Comté
(1 row)





-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: insensitive collations

От
"Daniel Verite"
Дата:
    Jim Finnerty wrote:

> SET client_encoding = WIN1252;
> [...]
> postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt__';  --
> the wildcard is applied byte by byte instead of character by character, so
> the 2-byte accented character is matched only by 2 '_'s
>    location
> ----------------
> Franche-Comté
> (1 row)

The most plausible explanation is that the client-side text is encoded
in UTF-8, rather than WIN1252 as declared.

If you added  length('Franche-Comté') to the above query, I suspect
it would tell that the string is one character longer than
expected, and octet_length('Franche-Comté') would be
two-byte longer than expected.

Also dumping the contents of the "location" column with
convert_to() would show that the accents have been
wrongly translated, if the explanation of the encoding snafu is
correct.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: insensitive collations

От
Jim Finnerty
Дата:
you were exactly right. With client_encoding set to UTF8 it now works
correctly.

the transformation to do the LIKE or ILIKE on a nondeterministic ICU
collation CI_AS for which there is an otherwise identical deterministic
collation CS_AS is as shown below:

SELECT * FROM locations WHERE location LIKE 'midi-Pyr_n%ées';  -- what we
want to do
ERROR:  nondeterministic collations are not supported for LIKE

-- explicitly collate with the CS_AS collation and then use ILIKE to get the
desired CI_AS behavior:
-- note that the single-character wildcard '_' matches either e or é, as
expected

SELECT * FROM locations WHERE location COLLATE SQL_Latin1_General_CP1_CS_AS
ILIKE 'midi-Pyr_n%ées';
   location
---------------
 Midi-Pyrénées
 midi-Pyrénées
 midi-Pyrenées
(3 rows)

EXPLAIN SELECT * FROM locations WHERE location COLLATE
SQL_Latin1_General_CP1_CS_AS ILIKE 'midi-Pyr_n%ées';
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Bitmap Heap Scan on locations  (cost=8.90..20.77 rows=1 width=516)
   Filter: ((location)::text ~~* 'midi-Pyr_n%ées'::text)
   ->  Bitmap Index Scan on location_index  (cost=0.00..8.89 rows=150
width=0)
(3 rows)



-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html