Обсуждение: [17] collation provider "builtin"

Поиск
Список
Период
Сортировка

[17] collation provider "builtin"

От
Jeff Davis
Дата:
The locale "C" (and equivalently, "POSIX") is not really a libc locale;
it's implemented internally with memcmp for collation and
pg_ascii_tolower, etc., for ctype.

The attached patch implements a new collation provider, "builtin",
which only supports "C" and "POSIX". It does not change the initdb
default provider, so it must be requested explicitly. The user will be
guaranteed that collations with provider "builtin" will never change
semantics; therefore they need no version and indexes are not at risk
of corruption. See previous discussion[1].

(Caveat: the "C" locale ordering may depend on the specific encoding.
For UTF-8, memcmp is equivalent to code point order, but that may not
be true of other encodings. Encodings can't change during pg_upgrade,
so indexes are not at risk; but the encoding can change during
dump/reload so results may change.)

This built-in provider is just here to support "C" and "POSIX" using
memcmp/pg_ascii_*, and no other locales. It is not intended as a
general license to take on the problem of maintaining locales. We may
support some other locale name to mean "code point order", but like
UCS_BASIC, that would just be an alias for locale "C" that also checks
that the encoding is UTF-8.

Motivation:

Why not just use the "C" locale with the libc provider?

1. It's more clear to the user what's going on: Postgres is managing
the provider; we aren't passing it on to libc at all. With the libc
provider, something like C.UTF-8 leaves room for confusion[2]; with the
built-in provider, "C.UTF-8" is not a supported locale and the user
will get an error if it's requested.

2. The libc provider conflates LC_COLLATE/LC_CTYPE with the default
collation; whereas in the icu and built-in providers, they are separate
concepts. With ICU and builtin, you can set LC_COLLATE and LC_CTYPE for
a database to whatever you want at creation time

3. If you use libc with locale "C", then future CREATE DATABASE
commands will default to the libc provider (because that would be the
provider for template0), which is not what the user wants if the
purpose is to avoid problems with external collation providers. If you
use the built-in provider instead, then future CREATE DATABASE commands
will only succeed if the user either specifies locale C or explicitly
chooses a new provider; which will allow them a chance to prepare for
any challenges.

4. It makes it easier to document the trade-offs between various
providers without confusing special cases around the C locale.


[1]
https://www.postgresql.org/message-id/87sfb4gwgv.fsf%40news-spur.riddles.org.uk
[2]
https://www.postgresql.org/message-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f@manitou-mail.org


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Вложения

Re: [17] collation provider "builtin"

От
Thomas Munro
Дата:
On Thu, Jun 15, 2023 at 10:55 AM Jeff Davis <pgsql@j-davis.com> wrote:
> The locale "C" (and equivalently, "POSIX") is not really a libc locale;
> it's implemented internally with memcmp for collation and
> pg_ascii_tolower, etc., for ctype.
>
> The attached patch implements a new collation provider, "builtin",
> which only supports "C" and "POSIX". It does not change the initdb
> default provider, so it must be requested explicitly. The user will be
> guaranteed that collations with provider "builtin" will never change
> semantics; therefore they need no version and indexes are not at risk
> of corruption. See previous discussion[1].

I haven't studied the details yet but +1 for this idea.  It models
what we are actually doing.



Re: [17] collation provider "builtin"

От
Joe Conway
Дата:
On 6/14/23 19:20, Thomas Munro wrote:
> On Thu, Jun 15, 2023 at 10:55 AM Jeff Davis <pgsql@j-davis.com> wrote:
>> The locale "C" (and equivalently, "POSIX") is not really a libc locale;
>> it's implemented internally with memcmp for collation and
>> pg_ascii_tolower, etc., for ctype.
>>
>> The attached patch implements a new collation provider, "builtin",
>> which only supports "C" and "POSIX". It does not change the initdb
>> default provider, so it must be requested explicitly. The user will be
>> guaranteed that collations with provider "builtin" will never change
>> semantics; therefore they need no version and indexes are not at risk
>> of corruption. See previous discussion[1].
> 
> I haven't studied the details yet but +1 for this idea.  It models
> what we are actually doing.

+1 agreed

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [17] collation provider "builtin"

От
Peter Eisentraut
Дата:
On 15.06.23 00:55, Jeff Davis wrote:
> The locale "C" (and equivalently, "POSIX") is not really a libc locale;
> it's implemented internally with memcmp for collation and
> pg_ascii_tolower, etc., for ctype.
> 
> The attached patch implements a new collation provider, "builtin",
> which only supports "C" and "POSIX". It does not change the initdb
> default provider, so it must be requested explicitly. The user will be
> guaranteed that collations with provider "builtin" will never change
> semantics; therefore they need no version and indexes are not at risk
> of corruption. See previous discussion[1].

What happens if after this patch you continue to specify provider=libc 
and locale=C?  Do you then get the "slow" path?

Should there be some logic in pg_dump to change the provider if locale=C?

What is the transition plan?




Re: [17] collation provider "builtin"

От
Jeff Davis
Дата:
On Fri, 2023-06-16 at 16:01 +0200, Peter Eisentraut wrote:
> What happens if after this patch you continue to specify
> provider=libc
> and locale=C?  Do you then get the "slow" path?

Users can continue to use the libc provider as they did before and the
fast path will still work.

> Should there be some logic in pg_dump to change the provider if
> locale=C?

That's not a part of this proposal.

> What is the transition plan?

The built-in provider is for users who want to choose a provider that
is guaranteed not to have the problems of an external provider
(versioning, tracking affected objects, corrupt indexes, and slow
performance). If they initialize with --locale-provider=builtin and --
locale=C, and they want to choose a different locale for another
database, they'll need to specifically choose libc or ICU. Of course
they can still use specific collations attached to columns or queries
as required.

It also acts as a nice complement to ICU (which doesn't support the C
locale) or a potential replacement for many uses of the libc provider
with the C locale. We can discuss later exactly how that will look, but
even if the builtin provider needs to be explicitly requested (as in
the current patch), it's still useful, so I don't think we need to
decide now.

We should also keep in mind that whatever provider is selected at
initdb time also becomes the default for future databases.

Regards,
    Jeff Davis




Re: [17] collation provider "builtin"

От
Jeff Davis
Дата:
On Wed, 2023-06-14 at 15:55 -0700, Jeff Davis wrote:
> The locale "C" (and equivalently, "POSIX") is not really a libc
> locale;
> it's implemented internally with memcmp for collation and
> pg_ascii_tolower, etc., for ctype.
>
> The attached patch implements a new collation provider, "builtin",
> which only supports "C" and "POSIX".

Rebased patch attached.

I got some generally positive comments, but it needs some more feedback
on the specifics to be committable.

This might be a good time to summarize my thoughts on collation after
my work in v16:

* Picking a database default collation other than UCS_BASIC (a.k.a.
"C", a.k.a. memcmp(), a.k.a. provider=builtin) is something that should
be done intentionally. It's an impactful choice that affects semantics,
performance, and upgrades/deployment. Beyond that, our implementation
still lacks a good way to manage versions of collation provider
libraries and track object dependencies in a safe way to prevent index
corruption, so the safest choice is really just to use stable memcmp()
semantics.

* The defaults for initdb seem bad in a number of ways, but it's too
hard to change that default now (I tried in v16 and reverted it). So
the job of reasonable choices is left for higher-level tools and
documentation.

* We can handle the collation and character classification
independently. The main use case is to set the collation to memcmp()
semantics (for stability and performance) and set the character
classification to something interesting (on the grounds that it's more
likely to be stable and less likely to be used in an index than a
collation). Right now the only way to do that is to use the libc
provider and set the collation to C and the ctype to a libc locale. But
there is also a use case for having ICU as the provider for character
classification. One option is to have separate datcolprovider=b
(builtin provider) and datctypeprovider=i, so that the collation would
be handled with memcmp and the character classification daticulocale.
It feels like we're growing the fields in pg_database a little too
much, but the use case seems valid, and perhaps we can reorganize the
catalog representation a bit.


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Вложения

Re: [17] collation provider "builtin"

От
Jeff Davis
Дата:
On Thu, 2023-06-15 at 15:08 -0400, Joe Conway wrote:
> > I haven't studied the details yet but +1 for this idea.  It models
> > what we are actually doing.
>
> +1 agreed

I am combining this discussion with my "built-in CTYPE provider"
proposal here:

https://www.postgresql.org/message-id/804eb67b37f41d3afeb2b6469cbe8bfa79c562cc.camel@j-davis.com

and the most recent patch is posted there. Having a built-in provider
is more useful if it also offers a "C.UTF-8" locale that is superior to
the libc locale of the same name.

Regards,
    Jeff Davis