Обсуждение: [17] collation provider "builtin"
The locale "C" (and equivalently, "POSIX") is not really a libc locale; it's implemented internally with memcmp for collation and pg_ascii_tolower, etc., for ctype. The attached patch implements a new collation provider, "builtin", which only supports "C" and "POSIX". It does not change the initdb default provider, so it must be requested explicitly. The user will be guaranteed that collations with provider "builtin" will never change semantics; therefore they need no version and indexes are not at risk of corruption. See previous discussion[1]. (Caveat: the "C" locale ordering may depend on the specific encoding. For UTF-8, memcmp is equivalent to code point order, but that may not be true of other encodings. Encodings can't change during pg_upgrade, so indexes are not at risk; but the encoding can change during dump/reload so results may change.) This built-in provider is just here to support "C" and "POSIX" using memcmp/pg_ascii_*, and no other locales. It is not intended as a general license to take on the problem of maintaining locales. We may support some other locale name to mean "code point order", but like UCS_BASIC, that would just be an alias for locale "C" that also checks that the encoding is UTF-8. Motivation: Why not just use the "C" locale with the libc provider? 1. It's more clear to the user what's going on: Postgres is managing the provider; we aren't passing it on to libc at all. With the libc provider, something like C.UTF-8 leaves room for confusion[2]; with the built-in provider, "C.UTF-8" is not a supported locale and the user will get an error if it's requested. 2. The libc provider conflates LC_COLLATE/LC_CTYPE with the default collation; whereas in the icu and built-in providers, they are separate concepts. With ICU and builtin, you can set LC_COLLATE and LC_CTYPE for a database to whatever you want at creation time 3. If you use libc with locale "C", then future CREATE DATABASE commands will default to the libc provider (because that would be the provider for template0), which is not what the user wants if the purpose is to avoid problems with external collation providers. If you use the built-in provider instead, then future CREATE DATABASE commands will only succeed if the user either specifies locale C or explicitly chooses a new provider; which will allow them a chance to prepare for any challenges. 4. It makes it easier to document the trade-offs between various providers without confusing special cases around the C locale. [1] https://www.postgresql.org/message-id/87sfb4gwgv.fsf%40news-spur.riddles.org.uk [2] https://www.postgresql.org/message-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f@manitou-mail.org -- Jeff Davis PostgreSQL Contributor Team - AWS
Вложения
On Thu, Jun 15, 2023 at 10:55 AM Jeff Davis <pgsql@j-davis.com> wrote: > The locale "C" (and equivalently, "POSIX") is not really a libc locale; > it's implemented internally with memcmp for collation and > pg_ascii_tolower, etc., for ctype. > > The attached patch implements a new collation provider, "builtin", > which only supports "C" and "POSIX". It does not change the initdb > default provider, so it must be requested explicitly. The user will be > guaranteed that collations with provider "builtin" will never change > semantics; therefore they need no version and indexes are not at risk > of corruption. See previous discussion[1]. I haven't studied the details yet but +1 for this idea. It models what we are actually doing.
On 6/14/23 19:20, Thomas Munro wrote: > On Thu, Jun 15, 2023 at 10:55 AM Jeff Davis <pgsql@j-davis.com> wrote: >> The locale "C" (and equivalently, "POSIX") is not really a libc locale; >> it's implemented internally with memcmp for collation and >> pg_ascii_tolower, etc., for ctype. >> >> The attached patch implements a new collation provider, "builtin", >> which only supports "C" and "POSIX". It does not change the initdb >> default provider, so it must be requested explicitly. The user will be >> guaranteed that collations with provider "builtin" will never change >> semantics; therefore they need no version and indexes are not at risk >> of corruption. See previous discussion[1]. > > I haven't studied the details yet but +1 for this idea. It models > what we are actually doing. +1 agreed -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 15.06.23 00:55, Jeff Davis wrote: > The locale "C" (and equivalently, "POSIX") is not really a libc locale; > it's implemented internally with memcmp for collation and > pg_ascii_tolower, etc., for ctype. > > The attached patch implements a new collation provider, "builtin", > which only supports "C" and "POSIX". It does not change the initdb > default provider, so it must be requested explicitly. The user will be > guaranteed that collations with provider "builtin" will never change > semantics; therefore they need no version and indexes are not at risk > of corruption. See previous discussion[1]. What happens if after this patch you continue to specify provider=libc and locale=C? Do you then get the "slow" path? Should there be some logic in pg_dump to change the provider if locale=C? What is the transition plan?
On Fri, 2023-06-16 at 16:01 +0200, Peter Eisentraut wrote: > What happens if after this patch you continue to specify > provider=libc > and locale=C? Do you then get the "slow" path? Users can continue to use the libc provider as they did before and the fast path will still work. > Should there be some logic in pg_dump to change the provider if > locale=C? That's not a part of this proposal. > What is the transition plan? The built-in provider is for users who want to choose a provider that is guaranteed not to have the problems of an external provider (versioning, tracking affected objects, corrupt indexes, and slow performance). If they initialize with --locale-provider=builtin and -- locale=C, and they want to choose a different locale for another database, they'll need to specifically choose libc or ICU. Of course they can still use specific collations attached to columns or queries as required. It also acts as a nice complement to ICU (which doesn't support the C locale) or a potential replacement for many uses of the libc provider with the C locale. We can discuss later exactly how that will look, but even if the builtin provider needs to be explicitly requested (as in the current patch), it's still useful, so I don't think we need to decide now. We should also keep in mind that whatever provider is selected at initdb time also becomes the default for future databases. Regards, Jeff Davis
On Wed, 2023-06-14 at 15:55 -0700, Jeff Davis wrote: > The locale "C" (and equivalently, "POSIX") is not really a libc > locale; > it's implemented internally with memcmp for collation and > pg_ascii_tolower, etc., for ctype. > > The attached patch implements a new collation provider, "builtin", > which only supports "C" and "POSIX". Rebased patch attached. I got some generally positive comments, but it needs some more feedback on the specifics to be committable. This might be a good time to summarize my thoughts on collation after my work in v16: * Picking a database default collation other than UCS_BASIC (a.k.a. "C", a.k.a. memcmp(), a.k.a. provider=builtin) is something that should be done intentionally. It's an impactful choice that affects semantics, performance, and upgrades/deployment. Beyond that, our implementation still lacks a good way to manage versions of collation provider libraries and track object dependencies in a safe way to prevent index corruption, so the safest choice is really just to use stable memcmp() semantics. * The defaults for initdb seem bad in a number of ways, but it's too hard to change that default now (I tried in v16 and reverted it). So the job of reasonable choices is left for higher-level tools and documentation. * We can handle the collation and character classification independently. The main use case is to set the collation to memcmp() semantics (for stability and performance) and set the character classification to something interesting (on the grounds that it's more likely to be stable and less likely to be used in an index than a collation). Right now the only way to do that is to use the libc provider and set the collation to C and the ctype to a libc locale. But there is also a use case for having ICU as the provider for character classification. One option is to have separate datcolprovider=b (builtin provider) and datctypeprovider=i, so that the collation would be handled with memcmp and the character classification daticulocale. It feels like we're growing the fields in pg_database a little too much, but the use case seems valid, and perhaps we can reorganize the catalog representation a bit. -- Jeff Davis PostgreSQL Contributor Team - AWS
Вложения
On Thu, 2023-06-15 at 15:08 -0400, Joe Conway wrote: > > I haven't studied the details yet but +1 for this idea. It models > > what we are actually doing. > > +1 agreed I am combining this discussion with my "built-in CTYPE provider" proposal here: https://www.postgresql.org/message-id/804eb67b37f41d3afeb2b6469cbe8bfa79c562cc.camel@j-davis.com and the most recent patch is posted there. Having a built-in provider is more useful if it also offers a "C.UTF-8" locale that is superior to the libc locale of the same name. Regards, Jeff Davis