Обсуждение: [HACKERS] What users can do with custom ICU collations in Postgres 10
There are actually very many customizations to collations that are possible beyond what the "stock" ICU collations provide (whatever "stock" means). Some of these are really cool, and I can imagine use cases where they are very compelling that have nothing to do with internationalization (such customizations are how we should eventually implement case-insensitive collations, once the infrastructure for doing that without breaking hashing is in place). I'd like to give a demo on what is already possible, but not currently documented. I didn't see anyone else comment on this, including Peter E (maybe I missed that?). We should improve the documentation in this area, to get this into the hands of users. Say we're unhappy that numbers come first, which we see here: postgres=# select * from (select '1a' i union select '1b' union select '1c' union select 'a1' union select 'b2' union select 'c3') j order by i collate "en-x-icu";i ────1a1b1ca1b2c3 (6 rows) We may do this to get our desired sort order: postgres=# create collation digitlast (provider=icu, locale='en-u-kr-latn-digit'); CREATE COLLATION postgres=# select * from (select '1a' i union select '1b' union select '1c' union select 'a1' union select 'b2' union select 'c3') j order by i collate "digitlast";i ────a1b2c31a1b1c (6 rows) Note that 'kr' is a specific BCP47 Key [1]. Many different options can be set in this manner. Let's say we are unhappy with the fact that capital letters sort higher than lowercase: postgres=# select * from (select 'B' i union select 'b' union select 'A' union select 'a') j order by i collate "en-x-icu";i ───aAbB (4 rows) ICU provides a solution here, too: postgres=# create collation capitalfirst (provider=icu, locale='en-u-kf-upper'); CREATE COLLATION postgres=# select * from (select 'B' i union select 'b' union select 'A' union select 'a') j order by i collate "capitalfirst";i ───AaBb (4 rows) And, yes -- you can even *combine* these two options by creating a third custom collation. That can be spelled 'en-u-kf-upper-kr-latn-digit', in case you were wondering. Users have certainly complained about not liking this or that aspect of how glibc sorts text many times over the years, particularly around things like how whitespace and punctuation are handled, which they can now do something about [2]. Users can also have numbers sort like numbers should when compared against other numbers, by using the numericOrdering option (not shown). numericOrdering would be great for things like alphanumeric invoice numbers, or the alphanumeric car registration plate numbers that are used in certain countries [3], with fixed number/letter fields. These options are very powerful. [1] http://unicode.org/reports/tr35/tr35-collation.html#Setting_Options [2] http://unicode.org/reports/tr35/tr35-collation.html#Common_Settings [3] https://en.wikipedia.org/wiki/Vehicle_registration_plates_of_the_Republic_of_Ireland -- Peter Geoghegan
On 8/9/17 18:49, Peter Geoghegan wrote: > There are actually very many customizations to collations that are > possible beyond what the "stock" ICU collations provide (whatever > "stock" means). This is very nice indeed, and I was not aware that this was possible with what we already have in place. I'm having trouble finding some concrete documentation for this. The TR 35 link you showed documents the key words and values, BCP 47 documents the syntax, but nothing puts it all together in a form consumable by users. The ICU documentation still mainly focuses on the "old" @keyword=value syntax. I guess we'll have to write our own for now. Given that we cannot reasonably preload all these new variants that you demonstrated, I think it would make sense to drop all the keyword variants from the preloaded set. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Aug 14, 2017 at 9:15 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > I'm having trouble finding some concrete documentation for this. The TR > 35 link you showed documents the key words and values, BCP 47 documents > the syntax, but nothing puts it all together in a form consumable by > users. The ICU documentation still mainly focuses on the "old" > @keyword=value syntax. I guess we'll have to write our own for now. There is an unusual style to the standards that apply here. It's incredibly detailed, and the options are very powerful, but it's in an unfamiliar language. ICU just considers itself a consumer of the CLDR locale stuff, which is a broad standard. We don't have to write comprehensive documentation of these kn/kb/ka/kh options that I pointed out exist. I think it would be nice to cover a few interesting cases, and link to the BCP 47 Unicode extension (TR 35) stuff. Here is a list of scripts, that are all reorderable with this TR 35 stuff (varies somewhat based on CLDR/ICU version): http://unicode.org/iso15924/iso15924-codes.html Here is a CLDR specific XML specification of the variant keywords (can be mapped to specific ICU version easily): http://www.unicode.org/repos/cldr/tags/release-31/common/bcp47/collation.xml > Given that we cannot reasonably preload all these new variants that you > demonstrated, I think it would make sense to drop all the keyword > variants from the preloaded set. Cool. While I am of course in favor of this, I actually understand very well why you had initdb add them. I think that removing them creates a discoverability problem that cannot easily be fixed through documentation. ISTM that we ought to also add an SQL-callable function that lists the most common keyword variants. Some of those are specific to one or two locales, such as traditional Spanish, or the alternative sort orders for Han characters. What do you think of that idea? I guess an alternative idea is to just link to that XML document (collation.xml), which exactly specifies the variants. Users can get the "co" variants there. Should be for the most part obvious which one is interesting to which locale, since there is not that many "co" variants to choose from, and users will probably know what to look for if they look at all. -- Peter Geoghegan
On 10 August 2017 at 06:49, Peter Geoghegan <pg@bowt.ie> wrote:
There are actually very many customizations to collations that are
possible beyond what the "stock" ICU collations provide (whatever
"stock" means). Some of these are really cool, and I can imagine use
cases where they are very compelling that have nothing to do with
internationalization (such customizations are how we should eventually
implement case-insensitive collations, once the infrastructure for
doing that without breaking hashing is in place).
I'd like to give a demo on what is already possible, but not currently
documented. I didn't see anyone else comment on this, including Peter
E (maybe I missed that?). We should improve the documentation in this
area, to get this into the hands of users.
Say we're unhappy that numbers come first, which we see here:
Ooh, this finally gives us a path toward case-insensitive default database collation via CLDR caseLevel.
That *definitely* should be documented and exposed by initdb.
On Tue, Aug 15, 2017 at 11:10 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > Ooh, this finally gives us a path toward case-insensitive default database > collation via CLDR caseLevel. > > http://userguide.icu-project.org/collation > http://www.unicode.org/reports/tr35/tr35-collation.html#Algorithm_Case > > That *definitely* should be documented and exposed by initdb. The addition of an interface to initdb smells like an item for v11~, but the documentation for 10 could be improved in this sense? -- Michael
On 15 August 2017 at 10:16, Michael Paquier <michael.paquier@gmail.com> wrote:
On Tue, Aug 15, 2017 at 11:10 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> Ooh, this finally gives us a path toward case-insensitive default database
> collation via CLDR caseLevel.
>
> http://userguide.icu-project.org/collation
> http://www.unicode.org/reports/tr35/tr35-collation. html#Algorithm_Case
>
> That *definitely* should be documented and exposed by initdb.
The addition of an interface to initdb smells like an item for v11~,
but the documentation for 10 could be improved in this sense?
Yeah, not suggesting changing it for Pg10, way too late.
It's also not enough for case-insensitive DB by its self, since we still do binary compares for equality. There'd need to be deeper surgery to make it work. So I'm getting prematurely excited here.
On 8/14/17 12:15, Peter Eisentraut wrote: > Given that we cannot reasonably preload all these new variants that you > demonstrated, I think it would make sense to drop all the keyword > variants from the preloaded set. After playing with this a bit, I'm having some doubts. While the "k" keys from TR 35 are algorithmic parameters that apply to all locales and can be looked up in the respective documents, I don't find any way a user can discover what collation types ("co") are available for a locale. Any ideas? If there isn't one, I think we need to provide one. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 8/9/17 18:49, Peter Geoghegan wrote: > I'd like to give a demo on what is already possible, but not currently > documented. I didn't see anyone else comment on this, including Peter > E (maybe I missed that?). We should improve the documentation in this > area, to get this into the hands of users. Here is a small piece of documentation. Thoughts? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On Tue, Aug 15, 2017 at 11:19 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 8/14/17 12:15, Peter Eisentraut wrote: >> Given that we cannot reasonably preload all these new variants that you >> demonstrated, I think it would make sense to drop all the keyword >> variants from the preloaded set. > > After playing with this a bit, I'm having some doubts. While the "k" > keys from TR 35 are algorithmic parameters that apply to all locales and > can be looked up in the respective documents, I don't find any way a > user can discover what collation types ("co") are available for a > locale. Any ideas? If there isn't one, I think we need to provide one. I wanted to do that too, but Tom didn't seem sold on it yesterday. He called it v11 material over on the ICU bug thread. All of the unicode "u" extensions are documented per-CLDR version as an XML file. For example: http://www.unicode.org/repos/cldr/tags/release-31/common/bcp47/collation.xml This isn't ideal, because only some of the "co" variants change things for all possible base collations. But, there isn't that many "co" options to choose from, and I think that for the most part it's reasonably obvious which one is desirable. For example, Chinese people are probably well aware of what Pinyin is, and what stroke is. Things like EOR and search are much more esoteric, but also much less useful. So, I wouldn't hate it if this was the only way that users could discover the variants in v10. -- Peter Geoghegan
On Tue, Aug 15, 2017 at 11:33 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 8/9/17 18:49, Peter Geoghegan wrote: >> I'd like to give a demo on what is already possible, but not currently >> documented. I didn't see anyone else comment on this, including Peter >> E (maybe I missed that?). We should improve the documentation in this >> area, to get this into the hands of users. > > Here is a small piece of documentation. Thoughts? This looks pretty good, but I do have some feedback: * "23.2.2.3. Copying Collations" suggests that the only use of CREATE COLLATION is copying collations, which is far from true with ICU. We should change that at the same time as this change is made. I think that just changing the title would improve the overall flow of the page. * Maybe add an example of numeric ordering -- the "alphanumeric invoice" case, where you want text containing numbers to have the numbers sort as numbers iff the comparison is to be resolved when comparing numbers. I think that that's really useful, and worth specifically calling out. I definitely would have used that had it been available ten years ago. * Let's use "en-u-kr-others-digit" instead of "en-u-kr-latn-digit' in the example. It makes no real difference to us English speakers, but means that the example works the same for those that use a different alphabet. It's more culturally neutral. * If we end up having initdb put all locales rather than all collations in pg_collation, which I think is very likely, then we can put in a link to ICU's locale explorer web resource: https://ssl.icu-project.org/icu-bin/locexp?d_=en&_=en_HK This lets the user see exactly what they'll get from a base locale using an intuitive interface (assuming it matches their CLDR version). -- Peter Geoghegan
On 8/15/17 15:04, Peter Geoghegan wrote: > * "23.2.2.3. Copying Collations" suggests that the only use of CREATE > COLLATION is copying collations, which is far from true with ICU. We > should change that at the same time as this change is made. I think > that just changing the title would improve the overall flow of the > page. I don't understand why that has to be changed and how. > * Maybe add an example of numeric ordering -- the "alphanumeric > invoice" case, where you want text containing numbers to have the > numbers sort as numbers iff the comparison is to be resolved when > comparing numbers. I think that that's really useful, and worth > specifically calling out. I definitely would have used that had it > been available ten years ago. done, quite useful > * Let's use "en-u-kr-others-digit" instead of "en-u-kr-latn-digit' in > the example. It makes no real difference to us English speakers, but > means that the example works the same for those that use a different > alphabet. It's more culturally neutral. I follow what you are saying, but that locale string is not accepted by CREATE COLLATION. > * If we end up having initdb put all locales rather than all > collations in pg_collation, which I think is very likely, then we can > put in a link to ICU's locale explorer web resource: > > https://ssl.icu-project.org/icu-bin/locexp?d_=en&_=en_HK > > This lets the user see exactly what they'll get from a base locale > using an intuitive interface (assuming it matches their CLDR version). done Patch has been posted to another thread. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services