Re: Dealing with collation and strcoll/strxfrm/etc
От | Peter Geoghegan |
---|---|
Тема | Re: Dealing with collation and strcoll/strxfrm/etc |
Дата | |
Msg-id | CAM3SWZR=Mz9rc=L0F+XmyYSZbQ+39KQqfS=RV3YKbpyAfp9vAw@mail.gmail.com обсуждение исходный текст |
Ответ на | Dealing with collation and strcoll/strxfrm/etc (Stephen Frost <sfrost@snowman.net>) |
Ответы |
Re: Dealing with collation and strcoll/strxfrm/etc
(Stephen Frost <sfrost@snowman.net>)
|
Список | pgsql-hackers |
On Mon, Mar 28, 2016 at 7:57 AM, Stephen Frost <sfrost@snowman.net> wrote: > If we're going to talk about minimum requirements, I'd like to argue > that we require whatever system we're using to have versioning (which > glibc currently lacks, as I understand it...) to avoid the risk that > indexes will become corrupt when whatever we're using for collation > changes. I'm pretty sure that's already bitten us on at least some > RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues > with strcoll vs. strxfrm. I totally agree that anything we should adopt should support versioning. Glibc does have a non-standard versioning scheme, but we don't use it. Other stdlibs may do versioning another way, or not at all. A world in which ICU is the defacto standard for Postgres (i.e. the actual standard on all major platforms), we mostly just have one thing to target, which seems like something to aim for. Collations change from time to time, legitimately. Read from "Collation order is not fixed", here: http://unicode.org/reports/tr10/#Stability The question is only how we deal with this when it happens. One thing that's attractive about ICU is that it makes this explicit, both for the logical behavior of a collation, as well as the stability of binary sort keys (Glibc's versioning seemingly just does the former). So the equivalent of strxfrm() output has license to change for technical reasons that are orthogonal to the practical concerns of end-users about how text sorts in their locale. ICU is clear on what it takes to make binary sort keys in indexes work. And various major database systems rely on this being right. > Regarding key abbreviation and performance, if we are confident that > strcoll and strxfrm are at least independently internally consistent > then we could consider offering an option to choose between them. I think they just need to match, per the standard. After all, abbreviation will sometimes require strcoll() tie-breakers. Clearly it would be very naive to imagine that ICU is bug-free. However, I surmise that there is a large difference how ICU and glibc think about things like strxfrm() or strcoll() stability and consistency. Tom was able to demonstrate that strxfrm() and strcoll() behaved inconsistently without too much effort, contrary to POSIX, and in many common cases. I doubt that the Glibc maintainers are all that concerned about it. Certainly, less concerned than they are about the latest security bug. Whereas if this happened in ICU, it would be a total failure of the project to fulfill its most basic goals. Our disaster would also be a disaster for several other major database systems. ICU carefully and explicitly considers multiple forms of stability, "deterministic" sort ordering, etc. That *is* a big difference, and it makes me optimistic that there'd be far fewer problems. I also think that ICU could be a reasonable basis for case-insensitive collations, which would let us kill citext, a module that I consider to be a total kludge. And, we might also be able to lock down WAL compatibility, which would be generally useful. -- Peter Geoghegan
В списке pgsql-hackers по дате отправления: