Обсуждение: how is text-equality handled in postgresql?

Поиск
Список
Период
Сортировка

how is text-equality handled in postgresql?

От
Gábor Farkas
Дата:
hi,

when i create an unique-constraint on a varchar field, how exactly
does postgresql compare the texts? i'm asking because in UNICODE there
are a lot of complexities about this..

or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?

thanks,
gabor


Re: how is text-equality handled in postgresql?

От
Ivan Voras
Дата:
On 15/01/2014 10:10, Gábor Farkas wrote:
> hi,
>
> when i create an unique-constraint on a varchar field, how exactly
> does postgresql compare the texts? i'm asking because in UNICODE there
> are a lot of complexities about this..
>
> or in other words, when are two varchars equal in postgres? when their
> bytes are? or some algorithm is applied?

By default, it is "whatever the operating system thinks it's right".
PostgreSQL doesn't have its own collation code, it uses the OS's locale
support for this.

(which breaks on certain systems which don't have complete UTF-8 support
- I'm in favour of importing ICU at least as an optional dependancy,
similar to what the FreeBSD's patch does:
http://people.freebsd.org/~girgen/postgresql-icu/).


Вложения

Re: how is text-equality handled in postgresql?

От
Amit Langote
Дата:
On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 15/01/2014 10:10, Gábor Farkas wrote:
>> hi,
>>
>> when i create an unique-constraint on a varchar field, how exactly
>> does postgresql compare the texts? i'm asking because in UNICODE there
>> are a lot of complexities about this..
>>
>> or in other words, when are two varchars equal in postgres? when their
>> bytes are? or some algorithm is applied?
>
> By default, it is "whatever the operating system thinks it's right".
> PostgreSQL doesn't have its own collation code, it uses the OS's locale
> support for this.
>

Just to add to this, whenever strcoll() (a locale aware comparator)
says two strings are equal, postgres re-compares them using strcmp().
See following code snippet off
src/backend/utils/adt/varlena.c:varstr_cmp() -

#ifdef HAVE_LOCALE_T
                if (mylocale)
                        result = strcoll_l(a1p, a2p, mylocale);
                else
#endif
                        result = strcoll(a1p, a2p);

                /*
                 * In some locales strcoll() can claim that
nonidentical strings are
                 * equal.  Believing that would be bad news for a
number of reasons,
                 * so we follow Perl's lead and sort "equal" strings
according to
                 * strcmp().
                 */
                if (result == 0)
                        result = strcmp(a1p, a2p);

--
Amit Langote


Re: how is text-equality handled in postgresql?

От
Ivan Voras
Дата:
On 15/01/2014 12:36, Amit Langote wrote:
> On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> On 15/01/2014 10:10, Gábor Farkas wrote:
>>> hi,
>>>
>>> when i create an unique-constraint on a varchar field, how exactly
>>> does postgresql compare the texts? i'm asking because in UNICODE there
>>> are a lot of complexities about this..
>>>
>>> or in other words, when are two varchars equal in postgres? when their
>>> bytes are? or some algorithm is applied?
>>
>> By default, it is "whatever the operating system thinks it's right".
>> PostgreSQL doesn't have its own collation code, it uses the OS's locale
>> support for this.
>>
>
> Just to add to this, whenever strcoll() (a locale aware comparator)
> says two strings are equal, postgres re-compares them using strcmp().
> See following code snippet off
> src/backend/utils/adt/varlena.c:varstr_cmp() -

>                 /*
>                  * In some locales strcoll() can claim that
> nonidentical strings are
>                  * equal.  Believing that would be bad news for a
> number of reasons,
>                  * so we follow Perl's lead and sort "equal" strings
> according to
>                  * strcmp().
>                  */
>                 if (result == 0)
>                         result = strcmp(a1p, a2p);

That seems odd and inefficient. Why would it be necessary? I would think
indexing (and other collation-sensitive operations) don't care what the
actual collation result is for arbitrary blobs of strings, as long as
they are stable?


Вложения

Re: how is text-equality handled in postgresql?

От
Amit Langote
Дата:
On Wed, Jan 15, 2014 at 9:02 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 15/01/2014 12:36, Amit Langote wrote:
>> On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>>> On 15/01/2014 10:10, Gábor Farkas wrote:
>>>> hi,
>>>>
>>>> when i create an unique-constraint on a varchar field, how exactly
>>>> does postgresql compare the texts? i'm asking because in UNICODE there
>>>> are a lot of complexities about this..
>>>>
>>>> or in other words, when are two varchars equal in postgres? when their
>>>> bytes are? or some algorithm is applied?
>>>
>>> By default, it is "whatever the operating system thinks it's right".
>>> PostgreSQL doesn't have its own collation code, it uses the OS's locale
>>> support for this.
>>>
>>
>> Just to add to this, whenever strcoll() (a locale aware comparator)
>> says two strings are equal, postgres re-compares them using strcmp().
>> See following code snippet off
>> src/backend/utils/adt/varlena.c:varstr_cmp() -
>
>>                 /*
>>                  * In some locales strcoll() can claim that
>> nonidentical strings are
>>                  * equal.  Believing that would be bad news for a
>> number of reasons,
>>                  * so we follow Perl's lead and sort "equal" strings
>> according to
>>                  * strcmp().
>>                  */
>>                 if (result == 0)
>>                         result = strcmp(a1p, a2p);
>
> That seems odd and inefficient. Why would it be necessary? I would think
> indexing (and other collation-sensitive operations) don't care what the
> actual collation result is for arbitrary blobs of strings, as long as
> they are stable?
>

This is the behavior since quite some time introduced by this commit

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

--
Amit Langote


Re: how is text-equality handled in postgresql?

От
Ivan Voras
Дата:
On 15/01/2014 13:29, Amit Langote wrote:
> On Wed, Jan 15, 2014 at 9:02 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> On 15/01/2014 12:36, Amit Langote wrote:

>>>                  * In some locales strcoll() can claim that
>>> nonidentical strings are
>>>                  * equal.  Believing that would be bad news for a
>>> number of reasons,
>>>                  * so we follow Perl's lead and sort "equal" strings
>>> according to
>>>                  * strcmp().
>>>                  */
>>>                 if (result == 0)
>>>                         result = strcmp(a1p, a2p);
>>
>> That seems odd and inefficient. Why would it be necessary? I would think
>> indexing (and other collation-sensitive operations) don't care what the
>> actual collation result is for arbitrary blobs of strings, as long as
>> they are stable?
>>
>
> This is the behavior since quite some time introduced by this commit
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

Ok, the commit comment is:

"Adjust string comparison so that only bitwise-equal strings are considered
equal: if strcoll claims two strings are equal, check it with strcmp, and
sort according to strcmp if not identical.  This fixes inconsistent
behavior under glibc's hu_HU locale, and probably under some other locales
as well.  Also, take advantage of the now-well-defined behavior to speed up
texteq, textne, bpchareq, bpcharne: they may as well just do a bitwise
comparison and not bother with strcoll at all."

... so it's just another workaround for OS specific locale issues - to
me it looks like just another reason to use ICU.


Вложения

Re: how is text-equality handled in postgresql?

От
Tom Lane
Дата:
Ivan Voras <ivoras@freebsd.org> writes:
> On 15/01/2014 12:36, Amit Langote wrote:
>> Just to add to this, whenever strcoll() (a locale aware comparator)
>> says two strings are equal, postgres re-compares them using strcmp().

> That seems odd and inefficient. Why would it be necessary? I would think
> indexing (and other collation-sensitive operations) don't care what the
> actual collation result is for arbitrary blobs of strings, as long as
> they are stable?

If we didn't do it like this, we could not use hashing techniques for
text --- at least not unless we could find a hash function guaranteed
to yield the same values for any two strings that strcoll() claims are
equal.

            regards, tom lane


Re: how is text-equality handled in postgresql?

От
Vick Khera
Дата:

On Wed, Jan 15, 2014 at 4:10 AM, Gábor Farkas <gabor.farkas@gmail.com> wrote:
or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?

On this topic, when I write my strings to the DB and search from the DB, should I canonicalize them first as NKFC (or some other), or just let the DB figure it out? In my specific case I use perl DBI with place holders to submit my queries.