Обсуждение: Proposal - Collation at database level

Поиск
Список
Период
Сортировка

Proposal - Collation at database level

От
Radek Strnad
Дата:
Hello, 
I'm working on implementation of collation at database level using
system locales as a Google Summer of Code 2008 project. You can read my
proposal on the wiki page -
http://wiki.postgresql.org/wiki/Gsoc08-collation . I'm building this
over Alexey Slynko's patch sent two years ago
(http://www.activebait.net/msg00019.html). Currently I'm in stage of
creating catalogs for collations, repertoires, encodings and charsets. 
Because of every single system is containing different locales we have
to guarantee at least those specified in SQL standard. SQL 2003 standard
can be downloaded at http://www.wiscorp.com/sql_2003_standard.zip (~18
MB). File 5wd-02-foundation-2003-09.pdf is specifying the foundation of
collations. One or more collations must be specified out of these: 

— SQL_CHARACTER is an implementation-defined collation. It is applicable
to the SQL_CHARACTER character repertoire.
— GRAPHIC_IRV is a collation in which the ordering is determined by
treating the code points defined by ISO 646:1991 as unsigned integers. It is applicable to the GRAPHIC_IRV
character repertoire.
— LATIN1 is a collation in which the ordering is determined by treating
the code points defined by ISO 8859- 1 as unsigned integers. It is applicable to the LATIN1 character
repertoire.
— ISO8BIT is a collation in which the ordering is determined by treating
the code points defined by ISO 8859-1 as unsigned integers. When restricted to the LATIN1 characters,
it produces the same collation as LATIN1. It is applicable to the ISO8BIT character repertoire.
— UCS_BASIC is a collation in which the ordering is determined entirely
by the Unicode scalar values of the characters in the strings being sorted. It is applicable to the
UCS character repertoire. Since every character repertoire is a subset of the UCS repertoire, the UCS_BASIC
collation is potentially applicable to every character set. NOTE 11 — The Unicode scalar value of a character is its
codepoint
 
treated as an unsigned integer.
— UNICODE is the collation in which the ordering is determined by
applying the Unicode Collation Algorithm with the Default Unicode Collation Element Table, as specified in
[Unicode10]. It is applicable to the UCS character repertoire. Since every character repertoire is a subset of
the UCS repertoire, the UNICODE collation is potentially applicable to every character set.
— SQL_TEXT is an implementation-defined collation. It is applicable to
the SQL_TEXT character repertoire.
— SQL_IDENTIFIER is an implementation-defined collation. It is
applicable to the SQL_IDENTIFIER character repertoire.

I'm thinking of dividing the problem into two parts - in beginning
pg_collation will contain two functions. One will have hard-coded rules
for these basic collations (SQL_CHARACTER, GRAPHIC_IRV, LATIN1, ISO8BIT,
UCS_BASIC). It will compare each string character bitwise and guarantee
that the implementation will meet the SQL standard implemented in
PostgreSQL. 

Second one will allow the user to use installed system locales. The set
of these collations will obviously vary between systems. Catalogs will
contain encoding and collation for calling the system locale function.
This will allow us to use collations such as en_US.utf8, cs_CZ.iso88592
etc. if they will be availible.

We will also need to change the way how strings are compared. Regarding
the set database collation the right function will be used.
http://doxygen.postgresql.org/varlena_8c.html#4c7af81f110f9be0bd8eb2bd99525675

This design will make possible switch to ICU or any other implementation
quite simple and will not cause any major rewriting of what I'm coding
right now.

Catalogs specification with SQL 2003 standard SQL commands for creating
tables follows:

=============================
pg_repertoires
=============================
CREATE TABLE CHARACTER_REPERTOIRES (   CHARACTER_REPERTOIRE_NAME      INFORMATION_SCHEMA.SQL_IDENTIFIER     CONSTRAINT
CHARACTER_REPERTOIRE_NAME_NOT_NULL      NOT NULL,   DEFAULT_COLLATION_CATALOG      INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT
 
CHARACTER_REPERTOIRES_DEFAULT_COLLATION_CATALOG_NOT_NULL       NOT NULL,   DEFAULT_COLLATION_SCHEMA
INFORMATION_SCHEMA.SQL_IDENTIFIER    CONSTRAINT CHARACTER_REPERTOIRES_DEFAULT_COLLATION_SCHEMA_NOT_NULL       NOT NULL,
 DEFAULT_COLLATION_NAME         INFORMATION_SCHEMA.SQL_IDENTIFIER     CONSTRAINT
CHARACTER_REPERTOIRES_DEFAULT_COLLATION_NAME_NOT_NULL      NOT NULL,   CONSTRAINT CHARACTER_REPERTOIRES_PRIMARY_KEY
PRIMARYKEY ( CHARACTER_REPERTOIRE_NAME ),   CONSTRAINT CHARACTER_REPERTOIRES_FOREIGN_KEY_COLLATIONS     FOREIGN KEY (
DEFAULT_COLLATION_CATALOG,DEFAULT_COLLATION_SCHEMA,                   DEFAULT_COLLATION_NAME )       REFERENCES
COLLATIONS  )
 


CATALOG(pg_repertoires, ###)
{NameData    repname;        /* repertoire name */Oid        repdefcolloid;        /* default collation catalog */Oid
    repdefcolschema;    /* default collation schema */NameData    repsysname;        /* used repertoire - system or
hard-coded*/
 

} FormData_pg_repertoires;

=============================
pg_collation
=============================
CREATE TABLE COLLATIONS (   COLLATION_CATALOG             INFORMATION_SCHEMA.SQL_IDENTIFIER,   COLLATION_SCHEMA
    INFORMATION_SCHEMA.SQL_IDENTIFIER,   COLLATION_NAME                INFORMATION_SCHEMA.SQL_IDENTIFIER,
PAD_ATTRIBUTE                INFORMATION_SCHEMA.CHARACTER_DATA     CONSTRAINT COLLATIONS_PAD_ATTRIBUTE_CHECK
CHECK( PAD_ATTRIBUTE IN               ( 'NO PAD', 'PAD SPACE' ) ),   COLLATION_TYPE
INFORMATION_SCHEMA.SQL_IDENTIFIER,  COLLATION_DEFINITION          INFORMATION_SCHEMA.CHARACTER_DATA,
COLLATION_DICTIONARY         INFORMATION_SCHEMA.CHARACTER_DATA,   CHARACTER_REPERTOIRE_NAME
INFORMATION_SCHEMA.SQL_IDENTIFIER    CONSTRAINT CHARACTER_REPERTOIRE_NAME_NOT_NULL       NOT NULL,   CONSTRAINT
COLLATIONS_PRIMARY_KEY    PRIMARY KEY ( COLLATION_CATALOG, COLLATION_SCHEMA,
 
COLLATION_NAME ),   CONSTRAINT COLLATIONS_FOREIGN_KEY_SCHEMATA     FOREIGN KEY ( COLLATION_CATALOG, COLLATION_SCHEMA )
    REFERENCES SCHEMATA   )
 

CATALOG(pg_collations, ###)
{NameData    colname;        /* collation name */Oid        colschema;        /* collation schema */bool
colpadattribute;   /* pad attribute */bool        colcasesensitive;    /* case sensitive */bool        colaccent;
/* accent sensitive */regproc        colfunc;        /* used collation function */Oid        colrepertoire;        /*
collationrepertoire */
 

} FormData_pg_collations;

- COLLATION_TYPE, COLLATION_DEFINITION, COLLATION_DICTIONARY are by NULL
by standard. Will be created by view



=============================
pg_charset
=============================
CREATE TABLE CHARACTER_SETS (   CHARACTER_SET_CATALOG          INFORMATION_SCHEMA.SQL_IDENTIFIER,
CHARACTER_SET_SCHEMA          INFORMATION_SCHEMA.SQL_IDENTIFIER,   CHARACTER_SET_NAME
INFORMATION_SCHEMA.SQL_IDENTIFIER,  CHARACTER_REPERTOIRE           INFORMATION_SCHEMA.SQL_IDENTIFIER,   FORM_OF_USE
              INFORMATION_SCHEMA.SQL_IDENTIFIER,   NUMBER_OF_CHARACTERS           INFORMATION_SCHEMA.CARDINAL_NUMBER,
DEFAULT_COLLATE_CATALOG       INFORMATION_SCHEMA.SQL_IDENTIFIER     CONSTRAINT
CHARACTER_SETS_DEFAULT_COLLATE_CATALOG_NOT_NULL      NOT NULL,   DEFAULT_COLLATE_SCHEMA
INFORMATION_SCHEMA.SQL_IDENTIFIER    CONSTRAINT CHARACTER_SETS_DEFAULT_COLLATE_SCHEMA_NOT_NULL       NOT NULL,
DEFAULT_COLLATE_NAME          INFORMATION_SCHEMA.SQL_IDENTIFIER     CONSTRAINT
CHARACTER_SETS_DEFAULT_COLLATE_NAME_NOT_NULL      NOT NULL,   CONSTRAINT CHARACTER_SETS_PRIMARY_KEY     PRIMARY KEY (
CHARACTER_SET_CATALOG,CHARACTER_SET_SCHEMA,
 
CHARACTER_SET_NAME ),   CONSTRAINT CHARACTER_SETS_FOREIGN_KEY_SCHEMATA     FOREIGN KEY ( CHARACTER_SET_CATALOG,
CHARACTER_SET_SCHEMA)       REFERENCES SCHEMATA,   CONSTRAINT CHARACTER_SETS_FOREIGN_KEY_CHARACTER_ENCODING_FORMS
FOREIGNKEY (FORM_OF_USE, CHARACTER_REPERTOIRE )       REFERENCES CHARACTER_ENCODING_FORMS,   CONSTRAINT
CHARACTER_SETS_CHECK_REFERENCES_COLLATIONS    CHECK ( DEFAULT_COLLATE_CATALOG NOT IN              ( SELECT CATALOG_NAME
FROMSCHEMATA )schema           OR              ( DEFAULT_COLLATE_CATALOG, DEFAULT_COLLATE_SCHEMA,
DEFAULT_COLLATE_NAME) IN              ( SELECT COLLATION_CATALOG, COLLATION_SCHEMA,
 
COLLATION_NAME                FROM COLLATIONS ) )   )

CATALOG(pg_charset, ###) 
{NameData    chaname;        /* character set name */Oid        chaschema;        /* character set schema */Oid
charepertoire;       /* repertoire oid */NameData    chaformofuse;        /* character encoding form */int4
chanumofcharacters;   /* number of characters */Oid        chadefcollcatalog;    /* default collate catalog */Oid
chadefcollschema;    /* default collate schema */NameData    chadefcollname;        /* default collate name */
 

} FormData_pg_charset;

=============================
pg_encoding 
=============================
CREATE TABLE CHARACTER_ENCODING_FORMS (   CHARACTER_REPERTOIRE_NAME     INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT
CHARACTER_ENCODING_FORMS_CHARACTER_REPERTOIRE_NAME_NOT_NULL       NOT NULL,   CHARACTER_ENCODING_FORM_NAME
INFORMATION_SCHEMA.SQL_IDENTIFIER    CONSTRAINT
 
CHARACTER_ENCODING_FORMS_CHARACTER_ENCODING_FORM_NAME_NOT_NULL       NOT NULL,   CONSTRAINT
CHARACTER_ENCODING_FORMS_PRIMARY_KEY    PRIMARY KEY ( CHARACTER_ENCODING_FORM_NAME,
 
CHARACTER_REPERTOIRE_NAME ),   CONSTRAINT
CHARACTER_ENCODING_FORMS_FOREIGN_KEY_CHARACTER_REPERTOIRES     FOREIGN KEY ( CHARACTER_REPERTOIRE_NAME )
REFERENCESCHARACTER_REPERTOIRES   )
 

CATALOG(pg_encoding, ###)
{NameData    encname;        /* encoding name */NameData    encsystemencoding;    /* system or built-in encoding */

} FormData_pg_encoding;


Regarding the Alexey Slynko's patch mentioned earlier pg_database has
been also extended with lc_collate and lc_ctype records that currently
sets lc_collate and lc_ctype per each database.

Please let me know if my idea is right or needs some adjusts. Thank you
Regards
    Radek Strnad






Re: Proposal - Collation at database level

От
Zdenek Kotala
Дата:
Radek Strnad napsal(a):

<snip>

> 
> I'm thinking of dividing the problem into two parts - in beginning
> pg_collation will contain two functions. One will have hard-coded rules
> for these basic collations (SQL_CHARACTER, GRAPHIC_IRV, LATIN1, ISO8BIT,
> UCS_BASIC). It will compare each string character bitwise and guarantee
> that the implementation will meet the SQL standard implemented in
> PostgreSQL. 
> 
> Second one will allow the user to use installed system locales. The set
> of these collations will obviously vary between systems. Catalogs will
> contain encoding and collation for calling the system locale function.
> This will allow us to use collations such as en_US.utf8, cs_CZ.iso88592
> etc. if they will be availible.
> 
> We will also need to change the way how strings are compared. Regarding
> the set database collation the right function will be used.
> http://doxygen.postgresql.org/varlena_8c.html#4c7af81f110f9be0bd8eb2bd99525675
> 
> This design will make possible switch to ICU or any other implementation
> quite simple and will not cause any major rewriting of what I'm coding
> right now.


Collation function is main point here. How you mentioned one will be only 
wrapper about strcmp and second one about strcoll. (maybe you need four - 
char/wchar) Which function will be used it is defined in pg_collation catalog by 
CREATE COLLATION command. But you need specify name of locale for system 
locales. It means you need attribute for storing locale name.

<snip>

> CATALOG(pg_collations, ###)
> {
>     NameData    colname;        /* collation name */
>     Oid        colschema;        /* collation schema */
>     bool        colpadattribute;    /* pad attribute */
>     bool        colcasesensitive;    /* case sensitive */
>     bool        colaccent;        /* accent sensitive */
>     regproc        colfunc;        /* used collation function */
>     Oid        colrepertoire;        /* collation repertoire */
> 
> } FormData_pg_collations;
> 

It would be good to send list of new and modified SQL commands (like CREATE 
COLLATION) for wide discussion.

    Zdenek


Re: Proposal - Collation at database level

От
Radek Strnad
Дата:
Zdenek Kotala wrote:
> Radek Strnad napsal(a):
>
> <snip>
>
>>
>> I'm thinking of dividing the problem into two parts - in beginning
>> pg_collation will contain two functions. One will have hard-coded rules
>> for these basic collations (SQL_CHARACTER, GRAPHIC_IRV, LATIN1, ISO8BIT,
>> UCS_BASIC). It will compare each string character bitwise and guarantee
>> that the implementation will meet the SQL standard implemented in
>> PostgreSQL.
>> Second one will allow the user to use installed system locales. The set
>> of these collations will obviously vary between systems. Catalogs will
>> contain encoding and collation for calling the system locale function.
>> This will allow us to use collations such as en_US.utf8, cs_CZ.iso88592
>> etc. if they will be availible.
>>
>> We will also need to change the way how strings are compared. Regarding
>> the set database collation the right function will be used.
>> http://doxygen.postgresql.org/varlena_8c.html#4c7af81f110f9be0bd8eb2bd99525675 
>>
>>
>> This design will make possible switch to ICU or any other implementation
>> quite simple and will not cause any major rewriting of what I'm coding
>> right now.
>
>
> Collation function is main point here. How you mentioned one will be 
> only wrapper about strcmp and second one about strcoll. (maybe you 
> need four - char/wchar) Which function will be used it is defined in 
> pg_collation catalog by CREATE COLLATION command. But you need specify 
> name of locale for system locales. It means you need attribute for 
> storing locale name.
>
You're right. I've extended pg_collation for system locale columns. In 
the first stage we actually don't need any other catalogs such as 
encoding, etc. and we can build this functionality only on following 
pg_collation catalog. Used collation function (system or built-in) will 
be decided  on existing collation name.

CATALOG(pg_collations, ###)
{   NameData    colname;        /* collation name */   Oid        colschema;        /* collation schema */   NameData
colcharset;   /*  character set specification */   Oid         colexistingcollation; /* existing collation */   bool
   colpadattribute;    /* pad attribute */   bool        colcasesensitive;    /* case sensitive */   bool
colaccent;       /* accent sensitive */   NameData    colsyslccollate;    /* lc_collate */   NameData    colsyslcctype;
/*lc_ctype */   regproc        colfunc;        /* used collation function */
 
} FormData_pg_collations;


>> FormData_pg_collations;
> It would be good to send list of new and modified SQL commands (like 
> CREATE COLLATION) for wide discussion.
>
CREATE COLLATION <collation name> FOR <character set specification> FROM 
<existing collation name> [ <pad characteristic> ] [ <case sensitive> ] 
[ <accent sensitive> ] [ LC_COLLATE <lc_collate> ] [ LC_CTYPE <lc_ctype> ]

<pad characteristic> := NO PAD | PAD SPACE
<case sensitive> := CASE SENSITIVE | CASE INSENSITIVE
<accent sensitive> := ACCENT SENSITIVE | ACCENT INSENSITIVE

Since you can specify order by in select clause there's no need for 
adding ascending and descending type  of collation. They will allways be 
ascending.

DROP COLLATION <collation name>

CREATE DATABASE ... [ COLLATE <collation name> ] ...

ALTER DATABASE ... [ COLLATE <collation name> ] ...
   Any thoughts?
         Radek




Re: Proposal - Collation at database level

От
Zdenek Kotala
Дата:
Radek Strnad napsal(a):
> Zdenek Kotala wrote:
>> Radek Strnad napsal(a):
>>
>> <snip>
>>
>>>
>>> I'm thinking of dividing the problem into two parts - in beginning
>>> pg_collation will contain two functions. One will have hard-coded rules
>>> for these basic collations (SQL_CHARACTER, GRAPHIC_IRV, LATIN1, ISO8BIT,
>>> UCS_BASIC). It will compare each string character bitwise and guarantee
>>> that the implementation will meet the SQL standard implemented in
>>> PostgreSQL.
>>> Second one will allow the user to use installed system locales. The set
>>> of these collations will obviously vary between systems. Catalogs will
>>> contain encoding and collation for calling the system locale function.
>>> This will allow us to use collations such as en_US.utf8, cs_CZ.iso88592
>>> etc. if they will be availible.
>>>
>>> We will also need to change the way how strings are compared. Regarding
>>> the set database collation the right function will be used.
>>> http://doxygen.postgresql.org/varlena_8c.html#4c7af81f110f9be0bd8eb2bd99525675 
>>>
>>>
>>> This design will make possible switch to ICU or any other implementation
>>> quite simple and will not cause any major rewriting of what I'm coding
>>> right now.
>>
>>
>> Collation function is main point here. How you mentioned one will be 
>> only wrapper about strcmp and second one about strcoll. (maybe you 
>> need four - char/wchar) Which function will be used it is defined in 
>> pg_collation catalog by CREATE COLLATION command. But you need specify 
>> name of locale for system locales. It means you need attribute for 
>> storing locale name.
>>
> You're right. I've extended pg_collation for system locale columns. In 
> the first stage we actually don't need any other catalogs such as 
> encoding, etc. and we can build this functionality only on following 
> pg_collation catalog. Used collation function (system or built-in) will 
> be decided  on existing collation name.
> 
> CATALOG(pg_collations, ###)
> {
>    NameData    colname;        /* collation name */
>    Oid        colschema;        /* collation schema */
>    NameData   colcharset;    /*  character set specification */
>    Oid         colexistingcollation; /* existing collation */
>    bool        colpadattribute;    /* pad attribute */
>    bool        colcasesensitive;    /* case sensitive */
>    bool        colaccent;        /* accent sensitive */
>    NameData    colsyslccollate;    /* lc_collate */
>    NameData    colsyslcctype; /* lc_ctype */
>    regproc        colfunc;        /* used collation function */
> } FormData_pg_collations;
> 
> 
>>> FormData_pg_collations;
>> It would be good to send list of new and modified SQL commands (like 
>> CREATE COLLATION) for wide discussion.
>>
> CREATE COLLATION <collation name> FOR <character set specification> FROM 
> <existing collation name> [ <pad characteristic> ] [ <case sensitive> ] 
> [ <accent sensitive> ] [ LC_COLLATE <lc_collate> ] [ LC_CTYPE <lc_ctype> ]
> 
> <pad characteristic> := NO PAD | PAD SPACE
> <case sensitive> := CASE SENSITIVE | CASE INSENSITIVE
> <accent sensitive> := ACCENT SENSITIVE | ACCENT INSENSITIVE
> 
> Since you can specify order by in select clause there's no need for 
> adding ascending and descending type  of collation. They will allways be 
> ascending.
> 
> DROP COLLATION <collation name>
> 
> CREATE DATABASE ... [ COLLATE <collation name> ] ...
> 
> ALTER DATABASE ... [ COLLATE <collation name> ] ...
> 
>

I think catalog is good. Maybe attributes names colsyslccollate and 
colsyslcctype should be more generic because they could be shared with ICU.

But collation function should be specified in CREATE COLLATION command.
Maybe CREATE COLLATION .... [STRCOL <fn name>]

    Zdenek