Обсуждение: Encoding names

Поиск
Список
Период
Сортировка

Encoding names

От
Karel Zak
Дата:
Hi,
I a little work with encodings (Japanese, Latin(s)) and I see that 
PG use non-standard encoding names.
Why is here SJIS instead Shift-JIS, EUC_JP intead EUC-JP, 
Latin2 instead ISO-8859-2 ?
It is not good for example for applications that output data to HTML and 
needs set correct meta-tags, for this is needful maintain in application
PostgreSQL's specific names and standard names and translate between these
...
Comments?
        Karel



Re: Encoding names

От
Tatsuo Ishii
Дата:
>  I a little work with encodings (Japanese, Latin(s)) and I see that 
> PG use non-standard encoding names.
> 
>  Why is here SJIS instead Shift-JIS, EUC_JP intead EUC-JP, 
> Latin2 instead ISO-8859-2 ?
> 
>  It is not good for example for applications that output data to HTML and 
> needs set correct meta-tags, for this is needful maintain in application
> PostgreSQL's specific names and standard names and translate between these
> ...
> 
>  Comments?

But HTML meta tags used to use their own encoding names such as
x-euc-jp, x-sjis....

Well, the reaons are:

1) shell does not like "-" (configure and some Unix commands in  PostgreSQL accepts encoding names)
2) I don't like longer names

BTW, I and Thomas (and maybe others) are interested in implementing
CREATE CHRACATER SET staffs in SQL92/99. The encoding names might be
changed at that time...
--
Tatsuo Ishii


Re: Encoding names

От
Karel Zak
Дата:
> But HTML meta tags used to use their own encoding names such as
> x-euc-jp, x-sjis....
Not sure, my mozilla understand "ISO-xxxx-x", "Shift-JIS" format too.
But it's irrelevant, important is that something like "Latin2" or "SJIS"
or "EUC_JP" are less standard names. And here aren't HTML only, but other
formats too (I-MODE, Wap, XML ...etc).

> Well, the reaons are:
> 
> 1) shell does not like "-" (configure and some Unix commands in
>    PostgreSQL accepts encoding names)
>
> 2) I don't like longer names
Sorry, but both are great traverses and please never say "I don't like" 
if you talking about already existing standards, it's way to chaos.Sorry of this hard words, but I hope you understand
me:-) 
 

> BTW, I and Thomas (and maybe others) are interested in implementing
> CREATE CHRACATER SET staffs in SQL92/99. The encoding names might be
Well, I look forward.
        Karel



Re: Encoding names

От
Tatsuo Ishii
Дата:
> > But HTML meta tags used to use their own encoding names such as
> > x-euc-jp, x-sjis....
> 
>  Not sure, my mozilla understand "ISO-xxxx-x", "Shift-JIS" format too.
> But it's irrelevant, important is that something like "Latin2" or "SJIS"
> or "EUC_JP" are less standard names. And here aren't HTML only, but other
> formats too (I-MODE, Wap, XML ...etc).

They were introduced recently. If I remever correctly, when I started
to implemnet the multi-byte fucntionality, most of browsers did not
accept "Shift-JIS" as their meta tags.

> > Well, the reaons are:
> > 
> > 1) shell does not like "-" (configure and some Unix commands in
> >    PostgreSQL accepts encoding names)
> >
> > 2) I don't like longer names
> 
>  Sorry, but both are great traverses and please never say "I don't like" 
> if you talking about already existing standards, it's way to chaos.
>  
>  Sorry of this hard words, but I hope you understand me :-) 

Please understand there is no standard for charset/encoding names in
SQL92/99 itself. The SQL standard just says "you can import any
charset/encoding from anywhere if you can". Please correct me if I am
wrong.

However, I do not object to change encoding names if there are enough
agrees (and as long as the backward compatibilities are kept). 

> > BTW, I and Thomas (and maybe others) are interested in implementing
> > CREATE CHRACATER SET staffs in SQL92/99. The encoding names might be
> 
>  Well, I look forward.

Good.
--
Tatsuo Ishii


Re: Encoding names

От
Karel Zak
Дата:
On Wed, 21 Feb 2001, Tatsuo Ishii wrote:

> Please understand there is no standard for charset/encoding names in
> SQL92/99 itself. The SQL standard just says "you can import any
> charset/encoding from anywhere if you can". Please correct me if I am
> wrong.

In SQL standards not, but all probably known for example ISO names or 
some form for this.

> However, I do not object to change encoding names if there are enough
> agrees (and as long as the backward compatibilities are kept). 
You not must change current names, you can add to pg_conv_tbl[] new lines
with names synonym for already existing encoding.
An example:
{LATIN1, "LATIN1",    0, latin12mic, mic2latin1, 0, 0}, {LATIN1, "ISO-8859-1",    0, latin12mic, mic2latin1, 0, 0}, 
And if you order this table by alphabet and in pg_char_to_encoding()
you use Knuth's binary search intead current seq. scannig by for() every
thing will faster and more nice. It's easy.
What? :-)
    Karel




Re: Encoding names

От
Thomas Lockhart
Дата:
>  You not must change current names, you can add to pg_conv_tbl[] new lines
> with names synonym for already existing encoding...
>  {LATIN1, "LATIN1",     0, latin12mic, mic2latin1, 0, 0},
>  {LATIN1, "ISO-8859-1", 0, latin12mic, mic2latin1, 0, 0},
>  And if you order this table by alphabet and in pg_char_to_encoding()
> you use Knuth's binary search intead current seq. scannig by for() every
> thing will faster and more nice. It's easy.

As you probably know, there is already a binary search algorithm coded
up for the date/time string lookups in utils/adt/datetime.c. Since that
lookup caches the last value (which could be done here too) most lookups
are immediate.

Are you proposing to make a change Karel, or just encouraging others? :)
                   - Thomas


Re: Encoding names

От
Karel Zak
Дата:
On Wed, 21 Feb 2001, Thomas Lockhart wrote:

> >  You not must change current names, you can add to pg_conv_tbl[] new lines
> > with names synonym for already existing encoding...
> >  {LATIN1, "LATIN1",     0, latin12mic, mic2latin1, 0, 0},
> >  {LATIN1, "ISO-8859-1", 0, latin12mic, mic2latin1, 0, 0},
> >  And if you order this table by alphabet and in pg_char_to_encoding()
> > you use Knuth's binary search intead current seq. scannig by for() every
> > thing will faster and more nice. It's easy.
> 
> As you probably know, there is already a binary search algorithm coded
> up for the date/time string lookups in utils/adt/datetime.c. Since that
> lookup caches the last value (which could be done here too) most lookups
> are immediate.
> 
> Are you proposing to make a change Karel, or just encouraging others? :)
> 
No problem for me. Do you want patch with this to tommorow breakfast?
IMHO it's acceptable for current 7.1 too, it's really small change.
Or do it Tatsuo?
        Karel



Re: Encoding names

От
Tatsuo Ishii
Дата:
> > As you probably know, there is already a binary search algorithm coded
> > up for the date/time string lookups in utils/adt/datetime.c. Since that
> > lookup caches the last value (which could be done here too) most lookups
> > are immediate.
> > 
> > Are you proposing to make a change Karel, or just encouraging others? :)
> > 
> 
>  No problem for me. Do you want patch with this to tommorow breakfast?
> IMHO it's acceptable for current 7.1 too, it's really small change.
> 
>  Or do it Tatsuo?

Please go ahead. By the way, there is one more place you need to tweak
the encoding name table. Take a look at
interfaces/libpq/fe-connect.c. It's ugly to have simlilar tables in
two places, but I did not find better way to avoid to link huge
Unicode conversion tables in the frontend.
--
Tatsuo Ishii


Re: Encoding names

От
Karel Zak
Дата:
On Thu, 22 Feb 2001, Tatsuo Ishii wrote:

> > > As you probably know, there is already a binary search algorithm coded
> > > up for the date/time string lookups in utils/adt/datetime.c. Since that
> > > lookup caches the last value (which could be done here too) most lookups
> > > are immediate.
> > > 
> > > Are you proposing to make a change Karel, or just encouraging others? :)
> > > 
> > 
> >  No problem for me. Do you want patch with this to tommorow breakfast?
> > IMHO it's acceptable for current 7.1 too, it's really small change.
> > 
> >  Or do it Tatsuo?
> 
> Please go ahead. By the way, there is one more place you need to tweak
> the encoding name table. Take a look at
> interfaces/libpq/fe-connect.c. It's ugly to have simlilar tables in
> two places, but I did not find better way to avoid to link huge
> Unicode conversion tables in the frontend.
Hmm, I see. It's realy a little ugly maintain two places with same
things. What this solution:
 * split (on backend) pg_conv_tbl[] into two tables:
encstr2enc[]     - for encoding names (strings) to encode 'id'.          This table will sort by alphabet.
pg_conv_tbl[]   - table with encoding 'id' and with encoding routines.          This table will order by encoding 'id'
andthis          order allows found relevant routines, an example:
 
pg_conv_tbl[ LATIN1 ]

This solution allows to use and share on libpq and backend one encstr2enc[] 
table and basic functions those works with this table -- like
pg_char_to_encoding().
May be will better define for encoding 'id' separate enum datetype instead
current mistake-able '#define'. 
        Karel