Обсуждение: Encoding names
Hi, I a little work with encodings (Japanese, Latin(s)) and I see that PG use non-standard encoding names. Why is here SJIS instead Shift-JIS, EUC_JP intead EUC-JP, Latin2 instead ISO-8859-2 ? It is not good for example for applications that output data to HTML and needs set correct meta-tags, for this is needful maintain in application PostgreSQL's specific names and standard names and translate between these ... Comments? Karel
> I a little work with encodings (Japanese, Latin(s)) and I see that > PG use non-standard encoding names. > > Why is here SJIS instead Shift-JIS, EUC_JP intead EUC-JP, > Latin2 instead ISO-8859-2 ? > > It is not good for example for applications that output data to HTML and > needs set correct meta-tags, for this is needful maintain in application > PostgreSQL's specific names and standard names and translate between these > ... > > Comments? But HTML meta tags used to use their own encoding names such as x-euc-jp, x-sjis.... Well, the reaons are: 1) shell does not like "-" (configure and some Unix commands in PostgreSQL accepts encoding names) 2) I don't like longer names BTW, I and Thomas (and maybe others) are interested in implementing CREATE CHRACATER SET staffs in SQL92/99. The encoding names might be changed at that time... -- Tatsuo Ishii
> But HTML meta tags used to use their own encoding names such as > x-euc-jp, x-sjis.... Not sure, my mozilla understand "ISO-xxxx-x", "Shift-JIS" format too. But it's irrelevant, important is that something like "Latin2" or "SJIS" or "EUC_JP" are less standard names. And here aren't HTML only, but other formats too (I-MODE, Wap, XML ...etc). > Well, the reaons are: > > 1) shell does not like "-" (configure and some Unix commands in > PostgreSQL accepts encoding names) > > 2) I don't like longer names Sorry, but both are great traverses and please never say "I don't like" if you talking about already existing standards, it's way to chaos.Sorry of this hard words, but I hope you understand me:-) > BTW, I and Thomas (and maybe others) are interested in implementing > CREATE CHRACATER SET staffs in SQL92/99. The encoding names might be Well, I look forward. Karel
> > But HTML meta tags used to use their own encoding names such as > > x-euc-jp, x-sjis.... > > Not sure, my mozilla understand "ISO-xxxx-x", "Shift-JIS" format too. > But it's irrelevant, important is that something like "Latin2" or "SJIS" > or "EUC_JP" are less standard names. And here aren't HTML only, but other > formats too (I-MODE, Wap, XML ...etc). They were introduced recently. If I remever correctly, when I started to implemnet the multi-byte fucntionality, most of browsers did not accept "Shift-JIS" as their meta tags. > > Well, the reaons are: > > > > 1) shell does not like "-" (configure and some Unix commands in > > PostgreSQL accepts encoding names) > > > > 2) I don't like longer names > > Sorry, but both are great traverses and please never say "I don't like" > if you talking about already existing standards, it's way to chaos. > > Sorry of this hard words, but I hope you understand me :-) Please understand there is no standard for charset/encoding names in SQL92/99 itself. The SQL standard just says "you can import any charset/encoding from anywhere if you can". Please correct me if I am wrong. However, I do not object to change encoding names if there are enough agrees (and as long as the backward compatibilities are kept). > > BTW, I and Thomas (and maybe others) are interested in implementing > > CREATE CHRACATER SET staffs in SQL92/99. The encoding names might be > > Well, I look forward. Good. -- Tatsuo Ishii
On Wed, 21 Feb 2001, Tatsuo Ishii wrote: > Please understand there is no standard for charset/encoding names in > SQL92/99 itself. The SQL standard just says "you can import any > charset/encoding from anywhere if you can". Please correct me if I am > wrong. In SQL standards not, but all probably known for example ISO names or some form for this. > However, I do not object to change encoding names if there are enough > agrees (and as long as the backward compatibilities are kept). You not must change current names, you can add to pg_conv_tbl[] new lines with names synonym for already existing encoding. An example: {LATIN1, "LATIN1", 0, latin12mic, mic2latin1, 0, 0}, {LATIN1, "ISO-8859-1", 0, latin12mic, mic2latin1, 0, 0}, And if you order this table by alphabet and in pg_char_to_encoding() you use Knuth's binary search intead current seq. scannig by for() every thing will faster and more nice. It's easy. What? :-) Karel
> You not must change current names, you can add to pg_conv_tbl[] new lines > with names synonym for already existing encoding... > {LATIN1, "LATIN1", 0, latin12mic, mic2latin1, 0, 0}, > {LATIN1, "ISO-8859-1", 0, latin12mic, mic2latin1, 0, 0}, > And if you order this table by alphabet and in pg_char_to_encoding() > you use Knuth's binary search intead current seq. scannig by for() every > thing will faster and more nice. It's easy. As you probably know, there is already a binary search algorithm coded up for the date/time string lookups in utils/adt/datetime.c. Since that lookup caches the last value (which could be done here too) most lookups are immediate. Are you proposing to make a change Karel, or just encouraging others? :) - Thomas
On Wed, 21 Feb 2001, Thomas Lockhart wrote: > > You not must change current names, you can add to pg_conv_tbl[] new lines > > with names synonym for already existing encoding... > > {LATIN1, "LATIN1", 0, latin12mic, mic2latin1, 0, 0}, > > {LATIN1, "ISO-8859-1", 0, latin12mic, mic2latin1, 0, 0}, > > And if you order this table by alphabet and in pg_char_to_encoding() > > you use Knuth's binary search intead current seq. scannig by for() every > > thing will faster and more nice. It's easy. > > As you probably know, there is already a binary search algorithm coded > up for the date/time string lookups in utils/adt/datetime.c. Since that > lookup caches the last value (which could be done here too) most lookups > are immediate. > > Are you proposing to make a change Karel, or just encouraging others? :) > No problem for me. Do you want patch with this to tommorow breakfast? IMHO it's acceptable for current 7.1 too, it's really small change. Or do it Tatsuo? Karel
> > As you probably know, there is already a binary search algorithm coded > > up for the date/time string lookups in utils/adt/datetime.c. Since that > > lookup caches the last value (which could be done here too) most lookups > > are immediate. > > > > Are you proposing to make a change Karel, or just encouraging others? :) > > > > No problem for me. Do you want patch with this to tommorow breakfast? > IMHO it's acceptable for current 7.1 too, it's really small change. > > Or do it Tatsuo? Please go ahead. By the way, there is one more place you need to tweak the encoding name table. Take a look at interfaces/libpq/fe-connect.c. It's ugly to have simlilar tables in two places, but I did not find better way to avoid to link huge Unicode conversion tables in the frontend. -- Tatsuo Ishii
On Thu, 22 Feb 2001, Tatsuo Ishii wrote: > > > As you probably know, there is already a binary search algorithm coded > > > up for the date/time string lookups in utils/adt/datetime.c. Since that > > > lookup caches the last value (which could be done here too) most lookups > > > are immediate. > > > > > > Are you proposing to make a change Karel, or just encouraging others? :) > > > > > > > No problem for me. Do you want patch with this to tommorow breakfast? > > IMHO it's acceptable for current 7.1 too, it's really small change. > > > > Or do it Tatsuo? > > Please go ahead. By the way, there is one more place you need to tweak > the encoding name table. Take a look at > interfaces/libpq/fe-connect.c. It's ugly to have simlilar tables in > two places, but I did not find better way to avoid to link huge > Unicode conversion tables in the frontend. Hmm, I see. It's realy a little ugly maintain two places with same things. What this solution: * split (on backend) pg_conv_tbl[] into two tables: encstr2enc[] - for encoding names (strings) to encode 'id'. This table will sort by alphabet. pg_conv_tbl[] - table with encoding 'id' and with encoding routines. This table will order by encoding 'id' andthis order allows found relevant routines, an example: pg_conv_tbl[ LATIN1 ] This solution allows to use and share on libpq and backend one encstr2enc[] table and basic functions those works with this table -- like pg_char_to_encoding(). May be will better define for encoding 'id' separate enum datetype instead current mistake-able '#define'. Karel