Обсуждение: Questions on using multi-byte character in a field of a table (BIG5)

Поиск
Список
Период
Сортировка

Questions on using multi-byte character in a field of a table (BIG5)

От
"Hui Chun Kit, Jacky"
Дата:
Dear all,
   I have some difficult time in using postgresql 6.4 with chinese BIG5

characters. I am just looking for storing BIG characters in a text field

and retrieve correctly. I have --enable-mb when I compile. I am on RH5.1

intel platform, running PG 6.4.   I just created a testing table test   create test ( name char(20), age int);   For
mostof the characters in BIG5, it works and I can insert
 
chinese name into the table, but for some characters, esp my own name,
it does not work. I have check the problem out . But cannot solve it.   It is because in my name under BIG5 coding it
is"5cb3 54ab c7b3"
 
or
in ASCII code "263   \ 253   T 263 307" where two byte is a character.
That is "5cb3" ('263' '\' ) is the first character and '54ab' ( '253'
'T' ) becomes the second character. The problem is that somewhere
between storing the value into database and client frontend (Perl,
MSAccess) , the '\' is interpreted and thus the stored value becomes
"263  253   T 263 307" which is distorted.   I don't know where exactly is the problem as when I use Mysql, it is

working fine.   Could anyone give me some hints or help..   Your help is really very appreciated!!!!!!!!!!!!!!!!


Best Rgds,
Jacky







Re: [HACKERS] Questions on using multi-byte character in a field of a table (BIG5)

От
t-ishii@sra.co.jp (Tatsuo Ishii)
Дата:
At 3:46 AM 98.11.22 +0800, Hui Chun Kit, Jacky wrote:
>Dear all,
>
>    I have some difficult time in using postgresql 6.4 with chinese BIG5
>
>characters. I am just looking for storing BIG characters in a text field
>
>and retrieve correctly. I have --enable-mb when I compile. I am on RH5.1

What did you choose for an encoding?
BIG5 is not supported yet in 6.4, sorry.

>intel platform, running PG 6.4.
>    I just created a testing table test
>    create test ( name char(20), age int);
>    For most of the characters in BIG5, it works and I can insert
>chinese name into the table, but for some characters, esp my own name,
>it does not work. I have check the problem out . But cannot solve it.
>    It is because in my name under BIG5 coding it is "5cb3 54ab c7b3"
>or
>in ASCII code "263   \ 253   T 263 307" where two byte is a character.
>That is "5cb3" ('263' '\' ) is the first character and '54ab' ( '253'
>'T' ) becomes the second character. The problem is that somewhere
>between storing the value into database and client frontend (Perl,
>MSAccess) , the '\' is interpreted and thus the stored value becomes
>"263  253   T 263 307" which is distorted.
>    I don't know where exactly is the problem as when I use Mysql, it is
>
>working fine.

As you can see the problem is that BIG5 can contain some special characters
in the second byte that confuse the PostgreSQL parser. We had similar
experience with Japanese Shift Jis Code (SJIS). To address the problem
we have added a fuctionality to convert between SJIS and EUC_JP (that never
confuses the parser thus can be used as one of backend native encoding)
somewhere in the backend.

To solve your problem, there might be 2 solutions:

o Use EUC_TW(Chinese EUC Code) instead of BIG5. 6.4 should be happy with EUC_TW. To use EUC_TW, just create a new
database:     createdb mydb with encoding='EUC_TW'. or do "configure --with-mb=EUC_TW" and re-install. then re-create
thedatabase.
 
 Alternatively, you can use Unicode (UTF-8). Use "UNICODE" instead of "EUC_TW" in this case.

o Add an encoding conversion module between BIG5 and EUC_TW to PostgreSQL.I wish I could do that, but I have no idea
howto write it (I don't speak Chinese at all). So your contribution would be welcome!
 

BTW, you said you use perl. I'm surprised to hear that perl
can handle BIG5. Is it a modified version (localized version)?

You also use M$Access. So you must use ODBC, that make me worry about its
support for BIG5. Here in Japan we are using localized version of
ODBC driver that supports SJIS.

What I want to say here is that your problem may not be ony PostgreSQL
itself. I recommend you make sure that your clients can handle
BIG5.
--
Tatsuo Ishii
t-ishii@sra.co.jp



Re: [HACKERS] Questions on using multi-byte character in a field of a table (BIG5)

От
Jacky Hui Chun Kit
Дата:
Dear ,
   Really thanks for your reply... I have been waiting for reply for a while.   I realy want to help out with this but
Ihave some problems.   1. I am not familiar with ODBC Standards and internal.   2. I am not familiar with Language
Codingand Convertion.
 
   But I do used to programming in C, C++, Perl and both under UNIX and VC5   Maybe we can cooperate with some other
EastAsian Countries (Korean,
 
Taiwan) to create customized ODBC driver for each language coding we have.   Besides, perl do work with Chinese, in
fact,I only have problem with ODBC
 
now. When I use bind variables in DBD:Pg, all things work. I think this is
because when assigning variables in perl using single quote instead of double
quote $var='sth'; would prevent perl from interpreting the value of the
variable and thus everything works. Of course, I am using EUC_TW as my default
encoding during initdb and createdb.   Can u tell me where can I find more info on language coding and writing
ODBC dirver. I have read the source of the PsqlODBC and I think they are using
Crygus GNU toolset.  Can u tell me more about what you guys have done.   Thanks.


Best Rgds,
Jacky Hui

Tatsuo Ishii wrote:

> At 3:46 AM 98.11.22 +0800, Hui Chun Kit, Jacky wrote:
> >Dear all,
> >
> >    I have some difficult time in using postgresql 6.4 with chinese BIG5
> >
> >characters. I am just looking for storing BIG characters in a text field
> >
> >and retrieve correctly. I have --enable-mb when I compile. I am on RH5.1
>
> What did you choose for an encoding?
> BIG5 is not supported yet in 6.4, sorry.
>
> >intel platform, running PG 6.4.
> >    I just created a testing table test
> >    create test ( name char(20), age int);
> >    For most of the characters in BIG5, it works and I can insert
> >chinese name into the table, but for some characters, esp my own name,
> >it does not work. I have check the problem out . But cannot solve it.
> >    It is because in my name under BIG5 coding it is "5cb3 54ab c7b3"
> >or
> >in ASCII code "263   \ 253   T 263 307" where two byte is a character.
> >That is "5cb3" ('263' '\' ) is the first character and '54ab' ( '253'
> >'T' ) becomes the second character. The problem is that somewhere
> >between storing the value into database and client frontend (Perl,
> >MSAccess) , the '\' is interpreted and thus the stored value becomes
> >"263  253   T 263 307" which is distorted.
> >    I don't know where exactly is the problem as when I use Mysql, it is
> >
> >working fine.
>
> As you can see the problem is that BIG5 can contain some special characters
> in the second byte that confuse the PostgreSQL parser. We had similar
> experience with Japanese Shift Jis Code (SJIS). To address the problem
> we have added a fuctionality to convert between SJIS and EUC_JP (that never
> confuses the parser thus can be used as one of backend native encoding)
> somewhere in the backend.
>
> To solve your problem, there might be 2 solutions:
>
> o Use EUC_TW(Chinese EUC Code) instead of BIG5. 6.4 should be happy
>   with EUC_TW. To use EUC_TW, just create a new database:
>        createdb mydb with encoding='EUC_TW'.
>   or do "configure --with-mb=EUC_TW" and re-install. then re-create
>   the database.
>
>   Alternatively, you can use Unicode (UTF-8). Use "UNICODE" instead of
>   "EUC_TW" in this case.
>
> o Add an encoding conversion module between BIG5 and EUC_TW to PostgreSQL.
>  I wish I could do that, but I have no idea how to write it
>  (I don't speak Chinese at all). So your contribution would be welcome!
>
> BTW, you said you use perl. I'm surprised to hear that perl
> can handle BIG5. Is it a modified version (localized version)?
>
> You also use M$Access. So you must use ODBC, that make me worry about its
> support for BIG5. Here in Japan we are using localized version of
> ODBC driver that supports SJIS.
>
> What I want to say here is that your problem may not be ony PostgreSQL
> itself. I recommend you make sure that your clients can handle
> BIG5.
> --
> Tatsuo Ishii
> t-ishii@sra.co.jp





Re: [HACKERS] Questions on using multi-byte character in a field of a table (BIG5)

От
Tatsuo Ishii
Дата:
>    Really thanks for your reply... I have been waiting for reply for a while.
>    I realy want to help out with this but I have some problems.
>    1. I am not familiar with ODBC Standards and internal.
>    2. I am not familiar with Language Coding and Convertion.
>
>    But I do used to programming in C, C++, Perl and both under UNIX and VC5
>    Maybe we can cooperate with some other East Asian Countries (Korean,
>Taiwan) to create customized ODBC driver for each language coding we have.
>    Besides, perl do work with Chinese, in fact, I only have problem with ODBC
>now. When I use bind variables in DBD:Pg, all things work. I think this is
>because when assigning variables in perl using single quote instead of double
>quote $var='sth'; would prevent perl from interpreting the value of the
>variable and thus everything works. Of course, I am using EUC_TW as my default
>encoding during initdb and createdb.
>    Can u tell me where can I find more info on language coding and writing
>ODBC dirver. I have read the source of the PsqlODBC and I think they are using
>Crygus GNU toolset.  Can u tell me more about what you guys have done.
>    Thanks.

I talked to the author of localized version of PostgreSQL ODBC Drive
(http://www.insightdist.com/download/) and found that he is
interesting in supporting Big5/EUC_TW. According to him, that
shouldn't be very difficult. I think major problems are:

(1) conversion algorism between EUC_TW and Big5
(2) test data (both Big5 and EUC_TW)
(3) testing (we do not understand Chinese)

For (1), we can refer to
ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf or whatever
in the Internet, and I believe this would not be a big concern.

So for us real problems are (2) and (3).
--
Tatsuo Ishii.