Обсуждение: Storing double-byte strings in text fields.

Поиск
Список
Период
Сортировка

Storing double-byte strings in text fields.

От
Edmund von der Burg
Дата:
Hello,

I am putting together a web site to display a collection of Chinese
woodblock prints. I want to be able to store double byte values (that is
to say Big5, Unicode etc encoded) in a text field for things such as the
artist's name and the title of the print. I have the following questions:

Is this possible using a plain vanilla version of Postgres, ie without the
multi-lingual support enabled? As I understand it multi-lingual support
allows me to store table and field names etc in non-ASCII, but doesn't
really affect what goes into the fields.

Are programs such as pgdump and the COPY method 8bit clean or will they
mess up the text? I have done some quick trials and it all seems OK but I
want to be sure before commiting.

If the above is not the case will the multi-lingual support fix my
problems? I tried it out but had problems with the backend crashing on
certain queries. I'd also rather not use it as it will be easier to port
my system to other servers if it just needs a plain vanilla install.

I am currently using Postgresql 7.0.3 on RedHat 6.2 (x86) and also on
YellowDog 1.2 (PPC). The web server is Apache 1.3.12 with PHP 4.0.x.


Thanks,

Edmund.


--
 ***********************************************************
 *** Edmund von der Burg ***   edmund@ecclestoad.co.uk   ***
 ***********************************************************

Re: Storing double-byte strings in text fields.

От
Tom Lane
Дата:
Edmund von der Burg <edmund@ecclestoad.co.uk> writes:
> If the above is not the case will the multi-lingual support fix my
> problems? I tried it out but had problems with the backend crashing on
> certain queries. I'd also rather not use it as it will be easier to port
> my system to other servers if it just needs a plain vanilla install.

Actually, one could fairly say that MULTIBYTE *is* in the plain vanilla
install; it's certainly in all the prebuilt RPMs that we distribute.

If you're seeing problems then the right answer is to attack 'em
head-on, not run away.  File bug reports!

            regards, tom lane

Re: Storing double-byte strings in text fields.

От
Tatsuo Ishii
Дата:
> I am putting together a web site to display a collection of Chinese
> woodblock prints. I want to be able to store double byte values (that is
> to say Big5, Unicode etc encoded) in a text field for things such as the
> artist's name and the title of the print. I have the following questions:
>
> Is this possible using a plain vanilla version of Postgres, ie without the
> multi-lingual support enabled? As I understand it multi-lingual support
> allows me to store table and field names etc in non-ASCII, but doesn't
> really affect what goes into the fields.

As already Tom mentioned, your RPMS based Linux boxes already have
PostgreSQL multi-byte capability enabled.

> Are programs such as pgdump and the COPY method 8bit clean or will they
> mess up the text? I have done some quick trials and it all seems OK but I
> want to be sure before commiting.

I don't see any reason that copy or pg_dump is not 8bit clean.

> If the above is not the case will the multi-lingual support fix my
> problems? I tried it out but had problems with the backend crashing on
> certain queries. I'd also rather not use it as it will be easier to port
> my system to other servers if it just needs a plain vanilla install.

You said you use Big5. That might be the problem. PostgreSQL does not
accept any encoding conficting with ASCII. Certain Big5 characters
include such that second bytes in the ASCII range. In this case you
need to create a database with EUC_TW encoding and set the environment
varible "PGCLIENTENCODING" to BIG5 in your frontend. This will force
the backend to convert Big5 <--> EUC_TW automatically. Oh, you use
PHP4?  then you need to set the environment varible before starting up
Apache if you use PHP4 as a module. Also I suspect you might have
trouble with PHP4. It has a capability called "magic quote", that adds
an escape character (\) to the second byte of Big5 if it's a meta
character. You need to disable it otherwise PostgreSQL will be
confused. In summary you must be very carefull to use Big5 especially
with PHP.

Talking about Unicode, it is safe as long as UTF-8 encoding. UCS-2/4
cannot be used with PostgreSQL. PostgreSQL 7.1 will have the ability
to do an automatic code conversion between UTF-8 and other encodings
including Big5. This might be a good news for you.

Another problems I have seen so far with chinese character sets are
sometimes data produced by chinese applications are badly
broken. Actually PostgreSQL is not so robust against such broken
multi-byte strings. I suspect this may be the reason of the backend
crash you had if above are not apply. I don't know.

> I am currently using Postgresql 7.0.3 on RedHat 6.2 (x86) and also on
> YellowDog 1.2 (PPC). The web server is Apache 1.3.12 with PHP 4.0.x.
--
Tatsuo Ishii