Обсуждение: [ADMIN] what's the efficient/safest way to convert database character set ?

Поиск
Список
Период
Сортировка

[ADMIN] what's the efficient/safest way to convert database character set ?

От
"Huang, Suya"
Дата:

Hi,

 

I’ve got a question of converting database from ascii to UTF-8, what’s the best approach to do so if the database size is very large? Detailed procedure or experience sharing are much appreciated!

 

 

Thanks,

Suya

 

Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
Steve Atkins
Дата:
On Oct 17, 2013, at 3:13 PM, "Huang, Suya" <Suya.Huang@au.experian.com> wrote:

> Hi,
>
> I’ve got a question of converting database from ascii to UTF-8, what’s the best approach to do so if the database
sizeis very large? Detailed procedure or experience sharing are much appreciated! 
>

The answer to that depends on what you mean by "ascii".

If your current database uses SQL_ASCII encoding - that's not ascii. It could have anything in there, including any mix
ofencodings and there's been no enforcement of any encoding, so there's no way of knowing what they are. If you've had,
forexample, webapps that let people paste word documents into them, you potentially have different encodings used in
differentrows of the same table. 

If your current data is like that then you're probably looking at doing some (manual) data cleanup to work out what
encodingyour data is really in, and converting it to something consistent rather than a simple migration from ascii to
utf8.

Cheers,
  Steve



Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
John R Pierce
Дата:
On 10/17/2013 3:13 PM, Huang, Suya wrote:
> I’ve got a question of converting database from ascii to UTF-8, what’s
> the best approach to do so if the database size is very large?
> Detailed procedure or experience sharing are much appreciated!


I believe you will need to dump the whole database, and import it into a
new database that uses UTF8 encoding. Ss far as I know, there's no way
to convert encoding in place. As the other gentlemen pointed out, you
also will have to convert/sanitize all text data, as your current
SQL_ASCII fields could easily contain stuff that's not valid UTF8.

for large databases, this is a major undertaking. I find its often
easiest to do a major change like this between the old and a new
database server.

--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
"Huang, Suya"
Дата:
Thanks Steve,

Yes, we're using SQL_ASCII.

Would you please be more specific about manual data cleanup work here? I'm new to Postgres and don't have any
experiencein character set  conversion before, so any specific experience shared would be very much appreciated. 

Thanks,
Suya

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Steve Atkins
Sent: Friday, October 18, 2013 11:08 AM
To: pgsql-general@postgresql.org General
Subject: Re: [GENERAL] [ADMIN] what's the efficient/safest way to convert database character set ?


On Oct 17, 2013, at 3:13 PM, "Huang, Suya" <Suya.Huang@au.experian.com> wrote:

> Hi,
>
> I've got a question of converting database from ascii to UTF-8, what's the best approach to do so if the database
sizeis very large? Detailed procedure or experience sharing are much appreciated! 
>

The answer to that depends on what you mean by "ascii".

If your current database uses SQL_ASCII encoding - that's not ascii. It could have anything in there, including any mix
ofencodings and there's been no enforcement of any encoding, so there's no way of knowing what they are. If you've had,
forexample, webapps that let people paste word documents into them, you potentially have different encodings used in
differentrows of the same table. 

If your current data is like that then you're probably looking at doing some (manual) data cleanup to work out what
encodingyour data is really in, and converting it to something consistent rather than a simple migration from ascii to
utf8.

Cheers,
  Steve



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
Adrian Klaver
Дата:
On 10/17/2013 08:51 PM, Huang, Suya wrote:
> Thanks Steve,
>
> Yes, we're using SQL_ASCII.
>
> Would you please be more specific about manual data cleanup work here? I'm new to Postgres and don't have any
experiencein character set  conversion before, so any specific experience shared would be very much appreciated. 

This is a good place to start:

http://www.postgresql.org/docs/9.3/interactive/multibyte.html

>
> Thanks,
> Suya
>

--
Adrian Klaver
adrian.klaver@gmail.com


Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
"Huang, Suya"
Дата:
Forgot to mention, we're using a very old version which is 8.3.11.

I'll take a look at the guide for 8.3 with similar section.

Thanks Adrian!


-----Original Message-----
From: Adrian Klaver [mailto:adrian.klaver@gmail.com]
Sent: Friday, October 18, 2013 3:05 PM
To: Huang, Suya; Steve Atkins; pgsql-general@postgresql.org General
Subject: Re: [GENERAL] [ADMIN] what's the efficient/safest way to convert database character set ?

On 10/17/2013 08:51 PM, Huang, Suya wrote:
> Thanks Steve,
>
> Yes, we're using SQL_ASCII.
>
> Would you please be more specific about manual data cleanup work here? I'm new to Postgres and don't have any
experiencein character set  conversion before, so any specific experience shared would be very much appreciated. 

This is a good place to start:

http://www.postgresql.org/docs/9.3/interactive/multibyte.html

>
> Thanks,
> Suya
>

--
Adrian Klaver
adrian.klaver@gmail.com


Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
"Huang, Suya"
Дата:
Yes John, we probably will use a new database server here to accommodate those converted database.

By saying export/import, do you mean by :
1. pg_dump  (//should I specify -E  UTF 8 to dump the data in UTF-8 encoding?)
2. create database xxx -E UTF8
3. pg_restore

I also see someone's doing this by the following way:
1. perform a plain text dump of database.
    pg_dump -f db.sql [dbname]
2. convert the character encodings.
    iconv db.sql -f ISO-8859-1 -t UTF-8 -o db.utf8.sql
3. create the UTF8 database
    createdb  utf8db  (// I'm not sure why he's not specifying DB encoding here, maybe better use -E to specify the
encodingas UTF8) 
4.restore the converted UTF8 database.
    psql -d utf8db -f db.utf8.sql

which method is better? For what I can tell now is the second approach would generate bigger dump file size, so better
topipe it to bzip to have a compressed file. But other than that, any other considerations? 

Thanks,
Suya
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of John R Pierce
Sent: Friday, October 18, 2013 11:23 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] [ADMIN] what's the efficient/safest way to convert database character set ?

On 10/17/2013 3:13 PM, Huang, Suya wrote:
> I've got a question of converting database from ascii to UTF-8, what's
> the best approach to do so if the database size is very large?
> Detailed procedure or experience sharing are much appreciated!


I believe you will need to dump the whole database, and import it into a
new database that uses UTF8 encoding. Ss far as I know, there's no way
to convert encoding in place. As the other gentlemen pointed out, you
also will have to convert/sanitize all text data, as your current
SQL_ASCII fields could easily contain stuff that's not valid UTF8.

for large databases, this is a major undertaking. I find its often
easiest to do a major change like this between the old and a new
database server.

--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
John R Pierce
Дата:
On 10/17/2013 9:49 PM, Huang, Suya wrote:
> Yes John, we probably will use a new database server here to accommodate those converted database.
>
> By saying export/import, do you mean by :
> 1. pg_dump  (//should I specify -E  UTF 8 to dump the data in UTF-8 encoding?)
> 2. create database xxx -E UTF8
> 3. pg_restore

I don't believe 8.3 supported multiple different encodings on the same
server instance, thats relatively new.

before you can import your SQL_ASCII data, you need to know what charset
the data is actually in.    Is it UTF8 data stored in SQL_ASCII or is it
LATIN1 (ISO-8859) ? or some sort of Big5 or euc_cn ?  or what?  if it is
already UTF8 data, are you sure that there are no invalid encodings
accidentally stored?   Postgres with SQL_ASCII does no character
validation...  if its all USASCII (0x00 to 0x7F) then you're OK.

I would strongly recommend this new database server be running a
currently supported version, I'd probably use 9.2.   configure the old
server to allow the postgres user on the new server to connect and log
on, and while logged onto the new server, run something like...

     pg_dump -h oldserver -d olddatabase -u postgres -E oldencoding -s
-f olddatabase.schema.sql
     pg_dump -h oldserver -d olddatabase -u postgres -E oldencoding -a
-f olddatabase.data.sql
     createuser newuser
     createdb -O newuser -l en_US.utf8 newdbname
     psql -d newdbname -u newuser -f olddatabase.schema.sql
     psql -d newdbname -u newuser -f olddatabase.data.sql

if the data import fails due to a invalid encoding, then you may have to
pass the .data.sql file through iconv (and remove the set
client_encoding sql commands from it)




--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
"Huang, Suya"
Дата:
Hi John,

" Is it UTF8 data stored in SQL_ASCII or is it
LATIN1 (ISO-8859) ? or some sort of Big5 or euc_cn ?  or what?  if it is already UTF8 data, are you sure that there are
noinvalid encodings  
accidentally stored?   Postgres with SQL_ASCII does no character
validation...  if its all USASCII (0x00 to 0x7F) then you're OK."

Question: How can I pull out the real character set information from the database? does it rely on the understanding of
businessknowledge? 

so, the real data stored in database decides if we need to use iconv to convert them to UTF-8. If data is  USASCII,
thenthe pg_dump/restore process you provided should be sufficient and complete. Do I understand this correctly? 

Besides, we do have different  encoding database on same server instance, see below:

Welcome to psql 8.3.11, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

postgres=# \l
                 List of databases
            Name             |  Owner   | Encoding
-----------------------------+----------+-----------
 admin                     | dba      | UTF8
 pgdb_1_sky                      | pgdb_1      | SQL_ASCII
 pgdb_1_sky_utf8                 | pgdb_1      | UTF8
 pgdb_1_ca                      | pgdb_1      | SQL_ASCII
 pgdb_1_us                      | pgdb_1      | SQL_ASCII
 pgdb_sky                  | pgdb  | SQL_ASCII
 pgdb_sky_users            | pgdb  | SQL_ASCII
 pgdb_sky_users_utf8       | pgdb  | UTF8
 pgdb_sky_utf8             | pgdb  | UTF8
 pgdb_sky_utf8_86465_old   | pgdb  | UTF8

Thanks,
Suya
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of John R Pierce
Sent: Friday, October 18, 2013 4:12 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] [ADMIN] what's the efficient/safest way to convert database character set ?

On 10/17/2013 9:49 PM, Huang, Suya wrote:
> Yes John, we probably will use a new database server here to accommodate those converted database.
>
> By saying export/import, do you mean by :
> 1. pg_dump  (//should I specify -E  UTF 8 to dump the data in UTF-8
> encoding?) 2. create database xxx -E UTF8 3. pg_restore

I don't believe 8.3 supported multiple different encodings on the same server instance, thats relatively new.

before you can import your SQL_ASCII data, you need to know what charset
the data is actually in.    Is it UTF8 data stored in SQL_ASCII or is it
LATIN1 (ISO-8859) ? or some sort of Big5 or euc_cn ?  or what?  if it is already UTF8 data, are you sure that there are
noinvalid encodings  
accidentally stored?   Postgres with SQL_ASCII does no character
validation...  if its all USASCII (0x00 to 0x7F) then you're OK.

I would strongly recommend this new database server be running a
currently supported version, I'd probably use 9.2.   configure the old
server to allow the postgres user on the new server to connect and log on, and while logged onto the new server, run
somethinglike... 

     pg_dump -h oldserver -d olddatabase -u postgres -E oldencoding -s -f olddatabase.schema.sql
     pg_dump -h oldserver -d olddatabase -u postgres -E oldencoding -a -f olddatabase.data.sql
     createuser newuser
     createdb -O newuser -l en_US.utf8 newdbname
     psql -d newdbname -u newuser -f olddatabase.schema.sql
     psql -d newdbname -u newuser -f olddatabase.data.sql

if the data import fails due to a invalid encoding, then you may have to pass the .data.sql file through iconv (and
removethe set client_encoding sql commands from it) 




--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
John R Pierce
Дата:
On 10/17/2013 10:51 PM, Huang, Suya wrote:
> Question: How can I pull out the real character set information from the database? does it rely on the understanding
ofbusiness knowledge? 


what did you store in it?   because its declared SQL_ASCII, postgres
doesn't know, its all just bytes.   you could have stored standard 7 bit
USASCII, or you could have stored ISO-8859-1 (-2, -3,...), or you could
have stored one of the myriad non-UTF Asian multibyte character
codes.    postgres doesn't know or care what you put in there, and it
doesn't check to ensure its valid.

IF all your data is in a consistent encoding, and you specify that
encoding on the pg_dump command, then the psql command should be able to
restore it as-is to the new UTF8 database via the magic of
client_encoding.    if the data is not consistent, you'll have a much
harder time.





--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
"Huang, Suya"
Дата:
Hi John,

We stored ISO-8859-1, do we still need to use iconv to convert the file?

Thanks,
Suya

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of John R Pierce
Sent: Friday, October 18, 2013 5:12 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] [ADMIN] what's the efficient/safest way to convert database character set ?

On 10/17/2013 10:51 PM, Huang, Suya wrote:
> Question: How can I pull out the real character set information from the database? does it rely on the understanding
ofbusiness knowledge? 


what did you store in it?   because its declared SQL_ASCII, postgres
doesn't know, its all just bytes.   you could have stored standard 7 bit
USASCII, or you could have stored ISO-8859-1 (-2, -3,...), or you could have stored one of the myriad non-UTF Asian
multibytecharacter  
codes.    postgres doesn't know or care what you put in there, and it
doesn't check to ensure its valid.

IF all your data is in a consistent encoding, and you specify that encoding on the pg_dump command, then the psql
commandshould be able to restore it as-is to the new UTF8 database via the magic of  
client_encoding.    if the data is not consistent, you'll have a much
harder time.





--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
John R Pierce
Дата:
On 10/20/2013 3:56 AM, Huang, Suya wrote:
> We stored ISO-8859-1, do we still need to use iconv to convert the file?

you should be able to do the dump specifying pg_dump
--encoding=iso-8559-1 then restore it to the utf8 database, and postgres
will do the conversion for you.



--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
"Huang, Suya"
Дата:
> We stored ISO-8859-1, do we still need to use iconv to convert the file?

you should be able to do the dump specifying pg_dump
--encoding=iso-8559-1 then restore it to the utf8 database, and postgres will do the conversion for you.

One more question, if the data stored in database including UTF-8 and ISO 8859-1, is that OK to dump them all in
ISO-8859-1?

Thanks,
Suya



Re: [ADMIN] what's the efficient/safest way to convert database character set ?

От
John R Pierce
Дата:
On 10/20/2013 4:10 PM, Huang, Suya wrote:
> uestion, if the data stored in database including UTF-8 and ISO 8859-1, is that OK to dump them all in ISO-8859-1?

absolutely not.   UTF8 is not valid ISO88591, and if interpreted as such
will result in garbage characters.

if you have mixed character encodings in your database, you've got some
issues trying to sort that all out to be consistent.

--
john r pierce                                      37N 122W
somewhere on the middle of the left coast