Обсуждение: [ADMIN] what's the efficient/safest way to convert database character set ?
Hi,
I’ve got a question of converting database from ascii to UTF-8, what’s the best approach to do so if the database size is very large? Detailed procedure or experience sharing are much appreciated!
Thanks,
Suya
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
Steve Atkins
Дата:
On Oct 17, 2013, at 3:13 PM, "Huang, Suya" <Suya.Huang@au.experian.com> wrote: > Hi, > > I’ve got a question of converting database from ascii to UTF-8, what’s the best approach to do so if the database sizeis very large? Detailed procedure or experience sharing are much appreciated! > The answer to that depends on what you mean by "ascii". If your current database uses SQL_ASCII encoding - that's not ascii. It could have anything in there, including any mix ofencodings and there's been no enforcement of any encoding, so there's no way of knowing what they are. If you've had, forexample, webapps that let people paste word documents into them, you potentially have different encodings used in differentrows of the same table. If your current data is like that then you're probably looking at doing some (manual) data cleanup to work out what encodingyour data is really in, and converting it to something consistent rather than a simple migration from ascii to utf8. Cheers, Steve
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
John R Pierce
Дата:
On 10/17/2013 3:13 PM, Huang, Suya wrote: > I’ve got a question of converting database from ascii to UTF-8, what’s > the best approach to do so if the database size is very large? > Detailed procedure or experience sharing are much appreciated! I believe you will need to dump the whole database, and import it into a new database that uses UTF8 encoding. Ss far as I know, there's no way to convert encoding in place. As the other gentlemen pointed out, you also will have to convert/sanitize all text data, as your current SQL_ASCII fields could easily contain stuff that's not valid UTF8. for large databases, this is a major undertaking. I find its often easiest to do a major change like this between the old and a new database server. -- john r pierce 37N 122W somewhere on the middle of the left coast
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
"Huang, Suya"
Дата:
Thanks Steve, Yes, we're using SQL_ASCII. Would you please be more specific about manual data cleanup work here? I'm new to Postgres and don't have any experiencein character set conversion before, so any specific experience shared would be very much appreciated. Thanks, Suya -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Steve Atkins Sent: Friday, October 18, 2013 11:08 AM To: pgsql-general@postgresql.org General Subject: Re: [GENERAL] [ADMIN] what's the efficient/safest way to convert database character set ? On Oct 17, 2013, at 3:13 PM, "Huang, Suya" <Suya.Huang@au.experian.com> wrote: > Hi, > > I've got a question of converting database from ascii to UTF-8, what's the best approach to do so if the database sizeis very large? Detailed procedure or experience sharing are much appreciated! > The answer to that depends on what you mean by "ascii". If your current database uses SQL_ASCII encoding - that's not ascii. It could have anything in there, including any mix ofencodings and there's been no enforcement of any encoding, so there's no way of knowing what they are. If you've had, forexample, webapps that let people paste word documents into them, you potentially have different encodings used in differentrows of the same table. If your current data is like that then you're probably looking at doing some (manual) data cleanup to work out what encodingyour data is really in, and converting it to something consistent rather than a simple migration from ascii to utf8. Cheers, Steve -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
Adrian Klaver
Дата:
On 10/17/2013 08:51 PM, Huang, Suya wrote: > Thanks Steve, > > Yes, we're using SQL_ASCII. > > Would you please be more specific about manual data cleanup work here? I'm new to Postgres and don't have any experiencein character set conversion before, so any specific experience shared would be very much appreciated. This is a good place to start: http://www.postgresql.org/docs/9.3/interactive/multibyte.html > > Thanks, > Suya > -- Adrian Klaver adrian.klaver@gmail.com
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
"Huang, Suya"
Дата:
Forgot to mention, we're using a very old version which is 8.3.11. I'll take a look at the guide for 8.3 with similar section. Thanks Adrian! -----Original Message----- From: Adrian Klaver [mailto:adrian.klaver@gmail.com] Sent: Friday, October 18, 2013 3:05 PM To: Huang, Suya; Steve Atkins; pgsql-general@postgresql.org General Subject: Re: [GENERAL] [ADMIN] what's the efficient/safest way to convert database character set ? On 10/17/2013 08:51 PM, Huang, Suya wrote: > Thanks Steve, > > Yes, we're using SQL_ASCII. > > Would you please be more specific about manual data cleanup work here? I'm new to Postgres and don't have any experiencein character set conversion before, so any specific experience shared would be very much appreciated. This is a good place to start: http://www.postgresql.org/docs/9.3/interactive/multibyte.html > > Thanks, > Suya > -- Adrian Klaver adrian.klaver@gmail.com
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
"Huang, Suya"
Дата:
Yes John, we probably will use a new database server here to accommodate those converted database. By saying export/import, do you mean by : 1. pg_dump (//should I specify -E UTF 8 to dump the data in UTF-8 encoding?) 2. create database xxx -E UTF8 3. pg_restore I also see someone's doing this by the following way: 1. perform a plain text dump of database. pg_dump -f db.sql [dbname] 2. convert the character encodings. iconv db.sql -f ISO-8859-1 -t UTF-8 -o db.utf8.sql 3. create the UTF8 database createdb utf8db (// I'm not sure why he's not specifying DB encoding here, maybe better use -E to specify the encodingas UTF8) 4.restore the converted UTF8 database. psql -d utf8db -f db.utf8.sql which method is better? For what I can tell now is the second approach would generate bigger dump file size, so better topipe it to bzip to have a compressed file. But other than that, any other considerations? Thanks, Suya -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of John R Pierce Sent: Friday, October 18, 2013 11:23 AM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] [ADMIN] what's the efficient/safest way to convert database character set ? On 10/17/2013 3:13 PM, Huang, Suya wrote: > I've got a question of converting database from ascii to UTF-8, what's > the best approach to do so if the database size is very large? > Detailed procedure or experience sharing are much appreciated! I believe you will need to dump the whole database, and import it into a new database that uses UTF8 encoding. Ss far as I know, there's no way to convert encoding in place. As the other gentlemen pointed out, you also will have to convert/sanitize all text data, as your current SQL_ASCII fields could easily contain stuff that's not valid UTF8. for large databases, this is a major undertaking. I find its often easiest to do a major change like this between the old and a new database server. -- john r pierce 37N 122W somewhere on the middle of the left coast -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
John R Pierce
Дата:
On 10/17/2013 9:49 PM, Huang, Suya wrote: > Yes John, we probably will use a new database server here to accommodate those converted database. > > By saying export/import, do you mean by : > 1. pg_dump (//should I specify -E UTF 8 to dump the data in UTF-8 encoding?) > 2. create database xxx -E UTF8 > 3. pg_restore I don't believe 8.3 supported multiple different encodings on the same server instance, thats relatively new. before you can import your SQL_ASCII data, you need to know what charset the data is actually in. Is it UTF8 data stored in SQL_ASCII or is it LATIN1 (ISO-8859) ? or some sort of Big5 or euc_cn ? or what? if it is already UTF8 data, are you sure that there are no invalid encodings accidentally stored? Postgres with SQL_ASCII does no character validation... if its all USASCII (0x00 to 0x7F) then you're OK. I would strongly recommend this new database server be running a currently supported version, I'd probably use 9.2. configure the old server to allow the postgres user on the new server to connect and log on, and while logged onto the new server, run something like... pg_dump -h oldserver -d olddatabase -u postgres -E oldencoding -s -f olddatabase.schema.sql pg_dump -h oldserver -d olddatabase -u postgres -E oldencoding -a -f olddatabase.data.sql createuser newuser createdb -O newuser -l en_US.utf8 newdbname psql -d newdbname -u newuser -f olddatabase.schema.sql psql -d newdbname -u newuser -f olddatabase.data.sql if the data import fails due to a invalid encoding, then you may have to pass the .data.sql file through iconv (and remove the set client_encoding sql commands from it) -- john r pierce 37N 122W somewhere on the middle of the left coast
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
"Huang, Suya"
Дата:
Hi John, " Is it UTF8 data stored in SQL_ASCII or is it LATIN1 (ISO-8859) ? or some sort of Big5 or euc_cn ? or what? if it is already UTF8 data, are you sure that there are noinvalid encodings accidentally stored? Postgres with SQL_ASCII does no character validation... if its all USASCII (0x00 to 0x7F) then you're OK." Question: How can I pull out the real character set information from the database? does it rely on the understanding of businessknowledge? so, the real data stored in database decides if we need to use iconv to convert them to UTF-8. If data is USASCII, thenthe pg_dump/restore process you provided should be sufficient and complete. Do I understand this correctly? Besides, we do have different encoding database on same server instance, see below: Welcome to psql 8.3.11, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit postgres=# \l List of databases Name | Owner | Encoding -----------------------------+----------+----------- admin | dba | UTF8 pgdb_1_sky | pgdb_1 | SQL_ASCII pgdb_1_sky_utf8 | pgdb_1 | UTF8 pgdb_1_ca | pgdb_1 | SQL_ASCII pgdb_1_us | pgdb_1 | SQL_ASCII pgdb_sky | pgdb | SQL_ASCII pgdb_sky_users | pgdb | SQL_ASCII pgdb_sky_users_utf8 | pgdb | UTF8 pgdb_sky_utf8 | pgdb | UTF8 pgdb_sky_utf8_86465_old | pgdb | UTF8 Thanks, Suya -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of John R Pierce Sent: Friday, October 18, 2013 4:12 PM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] [ADMIN] what's the efficient/safest way to convert database character set ? On 10/17/2013 9:49 PM, Huang, Suya wrote: > Yes John, we probably will use a new database server here to accommodate those converted database. > > By saying export/import, do you mean by : > 1. pg_dump (//should I specify -E UTF 8 to dump the data in UTF-8 > encoding?) 2. create database xxx -E UTF8 3. pg_restore I don't believe 8.3 supported multiple different encodings on the same server instance, thats relatively new. before you can import your SQL_ASCII data, you need to know what charset the data is actually in. Is it UTF8 data stored in SQL_ASCII or is it LATIN1 (ISO-8859) ? or some sort of Big5 or euc_cn ? or what? if it is already UTF8 data, are you sure that there are noinvalid encodings accidentally stored? Postgres with SQL_ASCII does no character validation... if its all USASCII (0x00 to 0x7F) then you're OK. I would strongly recommend this new database server be running a currently supported version, I'd probably use 9.2. configure the old server to allow the postgres user on the new server to connect and log on, and while logged onto the new server, run somethinglike... pg_dump -h oldserver -d olddatabase -u postgres -E oldencoding -s -f olddatabase.schema.sql pg_dump -h oldserver -d olddatabase -u postgres -E oldencoding -a -f olddatabase.data.sql createuser newuser createdb -O newuser -l en_US.utf8 newdbname psql -d newdbname -u newuser -f olddatabase.schema.sql psql -d newdbname -u newuser -f olddatabase.data.sql if the data import fails due to a invalid encoding, then you may have to pass the .data.sql file through iconv (and removethe set client_encoding sql commands from it) -- john r pierce 37N 122W somewhere on the middle of the left coast -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
John R Pierce
Дата:
On 10/17/2013 10:51 PM, Huang, Suya wrote: > Question: How can I pull out the real character set information from the database? does it rely on the understanding ofbusiness knowledge? what did you store in it? because its declared SQL_ASCII, postgres doesn't know, its all just bytes. you could have stored standard 7 bit USASCII, or you could have stored ISO-8859-1 (-2, -3,...), or you could have stored one of the myriad non-UTF Asian multibyte character codes. postgres doesn't know or care what you put in there, and it doesn't check to ensure its valid. IF all your data is in a consistent encoding, and you specify that encoding on the pg_dump command, then the psql command should be able to restore it as-is to the new UTF8 database via the magic of client_encoding. if the data is not consistent, you'll have a much harder time. -- john r pierce 37N 122W somewhere on the middle of the left coast
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
"Huang, Suya"
Дата:
Hi John, We stored ISO-8859-1, do we still need to use iconv to convert the file? Thanks, Suya -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of John R Pierce Sent: Friday, October 18, 2013 5:12 PM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] [ADMIN] what's the efficient/safest way to convert database character set ? On 10/17/2013 10:51 PM, Huang, Suya wrote: > Question: How can I pull out the real character set information from the database? does it rely on the understanding ofbusiness knowledge? what did you store in it? because its declared SQL_ASCII, postgres doesn't know, its all just bytes. you could have stored standard 7 bit USASCII, or you could have stored ISO-8859-1 (-2, -3,...), or you could have stored one of the myriad non-UTF Asian multibytecharacter codes. postgres doesn't know or care what you put in there, and it doesn't check to ensure its valid. IF all your data is in a consistent encoding, and you specify that encoding on the pg_dump command, then the psql commandshould be able to restore it as-is to the new UTF8 database via the magic of client_encoding. if the data is not consistent, you'll have a much harder time. -- john r pierce 37N 122W somewhere on the middle of the left coast -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
John R Pierce
Дата:
On 10/20/2013 3:56 AM, Huang, Suya wrote: > We stored ISO-8859-1, do we still need to use iconv to convert the file? you should be able to do the dump specifying pg_dump --encoding=iso-8559-1 then restore it to the utf8 database, and postgres will do the conversion for you. -- john r pierce 37N 122W somewhere on the middle of the left coast
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
"Huang, Suya"
Дата:
> We stored ISO-8859-1, do we still need to use iconv to convert the file? you should be able to do the dump specifying pg_dump --encoding=iso-8559-1 then restore it to the utf8 database, and postgres will do the conversion for you. One more question, if the data stored in database including UTF-8 and ISO 8859-1, is that OK to dump them all in ISO-8859-1? Thanks, Suya
Re: [ADMIN] what's the efficient/safest way to convert database character set ?
От
John R Pierce
Дата:
On 10/20/2013 4:10 PM, Huang, Suya wrote: > uestion, if the data stored in database including UTF-8 and ISO 8859-1, is that OK to dump them all in ISO-8859-1? absolutely not. UTF8 is not valid ISO88591, and if interpreted as such will result in garbage characters. if you have mixed character encodings in your database, you've got some issues trying to sort that all out to be consistent. -- john r pierce 37N 122W somewhere on the middle of the left coast