Обсуждение: retry: converting ASCII to UTF-8

Поиск
Список
Период
Сортировка

retry: converting ASCII to UTF-8

От
Tom Hart
Дата:
I didn't see this come through the first time, so I'm retrying. I
apologize if this comes through twice.
------------------

Hello everybody. I hope your week's going well so far.

I built our data mine in postgreSQL around 3 months ago and I've been
working with it since. Postgres is great and I'm really enjoying it, but
I've hit a bit of a hitch. Originally (and against pgAdmin's good
advice, duh!) I set up the database to use ASCII encoding. However we
have a large base of Spanish speaking members and services, and we need
utf-8 encoding to maintain and support the extended character sets. In
my naivety I thought it would be a relatively simple process to convert
the db but I've found this to not be the case. I tried doing a dump and
restore into a new database with the proper encoding, but pg_restore is
getting hung up on one of the tables, our largest by far (~1gb, not huge
I know). When I tried pg_restore from a command line (I was using
pgAdmin, I know i'm a nub) I received this error.

C:\Program Files\PostgreSQL\8.2\bin>pg_restore.exe -i -h 192.168.1.xxx
-p 5432 -U foobar -d warehouse_utf8 -a -t "transaction" -v
"O:\foo\bar\pg_dump_transaction.backup"
pg_restore: connecting to database for restore
Password:
pg_restore: restoring data for table "transaction"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 1706; 0 17861 TABLE
DATA transaction foobar
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
for encoding "UTF8": 0xc52f
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT:  COPY transaction, line 209487
WARNING: errors ignored on restore: 1

I remember reading somewhere recently that I could use iconv to convert
the ASCII encoded dump to UTF-8 encoding, but I'm currently on a windows
box, and a windows server, so is there an easier way to do this? Also I
was thinking perhaps it was possible to do an ETL type setup, where I
can SELECT from the ASCII db and INSERT into the UTF-8 db.

If you haven't gathered yet, I'm pretty in the dark regarding encoding
issues, especially when applied to pg, so any help here would be
appreciated.

--
Tom Hart
IT Specialist
Cooperative Federal
723 Westcott St.
Syracuse, NY 13210
(315) 471-1116 ext. 202
(315) 476-0567 (fax)


Re: retry: converting ASCII to UTF-8

От
"Albe Laurenz"
Дата:
Tom Hart wrote:
> Originally I set up the database to use ASCII encoding. However we
> have a large base of Spanish speaking members and services,
> and we need utf-8 encoding to maintain and support the extended character
> sets. In my naivety I thought it would be a relatively simple process
> to convert the db but I've found this to not be the case. I tried doing
> a dump and restore into a new database with the proper encoding, but
> pg_restore is getting hung up on one of the tables, our largest by far
> (~1gb, not huge I know). When I tried pg_restore from a command line
> (I was using pgAdmin, I know i'm a nub) I received this error.
>
> C:\Program Files\PostgreSQL\8.2\bin>pg_restore.exe -i -h 192.168.1.xxx -p 5432 -U foobar -d warehouse_utf8 -a -t
"transaction"-v "O:\foo\bar\pg_dump_transaction.backup" 
> pg_restore: connecting to database for restore
> Password:
> pg_restore: restoring data for table "transaction"
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 1706; 0 17861 TABLE DATA transaction foobar
> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xc52f
> HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is
controlledby "client_encoding". 
> CONTEXT:  COPY transaction, line 209487
> WARNING: errors ignored on restore: 1

You will have to figure out in which encoding the data really are.

SQL_ASCII will allow anything in, and the client is responsible for
feeding the right thing.

Somebody stored a 0xC5 in the database, which is not ASCII.
In WIN-1252, ISO8859-1, and ISO8859-15 that would be Å.
Could that be?

Once you find out the correct encoding, you can uye the -E switch of
pg_dump to set that encoding for your exported data.

If your clients entered consistent data, that should work.

If different clients used different encodings, you might end up
sorting it out manually...

Yours,
Laurenz Albe