Обсуждение: COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion
Hi all,
I ran into this problem and want to share and have a confirmation.
I tried to use COPY function to load bulk data. I craft myself a UNICODE file from a MSSQL db. I can't load it into the postgresql. I always get the error: CONTEXT: COPY vd, line 1, column vdnum: "ÿþ1"
The problem is that both file are exactly the same... I found that pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE but with variable length encoding (Ie: Some character user 8 bytes and other 16 bytes ...). See for detail: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. The file I crafted is a true UNICODE (16 bytes or UCS-2) file (Confirm please)
So here is the content of the file:
UTF-8 (Postgresql dump):
1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ
UNICODE (crafted from mssql)
1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ
HEX representation UTF-8 (Postgresql dump):
00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3 1.1.1.AC.COLUMNÃ
00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf ¿ACNUMÿACDESCÿ
00000020:41 43 44 45 4c 50 41 49 c3 bf ACDELPAIÿ
HEX representation UNICODE (crafted from mssql)
00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00 ÿþ1...1...1...A.
00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00 C...C.O.L.U.M.N.
00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00 ÿ.A.C.N.U.M.ÿ.A.
00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00 C.D.E.S.C.ÿ.A.C.
00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00 D.E.L.P.A.I.ÿ.
So postgresql bug with the FF FE that start the UNICODE document. Is that normal UNICODE file starts with this FF FE ?! Note that I tried to delete those character but they aren`t visible...
So am I right ? Is Postgresql using UTF-8 and don`t really understand UNICODE file (UCS-2)? Is there a way I can make the COPY command with a UNICODE UCS-2 encoding
Thanks for your help
/David
I ran into this problem and want to share and have a confirmation.
I tried to use COPY function to load bulk data. I craft myself a UNICODE file from a MSSQL db. I can't load it into the postgresql. I always get the error: CONTEXT: COPY vd, line 1, column vdnum: "ÿþ1"
The problem is that both file are exactly the same... I found that pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE but with variable length encoding (Ie: Some character user 8 bytes and other 16 bytes ...). See for detail: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. The file I crafted is a true UNICODE (16 bytes or UCS-2) file (Confirm please)
So here is the content of the file:
UTF-8 (Postgresql dump):
1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ
UNICODE (crafted from mssql)
1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ
HEX representation UTF-8 (Postgresql dump):
00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3 1.1.1.AC.COLUMNÃ
00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf ¿ACNUMÿACDESCÿ
00000020:41 43 44 45 4c 50 41 49 c3 bf ACDELPAIÿ
HEX representation UNICODE (crafted from mssql)
00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00 ÿþ1...1...1...A.
00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00 C...C.O.L.U.M.N.
00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00 ÿ.A.C.N.U.M.ÿ.A.
00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00 C.D.E.S.C.ÿ.A.C.
00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00 D.E.L.P.A.I.ÿ.
So postgresql bug with the FF FE that start the UNICODE document. Is that normal UNICODE file starts with this FF FE ?! Note that I tried to delete those character but they aren`t visible...
So am I right ? Is Postgresql using UTF-8 and don`t really understand UNICODE file (UCS-2)? Is there a way I can make the COPY command with a UNICODE UCS-2 encoding
Thanks for your help
/David
David Gagnon <dgagnon@siunik.com> writes: > So am I right ? Is Postgresql using UTF-8 and don`t really understand > UNICODE file (UCS-2)? Is there a way I can make the COPY command with a > UNICODE UCS-2 encoding Postgres only supports UTF-8, not any other encoding of Unicode. Sorry. regards, tom lane
Am Mittwoch, den 06.04.2005, 18:12 -0400 schrieb David Gagnon: > Hi all, > > I ran into this problem and want to share and have a confirmation. > > I tried to use COPY function to load bulk data. I craft myself a > UNICODE file from a MSSQL db. I can't load it into the postgresql. I > always get the error: CONTEXT: COPY vd, line 1, column vdnum: "ÿþ1" > > The problem is that both file are exactly the same... I found that > pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE > but with variable length encoding (Ie: Some character user 8 bytes and > other 16 bytes ...). See for detail: > http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. The file I crafted > is a true UNICODE (16 bytes or UCS-2) file (Confirm please) > > So here is the content of the file: > UTF-8 (Postgresql dump): > 1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ > > UNICODE (crafted from mssql) > 1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ > > HEX representation UTF-8 (Postgresql dump): > > 00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3 > 1.1.1.AC.COLUMNà > 00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf > ¿ACNUMÿACDESCÿ > 00000020:41 43 44 45 4c 50 41 49 c3 bf > ACDELPAIÿ > > HEX representation UNICODE (crafted from mssql) > 00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00 > ÿþ1...1...1...A. > 00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00 > C...C.O.L.U.M.N. > 00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00 > ÿ.A.C.N.U.M.ÿ.A. > 00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00 > C.D.E.S.C.ÿ.A.C. > 00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00 > D.E.L.P.A.I.ÿ. > > So postgresql bug with the FF FE that start the UNICODE document. Is > that normal UNICODE file starts with this FF FE ?! Note that I tried > to delete those character but they aren`t visible... > > So am I right ? Is Postgresql using UTF-8 and don`t really understand > UNICODE file (UCS-2)? Is there a way I can make the COPY command with > a UNICODE UCS-2 encoding Yes, postgres Unicode means utf-8. Windows programs which store unicode text are usually prepending the file with a BOM (byte order mark) (just google for unicode and BOM) So you would need something to convert. For example I know python can read BOM so it can be a matter of open it, read, encode it into utf-8 and write it out again. Regards Tino Wildenhain
On Apr 6, 2005 10:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Gagnon <dgagnon@siunik.com> writes: > > So am I right ? Is Postgresql using UTF-8 and don`t really understand > > UNICODE file (UCS-2)? Is there a way I can make the COPY command with a > > UNICODE UCS-2 encoding > > Postgres only supports UTF-8, not any other encoding of Unicode. Sorry. > You can use iconv to convert the file on a *nix-like system, and there may even be a cygwin build if you're on Windows. http://www.gnu.org/software/libiconv/ -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
Improvement for COPY command .. unless it already exists (If yes please tell me!)
От
David Gagnon
Дата:
Hi all, Thanks for your reply on my yesterday's question regarding UTF-8 as a UNICODE implementation in postgresql. Is there a way to specify default values in the COPY command? In my example example VDVSSRC and VDVSNUM are the same for the 1500000 rows of the file. If it was possible to default their value instead of putting it in the file, it would have been great! COPY vd (VDNUM=DEFAULT, VDVSSRC=2, VDVSNUM=1, VDKEY, VDDATA) FROM '/cygdrive/f/projects/web-catalog/var/vdOk.backup'; Thanks anyway .. your doing a wonderfull job with postgresql! /David
David Gagnon <dgagnon@siunik.com> writes: > Is there a way to specify default values in the COPY command? There's always ALTER TABLE ... SET DEFAULT. regards, tom lane
Hi, I want to default some columns for this COPY command only. So DEFAULT is not appropriate in this case. In fact I think you suggested a workaround right ? Doing an alter table before and another one after will work ... not fully clean thought since my user may not have the write to modify the table. Being able to specify default values directly in the COPY command will be really nice :-) I just wanted to share my tought :-) Being eable to read zipped file will be cool to .. but I'm not sure it's Chrismas yet right so I didn't ask ;-) Thanks for your help /David Tom Lane wrote: >David Gagnon <dgagnon@siunik.com> writes: > > >>Is there a way to specify default values in the COPY command? >> >> > >There's always ALTER TABLE ... SET DEFAULT. > > regards, tom lane > > >
On Apr 7, 2005, at 10:33 AM, David Gagnon wrote: > Hi, > > I want to default some columns for this COPY command only. So DEFAULT > is not appropriate in this case. In fact I think you suggested a > workaround right ? Doing an alter table before and another one after > will work ... not fully clean thought since my user may not have the > write to modify the table. Being able to specify default values > directly in the COPY command will be really nice :-) I just wanted to > share my tought :-) > > Being eable to read zipped file will be cool to .. but I'm not sure > it's Chrismas yet right so I didn't ask ;-) > > Thanks for your help > /David David, If you use a simple wrapper around copy using DBI (as either an untrusted perl database function) or straight perl, you could read zipped files as a filehandle. As for defaults, you could also write a trigger to accomplish the "default" behavior or, if you use the above perl-based loader, you could implement defaults as well.... Sean > > Tom Lane wrote: > >> David Gagnon <dgagnon@siunik.com> writes: >> >>> Is there a way to specify default values in the COPY command? >>> >> >> There's always ALTER TABLE ... SET DEFAULT. >> >> regards, tom lane >> >> > > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)