Обсуждение: COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion

Поиск

Список

Период

Сортировка

COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion

От

David Gagnon

Дата:

07 апреля 2005 г., 02:08:40

Hi all,

I ran into this problem and want to share and have a confirmation.

I tried to use COPY function to load bulk data. I craft myself a UNICODE file from a MSSQL db. I can't load it into the postgresql. I always get the error: CONTEXT: COPY vd, line 1, column vdnum: "ÿþ1"

The problem is that both file are exactly the same... I found that pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE but with variable length encoding (Ie: Some character user 8 bytes and other 16 bytes ...). See for detail: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. The file I crafted is a true UNICODE (16 bytes or UCS-2) file (Confirm please)

So here is the content of the file:
UTF-8 (Postgresql dump):
1    1    1    AC    COLUMNÿACNUMÿACDESCÿACDELPAIÿ

UNICODE (crafted from mssql)
1    1    1    AC    COLUMNÿACNUMÿACDESCÿACDELPAIÿ

HEX representation UTF-8 (Postgresql dump):

00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3    1.1.1.AC.COLUMNÃ
00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf    ¿ACNUMÃ¿ACDESCÃ¿
00000020:41 43 44 45 4c 50 41 49 c3 bf                      ACDELPAIÃ¿

HEX representation UNICODE (crafted from mssql)
00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00    ÿþ1...1...1...A.
00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00    C...C.O.L.U.M.N.
00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00    ÿ.A.C.N.U.M.ÿ.A.
00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00    C.D.E.S.C.ÿ.A.C.
00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00          D.E.L.P.A.I.ÿ.

So postgresql bug with the FF FE that start the UNICODE document. Is that normal UNICODE file starts with this FF FE ?! Note that I tried to delete those character but they aren`t visible...

So am I right ? Is Postgresql using UTF-8 and don`t really understand UNICODE file (UCS-2)? Is there a way I can make the COPY command with a UNICODE UCS-2 encoding

Thanks for your help
/David

Re: COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion

От

Tom Lane

Дата:

07 апреля 2005 г., 02:22:15

David Gagnon <dgagnon@siunik.com> writes:
> So am I right ? Is Postgresql using UTF-8 and don`t really understand
> UNICODE file (UCS-2)?  Is there a way I can make the COPY command with a
> UNICODE UCS-2 encoding

Postgres only supports UTF-8, not any other encoding of Unicode.  Sorry.

            regards, tom lane

Re: COPY command use UTF-8 encoding and NOT

От

Tino Wildenhain

Дата:

07 апреля 2005 г., 02:40:58

Am Mittwoch, den 06.04.2005, 18:12 -0400 schrieb David Gagnon:
> Hi all,
>
>   I ran into this problem and want to share and have a confirmation.
>
> I tried to use COPY function to load bulk data.  I craft myself a
> UNICODE file from a MSSQL db.  I can't load it into the postgresql.  I
> always get the error: CONTEXT:  COPY vd, line 1, column vdnum: "ÿþ1"
>
> The problem is that both file are exactly the same... I found that
> pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE
> but with variable length encoding (Ie: Some character user 8 bytes and
> other 16 bytes ...).  See for detail:
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8.  The file I crafted
> is a true UNICODE (16 bytes or UCS-2) file (Confirm please)
>
> So here is the content of the file:
> UTF-8 (Postgresql dump):
> 1    1    1    AC    COLUMNÿACNUMÿACDESCÿACDELPAIÿ
>
> UNICODE (crafted from mssql)
> 1    1    1    AC    COLUMNÿACNUMÿACDESCÿACDELPAIÿ
>
> HEX representation UTF-8 (Postgresql dump):
>
> 00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3
> 1.1.1.AC.COLUMNÃ
> 00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf
> ¿ACNUMÃ¿ACDESCÃ¿
> 00000020:41 43 44 45 4c 50 41 49 c3 bf
> ACDELPAIÃ¿
>
> HEX representation UNICODE (crafted from mssql)
> 00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00
> ÿþ1...1...1...A.
> 00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00
> C...C.O.L.U.M.N.
> 00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00
> ÿ.A.C.N.U.M.ÿ.A.
> 00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00
> C.D.E.S.C.ÿ.A.C.
> 00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00
> D.E.L.P.A.I.ÿ.
>
> So postgresql bug with the FF FE that start the UNICODE document.  Is
> that normal UNICODE file starts with this FF FE ?! Note that I tried
> to delete those character but they aren`t visible...
>
> So am I right ? Is Postgresql using UTF-8 and don`t really understand
> UNICODE file (UCS-2)?  Is there a way I can make the COPY command with
> a UNICODE UCS-2 encoding

Yes, postgres Unicode means utf-8.
Windows programs which store unicode text are usually prepending
the file with a BOM (byte order mark) (just google for unicode
and BOM)
So you would need something to convert. For example I know python
can read BOM so it can be a matter of open it, read, encode it
into utf-8 and write it out again.

Regards
Tino Wildenhain

Re: COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion

От

Mike Rylander

Дата:

07 апреля 2005 г., 06:13:45

On Apr 6, 2005 10:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Gagnon <dgagnon@siunik.com> writes:
> > So am I right ? Is Postgresql using UTF-8 and don`t really understand
> > UNICODE file (UCS-2)?  Is there a way I can make the COPY command with a
> > UNICODE UCS-2 encoding
>
> Postgres only supports UTF-8, not any other encoding of Unicode.  Sorry.
>

You can use iconv to convert the file on a *nix-like system, and there
may even be a cygwin build if you're on Windows.

http://www.gnu.org/software/libiconv/

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Improvement for COPY command .. unless it already exists (If yes please tell me!)

От

David Gagnon

Дата:

07 апреля 2005 г., 16:18:21

Hi all,

  Thanks for your reply on my yesterday's question regarding UTF-8 as a
UNICODE implementation in postgresql.

Is there a way to specify default values in the COPY command?  In my
example example VDVSSRC and VDVSNUM are the same for the 1500000 rows of
the file.  If it was possible to default their value instead of putting
it in the file, it would have been great!

COPY vd (VDNUM=DEFAULT, VDVSSRC=2, VDVSNUM=1, VDKEY, VDDATA)
FROM '/cygdrive/f/projects/web-catalog/var/vdOk.backup';


Thanks anyway .. your doing a wonderfull job with postgresql!
/David

Re: Improvement for COPY command .. unless it already exists (If yes please tell me!)

От

Tom Lane

Дата:

07 апреля 2005 г., 18:15:32

David Gagnon <dgagnon@siunik.com> writes:
> Is there a way to specify default values in the COPY command?

There's always ALTER TABLE ... SET DEFAULT.

            regards, tom lane

Re: Improvement for COPY command .. unless it already exists

От

David Gagnon

Дата:

07 апреля 2005 г., 18:30:34

Hi,

I want to default some columns for this COPY command only.  So DEFAULT
is not appropriate in this case.  In fact I think you suggested a
workaround right ?  Doing an alter table before and another one after
will work ... not fully clean thought since my user may not have the
write to modify the table.   Being able to specify default values
directly in the COPY command will be really nice :-)  I just wanted to
share my tought :-)

Being eable to read zipped file will be cool to .. but I'm not sure it's
Chrismas yet right so I didn't ask ;-)

Thanks for your help
/David

Tom Lane wrote:

>David Gagnon <dgagnon@siunik.com> writes:
>
>
>>Is there a way to specify default values in the COPY command?
>>
>>
>
>There's always ALTER TABLE ... SET DEFAULT.
>
>            regards, tom lane
>
>
>

Re: Improvement for COPY command .. unless it already exists

От

Sean Davis

Дата:

07 апреля 2005 г., 19:05:22

On Apr 7, 2005, at 10:33 AM, David Gagnon wrote:

> Hi,
>
> I want to default some columns for this COPY command only.  So DEFAULT
> is not appropriate in this case.  In fact I think you suggested a
> workaround right ?  Doing an alter table before and another one after
> will work ... not fully clean thought since my user may not have the
> write to modify the table.   Being able to specify default values
> directly in the COPY command will be really nice :-)  I just wanted to
> share my tought :-)
>
> Being eable to read zipped file will be cool to .. but I'm not sure
> it's Chrismas yet right so I didn't ask ;-)
>
> Thanks for your help
> /David

David,

If you use a simple wrapper around copy using DBI (as either an
untrusted perl database function) or straight perl, you could read
zipped files as a filehandle.

As for defaults, you could also write a trigger to accomplish the
"default" behavior or, if you use the above perl-based loader, you
could implement defaults as well....

Sean

>
> Tom Lane wrote:
>
>> David Gagnon <dgagnon@siunik.com> writes:
>>
>>> Is there a way to specify default values in the COPY command?
>>>
>>
>> There's always ALTER TABLE ... SET DEFAULT.
>>
>>             regards, tom lane
>>
>>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion