Обсуждение: pgsql cannot read utf8 files moved from windows correctly!

Поиск
Список
Период
Сортировка

pgsql cannot read utf8 files moved from windows correctly!

От
"bookman bookman"
Дата:
H i,

I copied a table in sqlserver2005 to a txt file(There were many
chinese words in it).I saved it as a file encoded by ANSI,but I cant
open it in ubuntu.I tried GBK,GB18030,
UTF8,It just could not be opened.

    Then I save it in windows with encoding UTF8,then I can open it in
ubuntu.I copied it to postgresql,but the file could not be read
correctly.For example,here is a file:

--book.txt
bookid(int)   bookname(varchar(30))
1                  Java

I created a table "book" in postgre,then I input the command line:
     copy book from '/home/postgres/data/book.txt'
The error was:
    error:invalid input syntax for integer:"  1";
    context:line 1,column bookid
I know that every line of utf8 files  is started with "fffe" or "feff"
 and ended with "\r\n" in windows but not in linux,so  the character
"1" has a space before it in the error line.

    Is there any way I can transfer utf8 file in windows to linux system?

Thank you!

Re: pgsql cannot read utf8 files moved from windows correctly!

От
Martijn van Oosterhout
Дата:
On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote:
> I know that every line of utf8 files  is started with "fffe" or "feff"
>  and ended with "\r\n" in windows but not in linux,so  the character
> "1" has a space before it in the error line.

Err, no. In UTF-16 files it is common to begin the *file* with that
character, but UTF-8 doesn't have that character anywhere, it's
illegal. Just stripping them out should be fine.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Вложения

Re: pgsql cannot read utf8 files moved from windows correctly!

От
"Trevor Talbot"
Дата:
On 12/20/07, Martijn van Oosterhout <kleptog@svana.org> wrote:
> On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote:

> > I know that every line of utf8 files  is started with "fffe" or "feff"
> >  and ended with "\r\n" in windows but not in linux,so  the character
> > "1" has a space before it in the error line.

> Err, no. In UTF-16 files it is common to begin the *file* with that
> character, but UTF-8 doesn't have that character anywhere, it's
> illegal. Just stripping them out should be fine.

A BOM is perfectly legal in UTF-8, and it's commonly used as a
signature to indicate the text is UTF-8 instead of another encoding.
But yes, it is at the beginning of the file only.

http://unicode.org/faq/utf_bom.html#29

Re: pgsql cannot read utf8 files moved from windows correctly!

От
"Martin Gainty"
Дата:
it seems the use of BOM in UTF-8 is discouraged
http://unicode.org/faq/utf_bom.html#BOM
FF FE is UTF16-Little Endian
FE FF is UTF16-Big Endian

Please verify-
Bedankt/
Martin-
----- Original Message -----
From: "Trevor Talbot" <quension@gmail.com>
To: <pgsql-general@postgresql.org>
Sent: Sunday, December 23, 2007 10:39 AM
Subject: Re: [GENERAL] pgsql cannot read utf8 files moved from windows
correctly!


> On 12/20/07, Martijn van Oosterhout <kleptog@svana.org> wrote:
> > On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote:
>
> > > I know that every line of utf8 files  is started with "fffe" or "feff"
> > >  and ended with "\r\n" in windows but not in linux,so  the character
> > > "1" has a space before it in the error line.
>
> > Err, no. In UTF-16 files it is common to begin the *file* with that
> > character, but UTF-8 doesn't have that character anywhere, it's
> > illegal. Just stripping them out should be fine.
>
> A BOM is perfectly legal in UTF-8, and it's commonly used as a
> signature to indicate the text is UTF-8 instead of another encoding.
> But yes, it is at the beginning of the file only.
>
> http://unicode.org/faq/utf_bom.html#29
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org/
>


Re: pgsql cannot read utf8 files moved from windows correctly!

От
brian
Дата:
bookman bookman wrote:
> H i,
>
> I copied a table in sqlserver2005 to a txt file(There were many
> chinese words in it).I saved it as a file encoded by ANSI,but I cant
> open it in ubuntu.I tried GBK,GB18030,
> UTF8,It just could not be opened.
>
>     Then I save it in windows with encoding UTF8,then I can open it in
> ubuntu.I copied it to postgresql,but the file could not be read
> correctly.For example,here is a file:
>
> --book.txt
> bookid(int)   bookname(varchar(30))
> 1                  Java
>
> I created a table "book" in postgre,then I input the command line:
>      copy book from '/home/postgres/data/book.txt'
> The error was:
>     error:invalid input syntax for integer:"  1";
>     context:line 1,column bookid
> I know that every line of utf8 files  is started with "fffe" or "feff"
>  and ended with "\r\n" in windows but not in linux,so  the character
> "1" has a space before it in the error line.
>

Not long ago i ran into a similar problem with UTF-8 and BOM. It turned
out that a client of mine had edited some files in an old version of
Homesite for Windows, which has a bit of an issue in this area:

http://kb.adobe.com/selfservice/viewContent.do?externalId=tn_19059&sliceId=1

Perhaps yours is a related problem?

brian

Re: pgsql cannot read utf8 files moved from windows correctly!

От
"Trevor Talbot"
Дата:
On 12/23/07, Martin Gainty <mgainty@hotmail.com> wrote:

> it seems the use of BOM in UTF-8 is discouraged
> http://unicode.org/faq/utf_bom.html#BOM

Where do you see it being discouraged?

Re: pgsql cannot read utf8 files moved from windows correctly!

От
"Martin Gainty"
Дата:
the specifics..

Some byte oriented protocols expect ASCII characters at the beginning of a
file.
If UTF-8 is used with these protocols, use of the BOM as encoding form
signature should be avoided.

M--

----- Original Message -----
From: "Trevor Talbot" <quension@gmail.com>
To: <pgsql-general@postgresql.org>
Sent: Sunday, December 23, 2007 1:55 PM
Subject: Re: [GENERAL] pgsql cannot read utf8 files moved from windows
correctly!


> On 12/23/07, Martin Gainty <mgainty@hotmail.com> wrote:
>
> > it seems the use of BOM in UTF-8 is discouraged
> > http://unicode.org/faq/utf_bom.html#BOM
>
> Where do you see it being discouraged?
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


Re: pgsql cannot read utf8 files moved from windows correctly!

От
"Trevor Talbot"
Дата:
On 12/23/00, Martin Gainty <mgainty@hotmail.com> wrote:

> the specifics..
>
> Some byte oriented protocols expect ASCII characters at the beginning of a
> file.
> If UTF-8 is used with these protocols, use of the BOM as encoding form
> signature should be avoided.

Sure, but that isn't true of generic text files, which is one of the
major applications of a UTF-8 BOM. Especially when said text files are
being fed to something that understands multiple encodings. The other
items on that page say as much...