Обсуждение: COPY command character set

Поиск
Список
Период
Сортировка

COPY command character set

От
"Peter Headland"
Дата:

The documentation of the COPY command does not state what character set(s) are recognized or written. I need to import and export UTF-8 data; how can I do that?

--

Peter Headland

Architect

Actuate Corporation

Re: COPY command character set

От
Tom Lane
Дата:
"Peter Headland" <pheadland@actuate.com> writes:
> The documentation of the COPY command does not state what character
> set(s) are recognized or written. I need to import and export UTF-8
> data; how can I do that?

set client_encoding = 'utf8';
copy from stdin/to stdout;

            regards, tom lane

Re: COPY command character set

От
"Peter Headland"
Дата:
> set client_encoding = 'utf8';
> copy from stdin/to stdout;

What if I want to do this on the server side (because it's much, much
faster)? Does COPY use the default encoding of the database? If not,
what?

If this is a restrictive as it appears, and there are no outstanding
enhancements planned in this area, I might be interested in improving
this command to allow specifying the encoding and to have it do obvious
stuff like recognize UTF lead bytes automatically. At the very least,
the documentation needs some work to explain these subtleties.

--
Peter Headland
Architect
Actuate Corporation

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, September 09, 2009 19:14
To: Peter Headland
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set

"Peter Headland" <pheadland@actuate.com> writes:
> The documentation of the COPY command does not state what character
> set(s) are recognized or written. I need to import and export UTF-8
> data; how can I do that?

set client_encoding = 'utf8';
copy from stdin/to stdout;

            regards, tom lane

Re: COPY command character set

От
Tom Lane
Дата:
"Peter Headland" <pheadland@actuate.com> writes:
>> set client_encoding = 'utf8';
>> copy from stdin/to stdout;

> What if I want to do this on the server side (because it's much, much
> faster)? Does COPY use the default encoding of the database? If not,
> what?

> If this is a restrictive as it appears, and there are no outstanding
> enhancements planned in this area, I might be interested in improving
> this command to allow specifying the encoding and to have it do obvious
> stuff like recognize UTF lead bytes automatically. At the very least,
> the documentation needs some work to explain these subtleties.

The COPY command reference page saith

    Input data is interpreted according to the current client encoding,
    and output data is encoded in the the current client encoding, even
    if the data does not pass through the client but is read from or
    written to a file.

Seems clear enough to me.

            regards, tom lane

Re: COPY command character set

От
"Peter Headland"
Дата:
> The COPY command reference page saith
>
>    Input data is interpreted according to the current client encoding,
>    and output data is encoded in the the current client encoding, even
>    if the data does not pass through the client but is read from or
>    written to a file.

Rats - I read the manual page twice and that didn't register on my
feeble consciousness. I suspect that I didn't look beyond the word
"client", since I knew I wasn't interested in client behavior and I was
speed-reading. On the assumption that I am not uniquely stupid, maybe we
could re-phrase this slightly, with a "for example", and add a heading
"Localization"?

As a general comment, I18N/L10N is a hairy enough topic that it merits
its own heading in any commands where it is an issue.

How about my suggestion to add a means (extend COPY syntax) to specify
encoding explicitly and handle UTF lead bytes - would that be of
interest?

--
Peter Headland
Architect
Actuate Corporation


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, September 10, 2009 10:38
To: Peter Headland
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set

"Peter Headland" <pheadland@actuate.com> writes:
>> set client_encoding = 'utf8';
>> copy from stdin/to stdout;

> What if I want to do this on the server side (because it's much, much
> faster)? Does COPY use the default encoding of the database? If not,
> what?

> If this is a restrictive as it appears, and there are no outstanding
> enhancements planned in this area, I might be interested in improving
> this command to allow specifying the encoding and to have it do
obvious
> stuff like recognize UTF lead bytes automatically. At the very least,
> the documentation needs some work to explain these subtleties.

The COPY command reference page saith

    Input data is interpreted according to the current client encoding,
    and output data is encoded in the the current client encoding, even
    if the data does not pass through the client but is read from or
    written to a file.

Seems clear enough to me.

            regards, tom lane

Re: COPY command character set

От
Adrian Klaver
Дата:
----- "Peter Headland" <pheadland@actuate.com> wrote:

> > The COPY command reference page saith
> >
> >    Input data is interpreted according to the current client
> encoding,
> >    and output data is encoded in the the current client encoding,
> even
> >    if the data does not pass through the client but is read from or
> >    written to a file.
>
> Rats - I read the manual page twice and that didn't register on my
> feeble consciousness. I suspect that I didn't look beyond the word
> "client", since I knew I wasn't interested in client behavior and I
> was
> speed-reading. On the assumption that I am not uniquely stupid, maybe
> we
> could re-phrase this slightly, with a "for example", and add a
> heading
> "Localization"?
>
> As a general comment, I18N/L10N is a hairy enough topic that it
> merits
> its own heading in any commands where it is an issue.
>
> How about my suggestion to add a means (extend COPY syntax) to
> specify
> encoding explicitly and handle UTF lead bytes - would that be of
> interest?
>
> --
> Peter Headland
> Architect
> Actuate Corporation
>

>
> The COPY command reference page saith
>
>     Input data is interpreted according to the current client
> encoding,
>     and output data is encoded in the the current client encoding,
> even
>     if the data does not pass through the client but is read from or
>     written to a file.
>
> Seems clear enough to me.
>
>             regards, tom lane

Maybe the link might help?

http://www.postgresql.org/docs/8.4/interactive/multibyte.html


Adrian Klaver
aklaver@comcast.net

Re: COPY command character set

От
Tom Lane
Дата:
"Peter Headland" <pheadland@actuate.com> writes:
> How about my suggestion to add a means (extend COPY syntax) to specify
> encoding explicitly and handle UTF lead bytes - would that be of
> interest?

There are no lead bytes in UTF-8, and we make no pretense of handling
UTF-16, so I don't think we'd be interested in some hack that cleans
up misencoding problems.

The idea of overriding client_encoding has been suggested before.  I
don't remember if it was rejected or is just languishing on the TODO
list.  I'd be a little worried about sending clients data in an encoding
they aren't expecting, but if it only works for I/O to a file it might
be okay.

            regards, tom lane

Re: COPY command character set

От
"Peter Headland"
Дата:
> There are no lead bytes in UTF-8

Sorry, sloppy use of terminology. I should have said "UTF signatures"
aka the "byte order mark". IOW, the "magic number" bytes commonly found
at the front of UTF encoded files:

UTF-16 little-endian   FF FE
UTF-16 big-endian   FE FF
UTF-8   EF BB BF

These tend to be inserted automatically by text editors, so it would be
advantageous to have them handled automatically by COPY (at least as an
option). Right now, if I edit a UTF-8 file then load it with COPY, I get
errors or bad data if the editor chose to add the 3 signature bytes.

Whilst UTF-16 is not supported internally, COPY seems to be a legitimate
special case, because it is used for migration to/from other tools that
may emit or expect UTF-16. ISTR that Postgres uses UCI? If so it would
be near-trivial to allow COPY to read and write UTF-16. If done via a
syntax extension to COPY (which I think is the most desirable
implementation), this would have no adverse effect on any other
capability. It also seems sufficiently isolated from sensitive/complex
areas of the code that it might make a suitable first project for
someone who is interested in becoming a contributor...

--
Peter Headland
Architect
Actuate Corporation


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, September 10, 2009 11:13
To: Peter Headland
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set

"Peter Headland" <pheadland@actuate.com> writes:
> How about my suggestion to add a means (extend COPY syntax) to specify
> encoding explicitly and handle UTF lead bytes - would that be of
> interest?

There are no lead bytes in UTF-8, and we make no pretense of handling
UTF-16, so I don't think we'd be interested in some hack that cleans
up misencoding problems.

The idea of overriding client_encoding has been suggested before.  I
don't remember if it was rejected or is just languishing on the TODO
list.  I'd be a little worried about sending clients data in an encoding
they aren't expecting, but if it only works for I/O to a file it might
be okay.

            regards, tom lane

Re: COPY command character set

От
"Peter Headland"
Дата:
> Maybe the link might help?
> 
> http://www.postgresql.org/docs/8.4/interactive/multibyte.html

That page is too generic; what would be helpful is a section in the doc for each command that is affected by I18N/L10N
considerations,that identifies how that specific command behaves.
 

Now that I have grasped the behavior, I'm more than happy to edit the COPY doc page, if people think that would be
helpful/worthwhile.

-- 
Peter Headland
Architect
Actuate Corporation


-----Original Message-----
From: Adrian Klaver [mailto:aklaver@comcast.net] 
Sent: Thursday, September 10, 2009 11:06
To: Peter Headland
Cc: pgsql-general@postgresql.org; Tom Lane
Subject: Re: [GENERAL] COPY command character set


----- "Peter Headland" <pheadland@actuate.com> wrote:

> > The COPY command reference page saith
> >
> >    Input data is interpreted according to the current client
> encoding,
> >    and output data is encoded in the the current client encoding,
> even
> >    if the data does not pass through the client but is read from or
> >    written to a file.
> 
> Rats - I read the manual page twice and that didn't register on my
> feeble consciousness. I suspect that I didn't look beyond the word
> "client", since I knew I wasn't interested in client behavior and I
> was
> speed-reading. On the assumption that I am not uniquely stupid, maybe
> we
> could re-phrase this slightly, with a "for example", and add a
> heading
> "Localization"?
> 
> As a general comment, I18N/L10N is a hairy enough topic that it
> merits
> its own heading in any commands where it is an issue.
> 
> How about my suggestion to add a means (extend COPY syntax) to
> specify
> encoding explicitly and handle UTF lead bytes - would that be of
> interest?
> 
> -- 
> Peter Headland
> Architect
> Actuate Corporation
> 

> 
> The COPY command reference page saith
> 
>     Input data is interpreted according to the current client
> encoding,
>     and output data is encoded in the the current client encoding,
> even
>     if the data does not pass through the client but is read from or
>     written to a file. 
> 
> Seems clear enough to me.
> 
>             regards, tom lane

Maybe the link might help?

http://www.postgresql.org/docs/8.4/interactive/multibyte.html


Adrian Klaver
aklaver@comcast.net

Re: COPY command character set

От
Alvaro Herrera
Дата:
Peter Headland wrote:

> As a general comment, I18N/L10N is a hairy enough topic that it merits
> its own heading in any commands where it is an issue.

I agree, this seems a good idea because people is often confused by
this.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: COPY command character set

От
Bruce Momjian
Дата:
I have updated the documentation to be more direct about COPY encoding
behavior.  Patch attached and applied.

---------------------------------------------------------------------------

Peter Headland wrote:
> > Maybe the link might help?
> >
> > http://www.postgresql.org/docs/8.4/interactive/multibyte.html
>
> That page is too generic; what would be helpful is a section in the doc for each command that is affected by
I18N/L10Nconsiderations, that identifies how that specific command behaves. 
>
> Now that I have grasped the behavior, I'm more than happy to edit the COPY doc page, if people think that would be
helpful/worthwhile.
>
> --
> Peter Headland
> Architect
> Actuate Corporation
>
>
> -----Original Message-----
> From: Adrian Klaver [mailto:aklaver@comcast.net]
> Sent: Thursday, September 10, 2009 11:06
> To: Peter Headland
> Cc: pgsql-general@postgresql.org; Tom Lane
> Subject: Re: [GENERAL] COPY command character set
>
>
> ----- "Peter Headland" <pheadland@actuate.com> wrote:
>
> > > The COPY command reference page saith
> > >
> > >    Input data is interpreted according to the current client
> > encoding,
> > >    and output data is encoded in the the current client encoding,
> > even
> > >    if the data does not pass through the client but is read from or
> > >    written to a file.
> >
> > Rats - I read the manual page twice and that didn't register on my
> > feeble consciousness. I suspect that I didn't look beyond the word
> > "client", since I knew I wasn't interested in client behavior and I
> > was
> > speed-reading. On the assumption that I am not uniquely stupid, maybe
> > we
> > could re-phrase this slightly, with a "for example", and add a
> > heading
> > "Localization"?
> >
> > As a general comment, I18N/L10N is a hairy enough topic that it
> > merits
> > its own heading in any commands where it is an issue.
> >
> > How about my suggestion to add a means (extend COPY syntax) to
> > specify
> > encoding explicitly and handle UTF lead bytes - would that be of
> > interest?
> >
> > --
> > Peter Headland
> > Architect
> > Actuate Corporation
> >
>
> >
> > The COPY command reference page saith
> >
> >     Input data is interpreted according to the current client
> > encoding,
> >     and output data is encoded in the the current client encoding,
> > even
> >     if the data does not pass through the client but is read from or
> >     written to a file.
> >
> > Seems clear enough to me.
> >
> >             regards, tom lane
>
> Maybe the link might help?
>
> http://www.postgresql.org/docs/8.4/interactive/multibyte.html
>
>
> Adrian Klaver
> aklaver@comcast.net
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/ref/copy.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/ref/copy.sgml,v
retrieving revision 1.93
diff -c -c -r1.93 copy.sgml
*** doc/src/sgml/ref/copy.sgml    17 Feb 2010 04:19:39 -0000    1.93
--- doc/src/sgml/ref/copy.sgml    23 Feb 2010 05:15:00 -0000
***************
*** 367,376 ****
     </para>

     <para>
!     Input data is interpreted according to the current client encoding,
!     and output data is encoded in the the current client encoding, even
!     if the data does not pass through the client but is read from or
!     written to a file.
     </para>

     <para>
--- 367,376 ----
     </para>

     <para>
!     <command>COPY</command> always processes data according to the
!     current client encoding, even if the data does not pass through
!     the client but is read from or written to a file directly by the
!     server.
     </para>

     <para>

Re: COPY command character set

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> I have updated the documentation to be more direct about COPY encoding
> behavior.  Patch attached and applied.

Uh, why exactly do you find that better?  "Processes data" seems a lot
vaguer to me than the previous wording.  I certainly don't think that
this does much to address Peter's original complaint.

            regards, tom lane

Re: COPY command character set

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > I have updated the documentation to be more direct about COPY encoding
> > behavior.  Patch attached and applied.
>
> Uh, why exactly do you find that better?  "Processes data" seems a lot
> vaguer to me than the previous wording.  I certainly don't think that
> this does much to address Peter's original complaint.

I thought the problem was that we said "input", then "output" and then
got to the point about the server, and I thought the reader just stopped
reading that far, so I tried to shorten it so the idea was sooner, and I
mentioned "server" at the end.  It might not be better, but I tried.

We don't want to highlight the input/output, we want to highlight that
all input and output are controlled by the client encoding.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

Re: COPY command character set

От
"Peter Headland"
Дата:
As Tom says, this doesn't really address my original issue, which was
not that I read the material on encoding and misunderstood it, but that
I didn't even see that material because it was mixed in with a bunch of
other notes on all sorts of random subjects.

To address this issue in the documentation at large, I would like to see
every command that has I18N/L10N-related behavior have a separate
sub-head for the explanation of that behavior. That way, anyone who
needs to know about that aspect (which should be everyone), just has to
look for the sub-head to be sure they have found what they need to know.
Whilst I know we can't do that for every single cross-command topic, it
seems to me that I18N/L10N is sufficiently important to users of pg that
it merits this treatment.

FWIW, I think error handling/behavior also merits its own sub-heads
throughout. And there should be links within the pages to sub-heads (cf.
DB2's online doc).

Of course, all of this would be a substantial project.

Note that for the COPY command the I18N/L10N material covers both
DateStyle and encoding.

In respect of Bruce's proposed changes, I prefer the original wording
(for the same reasons as Tom), but with the addition of the mention of
the server - "... read from or written to a file directly by the
server".

--
Peter Headland
Architect
Actuate Corporation

-----Original Message-----
From: Bruce Momjian [mailto:bruce@momjian.us]
Sent: Monday, February 22, 2010 22:01
To: Tom Lane
Cc: Peter Headland; Adrian Klaver; pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > I have updated the documentation to be more direct about COPY
encoding
> > behavior.  Patch attached and applied.
>
> Uh, why exactly do you find that better?  "Processes data" seems a lot
> vaguer to me than the previous wording.  I certainly don't think that
> this does much to address Peter's original complaint.

I thought the problem was that we said "input", then "output" and then
got to the point about the server, and I thought the reader just stopped
reading that far, so I tried to shorten it so the idea was sooner, and I
mentioned "server" at the end.  It might not be better, but I tried.

We don't want to highlight the input/output, we want to highlight that
all input and output are controlled by the client encoding.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

Re: COPY command character set

От
Bruce Momjian
Дата:
Peter Headland wrote:
> In respect of Bruce's proposed changes, I prefer the original wording
> (for the same reasons as Tom), but with the addition of the mention of
> the server - "... read from or written to a file directly by the
> server".

OK, done with the attached patch.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/ref/copy.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/ref/copy.sgml,v
retrieving revision 1.94
diff -c -c -r1.94 copy.sgml
*** doc/src/sgml/ref/copy.sgml    23 Feb 2010 05:17:33 -0000    1.94
--- doc/src/sgml/ref/copy.sgml    23 Feb 2010 21:38:07 -0000
***************
*** 1,5 ****
  <!--
! $PostgreSQL: pgsql/doc/src/sgml/ref/copy.sgml,v 1.94 2010/02/23 05:17:33 momjian Exp $
  PostgreSQL documentation
  -->

--- 1,5 ----
  <!--
! $PostgreSQL: pgsql/doc/src/sgml/ref/copy.sgml,v 1.93 2010/02/17 04:19:39 tgl Exp $
  PostgreSQL documentation
  -->

***************
*** 367,376 ****
     </para>

     <para>
!     <command>COPY</command> always processes data according to the
!     current client encoding, even if the data does not pass through
!     the client but is read from or written to a file directly by the
!     server.
     </para>

     <para>
--- 367,376 ----
     </para>

     <para>
!     Input data is interpreted according to the current client encoding,
!     and output data is encoded in the the current client encoding, even
!     if the data does not pass through the client but is read from or
!     written to a file directly by the server.
     </para>

     <para>