Обсуждение: Re: [HACKERS] multiline CSV fields

Поиск
Список
Период
Сортировка

Re: [HACKERS] multiline CSV fields

От
Andrew Dunstan
Дата:

I wrote:

>
> If it bothers you that much. I'd make a flag, cleared at the start of
> each COPY, and then where we test for CR or LF in CopyAttributeOutCSV,
> if the flag is not set then set it and issue the warning.



I didn't realise until Bruce told me just now that I was on the hook for
this. I guess i should keep my big mouth shut. (Yeah, that's gonna
happen ...)

Anyway, here's a tiny patch that does what I had in mind.

cheers

andrew
Index: copy.c
===================================================================
RCS file: /home/cvsmirror/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.234
diff -c -r1.234 copy.c
*** copy.c    6 Nov 2004 17:46:27 -0000    1.234
--- copy.c    2 Dec 2004 23:34:20 -0000
***************
*** 98,103 ****
--- 98,104 ----
  static EolType eol_type;        /* EOL type of input */
  static int    client_encoding;    /* remote side's character encoding */
  static int    server_encoding;    /* local encoding */
+ static bool embedded_line_warning;

  /* these are just for error messages, see copy_in_error_callback */
  static bool copy_binary;        /* is it a binary copy? */
***************
*** 1190,1195 ****
--- 1191,1197 ----
      attr = tupDesc->attrs;
      num_phys_attrs = tupDesc->natts;
      attr_count = list_length(attnumlist);
+     embedded_line_warning = false;

      /*
       * Get info about the columns we need to process.
***************
*** 2627,2632 ****
--- 2629,2653 ----
           !use_quote && (c = *test_string) != '\0';
           test_string += mblen)
      {
+         /*
+          * We don't know here what the surrounding line end characters
+          * might be. It might not even be under postgres' control. So
+          * we simple warn on ANY embedded line ending character.
+          *
+          * This warning will disappear when we make line parsing field-aware,
+          * so that we can reliably read in embedded line ending characters
+          * regardless of the file's line-end context.
+          *
+          */
+
+         if (!embedded_line_warning  && (c == '\n' || c == '\r') )
+         {
+             embedded_line_warning = true;
+             elog(WARNING,
+                  "CSV fields with embedded linefeed or carriage return "
+                  "characters might not be able to be reimported");
+         }
+
          if (c == delimc || c == quotec || c == '\n' || c == '\r')
              use_quote = true;
          if (!same_encoding)

Re: [HACKERS] multiline CSV fields

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> +         if (!embedded_line_warning  && (c == '\n' || c == '\r') )
> +         {
> +             embedded_line_warning = true;
> +             elog(WARNING,
> +                  "CSV fields with embedded linefeed or carriage return "
> +                  "characters might not be able to be reimported");
> +         }

What about forcibly translating them to the two-character sequences \n
or \r?  Or is that not considered a CSV-compatible representation?

            regards, tom lane

Re: [HACKERS] multiline CSV fields

От
Andrew Dunstan
Дата:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>
>
>>+         if (!embedded_line_warning  && (c == '\n' || c == '\r') )
>>+         {
>>+             embedded_line_warning = true;
>>+             elog(WARNING,
>>+                  "CSV fields with embedded linefeed or carriage return "
>>+                  "characters might not be able to be reimported");
>>+         }
>>
>>
>
>What about forcibly translating them to the two-character sequences \n
>or \r?  Or is that not considered a CSV-compatible representation?
>
>
>
>

Not compatible AFAIK. Certainly not portably. And the warning would
still be true, because we don't do this unescaping on the way back in. I
think the way the comment in the patch suggests and previous emails have
discussed is the right way to go with this - I will attend to it after
we branch ;-)

cheers

andrew

Re: [HACKERS] multiline CSV fields

От
Bruce Momjian
Дата:
Patch applied.  Thanks.

---------------------------------------------------------------------------


Andrew Dunstan wrote:
>
>
> I wrote:
>
> >
> > If it bothers you that much. I'd make a flag, cleared at the start of
> > each COPY, and then where we test for CR or LF in CopyAttributeOutCSV,
> > if the flag is not set then set it and issue the warning.
>
>
>
> I didn't realise until Bruce told me just now that I was on the hook for
> this. I guess i should keep my big mouth shut. (Yeah, that's gonna
> happen ...)
>
> Anyway, here's a tiny patch that does what I had in mind.
>
> cheers
>
> andrew

[ text/x-patch is unsupported, treating like TEXT/PLAIN ]

> Index: copy.c
> ===================================================================
> RCS file: /home/cvsmirror/pgsql/src/backend/commands/copy.c,v
> retrieving revision 1.234
> diff -c -r1.234 copy.c
> *** copy.c    6 Nov 2004 17:46:27 -0000    1.234
> --- copy.c    2 Dec 2004 23:34:20 -0000
> ***************
> *** 98,103 ****
> --- 98,104 ----
>   static EolType eol_type;        /* EOL type of input */
>   static int    client_encoding;    /* remote side's character encoding */
>   static int    server_encoding;    /* local encoding */
> + static bool embedded_line_warning;
>
>   /* these are just for error messages, see copy_in_error_callback */
>   static bool copy_binary;        /* is it a binary copy? */
> ***************
> *** 1190,1195 ****
> --- 1191,1197 ----
>       attr = tupDesc->attrs;
>       num_phys_attrs = tupDesc->natts;
>       attr_count = list_length(attnumlist);
> +     embedded_line_warning = false;
>
>       /*
>        * Get info about the columns we need to process.
> ***************
> *** 2627,2632 ****
> --- 2629,2653 ----
>            !use_quote && (c = *test_string) != '\0';
>            test_string += mblen)
>       {
> +         /*
> +          * We don't know here what the surrounding line end characters
> +          * might be. It might not even be under postgres' control. So
> +          * we simple warn on ANY embedded line ending character.
> +          *
> +          * This warning will disappear when we make line parsing field-aware,
> +          * so that we can reliably read in embedded line ending characters
> +          * regardless of the file's line-end context.
> +          *
> +          */
> +
> +         if (!embedded_line_warning  && (c == '\n' || c == '\r') )
> +         {
> +             embedded_line_warning = true;
> +             elog(WARNING,
> +                  "CSV fields with embedded linefeed or carriage return "
> +                  "characters might not be able to be reimported");
> +         }
> +
>           if (c == delimc || c == quotec || c == '\n' || c == '\r')
>               use_quote = true;
>           if (!same_encoding)

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073