Обсуждение: New COPY options: DELIMITER NONE and QUOTE NONE

Поиск
Список
Период
Сортировка

New COPY options: DELIMITER NONE and QUOTE NONE

От
"Joel Jacobson"
Дата:
The thread "Should CSV parsing be stricter about mid-field quotes?" [1] forked
into a new topic, with two new ideas, hence this new thread.

1. COPY ... QUOTE NONE

In the [1] thread, Andrew Dunstan suggested a trick on how to deal with
unquoted but delimited files, such as TSV-files produced by Google Sheets:

> You can use CSV mode pretty reliably for TSV files.
> The trick is to use a quoting char that shouldn't appear,
> such as E'\x01' as well as setting the delimiter to E'\t'.
> Yes, it's far from obvious.

Would it be an improvement to allow specifying `QUOTE NONE` instead?

quotes.tsv:
id quote
1 "E = mc^2" -- Albert Einstein

COPY quotes FROM '/tmp/quotes.tsv' WITH CSV HEADER DELIMITER E'\t' QUOTE NONE;

SELECT * FROM quotes;
id |             quote
----+-------------------------------
  1 | "E = mc^2" -- Albert Einstein
(1 row)

2. COPY ... DELIMITER NONE

This is meant to improve the use-case when wanting to import e.g. an
unstructured log file line-by-line into a single column.

The current trick I've been using is similar to the first one,
that is, to specify a non-existing delimiter. But that involves having to find
some non-existing byte, which is error-prone since future log files might
suddenly start to contain it. So I think it would be better to be to be explicit
about not wanting to delimit fields at all, treating the entire whole line as a column.

Example:

% cat /tmp/apache.log
192.168.1.1 - - [19/May/2023:09:54:17 -0700] "GET /index.html HTTP/1.1" 200 431 "http://www.example.com/home.html" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
192.168.1.2 - - [19/May/2023:09:55:12 -0700] "POST /form.php HTTP/1.1" 200 512 "http://www.example.com/form.html" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"

CREATE TABLE unstructured_log (whole_line text);
COPY unstructured_log FROM '/tmp/apache.log' WITH CSV DELIMITER NONE QUOTE NONE;
SELECT * FROM unstructured_log;
                                                                                                               whole_line
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
192.168.1.1 - - [19/May/2023:09:54:17 -0700] "GET /index.html HTTP/1.1" 200 431 "http://www.example.com/home.html" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
192.168.1.2 - - [19/May/2023:09:55:12 -0700] "POST /form.php HTTP/1.1" 200 512 "http://www.example.com/form.html" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
(2 rows)

I hacked together a broken patch just to demonstrate the idea on syntax
and basic idea. The `COPY ... FROM` examples above works.
But it doesn't work at all for `COPY ... TO`, since it output \0 byte as
delimiter and quote in the output, which is of course not what we want.

Just wanted some feedback to see if there is any broader interest in this,
before proceeding and looking into how to implement it properly.

Is this something we want or are there just a few of us who have needed this in the past?

/Joel

Вложения

Re: New COPY options: DELIMITER NONE and QUOTE NONE

От
Andrew Dunstan
Дата:


On 2023-05-19 Fr 05:24, Joel Jacobson wrote:

I hacked together a broken patch just to demonstrate the idea on syntax
and basic idea. The `COPY ... FROM` examples above works.
But it doesn't work at all for `COPY ... TO`, since it output \0 byte as
delimiter and quote in the output, which is of course not what we want.



I think you've been a bit too cute with the grammar changes, but as you say this is a POC.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: New COPY options: DELIMITER NONE and QUOTE NONE

От
"Joel Jacobson"
Дата:
On Fri, May 19, 2023, at 19:03, Andrew Dunstan wrote:
> I think you've been a bit too cute with the grammar changes, but as you say this is a POC.

Thanks for feedback.

The approach I took for the new grammar rules was inspired by previous commits,
such as de7531a971b, which introduced support for 'FORCE QUOTE '*''. In that
case, a new separate grammar rule was crafted.

Not sure what you mean with it being "too cute", but maybe you think it's a bit
verbose with another grammar rule and it would be better to integrate it into
the existing one?

Example:

| DELIMITER opt_as (Sconst | NONE)
        {
                if ($3 == NONE)
                        $$ = makeDefElem("delimiter", (Node *) makeString("\0"), @1);
                else
                        $$ = makeDefElem("delimiter", (Node *) makeString($3), @1);
        }

/Joel

Re: New COPY options: DELIMITER NONE and QUOTE NONE

От
Andrew Dunstan
Дата:


On 2023-05-20 Sa 02:59, Joel Jacobson wrote:
On Fri, May 19, 2023, at 19:03, Andrew Dunstan wrote:
> I think you've been a bit too cute with the grammar changes, but as you say this is a POC.

Thanks for feedback.

The approach I took for the new grammar rules was inspired by previous commits,
such as de7531a971b, which introduced support for 'FORCE QUOTE '*''. In that
case, a new separate grammar rule was crafted.

Not sure what you mean with it being "too cute", but maybe you think it's a bit
verbose with another grammar rule and it would be better to integrate it into
the existing one?

Example:

| DELIMITER opt_as (Sconst | NONE)
        {
                if ($3 == NONE)
                        $$ = makeDefElem("delimiter", (Node *) makeString("\0"), @1);
                else
                        $$ = makeDefElem("delimiter", (Node *) makeString($3), @1);
        }




I would probably go for something like this for "DELIMITER NONE" in a separate rule:

 | DELIMITER NONE
    {

       $$ = makeDefElem("delimiter_none", (Node *)makeInteger(true), @1);

    }

and deal with that element further down the stack. It looks to me at first glance that your changes would allow "DELIMITER ''" which is probably not what we want.

Similarly for "QUOTE NONE".


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com