Обсуждение: Re: Emitting JSON to file using COPY TO

Поиск

Список

Период

Сортировка

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

01 декабря 2023 г., 22:28:55

On 11/29/23 10:32, Davin Shearer wrote:
> Thanks for the responses everyone.
> 
> I worked around the issue using the `psql -tc` method as Filip described.
> 
> I think it would be great to support writing JSON using COPY TO at 
> some point so I can emit JSON to files using a PostgreSQL function directly.
> 
> -Davin
> 
> On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <filip@sedlakovi.org 
> <mailto:filip@sedlakovi.org>> wrote:
> 
>     This would be a very special case for COPY. It applies only to a single
>     column of JSON values. The original problem can be solved with psql
>     --tuples-only as David wrote earlier.
> 
> 
>     $ psql -tc 'select json_agg(row_to_json(t))
>                    from (select * from public.tbl_json_test) t;'
> 
>        [{"id":1,"t_test":"here's a \"string\""}]
> 
> 
>     Special-casing any encoding/escaping scheme leads to bugs and harder
>     parsing.

(moved to hackers)

I did a quick PoC patch (attached) -- if there interest and no hard 
objections I would like to get it up to speed for the January commitfest.

Currently the patch lacks documentation and regression test support.

Questions:
----------
1. Is supporting JSON array format sufficient, or does it need to 
support some other options? How flexible does the support scheme need to be?

2. This only supports COPY TO and we would undoubtedly want to support 
COPY FROM for JSON as well, but is that required from the start?

Thanks for any feedback.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

copyto_json.000.diff

Re: Emitting JSON to file using COPY TO

От

Nathan Bossart

Дата:

02 декабря 2023 г., 02:09:58

On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote:
> I did a quick PoC patch (attached) -- if there interest and no hard
> objections I would like to get it up to speed for the January commitfest.

Cool.  I would expect there to be interest, given all the other JSON
support that has been added thus far.

I noticed that, with the PoC patch, "json" is the only format that must be
quoted.  Without quotes, I see a syntax error.  I'm assuming there's a
conflict with another json-related rule somewhere in gram.y, but I haven't
tracked down exactly which one is causing it.

> 1. Is supporting JSON array format sufficient, or does it need to support
> some other options? How flexible does the support scheme need to be?

I don't presently have a strong opinion on this one.  My instinct would be
start with something simple, though.  I don't think we offer any special
options for log_destination...

> 2. This only supports COPY TO and we would undoubtedly want to support COPY
> FROM for JSON as well, but is that required from the start?

I would vote for including COPY FROM support from the start.

> !     if (!cstate->opts.json_mode)

I think it's unfortunate that this further complicates the branching in
CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's
worth refactoring a bunch of stuff to make this nicer.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Davin Shearer

Дата:

02 декабря 2023 г., 06:00:29

I'm really glad to see this taken up as a possible new feature and will definitely use it if it gets released. I'm impressed with how clean, understandable, and approachable the postgres codebase is in general and how easy it is to read and understand this patch.

I reviewed the patch (though I didn't build and test the code) and have a concern with adding the '[' at the beginning and ']' at the end of the json output. Those are already added by `json_agg` (https://www.postgresql.org/docs/current/functions-aggregate.html) as you can see in my initial email. Adding them in the COPY TO may be redundant (e.g., [[{"key":"value"...}....]]).

I think COPY TO makes good sense to support, though COPY FROM maybe not so much as JSON isn't necessarily flat and rectangular like CSV.

For my use-case, I'm emitting JSON files to Apache NiFi for processing, and NiFi has superior handling of JSON (via JOLT parsers) versus CSV where parsing is generally done with regex. I want to be able to emit JSON using a postgres function and thus COPY TO.

Definitely +1 for COPY TO.

I don't think COPY FROM will work out well unless the JSON is required to be flat and rectangular. I would vote -1 to leave it out due to the necessary restrictions making it not generally useful.

Hope it helps,

Davin

On Fri, Dec 1, 2023 at 6:10 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote:
> I did a quick PoC patch (attached) -- if there interest and no hard
> objections I would like to get it up to speed for the January commitfest.

Cool. I would expect there to be interest, given all the other JSON
support that has been added thus far.

I noticed that, with the PoC patch, "json" is the only format that must be
quoted. Without quotes, I see a syntax error. I'm assuming there's a
conflict with another json-related rule somewhere in gram.y, but I haven't
tracked down exactly which one is causing it.

> 1. Is supporting JSON array format sufficient, or does it need to support
> some other options? How flexible does the support scheme need to be?

I don't presently have a strong opinion on this one. My instinct would be
start with something simple, though. I don't think we offer any special
options for log_destination...

> 2. This only supports COPY TO and we would undoubtedly want to support COPY
> FROM for JSON as well, but is that required from the start?

I would vote for including COPY FROM support from the start.

> ! if (!cstate->opts.json_mode)

I think it's unfortunate that this further complicates the branching in
CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's
worth refactoring a bunch of stuff to make this nicer.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

02 декабря 2023 г., 06:10:54

On 12/1/23 22:00, Davin Shearer wrote:
> I'm really glad to see this taken up as a possible new feature and will 
> definitely use it if it gets released.  I'm impressed with how clean, 
> understandable, and approachable the postgres codebase is in general and 
> how easy it is to read and understand this patch.
> 
> I reviewed the patch (though I didn't build and test the code) and have 
> a concern with adding the '[' at the beginning and ']' at the end of the 
> json output.  Those are already added by `json_agg` 
> (https://www.postgresql.org/docs/current/functions-aggregate.html 
> <https://www.postgresql.org/docs/current/functions-aggregate.html>) as 
> you can see in my initial email.  Adding them in the COPY TO may be 
> redundant (e.g., [[{"key":"value"...}....]]).

With this patch in place you don't use json_agg() at all. See the 
example output (this is real output with the patch applied):

(oops -- I meant to send this with the same email as the patch)

8<-------------------------------------------------
create table foo(id int8, f1 text, f2 timestamptz);
insert into foo
   select g.i,
          'line: ' || g.i::text,
          clock_timestamp()
   from generate_series(1,4) as g(i);

copy foo to stdout (format 'json');
[
  {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]
8<-------------------------------------------------


> I think COPY TO makes good sense to support, though COPY FROM maybe not 
> so much as JSON isn't necessarily flat and rectangular like CSV.

Yeah -- definitely not as straight forward but possibly we just support 
the array-of-jsonobj-rows as input as well?

> For my use-case, I'm emitting JSON files to Apache NiFi for processing, 
> and NiFi has superior handling of JSON (via JOLT parsers) versus CSV 
> where parsing is generally done with regex.  I want to be able to emit 
> JSON using a postgres function and thus COPY TO.
> 
> Definitely +1 for COPY TO.
> 
> I don't think COPY FROM will work out well unless the JSON is required 
> to be flat and rectangular.  I would vote -1 to leave it out due to the 
> necessary restrictions making it not generally useful.


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

02 декабря 2023 г., 17:31:46

On 12/1/23 18:09, Nathan Bossart wrote:
> On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote:
>> I did a quick PoC patch (attached) -- if there interest and no hard
>> objections I would like to get it up to speed for the January commitfest.
> 
> Cool.  I would expect there to be interest, given all the other JSON
> support that has been added thus far.

Thanks for the review

> I noticed that, with the PoC patch, "json" is the only format that must be
> quoted.  Without quotes, I see a syntax error.  I'm assuming there's a
> conflict with another json-related rule somewhere in gram.y, but I haven't
> tracked down exactly which one is causing it.

It seems to be because 'json' is also a type name ($$ = 
SystemTypeName("json")).

What do you think about using 'json_array' instead? It is more specific 
and accurate, and avoids the need to quote.

test=# copy foo to stdout (format json_array);
[
  {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]

>> 1. Is supporting JSON array format sufficient, or does it need to support
>> some other options? How flexible does the support scheme need to be?
> 
> I don't presently have a strong opinion on this one.  My instinct would be
> start with something simple, though.  I don't think we offer any special
> options for log_destination...

WFM

>> 2. This only supports COPY TO and we would undoubtedly want to support COPY
>> FROM for JSON as well, but is that required from the start?
> 
> I would vote for including COPY FROM support from the start.

Check. My thought is to only accept the same format we emit -- i.e. only 
take a json array.

>> !     if (!cstate->opts.json_mode)
> 
> I think it's unfortunate that this further complicates the branching in
> CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's
> worth refactoring a bunch of stuff to make this nicer.

Yeah that was my conclusion.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Tom Lane

Дата:

02 декабря 2023 г., 18:11:20

Joe Conway <mail@joeconway.com> writes:
>> I noticed that, with the PoC patch, "json" is the only format that must be
>> quoted.  Without quotes, I see a syntax error.  I'm assuming there's a
>> conflict with another json-related rule somewhere in gram.y, but I haven't
>> tracked down exactly which one is causing it.

While I've not looked too closely, I suspect this might be due to the
FORMAT_LA hack in base_yylex:

            /* Replace FORMAT by FORMAT_LA if it's followed by JSON */
            switch (next_token)
            {
                case JSON:
                    cur_token = FORMAT_LA;
                    break;
            }

So if you are writing a production that might need to match
FORMAT followed by JSON, you need to match FORMAT_LA too.

(I spent a little bit of time last week trying to get rid of
FORMAT_LA, thinking that it didn't look necessary.  Did not
succeed yet.)

            regards, tom lane

Re: Emitting JSON to file using COPY TO

От

Maciek Sakrejda

Дата:

02 декабря 2023 г., 21:50:12

On Fri, Dec 1, 2023 at 11:32 AM Joe Conway <mail@joeconway.com> wrote:
> 1. Is supporting JSON array format sufficient, or does it need to
> support some other options? How flexible does the support scheme need to be?

"JSON Lines" is a semi-standard format [1] that's basically just
newline-separated JSON values. (In fact, this is what
log_destination=jsonlog gives you for Postgres logs, no?) It might be
worthwhile to support that, too.

[1]: https://jsonlines.org/

Re: Emitting JSON to file using COPY TO

От

Nathan Bossart

Дата:

03 декабря 2023 г., 00:53:20

On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote:
> So if you are writing a production that might need to match
> FORMAT followed by JSON, you need to match FORMAT_LA too.

Thanks for the pointer.  That does seem to be the culprit.

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d631ac89a9..048494dd07 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3490,6 +3490,10 @@ copy_generic_opt_elem:
                 {
                     $$ = makeDefElem($1, $2, @1);
                 }
+            | FORMAT_LA copy_generic_opt_arg
+                {
+                    $$ = makeDefElem("format", $2, @1);
+                }
         ;
 
 copy_generic_opt_arg:

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

03 декабря 2023 г., 01:37:48

On 12/2/23 16:53, Nathan Bossart wrote:
> On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote:
>> So if you are writing a production that might need to match
>> FORMAT followed by JSON, you need to match FORMAT_LA too.
> 
> Thanks for the pointer.  That does seem to be the culprit.
> 
> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> index d631ac89a9..048494dd07 100644
> --- a/src/backend/parser/gram.y
> +++ b/src/backend/parser/gram.y
> @@ -3490,6 +3490,10 @@ copy_generic_opt_elem:
>                   {
>                       $$ = makeDefElem($1, $2, @1);
>                   }
> +            | FORMAT_LA copy_generic_opt_arg
> +                {
> +                    $$ = makeDefElem("format", $2, @1);
> +                }
>           ;
>   
>   copy_generic_opt_arg:


Yep -- I concluded the same. Thanks Tom!

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

03 декабря 2023 г., 01:43:52

On 12/2/23 13:50, Maciek Sakrejda wrote:
> On Fri, Dec 1, 2023 at 11:32 AM Joe Conway <mail@joeconway.com> wrote:
>> 1. Is supporting JSON array format sufficient, or does it need to
>> support some other options? How flexible does the support scheme need to be?
> 
> "JSON Lines" is a semi-standard format [1] that's basically just
> newline-separated JSON values. (In fact, this is what
> log_destination=jsonlog gives you for Postgres logs, no?) It might be
> worthwhile to support that, too.
> 
> [1]: https://jsonlines.org/


Yes, I have seen examples of that associated with other databases (MSSQL 
and Duckdb at least) as well. It probably makes sense to support that 
format too.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

03 декабря 2023 г., 16:46:28

On 2023-12-02 Sa 17:43, Joe Conway wrote:
> On 12/2/23 13:50, Maciek Sakrejda wrote:
>> On Fri, Dec 1, 2023 at 11:32 AM Joe Conway <mail@joeconway.com> wrote:
>>> 1. Is supporting JSON array format sufficient, or does it need to
>>> support some other options? How flexible does the support scheme 
>>> need to be?
>>
>> "JSON Lines" is a semi-standard format [1] that's basically just
>> newline-separated JSON values. (In fact, this is what
>> log_destination=jsonlog gives you for Postgres logs, no?) It might be
>> worthwhile to support that, too.
>>
>> [1]: https://jsonlines.org/
>
>
> Yes, I have seen examples of that associated with other databases 
> (MSSQL and Duckdb at least) as well. It probably makes sense to 
> support that format too.


You can do that today, e.g.


copy (select to_json(q) from table_or_query q) to stdout


You can also do it as a single document as proposed here, like this:


copy (select json_agg(q) from table_or_query q) to stdout


The only downside to that is that it has to construct the aggregate, 
which could be ugly for large datasets, and that's why I'm not opposed 
to this patch.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

03 декабря 2023 г., 17:53:49

On 12/2/23 17:37, Joe Conway wrote:
> On 12/2/23 16:53, Nathan Bossart wrote:
>> On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote:
>>> So if you are writing a production that might need to match
>>> FORMAT followed by JSON, you need to match FORMAT_LA too.
>> 
>> Thanks for the pointer.  That does seem to be the culprit.
>> 
>> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
>> index d631ac89a9..048494dd07 100644
>> --- a/src/backend/parser/gram.y
>> +++ b/src/backend/parser/gram.y
>> @@ -3490,6 +3490,10 @@ copy_generic_opt_elem:
>>                   {
>>                       $$ = makeDefElem($1, $2, @1);
>>                   }
>> +            | FORMAT_LA copy_generic_opt_arg
>> +                {
>> +                    $$ = makeDefElem("format", $2, @1);
>> +                }
>>           ;
>>   
>>   copy_generic_opt_arg:
> 
> 
> Yep -- I concluded the same. Thanks Tom!

The attached implements the above repair, as well as adding support for 
array decoration (or not) and/or comma row delimiters when not an array.

This covers the three variations of json import/export formats that I 
have found after light searching (SQL Server and DuckDB).

Still lacks and documentation, tests, and COPY FROM support, but here is 
what it looks like in a nutshell:

8<-----------------------------------------------
create table foo(id int8, f1 text, f2 timestamptz);
insert into foo
   select g.i,
          'line: ' || g.i::text,
          clock_timestamp()
   from generate_series(1,4) as g(i);

copy foo to stdout (format json);
{"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}

copy foo to stdout (format json, force_array);
[
  {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]

copy foo to stdout (format json, force_row_delimiter);
  {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}

copy foo to stdout (force_array);
ERROR:  COPY FORCE_ARRAY requires JSON mode

copy foo to stdout (force_row_delimiter);
ERROR:  COPY FORCE_ROW_DELIMITER requires JSON mode
8<-----------------------------------------------


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

copyto_json.001.diff

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

03 декабря 2023 г., 18:10:38

On 2023-12-01 Fr 14:28, Joe Conway wrote:
> On 11/29/23 10:32, Davin Shearer wrote:
>> Thanks for the responses everyone.
>>
>> I worked around the issue using the `psql -tc` method as Filip 
>> described.
>>
>> I think it would be great to support writing JSON using COPY TO at 
>> some point so I can emit JSON to files using a PostgreSQL function 
>> directly.
>>
>> -Davin
>>
>> On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <filip@sedlakovi.org 
>> <mailto:filip@sedlakovi.org>> wrote:
>>
>>     This would be a very special case for COPY. It applies only to a 
>> single
>>     column of JSON values. The original problem can be solved with psql
>>     --tuples-only as David wrote earlier.
>>
>>
>>     $ psql -tc 'select json_agg(row_to_json(t))
>>                    from (select * from public.tbl_json_test) t;'
>>
>>        [{"id":1,"t_test":"here's a \"string\""}]
>>
>>
>>     Special-casing any encoding/escaping scheme leads to bugs and harder
>>     parsing.
>
> (moved to hackers)
>
> I did a quick PoC patch (attached) -- if there interest and no hard 
> objections I would like to get it up to speed for the January commitfest.
>
> Currently the patch lacks documentation and regression test support.
>
> Questions:
> ----------
> 1. Is supporting JSON array format sufficient, or does it need to 
> support some other options? How flexible does the support scheme need 
> to be?
>
> 2. This only supports COPY TO and we would undoubtedly want to support 
> COPY FROM for JSON as well, but is that required from the start?
>
> Thanks for any feedback.


I  realize this is just a POC, but I'd prefer to see composite_to_json() 
not exposed. You could use the already public datum_to_json() instead, 
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third 
arguments.

I think JSON array format is sufficient.

I can see both sides of the COPY FROM argument, but I think insisting on 
that makes this less doable for release 17. On balance I would stick to 
COPY TO for now.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Davin Shearer

Дата:

03 декабря 2023 г., 18:31:58

Please be sure to include single and double quotes in the test values since that was the original problem (double quoting in COPY TO breaking the JSON syntax).

On Sun, Dec 3, 2023, 10:11 Andrew Dunstan <andrew@dunslane.net> wrote:

On 2023-12-01 Fr 14:28, Joe Conway wrote:
> On 11/29/23 10:32, Davin Shearer wrote:
>> Thanks for the responses everyone.
>>
>> I worked around the issue using the `psql -tc` method as Filip
>> described.
>>
>> I think it would be great to support writing JSON using COPY TO at
>> some point so I can emit JSON to files using a PostgreSQL function
>> directly.
>>
>> -Davin
>>
>> On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <filip@sedlakovi.org
>> <mailto:filip@sedlakovi.org>> wrote:
>>
>>     This would be a very special case for COPY. It applies only to a
>> single
>>     column of JSON values. The original problem can be solved with psql
>>     --tuples-only as David wrote earlier.
>>
>>
>>     $ psql -tc 'select json_agg(row_to_json(t))
>>      from (select * from public.tbl_json_test) t;'
>>
>>      [{"id":1,"t_test":"here's a \"string\""}]
>>
>>
>>     Special-casing any encoding/escaping scheme leads to bugs and harder
>>     parsing.
>
> (moved to hackers)
>
> I did a quick PoC patch (attached) -- if there interest and no hard
> objections I would like to get it up to speed for the January commitfest.
>
> Currently the patch lacks documentation and regression test support.
>
> Questions:
> ----------
> 1. Is supporting JSON array format sufficient, or does it need to
> support some other options? How flexible does the support scheme need
> to be?
>
> 2. This only supports COPY TO and we would undoubtedly want to support
> COPY FROM for JSON as well, but is that required from the start?
>
> Thanks for any feedback.

I realize this is just a POC, but I'd prefer to see composite_to_json()
not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.

I think JSON array format is sufficient.

I can see both sides of the COPY FROM argument, but I think insisting on
that makes this less doable for release 17. On balance I would stick to
COPY TO for now.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

03 декабря 2023 г., 18:51:12

On 12/3/23 10:31, Davin Shearer wrote:
> Please be sure to include single and double quotes in the test values 
> since that was the original problem (double quoting in COPY TO breaking 
> the JSON syntax).

test=# copy (select * from foo limit 4) to stdout (format json);
{"id":2456092,"f1":"line with ' in it: 
2456092","f2":"2023-12-03T10:44:40.9712-05:00"}
{"id":2456093,"f1":"line with \\" in it: 
2456093","f2":"2023-12-03T10:44:40.971221-05:00"}
{"id":2456094,"f1":"line with ' in it: 
2456094","f2":"2023-12-03T10:44:40.971225-05:00"}
{"id":2456095,"f1":"line with \\" in it: 
2456095","f2":"2023-12-03T10:44:40.971228-05:00"}

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

03 декабря 2023 г., 19:03:14

On 12/3/23 10:10, Andrew Dunstan wrote:
> 
> On 2023-12-01 Fr 14:28, Joe Conway wrote:
>> On 11/29/23 10:32, Davin Shearer wrote:
>>> Thanks for the responses everyone.
>>>
>>> I worked around the issue using the `psql -tc` method as Filip 
>>> described.
>>>
>>> I think it would be great to support writing JSON using COPY TO at 
>>> some point so I can emit JSON to files using a PostgreSQL function 
>>> directly.
>>>
>>> -Davin
>>>
>>> On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <filip@sedlakovi.org 
>>> <mailto:filip@sedlakovi.org>> wrote:
>>>
>>>     This would be a very special case for COPY. It applies only to a 
>>> single
>>>     column of JSON values. The original problem can be solved with psql
>>>     --tuples-only as David wrote earlier.
>>>
>>>
>>>     $ psql -tc 'select json_agg(row_to_json(t))
>>>                    from (select * from public.tbl_json_test) t;'
>>>
>>>        [{"id":1,"t_test":"here's a \"string\""}]
>>>
>>>
>>>     Special-casing any encoding/escaping scheme leads to bugs and harder
>>>     parsing.
>>
>> (moved to hackers)
>>
>> I did a quick PoC patch (attached) -- if there interest and no hard 
>> objections I would like to get it up to speed for the January commitfest.
>>
>> Currently the patch lacks documentation and regression test support.
>>
>> Questions:
>> ----------
>> 1. Is supporting JSON array format sufficient, or does it need to 
>> support some other options? How flexible does the support scheme need 
>> to be?
>>
>> 2. This only supports COPY TO and we would undoubtedly want to support 
>> COPY FROM for JSON as well, but is that required from the start?
>>
>> Thanks for any feedback.
> 
> I  realize this is just a POC, but I'd prefer to see composite_to_json()
> not exposed. You could use the already public datum_to_json() instead,
> passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
> arguments.

Ok, thanks, will do

> I think JSON array format is sufficient.

The other formats make sense from a completeness standpoint (versus 
other databases) and the latest patch already includes them, so I still 
lean toward supporting all three formats.

> I can see both sides of the COPY FROM argument, but I think insisting on
> that makes this less doable for release 17. On balance I would stick to
> COPY TO for now.

WFM.

 From your earlier post, regarding constructing the aggregate -- not 
extensive testing but one data point:
8<--------------------------
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 10000000
Time: 36353.153 ms (00:36.353)
test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
COPY 1
Time: 46835.238 ms (00:46.835)
8<--------------------------


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

03 декабря 2023 г., 20:11:34

On 12/3/23 11:03, Joe Conway wrote:
>   From your earlier post, regarding constructing the aggregate -- not
> extensive testing but one data point:
> 8<--------------------------
> test=# copy foo to '/tmp/buf' (format json, force_array);
> COPY 10000000
> Time: 36353.153 ms (00:36.353)
> test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
> COPY 1
> Time: 46835.238 ms (00:46.835)
> 8<--------------------------

Also if the table is large enough, the aggregate method is not even 
feasible whereas the COPY TO method works:
8<--------------------------
test=# select count(*) from foo;
   count
----------
  20000000
(1 row)

test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 1073741822 bytes by 1 
more bytes.

test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 20000000
8<--------------------------

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

03 декабря 2023 г., 22:24:53

On 12/3/23 11:03, Joe Conway wrote:
> On 12/3/23 10:10, Andrew Dunstan wrote:
>> I  realize this is just a POC, but I'd prefer to see composite_to_json()
>> not exposed. You could use the already public datum_to_json() instead,
>> passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
>> arguments.
> 
> Ok, thanks, will do

Just FYI, this change does loose some performance in my not massively 
scientific A/B/A test:

8<---------------------------
-- with datum_to_json()
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 10000000
Time: 37196.898 ms (00:37.197)
Time: 37408.161 ms (00:37.408)
Time: 38393.309 ms (00:38.393)
Time: 36855.438 ms (00:36.855)
Time: 37806.280 ms (00:37.806)

Avg = 37532

-- original patch
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 10000000
Time: 37426.207 ms (00:37.426)
Time: 36068.187 ms (00:36.068)
Time: 38285.252 ms (00:38.285)
Time: 36971.042 ms (00:36.971)
Time: 35690.822 ms (00:35.691)

Avg = 36888

-- with datum_to_json()
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 10000000
Time: 39083.467 ms (00:39.083)
Time: 37249.326 ms (00:37.249)
Time: 38529.721 ms (00:38.530)
Time: 38704.920 ms (00:38.705)
Time: 39001.326 ms (00:39.001)

Avg = 38513
8<---------------------------

That is somewhere in the 3% range.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

03 декабря 2023 г., 22:47:42

On 2023-12-03 Su 12:11, Joe Conway wrote:
> On 12/3/23 11:03, Joe Conway wrote:
>>   From your earlier post, regarding constructing the aggregate -- not
>> extensive testing but one data point:
>> 8<--------------------------
>> test=# copy foo to '/tmp/buf' (format json, force_array);
>> COPY 10000000
>> Time: 36353.153 ms (00:36.353)
>> test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
>> COPY 1
>> Time: 46835.238 ms (00:46.835)
>> 8<--------------------------
>
> Also if the table is large enough, the aggregate method is not even 
> feasible whereas the COPY TO method works:
> 8<--------------------------
> test=# select count(*) from foo;
>   count
> ----------
>  20000000
> (1 row)
>
> test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
> ERROR:  out of memory
> DETAIL:  Cannot enlarge string buffer containing 1073741822 bytes by 1 
> more bytes.
>
> test=# copy foo to '/tmp/buf' (format json, force_array);
> COPY 20000000
> 8<--------------------------


None of this is surprising. As I mentioned, limitations with json_agg() 
are why I support the idea of this patch.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

03 декабря 2023 г., 22:52:29

On 2023-12-03 Su 14:24, Joe Conway wrote:
> On 12/3/23 11:03, Joe Conway wrote:
>> On 12/3/23 10:10, Andrew Dunstan wrote:
>>> I  realize this is just a POC, but I'd prefer to see 
>>> composite_to_json()
>>> not exposed. You could use the already public datum_to_json() instead,
>>> passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
>>> arguments.
>>
>> Ok, thanks, will do
>
> Just FYI, this change does loose some performance in my not massively 
> scientific A/B/A test:
>
> 8<---------------------------
> -- with datum_to_json()
> test=# \timing
> Timing is on.
> test=# copy foo to '/tmp/buf' (format json, force_array);
> COPY 10000000
> Time: 37196.898 ms (00:37.197)
> Time: 37408.161 ms (00:37.408)
> Time: 38393.309 ms (00:38.393)
> Time: 36855.438 ms (00:36.855)
> Time: 37806.280 ms (00:37.806)
>
> Avg = 37532
>
> -- original patch
> test=# \timing
> Timing is on.
> test=# copy foo to '/tmp/buf' (format json, force_array);
> COPY 10000000
> Time: 37426.207 ms (00:37.426)
> Time: 36068.187 ms (00:36.068)
> Time: 38285.252 ms (00:38.285)
> Time: 36971.042 ms (00:36.971)
> Time: 35690.822 ms (00:35.691)
>
> Avg = 36888
>
> -- with datum_to_json()
> test=# \timing
> Timing is on.
> test=# copy foo to '/tmp/buf' (format json, force_array);
> COPY 10000000
> Time: 39083.467 ms (00:39.083)
> Time: 37249.326 ms (00:37.249)
> Time: 38529.721 ms (00:38.530)
> Time: 38704.920 ms (00:38.705)
> Time: 39001.326 ms (00:39.001)
>
> Avg = 38513
> 8<---------------------------
>
> That is somewhere in the 3% range.


I assume it's because datum_to_json() constructs a text value from which 
you then need to extract the cstring, whereas composite_to_json(), just 
gives you back the stringinfo. I guess that's a good enough reason to go 
with exposing composite_to_json().


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

03 декабря 2023 г., 23:09:36

On 12/3/23 14:52, Andrew Dunstan wrote:
> 
> On 2023-12-03 Su 14:24, Joe Conway wrote:
>> On 12/3/23 11:03, Joe Conway wrote:
>>> On 12/3/23 10:10, Andrew Dunstan wrote:
>>>> I  realize this is just a POC, but I'd prefer to see 
>>>> composite_to_json()
>>>> not exposed. You could use the already public datum_to_json() instead,
>>>> passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
>>>> arguments.
>>>
>>> Ok, thanks, will do
>>
>> Just FYI, this change does loose some performance in my not massively 
>> scientific A/B/A test:
>>
>> 8<---------------------------
<snip>
>> 8<---------------------------
>>
>> That is somewhere in the 3% range.
> 
> I assume it's because datum_to_json() constructs a text value from which
> you then need to extract the cstring, whereas composite_to_json(), just
> gives you back the stringinfo. I guess that's a good enough reason to go
> with exposing composite_to_json().

Yeah, that was why I went that route in the first place. If you are good 
with it I will go back to that. The code is a bit simpler too.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Davin Shearer

Дата:

04 декабря 2023 г., 01:38:28

" being quoted as \\" breaks the JSON. It needs to be \". This has been my whole problem with COPY TO for JSON.

Please validate that the output is in proper format with correct quoting for special characters. I use `jq` on the command line to validate and format the output.

On Sun, Dec 3, 2023, 10:51 Joe Conway <mail@joeconway.com> wrote:

On 12/3/23 10:31, Davin Shearer wrote:
> Please be sure to include single and double quotes in the test values
> since that was the original problem (double quoting in COPY TO breaking
> the JSON syntax).

test=# copy (select * from foo limit 4) to stdout (format json);
{"id":2456092,"f1":"line with ' in it:
2456092","f2":"2023-12-03T10:44:40.9712-05:00"}
{"id":2456093,"f1":"line with \\" in it:
2456093","f2":"2023-12-03T10:44:40.971221-05:00"}
{"id":2456094,"f1":"line with ' in it:
2456094","f2":"2023-12-03T10:44:40.971225-05:00"}
{"id":2456095,"f1":"line with \\" in it:
2456095","f2":"2023-12-03T10:44:40.971228-05:00"}

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

04 декабря 2023 г., 04:14:31

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
> " being quoted as \\" breaks the JSON. It needs to be \".  This has been 
> my whole problem with COPY TO for JSON.
> 
> Please validate that the output is in proper format with correct quoting 
> for special characters. I use `jq` on the command line to validate and 
> format the output.

I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), that 
would be a missing feature (or possibly a bug?).

Davin -- how did you work around the issue with the way the built in 
functions output JSON?

Andrew -- comments/thoughts?

Joe


> On Sun, Dec 3, 2023, 10:51 Joe Conway <mail@joeconway.com 
> <mailto:mail@joeconway.com>> wrote:
> 
>     On 12/3/23 10:31, Davin Shearer wrote:
>      > Please be sure to include single and double quotes in the test
>     values
>      > since that was the original problem (double quoting in COPY TO
>     breaking
>      > the JSON syntax).
> 
>     test=# copy (select * from foo limit 4) to stdout (format json);
>     {"id":2456092,"f1":"line with ' in it:
>     2456092","f2":"2023-12-03T10:44:40.9712-05:00"}
>     {"id":2456093,"f1":"line with \\" in it:
>     2456093","f2":"2023-12-03T10:44:40.971221-05:00"}
>     {"id":2456094,"f1":"line with ' in it:
>     2456094","f2":"2023-12-03T10:44:40.971225-05:00"}
>     {"id":2456095,"f1":"line with \\" in it:
>     2456095","f2":"2023-12-03T10:44:40.971228-05:00"}
> 
>     -- 
>     Joe Conway
>     PostgreSQL Contributors Team
>     RDS Open Source Databases
>     Amazon Web Services: https://aws.amazon.com <https://aws.amazon.com>
> 

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Davin Shearer

Дата:

04 декабря 2023 г., 04:27:49

I worked around it by using select json_agg(t)... and redirecting it to file via psql on the command line. COPY TO was working until we ran into broken JSON and discovered the double quoting issue due to some values containing " in them.

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

04 декабря 2023 г., 15:41:20

On 2023-12-03 Su 20:14, Joe Conway wrote:
> (please don't top quote on the Postgres lists)
>
> On 12/3/23 17:38, Davin Shearer wrote:
>> " being quoted as \\" breaks the JSON. It needs to be \".  This has 
>> been my whole problem with COPY TO for JSON.
>>
>> Please validate that the output is in proper format with correct 
>> quoting for special characters. I use `jq` on the command line to 
>> validate and format the output.
>
> I just hooked existing "row-to-json machinery" up to the "COPY TO" 
> statement. If the output is wrong (just for for this use case?), that 
> would be a missing feature (or possibly a bug?).
>
> Davin -- how did you work around the issue with the way the built in 
> functions output JSON?
>
> Andrew -- comments/thoughts?
>
>

I meant to mention this when I was making comments yesterday.

The patch should not be using CopyAttributeOutText - it will try to 
escape characters such as \, which produces the effect complained of 
here, or else we need to change its setup so we have a way to inhibit 
that escaping.


cheers


andrew



>
>
--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

04 декабря 2023 г., 16:37:23

On 12/4/23 07:41, Andrew Dunstan wrote:
> 
> On 2023-12-03 Su 20:14, Joe Conway wrote:
>> (please don't top quote on the Postgres lists)
>>
>> On 12/3/23 17:38, Davin Shearer wrote:
>>> " being quoted as \\" breaks the JSON. It needs to be \".  This has 
>>> been my whole problem with COPY TO for JSON.
>>>
>>> Please validate that the output is in proper format with correct 
>>> quoting for special characters. I use `jq` on the command line to 
>>> validate and format the output.
>>
>> I just hooked existing "row-to-json machinery" up to the "COPY TO" 
>> statement. If the output is wrong (just for for this use case?), that 
>> would be a missing feature (or possibly a bug?).
>>
>> Davin -- how did you work around the issue with the way the built in 
>> functions output JSON?
>>
>> Andrew -- comments/thoughts?
> 
> I meant to mention this when I was making comments yesterday.
> 
> The patch should not be using CopyAttributeOutText - it will try to
> escape characters such as \, which produces the effect complained of
> here, or else we need to change its setup so we have a way to inhibit
> that escaping.


Interesting.

I am surprised this has never been raised as a problem with COPY TO before.

Should the JSON output, as produced by composite_to_json(), be sent 
as-is with no escaping at all? If yes, is JSON somehow unique in this 
regard?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

04 декабря 2023 г., 17:25:04

On 2023-12-04 Mo 08:37, Joe Conway wrote:
> On 12/4/23 07:41, Andrew Dunstan wrote:
>>
>> On 2023-12-03 Su 20:14, Joe Conway wrote:
>>> (please don't top quote on the Postgres lists)
>>>
>>> On 12/3/23 17:38, Davin Shearer wrote:
>>>> " being quoted as \\" breaks the JSON. It needs to be \".  This has 
>>>> been my whole problem with COPY TO for JSON.
>>>>
>>>> Please validate that the output is in proper format with correct 
>>>> quoting for special characters. I use `jq` on the command line to 
>>>> validate and format the output.
>>>
>>> I just hooked existing "row-to-json machinery" up to the "COPY TO" 
>>> statement. If the output is wrong (just for for this use case?), 
>>> that would be a missing feature (or possibly a bug?).
>>>
>>> Davin -- how did you work around the issue with the way the built in 
>>> functions output JSON?
>>>
>>> Andrew -- comments/thoughts?
>>
>> I meant to mention this when I was making comments yesterday.
>>
>> The patch should not be using CopyAttributeOutText - it will try to
>> escape characters such as \, which produces the effect complained of
>> here, or else we need to change its setup so we have a way to inhibit
>> that escaping.
>
>
> Interesting.
>
> I am surprised this has never been raised as a problem with COPY TO 
> before.
>
> Should the JSON output, as produced by composite_to_json(), be sent 
> as-is with no escaping at all? If yes, is JSON somehow unique in this 
> regard?


Text mode output is in such a form that it can be read back in using 
text mode input. There's nothing special about JSON in this respect - 
any text field will be escaped too. But output suitable for text mode 
input is not what you're trying to produce here; you're trying to 
produce valid JSON.

So, yes, the result of composite_to_json, which is already suitably 
escaped, should not be further escaped in this case.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

04 декабря 2023 г., 18:45:58

On 12/4/23 09:25, Andrew Dunstan wrote:
> 
> On 2023-12-04 Mo 08:37, Joe Conway wrote:
>> On 12/4/23 07:41, Andrew Dunstan wrote:
>>>
>>> On 2023-12-03 Su 20:14, Joe Conway wrote:
>>>> (please don't top quote on the Postgres lists)
>>>>
>>>> On 12/3/23 17:38, Davin Shearer wrote:
>>>>> " being quoted as \\" breaks the JSON. It needs to be \".  This has 
>>>>> been my whole problem with COPY TO for JSON.
>>>>>
>>>>> Please validate that the output is in proper format with correct 
>>>>> quoting for special characters. I use `jq` on the command line to 
>>>>> validate and format the output.
>>>>
>>>> I just hooked existing "row-to-json machinery" up to the "COPY TO" 
>>>> statement. If the output is wrong (just for for this use case?), 
>>>> that would be a missing feature (or possibly a bug?).
>>>>
>>>> Davin -- how did you work around the issue with the way the built in 
>>>> functions output JSON?
>>>>
>>>> Andrew -- comments/thoughts?
>>>
>>> I meant to mention this when I was making comments yesterday.
>>>
>>> The patch should not be using CopyAttributeOutText - it will try to
>>> escape characters such as \, which produces the effect complained of
>>> here, or else we need to change its setup so we have a way to inhibit
>>> that escaping.
>>
>>
>> Interesting.
>>
>> I am surprised this has never been raised as a problem with COPY TO 
>> before.
>>
>> Should the JSON output, as produced by composite_to_json(), be sent 
>> as-is with no escaping at all? If yes, is JSON somehow unique in this 
>> regard?
> 
> 
> Text mode output is in such a form that it can be read back in using
> text mode input. There's nothing special about JSON in this respect -
> any text field will be escaped too. But output suitable for text mode
> input is not what you're trying to produce here; you're trying to
> produce valid JSON.
> 
> So, yes, the result of composite_to_json, which is already suitably
> escaped, should not be further escaped in this case.

Gotcha.

This patch version uses CopySendData() instead and includes 
documentation changes. Still lacks regression tests.

Hopefully this looks better. Any other particular strings I ought to 
test with?

8<------------------
test=# copy (select * from foo limit 4) to stdout (format json, 
force_array true);
[
  {"id":1,"f1":"line with \" in it: 
1","f2":"2023-12-03T12:26:41.596053-05:00"}
,{"id":2,"f1":"line with ' in it: 
2","f2":"2023-12-03T12:26:41.596173-05:00"}
,{"id":3,"f1":"line with \" in it: 
3","f2":"2023-12-03T12:26:41.596179-05:00"}
,{"id":4,"f1":"line with ' in it: 
4","f2":"2023-12-03T12:26:41.596182-05:00"}
]
8<------------------

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

copyto_json.003.diff

Re: Emitting JSON to file using COPY TO

От

Davin Shearer

Дата:

04 декабря 2023 г., 21:37:06

Looking great!

For testing, in addition to the quotes, include DOS and Unix EOL, \ and /, Byte Order Markers, and mulitbyte characters like UTF-8.

Essentially anything considered textural is fair game to be a value.

On Mon, Dec 4, 2023, 10:46 Joe Conway <mail@joeconway.com> wrote:

On 12/4/23 09:25, Andrew Dunstan wrote:
>
> On 2023-12-04 Mo 08:37, Joe Conway wrote:
>> On 12/4/23 07:41, Andrew Dunstan wrote:
>>>
>>> On 2023-12-03 Su 20:14, Joe Conway wrote:
>>>> (please don't top quote on the Postgres lists)
>>>>
>>>> On 12/3/23 17:38, Davin Shearer wrote:
>>>>> " being quoted as \\" breaks the JSON. It needs to be \". This has
>>>>> been my whole problem with COPY TO for JSON.
>>>>>
>>>>> Please validate that the output is in proper format with correct
>>>>> quoting for special characters. I use `jq` on the command line to
>>>>> validate and format the output.
>>>>
>>>> I just hooked existing "row-to-json machinery" up to the "COPY TO"
>>>> statement. If the output is wrong (just for for this use case?),
>>>> that would be a missing feature (or possibly a bug?).
>>>>
>>>> Davin -- how did you work around the issue with the way the built in
>>>> functions output JSON?
>>>>
>>>> Andrew -- comments/thoughts?
>>>
>>> I meant to mention this when I was making comments yesterday.
>>>
>>> The patch should not be using CopyAttributeOutText - it will try to
>>> escape characters such as \, which produces the effect complained of
>>> here, or else we need to change its setup so we have a way to inhibit
>>> that escaping.
>>
>>
>> Interesting.
>>
>> I am surprised this has never been raised as a problem with COPY TO
>> before.
>>
>> Should the JSON output, as produced by composite_to_json(), be sent
>> as-is with no escaping at all? If yes, is JSON somehow unique in this
>> regard?
>
>
> Text mode output is in such a form that it can be read back in using
> text mode input. There's nothing special about JSON in this respect -
> any text field will be escaped too. But output suitable for text mode
> input is not what you're trying to produce here; you're trying to
> produce valid JSON.
>
> So, yes, the result of composite_to_json, which is already suitably
> escaped, should not be further escaped in this case.

Gotcha.

This patch version uses CopySendData() instead and includes
documentation changes. Still lacks regression tests.

Hopefully this looks better. Any other particular strings I ought to
test with?

8<------------------
test=# copy (select * from foo limit 4) to stdout (format json,
force_array true);
[
{"id":1,"f1":"line with \" in it:
1","f2":"2023-12-03T12:26:41.596053-05:00"}
,{"id":2,"f1":"line with ' in it:
2","f2":"2023-12-03T12:26:41.596173-05:00"}
,{"id":3,"f1":"line with \" in it:
3","f2":"2023-12-03T12:26:41.596179-05:00"}
,{"id":4,"f1":"line with ' in it:
4","f2":"2023-12-03T12:26:41.596182-05:00"}
]
8<------------------

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

04 декабря 2023 г., 23:06:31

On 2023-12-04 Mo 13:37, Davin Shearer wrote:

Looking great!

For testing, in addition to the quotes, include DOS and Unix EOL, \ and /, Byte Order Markers, and mulitbyte characters like UTF-8.

Essentially anything considered textural is fair game to be a value.

Joe already asked you to avoid top-posting on PostgreSQL lists. See <http://idallen.com/topposting.html> for an explanation.

We don't process BOMs elsewhere, and probably should not here either. They are in fact neither required nor recommended for use with UTF8 data, AIUI. See a recent discussion on this list on that topic: <https://www.postgresql.org/message-id/flat/81ca2b25-6b3a-499a-9a09-2dd21253c2cb%40unitrunker.net>

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Davin Shearer

Дата:

05 декабря 2023 г., 01:55:00

Sorry about the top posting / top quoting... the link you sent me gives me a 404. I'm not exactly sure what top quoting / posting means and Googling those terms wasn't helpful for me, but I've removed the quoting that my mail client is automatically "helpfully" adding to my emails. I mean no offense.

Okay, digging in more...

If the value contains text that has BOMs [footnote 1] in it, it must be preserved (the database doesn't need to interpret them or do anything special with them - just store it and fetch it). There are however a few characters that need to be escaped (per https://www.w3docs.com/snippets/java/how-should-i-escape-strings-in-json.html) so that the JSON format isn't broken. They are:

" (double quote)
\ (backslash)
/ (forward slash)
\b (backspace)
\f (form feed)
\n (new line)
\r (carriage return)
\t (horizontal tab)

These characters should be represented in the test cases to see how the escaping behaves and to ensure that the escaping is done properly per JSON requirements. Forward slash comes as a bit of a surprise to me, but `jq` handles it either way:

➜ echo '{"key": "this / is a forward slash"}' | jq .
{
"key": "this / is a forward slash"
}
➜ echo '{"key": "this \/ is a forward slash"}' | jq .
{
"key": "this / is a forward slash"
}

Hope it helps, and thank you!

1. I don't disagree that BOMs shouldn't be used for UTF-8, but I'm also processing UTF-16{BE,LE} and UTF-32{BE,LE} (as well as other textural formats that are neither ASCII or Unicode). I don't have the luxury of changing the data that is given.

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

05 декабря 2023 г., 05:54:50

On 12/4/23 17:55, Davin Shearer wrote:
> Sorry about the top posting / top quoting... the link you sent me gives 
> me a 404.  I'm not exactly sure what top quoting / posting means and 
> Googling those terms wasn't helpful for me, but I've removed the quoting 
> that my mail client is automatically "helpfully" adding to my emails.  I 
> mean no offense.

No offense taken. But it is worthwhile to conform to the very long 
established norms of the mailing lists on which you participate. See:

   https://en.wikipedia.org/wiki/Posting_style

I would describe the Postgres list style (based on that link) as

    "inline replying, in which the different parts of the reply follow
     the relevant parts of the original post...[with]...trimming of the
     original text"

> There are however a few characters that need to be escaped

>  1. |"|(double quote)
>  2. |\|(backslash)
>  3. |/|(forward slash)
>  4. |\b|(backspace)
>  5. |\f|(form feed)
>  6. |\n|(new line)
>  7. |\r|(carriage return)
>  8. |\t|(horizontal tab)
> 
> These characters should be represented in the test cases to see how the 
> escaping behaves and to ensure that the escaping is done properly per 
> JSON requirements.

I can look at adding these as test cases. The latest version of the 
patch (attached) includes some of that already. For reference, the tests 
so far include this:

8<-------------------------------
test=# select * from copytest;
   style  |   test   | filler
---------+----------+--------
  DOS     | abc\r   +|      1
          | def      |
  Unix    | abc     +|      2
          | def      |
  Mac     | abc\rdef |      3
  esc\ape | a\r\\r\ +|      4
          | \nb      |
(4 rows)

test=# copy copytest to stdout (format json);
{"style":"DOS","test":"abc\r\ndef","filler":1}
{"style":"Unix","test":"abc\ndef","filler":2}
{"style":"Mac","test":"abc\rdef","filler":3}
{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
8<-------------------------------

At this point "COPY TO" should be sending exactly the unaltered output 
of the postgres JSON processing functions.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

copyto_json.004.diff

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

05 декабря 2023 г., 17:56:03

On 2023-12-04 Mo 17:55, Davin Shearer wrote:
> Sorry about the top posting / top quoting... the link you sent me 
> gives me a 404.  I'm not exactly sure what top quoting / posting means 
> and Googling those terms wasn't helpful for me, but I've removed the 
> quoting that my mail client is automatically "helpfully" adding to my 
> emails.  I mean no offense.


Hmm. Luckily the Wayback Machine has a copy: 
<http://web.archive.org/web/20230608210806/idallen.com/topposting.html>

Maybe I'll put a copy in the developer wiki.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

05 декабря 2023 г., 19:54:42

On 12/4/23 21:54, Joe Conway wrote:
> On 12/4/23 17:55, Davin Shearer wrote:
>> There are however a few characters that need to be escaped
> 
>>  1. |"|(double quote)
>>  2. |\|(backslash)
>>  3. |/|(forward slash)
>>  4. |\b|(backspace)
>>  5. |\f|(form feed)
>>  6. |\n|(new line)
>>  7. |\r|(carriage return)
>>  8. |\t|(horizontal tab)
>> 
>> These characters should be represented in the test cases to see how the 
>> escaping behaves and to ensure that the escaping is done properly per 
>> JSON requirements.
> 
> I can look at adding these as test cases.
So I did a quick check:
8<--------------------------
with t(f1) as
(
   values
     (E'aaa\"bbb'::text),
     (E'aaa\\bbb'::text),
     (E'aaa\/bbb'::text),
     (E'aaa\bbbb'::text),
     (E'aaa\fbbb'::text),
     (E'aaa\nbbb'::text),
     (E'aaa\rbbb'::text),
     (E'aaa\tbbb'::text)
)
select
   length(t.f1),
   t.f1,
   row_to_json(t)
from t;
  length |     f1      |    row_to_json
--------+-------------+-------------------
       7 | aaa"bbb     | {"f1":"aaa\"bbb"}
       7 | aaa\bbb     | {"f1":"aaa\\bbb"}
       7 | aaa/bbb     | {"f1":"aaa/bbb"}
       7 | aaa\x08bbb  | {"f1":"aaa\bbbb"}
       7 | aaa\x0Cbbb  | {"f1":"aaa\fbbb"}
       7 | aaa        +| {"f1":"aaa\nbbb"}
         | bbb         |
       7 | aaa\rbbb    | {"f1":"aaa\rbbb"}
       7 | aaa     bbb | {"f1":"aaa\tbbb"}
(8 rows)

8<--------------------------

This is all independent of my patch for COPY TO. If I am reading that 
correctly, everything matches Davin's table *except* the forward slash 
("/"). I defer to the experts on the thread to debate that...

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Davin Shearer

Дата:

05 декабря 2023 г., 20:43:31

Thanks for the wayback machine link Andrew. I read it, understood it, and will comply.

Joe, those test cases look great and the outputs are the same as `jq`.

As for forward slashes being escaped, I found this: https://stackoverflow.com/questions/1580647/json-why-are-forward-slashes-escaped.

Forward slash escaping is optional, so not escaping them in Postgres is okay. The important thing is that the software _reading_ JSON interprets both '\/' and '/' as '/'.

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

05 декабря 2023 г., 21:51:22

On 12/5/23 12:43, Davin Shearer wrote:
> Joe, those test cases look great and the outputs are the same as `jq`.

<link to info regarding escaping of forward slashes>

> Forward slash escaping is optional, so not escaping them in Postgres is 
> okay. The important thing is that the software _reading_ JSON 
> interprets both '\/' and '/' as '/'.

Thanks for the review and info. I modified the existing regression test 
thus:

8<--------------------------
create temp table copyjsontest (
     id bigserial,
     f1 text,
     f2 timestamptz);

insert into copyjsontest
   select g.i,
          CASE WHEN g.i % 2 = 0 THEN
            'line with '' in it: ' || g.i::text
          ELSE
            'line with " in it: ' || g.i::text
          END,
          'Mon Feb 10 17:32:01 1997 PST'
   from generate_series(1,5) as g(i);

insert into copyjsontest (f1) values
(E'aaa\"bbb'::text),
(E'aaa\\bbb'::text),
(E'aaa\/bbb'::text),
(E'aaa\bbbb'::text),
(E'aaa\fbbb'::text),
(E'aaa\nbbb'::text),
(E'aaa\rbbb'::text),
(E'aaa\tbbb'::text);
copy copyjsontest to stdout json;
{"id":1,"f1":"line with \" in it: 1","f2":"1997-02-10T20:32:01-05:00"}
{"id":2,"f1":"line with ' in it: 2","f2":"1997-02-10T20:32:01-05:00"}
{"id":3,"f1":"line with \" in it: 3","f2":"1997-02-10T20:32:01-05:00"}
{"id":4,"f1":"line with ' in it: 4","f2":"1997-02-10T20:32:01-05:00"}
{"id":5,"f1":"line with \" in it: 5","f2":"1997-02-10T20:32:01-05:00"}
{"id":1,"f1":"aaa\"bbb","f2":null}
{"id":2,"f1":"aaa\\bbb","f2":null}
{"id":3,"f1":"aaa/bbb","f2":null}
{"id":4,"f1":"aaa\bbbb","f2":null}
{"id":5,"f1":"aaa\fbbb","f2":null}
{"id":6,"f1":"aaa\nbbb","f2":null}
{"id":7,"f1":"aaa\rbbb","f2":null}
{"id":8,"f1":"aaa\tbbb","f2":null}
8<--------------------------

I think the code, documentation, and tests are in pretty good shape at 
this point. Latest version attached.

Any other comments or complaints out there?


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

copyto_json.005.diff

Re: Emitting JSON to file using COPY TO

От

Davin Shearer

Дата:

05 декабря 2023 г., 22:50:23

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, we should also _imply_ FORCE ROW DELIMITER. I can't envision a use case where someone would want to use FORCE ARRAY without also using FORCE ROW DELIMITER. I can, however, envision a use case where someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe including into a larger array. I definitely appreciate these options and the flexibility that they afford from a user perspective.

In the test output, will you also show the different variations with FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), (false, true), (true, true)}? Technically you've already shown me the (false, false) case as those are the defaults.

Thanks!

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

05 декабря 2023 г., 23:55:53

On 2023-12-05 Tu 14:50, Davin Shearer wrote:
> Hi Joe,
>
> In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
> we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
> case where someone would want to use FORCE ARRAY without also using 
> FORCE ROW DELIMITER.  I can, however, envision a use case where 
> someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe 
> including into a larger array.  I definitely appreciate these options 
> and the flexibility that they afford from a user perspective.
>
> In the test output, will you also show the different variations with 
> FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), 
> (false, true), (true, true)}?  Technically you've already shown me the 
> (false, false) case as those are the defaults.
>
>

I don't understand the point of FORCE_ROW_DELIMITER at all. There is 
only one legal delimiter of array items in JSON, and that's a comma. 
There's no alternative and it's not optional. So in the array case you 
MUST have commas and in any other case (e.g. LINES) I can't see why you 
would have them.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

06 декабря 2023 г., 00:02:06

On 12/5/23 15:55, Andrew Dunstan wrote:
> 
> On 2023-12-05 Tu 14:50, Davin Shearer wrote:
>> Hi Joe,
>>
>> In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
>> we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
>> case where someone would want to use FORCE ARRAY without also using 
>> FORCE ROW DELIMITER.  I can, however, envision a use case where 
>> someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe 
>> including into a larger array.  I definitely appreciate these options 
>> and the flexibility that they afford from a user perspective.
>>
>> In the test output, will you also show the different variations with 
>> FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), 
>> (false, true), (true, true)}?  Technically you've already shown me the 
>> (false, false) case as those are the defaults.
>>
>>
> 
> I don't understand the point of FORCE_ROW_DELIMITER at all. There is
> only one legal delimiter of array items in JSON, and that's a comma.
> There's no alternative and it's not optional. So in the array case you
> MUST have commas and in any other case (e.g. LINES) I can't see why you
> would have them.

The current patch already *does* imply row delimiters in the array case. 
It says so here:
8<---------------------------
+    <varlistentry>
+     <term><literal>FORCE_ARRAY</literal></term>
+     <listitem>
+      <para>
+       Force output of array decorations at the beginning and end of 
output.
+       This option implies the <literal>FORCE_ROW_DELIMITER</literal>
+       option. It is allowed only in <command>COPY TO</command>, and only
+       when using <literal>JSON</literal> format.
+       The default is <literal>false</literal>.
+      </para>
8<---------------------------

and it does so here:
8<---------------------------
+         if (opts_out->force_array)
+             opts_out->force_row_delimiter = true;
8<---------------------------

and it shows that here:
8<---------------------------
+ copy copytest to stdout (format json, force_array);
+ [
+  {"style":"DOS","test":"abc\r\ndef","filler":1}
+ ,{"style":"Unix","test":"abc\ndef","filler":2}
+ ,{"style":"Mac","test":"abc\rdef","filler":3}
+ ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
+ ]
8<---------------------------

It also does not allow explicitly setting row delimiters false while 
force_array is true here:
8<---------------------------

+         if (opts_out->force_array &&
+             force_row_delimiter_specified &&
+             !opts_out->force_row_delimiter)
+             ereport(ERROR,
+                     (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                      errmsg("cannot specify FORCE_ROW_DELIMITER false with 
FORCE_ARRAY true")));
8<---------------------------

Am I understanding something incorrectly?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

06 декабря 2023 г., 00:09:13

On 12/5/23 16:02, Joe Conway wrote:
> On 12/5/23 15:55, Andrew Dunstan wrote:
>> and in any other case (e.g. LINES) I can't see why you
>> would have them.

Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and export 
style. It was not that tough to support and the code as written already 
does it, so why not?

[1] 

https://learn.microsoft.com/en-us/sql/relational-databases/json/remove-square-brackets-from-json-without-array-wrapper-option?view=sql-server-ver16#example-multiple-row-result

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

06 декабря 2023 г., 00:12:00

On 2023-12-05 Tu 16:02, Joe Conway wrote:
> On 12/5/23 15:55, Andrew Dunstan wrote:
>>
>> On 2023-12-05 Tu 14:50, Davin Shearer wrote:
>>> Hi Joe,
>>>
>>> In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
>>> we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
>>> case where someone would want to use FORCE ARRAY without also using 
>>> FORCE ROW DELIMITER.  I can, however, envision a use case where 
>>> someone would want FORCE ROW DELIMITER without FORCE ARRAY, like 
>>> maybe including into a larger array.  I definitely appreciate these 
>>> options and the flexibility that they afford from a user perspective.
>>>
>>> In the test output, will you also show the different variations with 
>>> FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, 
>>> false), (false, true), (true, true)}? Technically you've already 
>>> shown me the (false, false) case as those are the defaults.
>>>
>>>
>>
>> I don't understand the point of FORCE_ROW_DELIMITER at all. There is
>> only one legal delimiter of array items in JSON, and that's a comma.
>> There's no alternative and it's not optional. So in the array case you
>> MUST have commas and in any other case (e.g. LINES) I can't see why you
>> would have them.
>
> The current patch already *does* imply row delimiters in the array 
> case. It says so here:
> 8<---------------------------
> +    <varlistentry>
> + <term><literal>FORCE_ARRAY</literal></term>
> +     <listitem>
> +      <para>
> +       Force output of array decorations at the beginning and end of 
> output.
> +       This option implies the <literal>FORCE_ROW_DELIMITER</literal>
> +       option. It is allowed only in <command>COPY TO</command>, and 
> only
> +       when using <literal>JSON</literal> format.
> +       The default is <literal>false</literal>.
> +      </para>
> 8<---------------------------
>
> and it does so here:
> 8<---------------------------
> +         if (opts_out->force_array)
> +             opts_out->force_row_delimiter = true;
> 8<---------------------------
>
> and it shows that here:
> 8<---------------------------
> + copy copytest to stdout (format json, force_array);
> + [
> +  {"style":"DOS","test":"abc\r\ndef","filler":1}
> + ,{"style":"Unix","test":"abc\ndef","filler":2}
> + ,{"style":"Mac","test":"abc\rdef","filler":3}
> + ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
> + ]
> 8<---------------------------
>
> It also does not allow explicitly setting row delimiters false while 
> force_array is true here:
> 8<---------------------------
>
> +         if (opts_out->force_array &&
> +             force_row_delimiter_specified &&
> +             !opts_out->force_row_delimiter)
> +             ereport(ERROR,
> +                     (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                      errmsg("cannot specify FORCE_ROW_DELIMITER 
> false with FORCE_ARRAY true")));
> 8<---------------------------
>
> Am I understanding something incorrectly?


But what's the point of having it if you're not using FORCE_ARRAY?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

06 декабря 2023 г., 00:15:29

On 12/5/23 16:12, Andrew Dunstan wrote:
> 
> On 2023-12-05 Tu 16:02, Joe Conway wrote:
>> On 12/5/23 15:55, Andrew Dunstan wrote:
>>>
>>> On 2023-12-05 Tu 14:50, Davin Shearer wrote:
>>>> Hi Joe,
>>>>
>>>> In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
>>>> we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
>>>> case where someone would want to use FORCE ARRAY without also using 
>>>> FORCE ROW DELIMITER.  I can, however, envision a use case where 
>>>> someone would want FORCE ROW DELIMITER without FORCE ARRAY, like 
>>>> maybe including into a larger array.  I definitely appreciate these 
>>>> options and the flexibility that they afford from a user perspective.
>>>>
>>>> In the test output, will you also show the different variations with 
>>>> FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, 
>>>> false), (false, true), (true, true)}? Technically you've already 
>>>> shown me the (false, false) case as those are the defaults.
>>>>
>>>>
>>>
>>> I don't understand the point of FORCE_ROW_DELIMITER at all. There is
>>> only one legal delimiter of array items in JSON, and that's a comma.
>>> There's no alternative and it's not optional. So in the array case you
>>> MUST have commas and in any other case (e.g. LINES) I can't see why you
>>> would have them.
>>
>> The current patch already *does* imply row delimiters in the array 
>> case. It says so here:
>> 8<---------------------------
>> +    <varlistentry>
>> + <term><literal>FORCE_ARRAY</literal></term>
>> +     <listitem>
>> +      <para>
>> +       Force output of array decorations at the beginning and end of 
>> output.
>> +       This option implies the <literal>FORCE_ROW_DELIMITER</literal>
>> +       option. It is allowed only in <command>COPY TO</command>, and 
>> only
>> +       when using <literal>JSON</literal> format.
>> +       The default is <literal>false</literal>.
>> +      </para>
>> 8<---------------------------
>>
>> and it does so here:
>> 8<---------------------------
>> +         if (opts_out->force_array)
>> +             opts_out->force_row_delimiter = true;
>> 8<---------------------------
>>
>> and it shows that here:
>> 8<---------------------------
>> + copy copytest to stdout (format json, force_array);
>> + [
>> +  {"style":"DOS","test":"abc\r\ndef","filler":1}
>> + ,{"style":"Unix","test":"abc\ndef","filler":2}
>> + ,{"style":"Mac","test":"abc\rdef","filler":3}
>> + ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
>> + ]
>> 8<---------------------------
>>
>> It also does not allow explicitly setting row delimiters false while 
>> force_array is true here:
>> 8<---------------------------
>>
>> +         if (opts_out->force_array &&
>> +             force_row_delimiter_specified &&
>> +             !opts_out->force_row_delimiter)
>> +             ereport(ERROR,
>> +                     (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> +                      errmsg("cannot specify FORCE_ROW_DELIMITER 
>> false with FORCE_ARRAY true")));
>> 8<---------------------------
>>
>> Am I understanding something incorrectly?
> 
> 
> But what's the point of having it if you're not using FORCE_ARRAY?


See the follow up email -- other databases support it so why not? It 
seems to be a thing...

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

06 декабря 2023 г., 00:20:11

On 2023-12-05 Tu 16:09, Joe Conway wrote:
> On 12/5/23 16:02, Joe Conway wrote:
>> On 12/5/23 15:55, Andrew Dunstan wrote:
>>> and in any other case (e.g. LINES) I can't see why you
>>> would have them.
>
> Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
> server I think [1] which had the non-array with commas import and 
> export style. It was not that tough to support and the code as written 
> already does it, so why not?
>
> [1] 
>
https://learn.microsoft.com/en-us/sql/relational-databases/json/remove-square-brackets-from-json-without-array-wrapper-option?view=sql-server-ver16#example-multiple-row-result
>
>

That seems quite absurd, TBH. I know we've catered for some absurdity in 
the CSV code (much of it down to me), so maybe we need to be liberal in 
what we accept here too. IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

06 декабря 2023 г., 00:46:50

On 12/5/23 16:20, Andrew Dunstan wrote:
> On 2023-12-05 Tu 16:09, Joe Conway wrote:
>> On 12/5/23 16:02, Joe Conway wrote:
>>> On 12/5/23 15:55, Andrew Dunstan wrote:
>>>> and in any other case (e.g. LINES) I can't see why you
>>>> would have them.
>>
>> Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
>> server I think [1] which had the non-array with commas import and 
>> export style. It was not that tough to support and the code as written 
>> already does it, so why not?
> 
> That seems quite absurd, TBH. I know we've catered for some absurdity in
> the CSV code (much of it down to me), so maybe we need to be liberal in
> what we accept here too. IMNSHO, we should produce either a single JSON
> document (the ARRAY case) or a series of JSON documents, one per row
> (the LINES case).


So your preference would be to not allow the non-array-with-commas case 
but if/when we implement COPY FROM we would accept that format? As in 
Postel'a law ("be conservative in what you do, be liberal in what you 
accept from others")?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Davin Shearer

Дата:

06 декабря 2023 г., 02:45:24

> Am I understanding something incorrectly?

No, you've got it. You already covered the concerns there.

> That seems quite absurd, TBH. I know we've catered for some absurdity in
> the CSV code (much of it down to me), so maybe we need to be liberal in
> what we accept here too. IMNSHO, we should produce either a single JSON
> document (the ARRAY case) or a series of JSON documents, one per row
> (the LINES case).

For what it's worth, I agree with Andrew on this. I also agree with COPY FROM allowing for potentially bogus commas at the end of non-arrays for interop with other products, but to not do that in COPY TO (unless there is some real compelling case to do so). Emitting bogus JSON (non-array with commas) feels wrong and would be nice to not perpetuate that, if possible.

Thanks again for doing this. If I can be of any help, let me know. If\When this makes it into the production product, I'll be using this feature for sure.

-Davin

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

06 декабря 2023 г., 15:36:36

On 2023-12-05 Tu 16:46, Joe Conway wrote:
> On 12/5/23 16:20, Andrew Dunstan wrote:
>> On 2023-12-05 Tu 16:09, Joe Conway wrote:
>>> On 12/5/23 16:02, Joe Conway wrote:
>>>> On 12/5/23 15:55, Andrew Dunstan wrote:
>>>>> and in any other case (e.g. LINES) I can't see why you
>>>>> would have them.
>>>
>>> Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
>>> server I think [1] which had the non-array with commas import and 
>>> export style. It was not that tough to support and the code as 
>>> written already does it, so why not?
>>
>> That seems quite absurd, TBH. I know we've catered for some absurdity in
>> the CSV code (much of it down to me), so maybe we need to be liberal in
>> what we accept here too. IMNSHO, we should produce either a single JSON
>> document (the ARRAY case) or a series of JSON documents, one per row
>> (the LINES case).
>
>
> So your preference would be to not allow the non-array-with-commas 
> case but if/when we implement COPY FROM we would accept that format? 
> As in Postel'a law ("be conservative in what you do, be liberal in 
> what you accept from others")?


Yes, I think so.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

06 декабря 2023 г., 16:49:13

On 12/6/23 07:36, Andrew Dunstan wrote:
> 
> On 2023-12-05 Tu 16:46, Joe Conway wrote:
>> On 12/5/23 16:20, Andrew Dunstan wrote:
>>> On 2023-12-05 Tu 16:09, Joe Conway wrote:
>>>> On 12/5/23 16:02, Joe Conway wrote:
>>>>> On 12/5/23 15:55, Andrew Dunstan wrote:
>>>>>> and in any other case (e.g. LINES) I can't see why you
>>>>>> would have them.
>>>>
>>>> Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
>>>> server I think [1] which had the non-array with commas import and 
>>>> export style. It was not that tough to support and the code as 
>>>> written already does it, so why not?
>>>
>>> That seems quite absurd, TBH. I know we've catered for some absurdity in
>>> the CSV code (much of it down to me), so maybe we need to be liberal in
>>> what we accept here too. IMNSHO, we should produce either a single JSON
>>> document (the ARRAY case) or a series of JSON documents, one per row
>>> (the LINES case).
>>
>> So your preference would be to not allow the non-array-with-commas 
>> case but if/when we implement COPY FROM we would accept that format? 
>> As in Postel'a law ("be conservative in what you do, be liberal in 
>> what you accept from others")?
> 
> 
> Yes, I think so.

Awesome. The attached does it that way. I also ran pgindent.

I believe this is ready to commit unless there are further comments or 
objections.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

copyto_json.006.diff

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

06 декабря 2023 г., 18:32:41

On 2023-12-06 We 08:49, Joe Conway wrote:
> On 12/6/23 07:36, Andrew Dunstan wrote:
>>
>> On 2023-12-05 Tu 16:46, Joe Conway wrote:
>>> On 12/5/23 16:20, Andrew Dunstan wrote:
>>>> On 2023-12-05 Tu 16:09, Joe Conway wrote:
>>>>> On 12/5/23 16:02, Joe Conway wrote:
>>>>>> On 12/5/23 15:55, Andrew Dunstan wrote:
>>>>>>> and in any other case (e.g. LINES) I can't see why you
>>>>>>> would have them.
>>>>>
>>>>> Oh I didn't address this -- I saw examples in the interwebs of 
>>>>> MSSQL server I think [1] which had the non-array with commas 
>>>>> import and export style. It was not that tough to support and the 
>>>>> code as written already does it, so why not?
>>>>
>>>> That seems quite absurd, TBH. I know we've catered for some 
>>>> absurdity in
>>>> the CSV code (much of it down to me), so maybe we need to be 
>>>> liberal in
>>>> what we accept here too. IMNSHO, we should produce either a single 
>>>> JSON
>>>> document (the ARRAY case) or a series of JSON documents, one per row
>>>> (the LINES case).
>>>
>>> So your preference would be to not allow the non-array-with-commas 
>>> case but if/when we implement COPY FROM we would accept that format? 
>>> As in Postel'a law ("be conservative in what you do, be liberal in 
>>> what you accept from others")?
>>
>>
>> Yes, I think so.
>
> Awesome. The attached does it that way. I also ran pgindent.
>
> I believe this is ready to commit unless there are further comments or 
> objections.


Sorry to bikeshed a little more, I'm a bit late looking at this.

I suspect that most users will actually want the table as a single JSON 
document, so it should probably be the default. In any case FORCE_ARRAY 
as an option has a slightly wrong feel to it. I'm having trouble coming 
up with a good name for the reverse of that, off the top of my head.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Tom Lane

Дата:

06 декабря 2023 г., 18:44:19

Joe Conway <mail@joeconway.com> writes:
> I believe this is ready to commit unless there are further comments or 
> objections.

I thought we were still mostly at proof-of-concept stage?

In particular, has anyone done any performance testing?
I'm concerned about that because composite_to_json() has
zero capability to cache any metadata across calls, meaning
there is going to be a large amount of duplicated work
per row.

            regards, tom lane

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

06 декабря 2023 г., 19:15:39

On 12/6/23 10:32, Andrew Dunstan wrote:
> 
> On 2023-12-06 We 08:49, Joe Conway wrote:
>> On 12/6/23 07:36, Andrew Dunstan wrote:
>>>
>>> On 2023-12-05 Tu 16:46, Joe Conway wrote:
>>>> On 12/5/23 16:20, Andrew Dunstan wrote:
>>>>> On 2023-12-05 Tu 16:09, Joe Conway wrote:
>>>>>> On 12/5/23 16:02, Joe Conway wrote:
>>>>>>> On 12/5/23 15:55, Andrew Dunstan wrote:
>>>>>>>> and in any other case (e.g. LINES) I can't see why you
>>>>>>>> would have them.
>>>>>>
>>>>>> Oh I didn't address this -- I saw examples in the interwebs of 
>>>>>> MSSQL server I think [1] which had the non-array with commas 
>>>>>> import and export style. It was not that tough to support and the 
>>>>>> code as written already does it, so why not?
>>>>>
>>>>> That seems quite absurd, TBH. I know we've catered for some 
>>>>> absurdity in
>>>>> the CSV code (much of it down to me), so maybe we need to be 
>>>>> liberal in
>>>>> what we accept here too. IMNSHO, we should produce either a single 
>>>>> JSON
>>>>> document (the ARRAY case) or a series of JSON documents, one per row
>>>>> (the LINES case).
>>>>
>>>> So your preference would be to not allow the non-array-with-commas 
>>>> case but if/when we implement COPY FROM we would accept that format? 
>>>> As in Postel'a law ("be conservative in what you do, be liberal in 
>>>> what you accept from others")?
>>>
>>>
>>> Yes, I think so.
>>
>> Awesome. The attached does it that way. I also ran pgindent.
>>
>> I believe this is ready to commit unless there are further comments or 
>> objections.
> 
> Sorry to bikeshed a little more, I'm a bit late looking at this.
> 
> I suspect that most users will actually want the table as a single JSON
> document, so it should probably be the default. In any case FORCE_ARRAY
> as an option has a slightly wrong feel to it.

Sure, I can make that happen, although I figured that for the 
many-rows-scenario the single array size might be an issue for whatever 
you are importing into.

> I'm having trouble coming up with a good name for the reverse of
> that, off the top of my head.

Will think about it and propose something with the next patch revision.


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

06 декабря 2023 г., 19:19:19

On 12/6/23 10:44, Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>> I believe this is ready to commit unless there are further comments or 
>> objections.
> 
> I thought we were still mostly at proof-of-concept stage?

The concept is narrowly scoped enough that I think we are homing in on 
the final patch.

> In particular, has anyone done any performance testing?
> I'm concerned about that because composite_to_json() has
> zero capability to cache any metadata across calls, meaning
> there is going to be a large amount of duplicated work
> per row.

I will devise some kind of test and report back. I suppose something 
with many rows and many narrow columns comparing time to COPY 
text/csv/json modes would do the trick?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

06 декабря 2023 г., 19:19:54

On 2023-12-06 We 10:44, Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>> I believe this is ready to commit unless there are further comments or
>> objections.
> I thought we were still mostly at proof-of-concept stage?
>
> In particular, has anyone done any performance testing?
> I'm concerned about that because composite_to_json() has
> zero capability to cache any metadata across calls, meaning
> there is going to be a large amount of duplicated work
> per row.
>
>             


Yeah, that's hard to deal with, too, as it can be called recursively.

OTOH I'd rather have a version of this that worked slowly than none at all.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Tom Lane

Дата:

06 декабря 2023 г., 19:26:35

Joe Conway <mail@joeconway.com> writes:
> On 12/6/23 10:44, Tom Lane wrote:
>> In particular, has anyone done any performance testing?

> I will devise some kind of test and report back. I suppose something 
> with many rows and many narrow columns comparing time to COPY 
> text/csv/json modes would do the trick?

Yeah.  If it's at least in the same ballpark as the existing text/csv
formats then I'm okay with it.  I'm worried that it might be 10x worse,
in which case I think we'd need to do something.

            regards, tom lane

Re: Emitting JSON to file using COPY TO

От

Sehrope Sarkuni

Дата:

06 декабря 2023 г., 19:28:33

Big +1 to this overall feature.

This is something I've wanted for a long time as well. While it's possible to use a COPY with text output for a trivial case, the double escaping falls apart quickly for arbitrary data. It's really only usable when you know exactly what you are querying and know it will not be a problem.

Regarding the defaults for the output, I think JSON lines (rather than a JSON array of objects) would be preferred. It's more natural to combine them and generate that type of data on the fly rather than forcing aggregation into a single object.

Couple more features / use cases come to mind as well. Even if they're not part of a first round of this feature I think it'd be helpful to document them now as it might give some ideas for what does make that first cut:

1. Outputting a top level JSON object without the additional column keys. IIUC, the top level keys are always the column names. A common use case would be a single json/jsonb column that is already formatted exactly as the user would like for output. Rather than enveloping it in an object with a dedicated key, it would be nice to be able to output it directly. This would allow non-object results to be outputted as well (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is structured, I think this would play nice with the JSON lines v.s. array concept.

COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_NOT_ENVELOPE)
{"foo":1}
{"foo":2}
{"foo":3}

2. An option to ignore null fields so they are excluded from the output. This would not be a default but would allow shrinking the total size of the output data in many situations. This would be recursive to allow nested objects to be shrunk down (not just the top level). This might be worthwhile as a standalone JSON function though handling it during output would be more efficient as it'd only be read once.

COPY (SELECT json_build_object('foo', CASE WHEN x > 1 THEN x END) FROM generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_NOT_ENVELOPE, JSON_SKIP_NULLS)
{}
{"foo":2}
{"foo":3}

3. Reverse of #2 when copying data in to allow defaulting missing fields to NULL.

Regards,

-- Sehrope Sarkuni

Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

Re: Emitting JSON to file using COPY TO

От

Tom Lane

Дата:

06 декабря 2023 г., 19:28:59

Andrew Dunstan <andrew@dunslane.net> writes:
> On 2023-12-06 We 10:44, Tom Lane wrote:
>> In particular, has anyone done any performance testing?
>> I'm concerned about that because composite_to_json() has
>> zero capability to cache any metadata across calls, meaning
>> there is going to be a large amount of duplicated work
>> per row.

> Yeah, that's hard to deal with, too, as it can be called recursively.

Right.  On the plus side, if we did improve this it would presumably
also benefit other callers of composite_to_json[b].

> OTOH I'd rather have a version of this that worked slowly than none at all.

It might be acceptable to plan on improving the performance later,
depending on just how bad it is now.

            regards, tom lane

Re: Emitting JSON to file using COPY TO

От

Nathan Bossart

Дата:

06 декабря 2023 г., 19:33:49

On Wed, Dec 06, 2023 at 11:28:59AM -0500, Tom Lane wrote:
> It might be acceptable to plan on improving the performance later,
> depending on just how bad it is now.

On 10M rows with 11 integers each, I'm seeing the following:

    (format text)
    Time: 10056.311 ms (00:10.056)
    Time: 8789.331 ms (00:08.789)
    Time: 8755.070 ms (00:08.755)

    (format csv)
    Time: 12295.480 ms (00:12.295)
    Time: 12311.059 ms (00:12.311)
    Time: 12305.469 ms (00:12.305)

    (format json)
    Time: 24568.621 ms (00:24.569)
    Time: 23756.234 ms (00:23.756)
    Time: 24265.730 ms (00:24.266)

'perf top' tends to look a bit like this:

  13.31%  postgres                      [.] appendStringInfoString
   7.57%  postgres                      [.] datum_to_json_internal
   6.82%  postgres                      [.] SearchCatCache1
   5.35%  [kernel]                      [k] intel_gpio_irq
   3.57%  postgres                      [.] composite_to_json
   3.31%  postgres                      [.] IsValidJsonNumber

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Nathan Bossart

Дата:

06 декабря 2023 г., 19:44:39

On Wed, Dec 06, 2023 at 10:33:49AM -0600, Nathan Bossart wrote:
>     (format csv)
>     Time: 12295.480 ms (00:12.295)
>     Time: 12311.059 ms (00:12.311)
>     Time: 12305.469 ms (00:12.305)
> 
>     (format json)
>     Time: 24568.621 ms (00:24.569)
>     Time: 23756.234 ms (00:23.756)
>     Time: 24265.730 ms (00:24.266)

I should also note that the json output is 85% larger than the csv output.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

"Daniel Verite"

Дата:

06 декабря 2023 г., 21:59:11

    Andrew Dunstan wrote:

> IMNSHO, we should produce either a single JSON
> document (the ARRAY case) or a series of JSON documents, one per row
> (the LINES case).

"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
   by zero or more CopyData messages (always one per row), followed by
   CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
    Type: CopyOut response
    Length: 13
    Format: Text (0)
    Columns: 3
    Format: Text (0)
PostgreSQL
    Type: Copy data
    Length: 6
    Copy data: 5b0a
PostgreSQL
    Type: Copy data
    Length: 76
    Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?

At least the json non-ARRAY case ("json lines") doesn't have
this issue, since every CopyData message corresponds effectively
to a row in the table.

[1] https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-COPY

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

06 декабря 2023 г., 22:47:25

On 12/6/23 13:59, Daniel Verite wrote:
>     Andrew Dunstan wrote:
> 
>> IMNSHO, we should produce either a single JSON 
>> document (the ARRAY case) or a series of JSON documents, one per row 
>> (the LINES case).
> 
> "COPY Operations" in the doc says:
> 
> " The backend sends a CopyOutResponse message to the frontend, followed
>     by zero or more CopyData messages (always one per row), followed by
>     CopyDone".
> 
> In the ARRAY case, the first messages with the copyjsontest
> regression test look like this (tshark output):
> 
> PostgreSQL
>      Type: CopyOut response
>      Length: 13
>      Format: Text (0)
>      Columns: 3
>     Format: Text (0)
> PostgreSQL
>      Type: Copy data
>      Length: 6
>      Copy data: 5b0a
> PostgreSQL
>      Type: Copy data
>      Length: 76
>      Copy data:
> 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…
> 
> The first Copy data message with contents "5b0a" does not qualify
> as a row of data with 3 columns as advertised in the CopyOut
> message. Isn't that a problem?


Is it a real problem, or just a bit of documentation change that I missed?

Anything receiving this and looking for a json array should know how to 
assemble the data correctly despite the extra CopyData messages.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

06 декабря 2023 г., 22:48:52

On 12/6/23 11:44, Nathan Bossart wrote:
> On Wed, Dec 06, 2023 at 10:33:49AM -0600, Nathan Bossart wrote:
>>     (format csv)
>>     Time: 12295.480 ms (00:12.295)
>>     Time: 12311.059 ms (00:12.311)
>>     Time: 12305.469 ms (00:12.305)
>> 
>>     (format json)
>>     Time: 24568.621 ms (00:24.569)
>>     Time: 23756.234 ms (00:23.756)
>>     Time: 24265.730 ms (00:24.266)
> 
> I should also note that the json output is 85% larger than the csv output.

I'll see if I can add some caching to composite_to_json(), but based on 
the relative data size it does not sound like there is much performance 
left on the table to go after, no?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Tom Lane

Дата:

06 декабря 2023 г., 23:20:46

Joe Conway <mail@joeconway.com> writes:
> I'll see if I can add some caching to composite_to_json(), but based on 
> the relative data size it does not sound like there is much performance 
> left on the table to go after, no?

If Nathan's perf results hold up elsewhere, it seems like some
micro-optimization around the text-pushing (appendStringInfoString)
might be more useful than caching.  The 7% spent in cache lookups
could be worth going after later, but it's not the top of the list.

The output size difference does say that maybe we should pay some
attention to the nearby request to not always label every field.
Perhaps there should be an option for each row to transform to
a JSON array rather than an object?

            regards, tom lane

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

07 декабря 2023 г., 00:03:12

On 2023-12-06 We 15:20, Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>> I'll see if I can add some caching to composite_to_json(), but based on
>> the relative data size it does not sound like there is much performance
>> left on the table to go after, no?
> If Nathan's perf results hold up elsewhere, it seems like some
> micro-optimization around the text-pushing (appendStringInfoString)
> might be more useful than caching.  The 7% spent in cache lookups
> could be worth going after later, but it's not the top of the list.
>
> The output size difference does say that maybe we should pay some
> attention to the nearby request to not always label every field.
> Perhaps there should be an option for each row to transform to
> a JSON array rather than an object?
>
>             


I doubt it. People who want this are likely to want pretty much what 
this patch is providing, not something they would have to transform in 
order to get it. If they want space-efficient data they won't really be 
wanting JSON. Maybe they want Protocol Buffers or something in that vein.

I see there's  nearby proposal to make this area pluggable at 
<https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com>


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 00:28:59

On 12/6/23 11:28, Sehrope Sarkuni wrote:
> Big +1 to this overall feature.

cool!

> Regarding the defaults for the output, I think JSON lines (rather than a 
> JSON array of objects) would be preferred. It's more natural to combine 
> them and generate that type of data on the fly rather than forcing 
> aggregation into a single object.

So that is +2 (Sehrope and me) for the status quo (JSON lines), and +2 
(Andrew and Davin) for defaulting to json arrays. Anyone else want to 
weigh in on that issue?

> Couple more features / use cases come to mind as well. Even if they're 
> not part of a first round of this feature I think it'd be helpful to 
> document them now as it might give some ideas for what does make that 
> first cut:
> 
> 1. Outputting a top level JSON object without the additional column 
> keys. IIUC, the top level keys are always the column names. A common use 
> case would be a single json/jsonb column that is already formatted 
> exactly as the user would like for output. Rather than enveloping it in 
> an object with a dedicated key, it would be nice to be able to output it 
> directly. This would allow non-object results to be outputted as well 
> (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is 
> structured, I think this would play nice with the JSON lines v.s. array 
> concept.
> 
> COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM 
> generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, 
> SOME_OPTION_TO_NOT_ENVELOPE)
> {"foo":1}
> {"foo":2}
> {"foo":3}

Your example does not match what you describe, or do I misunderstand? I 
thought your goal was to eliminate the repeated "foo" from each row...

> 2. An option to ignore null fields so they are excluded from the output. 
> This would not be a default but would allow shrinking the total size of 
> the output data in many situations. This would be recursive to allow 
> nested objects to be shrunk down (not just the top level). This might be 
> worthwhile as a standalone JSON function though handling it during 
> output would be more efficient as it'd only be read once.
> 
> COPY (SELECT json_build_object('foo', CASE WHEN x > 1 THEN x END) FROM 
> generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, 
> SOME_OPTION_TO_NOT_ENVELOPE, JSON_SKIP_NULLS)
> {}
> {"foo":2}
> {"foo":3}

clear enough I think

> 3. Reverse of #2 when copying data in to allow defaulting missing fields 
> to NULL.

good to record the ask, but applies to a different feature (COPY FROM 
instead of COPY TO).

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Sehrope Sarkuni

Дата:

07 декабря 2023 г., 00:36:02

On Wed, Dec 6, 2023 at 4:03 PM Andrew Dunstan <andrew@dunslane.net> wrote:

> The output size difference does say that maybe we should pay some
> attention to the nearby request to not always label every field.
> Perhaps there should be an option for each row to transform to
> a JSON array rather than an object?

I doubt it. People who want this are likely to want pretty much what
this patch is providing, not something they would have to transform in
order to get it. If they want space-efficient data they won't really be
wanting JSON. Maybe they want Protocol Buffers or something in that vein.

For arrays v.s. objects, it's not just about data size. There are plenty of situations where a JSON array is superior to an object (e.g. duplicate column names). Lines of JSON arrays of strings is pretty much CSV with JSON escaping rules and a pair of wrapping brackets. It's common for tabular data in node.js environments as you don't need a separate CSV parser.

Each one has its place and a default of the row_to_json(...) representation of the row still makes sense. But if the user has the option of outputting a single json/jsonb field for each row without an object or array wrapper, then it's possible to support all of these use cases as the user can explicitly pick whatever envelope makes sense:

-- Lines of JSON arrays:
COPY (SELECT json_build_array('test-' || a, b) FROM generate_series(1, 3) a, generate_series(5,6) b) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_DISABLE_ENVELOPE);
["test-1", 5]
["test-2", 5]
["test-3", 5]
["test-1", 6]
["test-2", 6]
["test-3", 6]

-- Lines of JSON strings:
COPY (SELECT to_json('test-' || x) FROM generate_series(1, 5) x) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_DISABLE_ENVELOPE);
"test-1"
"test-2"
"test-3"
"test-4"
"test-5"

I'm not sure how I feel about the behavior being automatic if it's a single top level json / jsonb field rather than requiring the explicit option. It's probably what a user would want but it also feels odd to change the output wrapper automatically based on the fields in the response. If it is automatic and the user wants the additional envelope, the option always exists to wrap it further in another: json_build_object('some_field", my_field_i_want_wrapped)

The duplicate field names would be a good test case too. I haven't gone through this patch but I'm guessing it doesn't filter out duplicates so the behavior would match up with row_to_json(...), i.e. duplicates are preserved:

=> SELECT row_to_json(t.*) FROM (SELECT 1 AS a, 2 AS a) t;
row_to_json
---------------
{"a":1,"a":2}

If so, that's a good test case to add as however that's handled should be deterministic.

Regards,

-- Sehrope Sarkuni

Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

Re: Emitting JSON to file using COPY TO

От

Nathan Bossart

Дата:

07 декабря 2023 г., 00:41:11

On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote:
> If Nathan's perf results hold up elsewhere, it seems like some
> micro-optimization around the text-pushing (appendStringInfoString)
> might be more useful than caching.  The 7% spent in cache lookups
> could be worth going after later, but it's not the top of the list.

Agreed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Sehrope Sarkuni

Дата:

07 декабря 2023 г., 00:42:11

On Wed, Dec 6, 2023 at 4:29 PM Joe Conway <mail@joeconway.com> wrote:

> 1. Outputting a top level JSON object without the additional column
> keys. IIUC, the top level keys are always the column names. A common use
> case would be a single json/jsonb column that is already formatted
> exactly as the user would like for output. Rather than enveloping it in
> an object with a dedicated key, it would be nice to be able to output it
> directly. This would allow non-object results to be outputted as well
> (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is
> structured, I think this would play nice with the JSON lines v.s. array
> concept.
>
> COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM
> generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
> SOME_OPTION_TO_NOT_ENVELOPE)
> {"foo":1}
> {"foo":2}
> {"foo":3}

Your example does not match what you describe, or do I misunderstand? I
thought your goal was to eliminate the repeated "foo" from each row...

The "foo" in this case is explicit as I'm adding it when building the object. What I was trying to show was not adding an additional object wrapper / envelope.

So each row is:

{"foo":1}

Rather than:

"{"json_build_object":{"foo":1}}

If each row has exactly one json / jsonb field, then the user has already indicated the format for each row.

That same mechanism can be used to remove the "foo" entirely via a json/jsonb array.

Regards,

-- Sehrope Sarkuni

Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 01:38:21

On 12/6/23 16:42, Sehrope Sarkuni wrote:
> On Wed, Dec 6, 2023 at 4:29 PM Joe Conway <mail@joeconway.com 
> <mailto:mail@joeconway.com>> wrote:
> 
>      > 1. Outputting a top level JSON object without the additional column
>      > keys. IIUC, the top level keys are always the column names. A
>     common use
>      > case would be a single json/jsonb column that is already formatted
>      > exactly as the user would like for output. Rather than enveloping
>     it in
>      > an object with a dedicated key, it would be nice to be able to
>     output it
>      > directly. This would allow non-object results to be outputted as
>     well
>      > (e.g., lines of JSON arrays, numbers, or strings). Due to how
>     JSON is
>      > structured, I think this would play nice with the JSON lines v.s.
>     array
>      > concept.
>      >
>      > COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM
>      > generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
>      > SOME_OPTION_TO_NOT_ENVELOPE)
>      > {"foo":1}
>      > {"foo":2}
>      > {"foo":3}
> 
>     Your example does not match what you describe, or do I misunderstand? I
>     thought your goal was to eliminate the repeated "foo" from each row...
> 
> 
> The "foo" in this case is explicit as I'm adding it when building the 
> object. What I was trying to show was not adding an additional object 
> wrapper / envelope.
> 
> So each row is:
> 
> {"foo":1}
> 
> Rather than:
> 
> "{"json_build_object":{"foo":1}}

I am still getting confused ;-)

Let's focus on the current proposed patch with a "minimum required 
feature set".

Right now the default behavior is "JSON lines":
8<-------------------------------
COPY (SELECT x.i, 'val' || x.i as v FROM
       generate_series(1, 3) x(i))
TO STDOUT WITH (FORMAT JSON);
{"i":1,"v":"val1"}
{"i":2,"v":"val2"}
{"i":3,"v":"val3"}
8<-------------------------------

and the other, non-default option is "JSON array":
8<-------------------------------
COPY (SELECT x.i, 'val' || x.i as v FROM
       generate_series(1, 3) x(i))
TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY);
[
  {"i":1,"v":"val1"}
,{"i":2,"v":"val2"}
,{"i":3,"v":"val3"}
]
8<-------------------------------

So the questions are:
1. Do those two formats work for the initial implementation?
2. Is the default correct or should it be switched
    e.g. rather than specifying FORCE_ARRAY to get an
    array, something like FORCE_NO_ARRAY to get JSON lines
    and the JSON array is default?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

"David G. Johnston"

Дата:

07 декабря 2023 г., 01:56:22

On Wed, Dec 6, 2023 at 3:38 PM Joe Conway <mail@joeconway.com> wrote:

So the questions are:
1. Do those two formats work for the initial implementation?

Yes. We provide a stream-oriented format and one atomic-import format.

2. Is the default correct or should it be switched
e.g. rather than specifying FORCE_ARRAY to get an
array, something like FORCE_NO_ARRAY to get JSON lines
and the JSON array is default?

No default?

Require explicit of a sub-format when the main format is JSON.

JSON_OBJECT_ROWS

JSON_ARRAY_OF_OBJECTS

For a future compact array-structured-composites sub-format:

JSON_ARRAY_OF_ARRAYS

JSON_ARRAY_ROWS

David J.

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 02:09:30

On 12/6/23 14:47, Joe Conway wrote:
> On 12/6/23 13:59, Daniel Verite wrote:
>>     Andrew Dunstan wrote:
>> 
>>> IMNSHO, we should produce either a single JSON 
>>> document (the ARRAY case) or a series of JSON documents, one per row 
>>> (the LINES case).
>> 
>> "COPY Operations" in the doc says:
>> 
>> " The backend sends a CopyOutResponse message to the frontend, followed
>>     by zero or more CopyData messages (always one per row), followed by
>>     CopyDone".
>> 
>> In the ARRAY case, the first messages with the copyjsontest
>> regression test look like this (tshark output):
>> 
>> PostgreSQL
>>      Type: CopyOut response
>>      Length: 13
>>      Format: Text (0)
>>      Columns: 3
>>     Format: Text (0)
>> PostgreSQL
>>      Type: Copy data
>>      Length: 6
>>      Copy data: 5b0a
>> PostgreSQL
>>      Type: Copy data
>>      Length: 76
>>      Copy data:
>> 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…
>> 
>> The first Copy data message with contents "5b0a" does not qualify
>> as a row of data with 3 columns as advertised in the CopyOut
>> message. Isn't that a problem?
> 
> 
> Is it a real problem, or just a bit of documentation change that I missed?
> 
> Anything receiving this and looking for a json array should know how to
> assemble the data correctly despite the extra CopyData messages.

Hmm, maybe the real problem here is that Columns do not equal "3" for 
the json mode case -- that should really say "1" I think, because the 
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more 
"rows" that actual output rows (the "[" and the "]"), but maybe those 
are less likely to cause some hazard?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

"David G. Johnston"

Дата:

07 декабря 2023 г., 02:28:06

On Wed, Dec 6, 2023 at 4:09 PM Joe Conway <mail@joeconway.com> wrote:

On 12/6/23 14:47, Joe Conway wrote:
> On 12/6/23 13:59, Daniel Verite wrote:
>> Andrew Dunstan wrote:
>>
>>> IMNSHO, we should produce either a single JSON
>>> document (the ARRAY case) or a series of JSON documents, one per row
>>> (the LINES case).
>>
>> "COPY Operations" in the doc says:
>>
>> " The backend sends a CopyOutResponse message to the frontend, followed
>> by zero or more CopyData messages (always one per row), followed by
>> CopyDone".
>>
>> In the ARRAY case, the first messages with the copyjsontest
>> regression test look like this (tshark output):
>>
>> PostgreSQL
>> Type: CopyOut response
>> Length: 13
>> Format: Text (0)
>> Columns: 3
>> Format: Text (0)

> Anything receiving this and looking for a json array should know how to
> assemble the data correctly despite the extra CopyData messages.

Hmm, maybe the real problem here is that Columns do not equal "3" for
the json mode case -- that should really say "1" I think, because the
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe those
are less likely to cause some hazard?

What is the limitation, if any, of introducing new type codes for these. n = 2..N for the different variants? Or even -1 for "raw text"? And document that columns and structural rows need to be determined out-of-band. Continuing to use 1 (text) for this non-csv data seems like a hack even if we can technically make it function. The semantics, especially for the array case, are completely discarded or wrong.

David J.

Re: Emitting JSON to file using COPY TO

От

"David G. Johnston"

Дата:

07 декабря 2023 г., 02:38:32

On Wed, Dec 6, 2023 at 4:28 PM David G. Johnston <david.g.johnston@gmail.com> wrote:

On Wed, Dec 6, 2023 at 4:09 PM Joe Conway <mail@joeconway.com> wrote:
On 12/6/23 14:47, Joe Conway wrote:
> On 12/6/23 13:59, Daniel Verite wrote:
>> Andrew Dunstan wrote:
>>
>>> IMNSHO, we should produce either a single JSON
>>> document (the ARRAY case) or a series of JSON documents, one per row
>>> (the LINES case).
>>
>> "COPY Operations" in the doc says:
>>
>> " The backend sends a CopyOutResponse message to the frontend, followed
>> by zero or more CopyData messages (always one per row), followed by
>> CopyDone".
>>
>> In the ARRAY case, the first messages with the copyjsontest
>> regression test look like this (tshark output):
>>
>> PostgreSQL
>> Type: CopyOut response
>> Length: 13
>> Format: Text (0)
>> Columns: 3
>> Format: Text (0)

> Anything receiving this and looking for a json array should know how to
> assemble the data correctly despite the extra CopyData messages.

Hmm, maybe the real problem here is that Columns do not equal "3" for
the json mode case -- that should really say "1" I think, because the
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe those
are less likely to cause some hazard?

What is the limitation, if any, of introducing new type codes for these. n = 2..N for the different variants? Or even -1 for "raw text"? And document that columns and structural rows need to be determined out-of-band. Continuing to use 1 (text) for this non-csv data seems like a hack even if we can technically make it function. The semantics, especially for the array case, are completely discarded or wrong.

Also, it seems like this answer would be easier to make if we implement COPY FROM now since how is the server supposed to deal with decomposing this data into tables without accurate type information? I don't see implementing only half of the feature being a good idea. I've had much more desire for FROM compared to TO personally.

David J.

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 02:45:52

On 12/6/23 18:28, David G. Johnston wrote:
> On Wed, Dec 6, 2023 at 4:09 PM Joe Conway <mail@joeconway.com 
> <mailto:mail@joeconway.com>> wrote:
> 
>     On 12/6/23 14:47, Joe Conway wrote:
>      > On 12/6/23 13:59, Daniel Verite wrote:
>      >>      Andrew Dunstan wrote:
>      >>
>      >>> IMNSHO, we should produce either a single JSON
>      >>> document (the ARRAY case) or a series of JSON documents, one
>     per row
>      >>> (the LINES case).
>      >>
>      >> "COPY Operations" in the doc says:
>      >>
>      >> " The backend sends a CopyOutResponse message to the frontend,
>     followed
>      >>     by zero or more CopyData messages (always one per row),
>     followed by
>      >>     CopyDone".
>      >>
>      >> In the ARRAY case, the first messages with the copyjsontest
>      >> regression test look like this (tshark output):
>      >>
>      >> PostgreSQL
>      >>      Type: CopyOut response
>      >>      Length: 13
>      >>      Format: Text (0)
>      >>      Columns: 3
>      >>      Format: Text (0)
> 
>      > Anything receiving this and looking for a json array should know
>     how to
>      > assemble the data correctly despite the extra CopyData messages.
> 
>     Hmm, maybe the real problem here is that Columns do not equal "3" for
>     the json mode case -- that should really say "1" I think, because the
>     row is not represented as 3 columns but rather 1 json object.
> 
>     Does that sound correct?
> 
>     Assuming yes, there is still maybe an issue that there are two more
>     "rows" that actual output rows (the "[" and the "]"), but maybe those
>     are less likely to cause some hazard?
> 
> 
> What is the limitation, if any, of introducing new type codes for 
> these.  n = 2..N for the different variants?  Or even -1 for "raw 
> text"?  And document that columns and structural rows need to be 
> determined out-of-band.  Continuing to use 1 (text) for this non-csv 
> data seems like a hack even if we can technically make it function.  The 
> semantics, especially for the array case, are completely discarded or wrong.

I am not following you here. SendCopyBegin looks like this currently:

8<--------------------------------
SendCopyBegin(CopyToState cstate)
{
    StringInfoData buf;
    int            natts = list_length(cstate->attnumlist);
    int16        format = (cstate->opts.binary ? 1 : 0);
    int            i;

    pq_beginmessage(&buf, PqMsg_CopyOutResponse);
    pq_sendbyte(&buf, format);    /* overall format */
    pq_sendint16(&buf, natts);
    for (i = 0; i < natts; i++)
        pq_sendint16(&buf, format); /* per-column formats */
    pq_endmessage(&buf);
    cstate->copy_dest = COPY_FRONTEND;
}
8<--------------------------------

The "1" is saying are we binary mode or not. JSON mode will never be 
sending in binary in the current implementation at least. And it always 
aggregates all the columns as one json object. So the correct answer is 
(I think):
8<--------------------------------
*************** SendCopyBegin(CopyToState cstate)
*** 146,154 ****

       pq_beginmessage(&buf, PqMsg_CopyOutResponse);
       pq_sendbyte(&buf, format);    /* overall format */
!     pq_sendint16(&buf, natts);
!     for (i = 0; i < natts; i++)
!         pq_sendint16(&buf, format); /* per-column formats */
       pq_endmessage(&buf);
       cstate->copy_dest = COPY_FRONTEND;
   }
--- 150,169 ----

       pq_beginmessage(&buf, PqMsg_CopyOutResponse);
       pq_sendbyte(&buf, format);    /* overall format */
!     if (!cstate->opts.json_mode)
!     {
!         pq_sendint16(&buf, natts);
!         for (i = 0; i < natts; i++)
!             pq_sendint16(&buf, format); /* per-column formats */
!     }
!     else
!     {
!         /*
!          * JSON mode is always one non-binary column
!          */
!         pq_sendint16(&buf, 1);
!         pq_sendint16(&buf, 0);
!     }
       pq_endmessage(&buf);
       cstate->copy_dest = COPY_FRONTEND;
   }
8<--------------------------------

That still leaves the need to fix the documentation:

" The backend sends a CopyOutResponse message to the frontend, followed
    by zero or more CopyData messages (always one per row), followed by
    CopyDone"

probably "always one per row" would be changed to note that json array 
format outputs two extra rows for the start/end bracket.

In fact, as written the patch does this:
8<--------------------------------
COPY (SELECT x.i, 'val' || x.i as v FROM
       generate_series(1, 3) x(i) WHERE false)
TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY);
[
]
8<--------------------------------

Not sure if that is a problem or not.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 02:48:54

On 12/6/23 18:38, David G. Johnston wrote:
> On Wed, Dec 6, 2023 at 4:28 PM David G. Johnston 
> <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>> wrote:
> 
>     On Wed, Dec 6, 2023 at 4:09 PM Joe Conway <mail@joeconway.com
>     <mailto:mail@joeconway.com>> wrote:
> 
>         On 12/6/23 14:47, Joe Conway wrote:
>          > On 12/6/23 13:59, Daniel Verite wrote:
>          >>      Andrew Dunstan wrote:
>          >>
>          >>> IMNSHO, we should produce either a single JSON
>          >>> document (the ARRAY case) or a series of JSON documents,
>         one per row
>          >>> (the LINES case).
>          >>
>          >> "COPY Operations" in the doc says:
>          >>
>          >> " The backend sends a CopyOutResponse message to the
>         frontend, followed
>          >>     by zero or more CopyData messages (always one per row),
>         followed by
>          >>     CopyDone".
>          >>
>          >> In the ARRAY case, the first messages with the copyjsontest
>          >> regression test look like this (tshark output):
>          >>
>          >> PostgreSQL
>          >>      Type: CopyOut response
>          >>      Length: 13
>          >>      Format: Text (0)
>          >>      Columns: 3
>          >>      Format: Text (0)
> 
>          > Anything receiving this and looking for a json array should
>         know how to
>          > assemble the data correctly despite the extra CopyData messages.
> 
>         Hmm, maybe the real problem here is that Columns do not equal
>         "3" for
>         the json mode case -- that should really say "1" I think,
>         because the
>         row is not represented as 3 columns but rather 1 json object.
> 
>         Does that sound correct?
> 
>         Assuming yes, there is still maybe an issue that there are two more
>         "rows" that actual output rows (the "[" and the "]"), but maybe
>         those
>         are less likely to cause some hazard?
> 
> 
>     What is the limitation, if any, of introducing new type codes for
>     these.  n = 2..N for the different variants?  Or even -1 for "raw
>     text"?  And document that columns and structural rows need to be
>     determined out-of-band.  Continuing to use 1 (text) for this non-csv
>     data seems like a hack even if we can technically make it function. 
>     The semantics, especially for the array case, are completely
>     discarded or wrong.
> 
> Also, it seems like this answer would be easier to make if we implement 
> COPY FROM now since how is the server supposed to deal with decomposing 
> this data into tables without accurate type information?  I don't see 
> implementing only half of the feature being a good idea.  I've had much 
> more desire for FROM compared to TO personally.

Several people have weighed in on the side of getting COPY TO done by 
itself first. Given how long this discussion has already become for a 
relatively small and simple feature, I am a big fan of not expanding the 
scope now.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

"David G. Johnston"

Дата:

07 декабря 2023 г., 03:39:11

On Wed, Dec 6, 2023 at 4:45 PM Joe Conway <mail@joeconway.com> wrote:

" The backend sends a CopyOutResponse message to the frontend, followed
by zero or more CopyData messages (always one per row), followed by
CopyDone"

probably "always one per row" would be changed to note that json array
format outputs two extra rows for the start/end bracket.

Fair, I was ascribing much more semantic meaning to this than it wants.

I don't see any real requirement, given the lack of semantics, to mention JSON at all. It is one CopyData per row, regardless of the contents. We don't delineate between the header and non-header data in CSV. It isn't a protocol concern.

But I still cannot shake the belief that using a format code of 1 - which really could be interpreted as meaning "textual csv" in practice - for this JSON output is unwise and we should introduce a new integer value for the new fundamental output format.

David J.

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 03:57:42

On 12/6/23 19:39, David G. Johnston wrote:
> On Wed, Dec 6, 2023 at 4:45 PM Joe Conway <mail@joeconway.com 
> <mailto:mail@joeconway.com>> wrote:
> 
> 
>     " The backend sends a CopyOutResponse message to the frontend, followed
>          by zero or more CopyData messages (always one per row), followed by
>          CopyDone"
> 
>     probably "always one per row" would be changed to note that json array
>     format outputs two extra rows for the start/end bracket.
> 
> 
> Fair, I was ascribing much more semantic meaning to this than it wants.
> 
> I don't see any real requirement, given the lack of semantics, to 
> mention JSON at all.  It is one CopyData per row, regardless of the 
> contents.  We don't delineate between the header and non-header data in 
> CSV.  It isn't a protocol concern.

good point

> But I still cannot shake the belief that using a format code of 1 - 
> which really could be interpreted as meaning "textual csv" in practice - 
> for this JSON output is unwise and we should introduce a new integer 
> value for the new fundamental output format.

No, I am pretty sure you still have that wrong. The "1" means binary 
mode. As in
8<----------------------
FORMAT

     Selects the data format to be read or written: text, csv (Comma 
Separated Values), or binary. The default is text.
8<----------------------

That is completely separate from text and csv. It literally means to use 
the binary output functions instead of the usual ones:

8<----------------------
         if (cstate->opts.binary)
             getTypeBinaryOutputInfo(attr->atttypid,
                                     &out_func_oid,
                                     &isvarlena);
         else
             getTypeOutputInfo(attr->atttypid,
                               &out_func_oid,
                               &isvarlena);
8<----------------------

Both "text" and "csv" mode use are non-binary output formats. I believe 
the JSON output format is also non-binary.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

"David G. Johnston"

Дата:

07 декабря 2023 г., 04:09:22

On Wed, Dec 6, 2023 at 5:57 PM Joe Conway <mail@joeconway.com> wrote:

On 12/6/23 19:39, David G. Johnston wrote:
> On Wed, Dec 6, 2023 at 4:45 PM Joe Conway <mail@joeconway.com
> <mailto:mail@joeconway.com>> wrote:

> But I still cannot shake the belief that using a format code of 1 -
> which really could be interpreted as meaning "textual csv" in practice -
> for this JSON output is unwise and we should introduce a new integer
> value for the new fundamental output format.

No, I am pretty sure you still have that wrong. The "1" means binary
mode

Ok. I made the same typo twice, I did mean to write 0 instead of 1. But the point that we should introduce a 2 still stands. The new code would mean: use text output functions but that there is no inherent tabular structure in the underlying contents. Instead the copy format was JSON and the output layout is dependent upon the json options in the copy command and that there really shouldn't be any attempt to turn the contents directly into a tabular data structure like you presently do with the CSV data under format 0. Ignore the column count and column formats as they are fixed or non-existent.

David J.

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 04:10:21

On 12/6/23 18:09, Joe Conway wrote:
> On 12/6/23 14:47, Joe Conway wrote:
>> On 12/6/23 13:59, Daniel Verite wrote:
>>>     Andrew Dunstan wrote:
>>> 
>>>> IMNSHO, we should produce either a single JSON 
>>>> document (the ARRAY case) or a series of JSON documents, one per row 
>>>> (the LINES case).
>>> 
>>> "COPY Operations" in the doc says:
>>> 
>>> " The backend sends a CopyOutResponse message to the frontend, followed
>>>     by zero or more CopyData messages (always one per row), followed by
>>>     CopyDone".
>>> 
>>> In the ARRAY case, the first messages with the copyjsontest
>>> regression test look like this (tshark output):
>>> 
>>> PostgreSQL
>>>      Type: CopyOut response
>>>      Length: 13
>>>      Format: Text (0)
>>>      Columns: 3
>>>     Format: Text (0)
>>> PostgreSQL
>>>      Type: Copy data
>>>      Length: 6
>>>      Copy data: 5b0a
>>> PostgreSQL
>>>      Type: Copy data
>>>      Length: 76
>>>      Copy data:
>>> 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…
>>> 
>>> The first Copy data message with contents "5b0a" does not qualify
>>> as a row of data with 3 columns as advertised in the CopyOut
>>> message. Isn't that a problem?
>> 
>> 
>> Is it a real problem, or just a bit of documentation change that I missed?
>> 
>> Anything receiving this and looking for a json array should know how to
>> assemble the data correctly despite the extra CopyData messages.
> 
> Hmm, maybe the real problem here is that Columns do not equal "3" for
> the json mode case -- that should really say "1" I think, because the
> row is not represented as 3 columns but rather 1 json object.
> 
> Does that sound correct?
> 
> Assuming yes, there is still maybe an issue that there are two more
> "rows" that actual output rows (the "[" and the "]"), but maybe those
> are less likely to cause some hazard?


The attached should fix the CopyOut response to say one column. I.e. it 
ought to look something like:

PostgreSQL
      Type: CopyOut response
      Length: 13
      Format: Text (0)
      Columns: 1
      Format: Text (0)
PostgreSQL
      Type: Copy data
      Length: 6
      Copy data: 5b0a
PostgreSQL
      Type: Copy data
      Length: 76
      Copy data: [...]


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

copyto_json.007.diff

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 04:14:09

On 12/6/23 20:09, David G. Johnston wrote:
> On Wed, Dec 6, 2023 at 5:57 PM Joe Conway <mail@joeconway.com 
> <mailto:mail@joeconway.com>> wrote:
> 
>     On 12/6/23 19:39, David G. Johnston wrote:
>      > On Wed, Dec 6, 2023 at 4:45 PM Joe Conway <mail@joeconway.com
>     <mailto:mail@joeconway.com>
>      > <mailto:mail@joeconway.com <mailto:mail@joeconway.com>>> wrote:
> 
>      > But I still cannot shake the belief that using a format code of 1 -
>      > which really could be interpreted as meaning "textual csv" in
>     practice -
>      > for this JSON output is unwise and we should introduce a new integer
>      > value for the new fundamental output format.
> 
>     No, I am pretty sure you still have that wrong. The "1" means binary
>     mode
> 
> 
> Ok.  I made the same typo twice, I did mean to write 0 instead of 1.

Fair enough.

> But the point that we should introduce a 2 still stands.  The new code 
> would mean: use text output functions but that there is no inherent 
> tabular structure in the underlying contents.  Instead the copy format 
> was JSON and the output layout is dependent upon the json options in the 
> copy command and that there really shouldn't be any attempt to turn the 
> contents directly into a tabular data structure like you presently do 
> with the CSV data under format 0.  Ignore the column count and column 
> formats as they are fixed or non-existent.

I think that amounts to a protocol change, which we tend to avoid at all 
costs.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

"David G. Johnston"

Дата:

07 декабря 2023 г., 04:21:28

On Wed, Dec 6, 2023 at 6:14 PM Joe Conway <mail@joeconway.com> wrote:

> But the point that we should introduce a 2 still stands. The new code
> would mean: use text output functions but that there is no inherent
> tabular structure in the underlying contents. Instead the copy format
> was JSON and the output layout is dependent upon the json options in the
> copy command and that there really shouldn't be any attempt to turn the
> contents directly into a tabular data structure like you presently do
> with the CSV data under format 0. Ignore the column count and column
> formats as they are fixed or non-existent.

I think that amounts to a protocol change, which we tend to avoid at all
costs.

I wasn't sure on that point but figured it might be the case. It is a value change, not structural, which seems like it is the kind of modification any living system might allow and be expected to have. But I also don't see any known problem with the current change of content semantics without the format identification change. Most of the relevant context ends up out-of-band in the copy command itself.

David J.

Re: Emitting JSON to file using COPY TO

От

"Euler Taveira"

Дата:

07 декабря 2023 г., 05:42:06

On Wed, Dec 6, 2023, at 3:59 PM, Daniel Verite wrote:

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?

At least the json non-ARRAY case ("json lines") doesn't have
this issue, since every CopyData message corresponds effectively
to a row in the table.

Moreover, if your interface wants to process the COPY data stream while

receiving it, you cannot provide "json array" format because each row (plus all

of the received ones) is not a valid JSON. Hence, a JSON parser cannot be

executed until you receive the whole data set. (wal2json format 1 has this

disadvantage. Format 2 was born to provide a better alternative -- each row is

a valid JSON.) I'm not saying that "json array" is not useful but that for

large data sets, it is less useful.

Euler Taveira

EDB https://www.enterprisedb.com/

Re: Emitting JSON to file using COPY TO

От

Nathan Bossart

Дата:

07 декабря 2023 г., 05:56:22

On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote:
> If Nathan's perf results hold up elsewhere, it seems like some
> micro-optimization around the text-pushing (appendStringInfoString)
> might be more useful than caching.  The 7% spent in cache lookups
> could be worth going after later, but it's not the top of the list.

Hah, it turns out my benchmark of 110M integers really stresses the
JSONTYPE_NUMERIC path in datum_to_json_internal().  That particular path
calls strlen() twice: once for IsValidJsonNumber(), and once in
appendStringInfoString().  If I save the result from IsValidJsonNumber()
and give it to appendBinaryStringInfo() instead, the COPY goes ~8% faster.
It's probably worth giving datum_to_json_internal() a closer look in a new
thread.

diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 71ae53ff97..1951e93d9d 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -180,6 +180,7 @@ datum_to_json_internal(Datum val, bool is_null, StringInfo result,
 {
     char       *outputstr;
     text       *jsontext;
+    int         len;
 
     check_stack_depth();
 
@@ -223,8 +224,8 @@ datum_to_json_internal(Datum val, bool is_null, StringInfo result,
              * Don't call escape_json for a non-key if it's a valid JSON
              * number.
              */
-            if (!key_scalar && IsValidJsonNumber(outputstr, strlen(outputstr)))
-                appendStringInfoString(result, outputstr);
+            if (!key_scalar && IsValidJsonNumber(outputstr, (len = strlen(outputstr))))
+                appendBinaryStringInfo(result, outputstr, len);
             else
                 escape_json(result, outputstr);
             pfree(outputstr);

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 15:15:28

On 12/6/23 21:56, Nathan Bossart wrote:
> On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote:
>> If Nathan's perf results hold up elsewhere, it seems like some
>> micro-optimization around the text-pushing (appendStringInfoString)
>> might be more useful than caching.  The 7% spent in cache lookups
>> could be worth going after later, but it's not the top of the list.
> 
> Hah, it turns out my benchmark of 110M integers really stresses the
> JSONTYPE_NUMERIC path in datum_to_json_internal().  That particular path
> calls strlen() twice: once for IsValidJsonNumber(), and once in
> appendStringInfoString().  If I save the result from IsValidJsonNumber()
> and give it to appendBinaryStringInfo() instead, the COPY goes ~8% faster.
> It's probably worth giving datum_to_json_internal() a closer look in a new
> thread.

Yep, after looking through that code I was going to make the point that 
your 11 integer test was over indexing on that one type. I am sure there 
are other micro-optimizations to be made here, but I also think that it 
is outside the scope of the COPY TO JSON patch.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Andrew Dunstan

Дата:

07 декабря 2023 г., 16:19:19

On 2023-12-06 We 17:56, David G. Johnston wrote:

On Wed, Dec 6, 2023 at 3:38 PM Joe Conway <mail@joeconway.com> wrote:
So the questions are:
1. Do those two formats work for the initial implementation?

Yes. We provide a stream-oriented format and one atomic-import format.

2. Is the default correct or should it be switched
e.g. rather than specifying FORCE_ARRAY to get an
array, something like FORCE_NO_ARRAY to get JSON lines
and the JSON array is default?

No default?

Require explicit of a sub-format when the main format is JSON.

JSON_OBJECT_ROWS
JSON_ARRAY_OF_OBJECTS

For a future compact array-structured-composites sub-format:
JSON_ARRAY_OF_ARRAYS
JSON_ARRAY_ROWS

No default seems unlike the way we treat other COPY options. I'm not terribly fussed about which format to have as the default, but I think we should have one.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

От

"Daniel Verite"

Дата:

07 декабря 2023 г., 16:35:52

    Joe Conway wrote:

> The attached should fix the CopyOut response to say one column. I.e. it
> ought to look something like:

Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...

this number must be the number of columns in the source.
That is for COPY table(a,b,c)    the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.

I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.

The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.

The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"

ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Emitting JSON to file using COPY TO

От

"David G. Johnston"

Дата:

07 декабря 2023 г., 16:47:10

On Thursday, December 7, 2023, Daniel Verite <daniel@manitou-mail.org> wrote:

Joe Conway wrote:

> The attached should fix the CopyOut response to say one column. I.e. it
> ought to look something like:

Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...

this number must be the number of columns in the source.
That is for COPY table(a,b,c) the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.

I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.

The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.

This argument for leaving 3 as the column count makes sense to me. I agree this content is not meant to facilitate interpreting the contents at a protocol level.

The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"

ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.

We already effectively interpret this as “one content line per copydata message” in the csv text with header line case. I’d probably reword it to state that explicitly and then we again don’t have to worry about the protocol caring about any data semantics of the underlying content, only physical semantics.

David J.

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 16:52:39

On 12/7/23 08:35, Daniel Verite wrote:
>     Joe Conway wrote:
> 
>> The attached should fix the CopyOut response to say one column. I.e. it 
>> ought to look something like:
> 
> Spending more time with the doc I came to the opinion that in this bit
> of the protocol, in CopyOutResponse (B)
> ...
> Int16
> The number of columns in the data to be copied (denoted N below).
> ...
> 
> this number must be the number of columns in the source.
> That is for COPY table(a,b,c)    the number is 3, independently
> on whether the result is formatted in text, cvs, json or binary.
> 
> I think that changing it for json can reasonably be interpreted
> as a protocol break and we should not do it.
> 
> The fact that this value does not help parsing the CopyData
> messages that come next is not a new issue. A reader that
> doesn't know the field separator and whether it's text or csv
> cannot parse these messages into fields anyway.
> But just knowing how much columns there are in the original
> data might be useful by itself and we don't want to break that.

Ok, that sounds reasonable to me -- I will revert that change.

> The other question for me is, in the CopyData message, this
> bit:
> " Messages sent from the backend will always correspond to single data rows"
> 
> ISTM that considering that the "[" starting the json array is a
> "data row" is a stretch.
> That might be interpreted as a protocol break, depending
> on how strict the interpretation is.

If we really think that is a problem I can see about changing it to this 
format for json array:

8<------------------
copy
(
   with ss(f1, f2) as
   (
     select 1, g.i from generate_series(1, 3) g(i)
   )
   select ss from ss
) to stdout (format json, force_array);
[{"ss":{"f1":1,"f2":1}}
,{"ss":{"f1":1,"f2":2}}
,{"ss":{"f1":1,"f2":3}}]
8<------------------

Is this acceptable to everyone?

Or maybe this is preferred?
8<------------------
[{"ss":{"f1":1,"f2":1}},
  {"ss":{"f1":1,"f2":2}},
  {"ss":{"f1":1,"f2":3}}]
8<------------------

Or as long as we are painting the shed, maybe this?
8<------------------
[{"ss":{"f1":1,"f2":1}},
{"ss":{"f1":1,"f2":2}},
{"ss":{"f1":1,"f2":3}}]
8<------------------

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 16:56:41

On 12/7/23 08:52, Joe Conway wrote:
> Or maybe this is preferred?
> 8<------------------
> [{"ss":{"f1":1,"f2":1}},
>    {"ss":{"f1":1,"f2":2}},
>    {"ss":{"f1":1,"f2":3}}]
> 8<------------------

I don't know why my mail client keeps adding extra spaces, but the 
intention here is a single space in front of row 2 and 3 in order to 
line the json objects up at column 2.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

"David G. Johnston"

Дата:

07 декабря 2023 г., 17:11:08

On Thursday, December 7, 2023, Joe Conway <mail@joeconway.com> wrote:

On 12/7/23 08:35, Daniel Verite wrote:
Joe Conway wrote:

The attached should fix the CopyOut response to say one column. I.e. it ought to look something like:

Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...

this number must be the number of columns in the source.
That is for COPY table(a,b,c) the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.

I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.

The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.

Ok, that sounds reasonable to me -- I will revert that change.

The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"

ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.

If we really think that is a problem I can see about changing it to this format for json array:

8<------------------
copy
(
with ss(f1, f2) as
(
select 1, g.i from generate_series(1, 3) g(i)
)
select ss from ss
) to stdout (format json, force_array);
[{"ss":{"f1":1,"f2":1}}
,{"ss":{"f1":1,"f2":2}}
,{"ss":{"f1":1,"f2":3}}]
8<------------------

Is this acceptable to everyone?

Or maybe this is preferred?
8<------------------
[{"ss":{"f1":1,"f2":1}},
{"ss":{"f1":1,"f2":2}},
{"ss":{"f1":1,"f2":3}}]
8<------------------

Or as long as we are painting the shed, maybe this?
8<------------------
[{"ss":{"f1":1,"f2":1}},
{"ss":{"f1":1,"f2":2}},
{"ss":{"f1":1,"f2":3}}]
8<------------------

Those are all the same breakage though - if truly interpreted as data rows the protocol is basically written such that the array format is not supportable and only the lines format can be used. Hence my “format 0 doesn’t work” comment for array output and we should explicitly add format 2 where we explicitly decouple lines of output from rows of data. That said, it would seem in practice format 0 already decouples them and so the current choice of the brackets on their own lines is acceptable.

I’d prefer to keep them on their own line.

I also don’t know why you introduced another level of object nesting here. That seems quite undesirable.

David J.

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

07 декабря 2023 г., 18:07:59

On 12/7/23 09:11, David G. Johnston wrote:
> Those are all the same breakage though - if truly interpreted as data 
> rows the protocol is basically written such that the array format is not 
> supportable and only the lines format can be used.  Hence my “format 0 
> doesn’t work” comment for array output and we should explicitly add 
> format 2 where we explicitly decouple lines of output from rows of 
> data.  That said, it would seem in practice format 0 already decouples 
> them and so the current choice of the brackets on their own lines is 
> acceptable.
> 
> I’d prefer to keep them on their own line.

WFM ¯\_(ツ)_/¯

I am merely responding with options to the many people opining on the 
thread.

> I also don’t know why you introduced another level of object nesting 
> here.  That seems quite undesirable.

I didn't add anything. It is an artifact of the particular query I wrote 
in the copy to statement (I did "select ss from ss" instead of "select * 
from ss"), mea culpa.

This is what the latest patch, as written today, outputs:
8<----------------------
copy
(select 1, g.i from generate_series(1, 3) g(i))
to stdout (format json, force_array);
[
  {"?column?":1,"i":1}
,{"?column?":1,"i":2}
,{"?column?":1,"i":3}
]
8<----------------------

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Dave Cramer

Дата:

08 декабря 2023 г., 17:01:07

On Thu, 7 Dec 2023 at 08:47, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Thursday, December 7, 2023, Daniel Verite <daniel@manitou-mail.org> wrote:
Joe Conway wrote:

> The attached should fix the CopyOut response to say one column. I.e. it
> ought to look something like:

Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...

this number must be the number of columns in the source.
That is for COPY table(a,b,c) the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.

I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.

The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.

This argument for leaving 3 as the column count makes sense to me. I agree this content is not meant to facilitate interpreting the contents at a protocol level.

I'd disagree. From my POV if the data comes back as a JSON Array this is one object and this should be reflected in the column count.

The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"

ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.

Well technically it is a single row if you send an array.

Regardless, I expect Euler's comment above that JSON lines format is going to be the preferred format as the client doesn't have to wait for the entire object before starting to parse.

Dave

Re: Emitting JSON to file using COPY TO

От

"Daniel Verite"

Дата:

08 декабря 2023 г., 22:45:23

    Joe Conway wrote:

> copyto_json.007.diff

When the source has json fields with non-significant line feeds, the COPY
output has these line feeds too, which makes the output incompatible
with rule #2 at https://jsonlines.org  ("2. Each Line is a Valid JSON
Value").

create table j(f json);

insert into j values('{"a":1,
"b":2
}');

copy j to stdout (format json);

Result:
{"f":{"a":1,
"b":2
}}

Is that expected? copy.sgml in 007 doesn't describe the output
in terms of lines so it's hard to tell from the doc.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Emitting JSON to file using COPY TO

От

"Daniel Verite"

Дата:

08 декабря 2023 г., 23:35:39

    Dave Cramer wrote:

> > This argument for leaving 3 as the column count makes sense to me.  I
> > agree this content is not meant to facilitate interpreting the contents at
> > a protocol level.
> >
>
> I'd disagree. From my POV if the data comes back as a JSON Array this is
> one object and this should be reflected in the column count.

The doc says this:
"Int16
    The number of columns in the data to be copied (denoted N below)."

and this formulation is repeated in PQnfields() for libpq:

"PQnfields
    Returns the number of columns (fields) to be copied."

How to interpret that sentence?
"to be copied" from what, into what, and by what way?

A plausible interpretation is "to be copied from the source data
into the COPY stream, by the backend".    So the number of columns
to be copied likely refers to the columns of the dataset, not the
"in-transit form" that is text or csv or json.

The interpetation you're proposing also makes sense, that it's just
one json column per row, or even a single-row single-column for the
entire dataset in the force_array case, but then the question is why
isn't that number of columns always 1 for the original "text" format,
since each row is represented in the stream as a single long piece of
text?

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

09 декабря 2023 г., 00:26:40

On 12/8/23 14:45, Daniel Verite wrote:
>     Joe Conway wrote:
> 
>> copyto_json.007.diff
> 
> When the source has json fields with non-significant line feeds, the COPY
> output has these line feeds too, which makes the output incompatible
> with rule #2 at https://jsonlines.org  ("2. Each Line is a Valid JSON
> Value").
> 
> create table j(f json);
> 
> insert into j values('{"a":1,
> "b":2
> }');
> 
> copy j to stdout (format json);
> 
> Result:
> {"f":{"a":1,
> "b":2
> }}
> 
> Is that expected? copy.sgml in 007 doesn't describe the output
> in terms of lines so it's hard to tell from the doc.

The patch as-is just does the equivalent of row_to_json():
8<----------------------------
select row_to_json(j) from j;
  row_to_json
--------------
  {"f":{"a":1,+
  "b":2       +
  }}
(1 row)
8<----------------------------

So yeah, that is how it works today. I will take a look at what it would 
take to fix it.


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

Hannu Krosing

Дата:

09 декабря 2023 г., 14:56:23

>

On Sat, Dec 2, 2023 at 4:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Joe Conway <mail@joeconway.com> writes:
> >> I noticed that, with the PoC patch, "json" is the only format that must be
> >> quoted.  Without quotes, I see a syntax error.


In longer term we should move any specific COPY flag names and values
out of grammar and their checking into the parts that actually
implement whatever the flag is influencing

Similar to what we do with OPTIONS in all levels of FDW  definitions
(WRAPPER itself, SERVER, USER MAPPING, FOREIGN TABLE)

[*] https://www.postgresql.org/docs/current/sql-createforeigndatawrapper.html

> >> I'm assuming there's a
> >> conflict with another json-related rule somewhere in gram.y, but I haven't
> >> tracked down exactly which one is causing it.
>
> While I've not looked too closely, I suspect this might be due to the
> FORMAT_LA hack in base_yylex:
>
>             /* Replace FORMAT by FORMAT_LA if it's followed by JSON */
>             switch (next_token)
>             {
>                 case JSON:
>                     cur_token = FORMAT_LA;
>                     break;
>             }

My hope is that turning the WITH into a fully independent part with no
grammar-defined keys or values would also solve the issue of quoting
"json".

For backwards compatibility we may even go the route of keeping the
WITH as is  but add the OPTIONS which can take any values at grammar
level.

I shared my "Pluggable Copy " talk slides from Berlin '22 in another thread

--
Hannu

Re: Emitting JSON to file using COPY TO

От

Dean Rasheed

Дата:

08 января, 22:36:34

On Thu, 7 Dec 2023 at 01:10, Joe Conway <mail@joeconway.com> wrote:
>
> The attached should fix the CopyOut response to say one column.
>

Playing around with this, I found a couple of cases that generate an error:

COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json);

COPY (VALUES (1), (2)) TO stdout WITH (format json);

both of those generate the following:

ERROR:  record type has not been registered

Regards,
Dean

Re: Emitting JSON to file using COPY TO

От

Joe Conway

Дата:

08 января, 23:40:23

On 1/8/24 14:36, Dean Rasheed wrote:
> On Thu, 7 Dec 2023 at 01:10, Joe Conway <mail@joeconway.com> wrote:
>>
>> The attached should fix the CopyOut response to say one column.
>>
> 
> Playing around with this, I found a couple of cases that generate an error:
> 
> COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json);
> 
> COPY (VALUES (1), (2)) TO stdout WITH (format json);
> 
> both of those generate the following:
> 
> ERROR:  record type has not been registered


Thanks -- will have a look

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

От

jian he

Дата:

16 января, 06:46:59

On Tue, Jan 9, 2024 at 4:40 AM Joe Conway <mail@joeconway.com> wrote:
>
> On 1/8/24 14:36, Dean Rasheed wrote:
> > On Thu, 7 Dec 2023 at 01:10, Joe Conway <mail@joeconway.com> wrote:
> >>
> >> The attached should fix the CopyOut response to say one column.
> >>
> >
> > Playing around with this, I found a couple of cases that generate an error:
> >
> > COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json);
> >
> > COPY (VALUES (1), (2)) TO stdout WITH (format json);
> >
> > both of those generate the following:
> >
> > ERROR:  record type has not been registered
>
>
> Thanks -- will have a look
>
> --
> Joe Conway
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>
>
>

In the function CopyOneRowTo, I try to call the function BlessTupleDesc again.

+BlessTupleDesc(slot->tts_tupleDescriptor);
rowdata = ExecFetchSlotHeapTupleDatum(slot);

Please check the attachment. (one test.sql file, one patch, one bless twice).

Now the error cases are gone, less cases return error.
but the new result is not the expected.

`COPY (SELECT g from generate_series(1,1) g) TO stdout WITH (format json);`
returns
{"":1}
The expected result would be `{"g":1}`.

I think the reason is maybe related to the function copy_dest_startup.

On Fri, Jan 19, 2024 at 4:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> If I'm not missing, copyto_json.007.diff is the latest patch but it
> needs to be rebased to the current HEAD. Here are random comments:
>

please check the latest version.

>  if (opts_out->json_mode)
> +   {
> +       if (is_from)
> +           ereport(ERROR,
> +                   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                    errmsg("cannot use JSON mode in COPY FROM")));
> +   }
> +   else if (opts_out->force_array)
> +       ereport(ERROR,
> +               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                errmsg("COPY FORCE_ARRAY requires JSON mode")));
>
> I think that flatting these two condition make the code more readable:

I make it two condition check

> if (opts_out->json_mode && is_from)
> ereport(ERROR, ...);
>
> if (!opts_out->json_mode && opts_out->force_array)
> ereport(ERROR, ...);
>
> Also these checks can be moved close to other checks at the end of
> ProcessCopyOptions().
>
Yes. I did it, please check it.

> @@ -3395,6 +3395,10 @@ copy_opt_item:
>                 {
>                     $$ = makeDefElem("format", (Node *) makeString("csv"), @1);
>                 }
> +           | JSON
> +               {
> +                   $$ = makeDefElem("format", (Node *) makeString("json"), @1);
> +               }
>             | HEADER_P
>                 {
>                     $$ = makeDefElem("header", (Node *) makeBoolean(true), @1);
> @@ -3427,6 +3431,10 @@ copy_opt_item:
>                 {
>                     $$ = makeDefElem("encoding", (Node *) makeString($2), @1);
>                 }
> +           | FORCE ARRAY
> +               {
> +                   $$ = makeDefElem("force_array", (Node *)
> makeBoolean(true), @1);
> +               }
>         ;
>
> I believe we don't need to support new options in old-style syntax.
>
> ---
> @@ -3469,6 +3477,10 @@ copy_generic_opt_elem:
>                 {
>                     $$ = makeDefElem($1, $2, @1);
>                 }
> +           | FORMAT_LA copy_generic_opt_arg
> +               {
> +                   $$ = makeDefElem("format", $2, @1);
> +               }
>         ;
>
> I think it's not necessary. "format" option is already handled in
> copy_generic_opt_elem.
>

test it, I found out this part is necessary.
because a query with WITH like `copy (select 1)  to stdout with
(format json, force_array false); ` will fail.

> ---
> +/* need delimiter to start next json array element */
> +static bool json_row_delim_needed = false;
>
> I think it's cleaner to include json_row_delim_needed into CopyToStateData.
yes. I agree. So I did it.

> ---
> Splitting the patch into two patches: add json format and add
> force_array option would make reviews easy.
>
done. one patch for json format, another one for force_array option.

I also made the following cases fail.
copy copytest to stdout (format csv, force_array false);
ERROR:  specify COPY FORCE_ARRAY is only allowed in JSON mode.

If copy to table then call table_scan_getnextslot no need to worry
about the Tupdesc.
however if we copy a query output as format json, we may need to consider it.

cstate->queryDesc->tupDesc is the output of Tupdesc, we can rely on this.
for copy a query result to json, I memcpy( cstate->queryDesc->tupDesc)
to the the slot's slot->tts_tupleDescriptor
so composite_to_json can use cstate->queryDesc->tupDesc to do the work.
I guess this will make it more bullet-proof.

Вложения

Re: Emitting JSON to file using COPY TO

От

Junwang Zhao

Дата:

27 января, 08:55:23

Hi hackers,

Kou-san(CCed) has been working on *Make COPY format extendable[1]*, so
I think making *copy to json* based on that work might be the right direction.

I write an extension for that purpose, and here is the patch set together
with Kou-san's *extendable copy format* implementation:

0001-0009 is the implementation of extendable copy format
00010 is the pg_copy_json extension

I also created a PR[2] if anybody likes the github review style.

The *extendable copy format* feature is still being developed, I post this
email in case the patch set in this thread is committed without knowing
the *extendable copy format* feature.

I'd like to hear your opinions.

[1]: https://www.postgresql.org/message-id/20240124.144936.67229716500876806.kou%40clear-code.com
[2]: https://github.com/zhjwpku/postgres/pull/2/files

-- 
Regards
Junwang Zhao

On Fri, Jan 19, 2024 at 4:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> if (opts_out->json_mode && is_from)
> ereport(ERROR, ...);
>
> if (!opts_out->json_mode && opts_out->force_array)
> ereport(ERROR, ...);
>
> Also these checks can be moved close to other checks at the end of
> ProcessCopyOptions().
>
> ---
> @@ -3395,6 +3395,10 @@ copy_opt_item:
>                 {
>                     $$ = makeDefElem("format", (Node *) makeString("csv"), @1);
>                 }
> +           | JSON
> +               {
> +                   $$ = makeDefElem("format", (Node *) makeString("json"), @1);
> +               }
>             | HEADER_P
>                 {
>                     $$ = makeDefElem("header", (Node *) makeBoolean(true), @1);
> @@ -3427,6 +3431,10 @@ copy_opt_item:
>                 {
>                     $$ = makeDefElem("encoding", (Node *) makeString($2), @1);
>                 }
> +           | FORCE ARRAY
> +               {
> +                   $$ = makeDefElem("force_array", (Node *)
> makeBoolean(true), @1);
> +               }
>         ;
>
> I believe we don't need to support new options in old-style syntax.
>
you are right about the force_array case.
we don't need to add force_array related changes in gram.y.


On Wed, Jan 31, 2024 at 9:26 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Jan-23, jian he wrote:
>
> > > +           | FORMAT_LA copy_generic_opt_arg
> > > +               {
> > > +                   $$ = makeDefElem("format", $2, @1);
> > > +               }
> > >         ;
> > >
> > > I think it's not necessary. "format" option is already handled in
> > > copy_generic_opt_elem.
> >
> > test it, I found out this part is necessary.
> > because a query with WITH like `copy (select 1)  to stdout with
> > (format json, force_array false); ` will fail.
>
> Right, because "FORMAT JSON" is turned into FORMAT_LA JSON by parser.c
> (see base_yylex there).  I'm not really sure but I think it might be
> better to make it "| FORMAT_LA JSON" instead of invoking the whole
> copy_generic_opt_arg syntax.  Not because of performance, but just
> because it's much clearer what's going on.
>
I am not sure what alternative you are referring to.
I've rebased the patch, made some cosmetic changes.
Now I think it's pretty neat.
you can, based on it, make your change, then I may understand the
alternative you are referring to.

On Sat, Mar 9, 2024 at 9:13 AM jian he <jian.universality@gmail.com> wrote:
>
> On Sat, Mar 9, 2024 at 2:03 AM Joe Conway <mail@joeconway.com> wrote:
> >
> > On 3/8/24 12:28, Andrey M. Borodin wrote:
> > > Hello everyone!
> > >
> > > Thanks for working on this, really nice feature!
> > >
> > >> On 9 Jan 2024, at 01:40, Joe Conway <mail@joeconway.com> wrote:
> > >>
> > >> Thanks -- will have a look
> > >
> > > Joe, recently folks proposed a lot of patches in this thread that seem like diverted from original way of
implementation.
> > > As an author of CF entry [0] can you please comment on which patch version needs review?
> >
> >
> > I don't know if I agree with the proposed changes, but I have also been
> > waiting to see how the parallel discussion regarding COPY extensibility
> > shakes out.
> >
> > And there were a couple of issues found that need to be tracked down.
> >
> > Additionally I have had time/availability challenges recently.
> >
> > Overall, chances seem slim that this will make it into 17, but I have
> > not quite given up hope yet either.
>
> Hi.
> summary changes I've made in v9 patches at [0]
>
> meta: rebased. Now you need to use `git apply` or `git am`, previously
> copyto_json.007.diff, you need to use GNU patch.
>
>
> at [1], Dean Rasheed found some corner cases when the returned slot's
> tts_tupleDescriptor
> from
> `
> ExecutorRun(cstate->queryDesc, ForwardScanDirection, 0, true);
> processed = ((DR_copy *) cstate->queryDesc->dest)->processed;
> `
> cannot be used for composite_to_json.
> generally DestReceiver->rStartup is to send the TupleDesc to the DestReceiver,
> The COPY TO DestReceiver's rStartup function is copy_dest_startup,
> however copy_dest_startup is a no-op.
> That means to make the final TupleDesc of COPY TO (FORMAT JSON)
> operation bullet proof,
> we need to copy the tupDesc from CopyToState's queryDesc.
> This only applies to when the COPY TO source is a query (example:
> copy (select 1) to stdout), not a table.
> The above is my interpretation.
>

trying to simplify the explanation.
first refer to the struct DestReceiver.
COPY TO (FORMAT JSON), we didn't send the preliminary Tupdesc to the
DestReceiver
via the rStartup function pointer within struct _DestReceiver.

`CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot)`
the slot is the final slot returned via query execution.
but we cannot use the tupdesc (slot->tts_tupleDescriptor) to do
composite_to_json.
because the final return slot Tupdesc may change during the query execution.

so we need to copy the tupDesc from CopyToState's queryDesc.

aslo rebased, now we can apply it cleanly.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: Emitting JSON to file using COPY TO

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения