Обсуждение: [PGdocs] fix description for handling pf non-ASCII characters

Поиск
Список
Период
Сортировка

[PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear hackers,

While discussing based on the article[1] with Japanese developers,
I found inconsistencies between codes and documents.

45b1a67a[2] changed the behavior when non-ASCII characters was set as application_name,
cluster_name and postgres_fdw.application_name, but it seemed not to be documented.
Previously non-ASCII chars were replaed with question makrs '?', but now they are replaced
with a hex escape instead.

How do you think? Is my understanding correct?

Acknowledgement:
Sawada-san and Shinoda-san led the developer's discussion.
Fujii-san was confirmed my points. Thank you for all of their works!

[1]:
https://h50146.www5.hpe.com/products/software/oe/linux/mainstream/support/lcc/pdf/PostgreSQL16Beta1_New_Features_en_20230528_1.pdf
[2]:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45b1a67a0fcb3f1588df596431871de4c93cb76f;hp=da5d4ea5aaac4fc02f2e2aec272efe438dd4e171

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: [PGdocs] fix description for handling pf non-ASCII characters

От
jian he
Дата:
On Fri, Jun 23, 2023 at 10:25 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear hackers,
>
> While discussing based on the article[1] with Japanese developers,
> I found inconsistencies between codes and documents.
>
> 45b1a67a[2] changed the behavior when non-ASCII characters was set as application_name,
> cluster_name and postgres_fdw.application_name, but it seemed not to be documented.
> Previously non-ASCII chars were replaed with question makrs '?', but now they are replaced
> with a hex escape instead.
>
> How do you think? Is my understanding correct?
>
> Acknowledgement:
> Sawada-san and Shinoda-san led the developer's discussion.
> Fujii-san was confirmed my points. Thank you for all of their works!
>
> [1]:
https://h50146.www5.hpe.com/products/software/oe/linux/mainstream/support/lcc/pdf/PostgreSQL16Beta1_New_Features_en_20230528_1.pdf
> [2]:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45b1a67a0fcb3f1588df596431871de4c93cb76f;hp=da5d4ea5aaac4fc02f2e2aec272efe438dd4e171
>
> Best Regards,
> Hayato Kuroda
> FUJITSU LIMITED
>

in your patch:
> printable ASCII characters will be replaced with a hex escape.

My wording is not good. I think the result will be: ASCII characters
will be as is, non-ASCII characters will be replaced with "a hex
escape".

set application_name to 'abc漢字Abc';
SET
test16=# show application_name;
        application_name
--------------------------------
 abc\xe6\xbc\xa2\xe5\xad\x97Abc
(1 row)

I see multi escape, so I am not sure "a hex escape".

to properly render it back to  'abc漢字Abc'
here is how i do it:
select 'abc' || convert_from(decode(' e6bca2e5ad97','hex'), 'UTF8') || 'Abc';

I guess it's still painful if your application_name has non-ASCII chars.



RE: [PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Jian,

Thank you for checking my patch!

>     
> in your patch:
> > printable ASCII characters will be replaced with a hex escape.
> 
> My wording is not good. I think the result will be: ASCII characters
> will be as is, non-ASCII characters will be replaced with "a hex
> escape".

Yeah, your point was right. I have already said:
"anything other than printable ASCII characters will be replaced with a hex escape"
IIUC They have same meaning.

You might want to say the line was not good, so reworded like
"non-ASCII characters will be replaced with hexadecimal strings." How do you think?

> set application_name to 'abc漢字Abc';
> SET
> test16=# show application_name;
>         application_name
> --------------------------------
>  abc\xe6\xbc\xa2\xe5\xad\x97Abc
> (1 row)
> 
> I see multi escape, so I am not sure "a hex escape".

Not sure what you said, but I could not find word "hex escape" in the document.
So I used "hexadecimal string" instead. Is it acceptable? 

> to properly render it back to  'abc漢字Abc'
> here is how i do it:
> select 'abc' || convert_from(decode(' e6bca2e5ad97','hex'), 'UTF8') || 'Abc';

Yeah, your approach seems right, but I'm not sure it is related with us.
Just to confirm, I don't have interest the method for rendering non-ASCII characters.
My motivation of the patch was to document the the incompatibility noted in [1]:

>
Changed the conversion rules when non-ASCII characters are specified for ASCII-only
strings such as parameters application_name and cluster_name. Previously, it was
converted in byte units with a question mark (?), but in PostgreSQL 16, it is
converted to a hexadecimal string.
>

> I guess it's still painful if your application_name has non-ASCII chars.

I agreed that, but no one has recommended to use non-ASCII.

[1]:
https://h50146.www5.hpe.com/products/software/oe/linux/mainstream/support/lcc/pdf/PostgreSQL16Beta1_New_Features_en_20230528_1.pdf

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Вложения

Re: [PGdocs] fix description for handling pf non-ASCII characters

От
jian he
Дата:
On Thu, Jun 29, 2023 at 3:51 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Jian,
>
> Thank you for checking my patch!
>
> >
> > in your patch:
> > > printable ASCII characters will be replaced with a hex escape.
> >
> > My wording is not good. I think the result will be: ASCII characters
> > will be as is, non-ASCII characters will be replaced with "a hex
> > escape".
>
> Yeah, your point was right. I have already said:
> "anything other than printable ASCII characters will be replaced with a hex escape"
> IIUC They have same meaning.
>
> You might want to say the line was not good, so reworded like
> "non-ASCII characters will be replaced with hexadecimal strings." How do you think?
>
> > set application_name to 'abc漢字Abc';
> > SET
> > test16=# show application_name;
> >         application_name
> > --------------------------------
> >  abc\xe6\xbc\xa2\xe5\xad\x97Abc
> > (1 row)
> >
> > I see multi escape, so I am not sure "a hex escape".
>
> Not sure what you said, but I could not find word "hex escape" in the document.
> So I used "hexadecimal string" instead. Is it acceptable?
>
> > to properly render it back to  'abc漢字Abc'
> > here is how i do it:
> > select 'abc' || convert_from(decode(' e6bca2e5ad97','hex'), 'UTF8') || 'Abc';
>
> Yeah, your approach seems right, but I'm not sure it is related with us.
> Just to confirm, I don't have interest the method for rendering non-ASCII characters.
> My motivation of the patch was to document the the incompatibility noted in [1]:
>
> >
> Changed the conversion rules when non-ASCII characters are specified for ASCII-only
> strings such as parameters application_name and cluster_name. Previously, it was
> converted in byte units with a question mark (?), but in PostgreSQL 16, it is
> converted to a hexadecimal string.
> >
>
> > I guess it's still painful if your application_name has non-ASCII chars.
>
> I agreed that, but no one has recommended to use non-ASCII.
>
> [1]:
https://h50146.www5.hpe.com/products/software/oe/linux/mainstream/support/lcc/pdf/PostgreSQL16Beta1_New_Features_en_20230528_1.pdf
>
> Best Regards,
> Hayato Kuroda
> FUJITSU LIMITED

looks fine. Do you need to add to commitfest?



RE: [PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Jian,

> On Thu, Jun 29, 2023 at 3:51 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Jian,
> >
> > Thank you for checking my patch!
> >
> > >
> > > in your patch:
> > > > printable ASCII characters will be replaced with a hex escape.
> > >
> > > My wording is not good. I think the result will be: ASCII characters
> > > will be as is, non-ASCII characters will be replaced with "a hex
> > > escape".
> >
> > Yeah, your point was right. I have already said:
> > "anything other than printable ASCII characters will be replaced with a hex
> escape"
> > IIUC They have same meaning.
> >
> > You might want to say the line was not good, so reworded like
> > "non-ASCII characters will be replaced with hexadecimal strings." How do you
> think?
> >
> > > set application_name to 'abc漢字Abc';
> > > SET
> > > test16=# show application_name;
> > >         application_name
> > > --------------------------------
> > >  abc\xe6\xbc\xa2\xe5\xad\x97Abc
> > > (1 row)
> > >
> > > I see multi escape, so I am not sure "a hex escape".
> >
> > Not sure what you said, but I could not find word "hex escape" in the document.
> > So I used "hexadecimal string" instead. Is it acceptable?
> >
> > > to properly render it back to  'abc漢字Abc'
> > > here is how i do it:
> > > select 'abc' || convert_from(decode(' e6bca2e5ad97','hex'), 'UTF8') || 'Abc';
> >
> > Yeah, your approach seems right, but I'm not sure it is related with us.
> > Just to confirm, I don't have interest the method for rendering non-ASCII
> characters.
> > My motivation of the patch was to document the the incompatibility noted in [1]:
> >
> > >
> > Changed the conversion rules when non-ASCII characters are specified for
> ASCII-only
> > strings such as parameters application_name and cluster_name. Previously, it
> was
> > converted in byte units with a question mark (?), but in PostgreSQL 16, it is
> > converted to a hexadecimal string.
> > >
> >
> > > I guess it's still painful if your application_name has non-ASCII chars.
> >
> > I agreed that, but no one has recommended to use non-ASCII.
> >
> > [1]:
> https://h50146.www5.hpe.com/products/software/oe/linux/mainstream/suppo
> rt/lcc/pdf/PostgreSQL16Beta1_New_Features_en_20230528_1.pdf
> >
> > Best Regards,
> > Hayato Kuroda
> > FUJITSU LIMITED
> 
> looks fine. Do you need to add to commitfest?

Thank you for your confirmation. ! I registered to following:

https://commitfest.postgresql.org/44/4437/


Best Regards,
Hayato Kuroda
FUJITSU LIMITED
 

Re: [PGdocs] fix description for handling pf non-ASCII characters

От
"Karl O. Pinc"
Дата:
Hello Hayato and Jian,

On Tue, 4 Jul 2023 01:30:56 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

> Dear Jian,

> > looks fine. Do you need to add to commitfest?  
> 
> Thank you for your confirmation. ! I registered to following:
> 
> https://commitfest.postgresql.org/44/4437/

The way the Postgres commitfest process works is that
someone has to update the page to mark "reviewed" and the
reviewer has to use the commitfest website to pass
the patches to a committer.

I see a few problems with the English and style of the patches
and am commenting below and have signed up as a reviewer.  At
commitfest.postgresql.org I have marked the thread
as needing author attention.  Hayato, you will need
to mark the thread as needing review when you have
replied to this message.

Jian, you might want to sign on as a reviewer as well.
It can be nice to have that record of your participation.

Now, to reviewing the patch:

First, it is now best practice in the PG docs to
put a line break at the end of each sentence.
At least for the sentences on the lines you change.
(No need to update the whole document!)  Please
do this in the next version of your patch.  I don't
know if this is a requirement for acceptance by
a committer, but it won't hurt.

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e700782d3c..a4ce99ba4d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7040,9 +7040,8 @@ local0.*    /var/log/postgresql
         The name will be displayed in the
<structname>pg_stat_activity</structname> view and included in CSV log
entries.  It can also be included in regular log entries via the <xref
linkend="guc-log-line-prefix"/> parameter.
-        Only printable ASCII characters may be used in the
-        <varname>application_name</varname> value. Other characters
will be
-        replaced with question marks (<literal>?</literal>).
+        Non-ASCII characters used in the
<varname>application_name</varname>
+        will be replaced with hexadecimal strings.
        </para>
       </listitem>
      </varlistentry>

Don't use the future tense to describe the system's behavior.  Instead
of "will be" write "are".  (Yes, this change would be an improvement
on the original text.  We should fix it while we're working on it
and our attention is focused.)

It is more accurate, if I understand the issue, to say that characters
are replaced with hexadecimal _representations_ of the input bytes.
Finally, it would be good to know what representation we're talking
about.  Perhaps just give the \xhh example and say: the Postgres
C-style escaped hexadecimal byte value.  And hyperlink to
https://www.postgresql.org/docs/16/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-ESCAPE

(The docbook would be, depending on text you want to link:

<link linkend="sql-syntax-strings-escape">C-style escaped hexadecimal
byte value</link>.

I think.  You link to id="someidvalue" attribute values.)


@@ -8037,10 +8036,9 @@ COPY postgres_log FROM
'/full/path/to/logfile.csv' WITH csv; <para>
         The name can be any string of less
         than <symbol>NAMEDATALEN</symbol> characters (64 characters in
a standard
-        build). Only printable ASCII characters may be used in the
-        <varname>cluster_name</varname> value. Other characters will be
-        replaced with question marks (<literal>?</literal>).  No name
is shown
-        if this parameter is set to the empty string
<literal>''</literal> (which is
+        build). Non-ASCII characters used in the
<varname>cluster_name</varname>
+        will be replaced with hexadecimal strings. No name is shown if
this
+        parameter is set to the empty string <literal>''</literal>
(which is the default). This parameter can only be set at server start.
        </para>
       </listitem>

Same review as for the first patch hunk.

diff --git a/doc/src/sgml/postgres-fdw.sgml
b/doc/src/sgml/postgres-fdw.sgml index 5062d712e7..98785e87ea 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -1067,9 +1067,8 @@ postgres=# SELECT postgres_fdw_disconnect_all();
       of any length and contain even non-ASCII characters.  However
when it's passed to and used as <varname>application_name</varname>
       in a foreign server, note that it will be truncated to less than
-      <symbol>NAMEDATALEN</symbol> characters and anything other than
-      printable ASCII characters will be replaced with question
-      marks (<literal>?</literal>).
+      <symbol>NAMEDATALEN</symbol> characters and non-ASCII characters
will be
+      replaced with hexadecimal strings.
       See <xref linkend="guc-application-name"/> for details.
      </para>
 
Same review as for the first patch hunk.

Since the both of you have looked and confirmed that the
actual behavior matches the new documentation I have not
done this.

But, have either of you checked that we're really talking about
replacing everything outside the 7-bit ASCII encodings?  
My reading of the commit referenced in the first email of this
thread says that it's everything outside of the _printable_
ASCII encodings, ASCII values outside of the range 32 to 127,
inclusive.  

Please check.  The docs should probably say "printable ASCII",
or "non-printable ASCII", depending.  I think the meaning
of "printable ASCII" is widely enough known to be 32-127.
So "printable" is good enough, right?

Regards,

Karl <kop@karlpinc.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein



RE: [PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Karl,

Thank you for reviewing! PSA new version.

> I see a few problems with the English and style of the patches
> and am commenting below and have signed up as a reviewer.

Your effort is quite helpful for me.

> At
> commitfest.postgresql.org I have marked the thread
> as needing author attention.  Hayato, you will need
> to mark the thread as needing review when you have
> replied to this message.

Sure. I will change the status after posting the patch.

Before replying your comments, I thought I should show the difference between
versions. Regarding old versions (here PG15 was used), non-ASCIIs (like Japanese) are
replaced with '?'.

```
psql (15.4)
Type "help" for help.

postgres=# SET application_name TO 'あああ';
SET
postgres=# SHOW application_name ;
 application_name
------------------
 ?????????
(1 row)
```

As for the HEAD, as my patch said, non-ASCIIs are replaced
with hexadecimal representations. (Were my terminologies correct?).

```
psql (17devel)
Type "help" for help.

postgres=# SET application_name TO 'あああ';
SET
postgres=# SHOW application_name ;
           application_name
--------------------------------------
 \xe3\x81\x82\xe3\x81\x82\xe3\x81\x82
(1 row)
```

Note that non-printable ASCIIs are also replaced with the same rule.

```
psql (15.4)
Type "help" for help.

postgres=# SET application_name TO E'\x03';
SET
postgres=# SHOW application_name ;
 application_name
------------------
 ?
(1 row)

psql (17devel)
Type "help" for help.

postgres=# SET application_name TO E'\x03';
SET
postgres=# SHOW application_name ;
 application_name
------------------
 \x03
(1 row)
```

> Now, to reviewing the patch:
>
> First, it is now best practice in the PG docs to
> put a line break at the end of each sentence.
> At least for the sentences on the lines you change.
> (No need to update the whole document!)  Please
> do this in the next version of your patch.  I don't
> know if this is a requirement for acceptance by
> a committer, but it won't hurt.

I didn't know that. Could you please tell me if you have a source? Anyway,
I put a line break for each sentences for now.

> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> index e700782d3c..a4ce99ba4d 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -7040,9 +7040,8 @@ local0.*    /var/log/postgresql
>          The name will be displayed in the
> <structname>pg_stat_activity</structname> view and included in CSV log
> entries.  It can also be included in regular log entries via the <xref
> linkend="guc-log-line-prefix"/> parameter.
> -        Only printable ASCII characters may be used in the
> -        <varname>application_name</varname> value. Other characters
> will be
> -        replaced with question marks (<literal>?</literal>).
> +        Non-ASCII characters used in the
> <varname>application_name</varname>
> +        will be replaced with hexadecimal strings.
>         </para>
>        </listitem>
>       </varlistentry>
>
> Don't use the future tense to describe the system's behavior.  Instead
> of "will be" write "are".  (Yes, this change would be an improvement
> on the original text.  We should fix it while we're working on it
> and our attention is focused.)
>
> It is more accurate, if I understand the issue, to say that characters
> are replaced with hexadecimal _representations_ of the input bytes.
> Finally, it would be good to know what representation we're talking
> about.  Perhaps just give the \xhh example and say: the Postgres
> C-style escaped hexadecimal byte value.  And hyperlink to
> https://www.postgresql.org/docs/16/sql-syntax-lexical.html#SQL-SYNTAX-ST
> RINGS-ESCAPE
>
> (The docbook would be, depending on text you want to link:
>
> <link linkend="sql-syntax-strings-escape">C-style escaped hexadecimal
> byte value</link>.
>
> I think.  You link to id="someidvalue" attribute values.)

IIUC the word " Postgres" cannot be used in the doc.
Based on your all comments, I changed as below. How do you think?

```
        Characters that are not printable ASCII, like <literal>\x03</literal>,
        are replaced with the <productname>PostgreSQL</productname>
        <link linkend="sql-syntax-strings-escape">C-style escaped hexadecimal byte value</link>.
```


> @@ -8037,10 +8036,9 @@ COPY postgres_log FROM
> '/full/path/to/logfile.csv' WITH csv; <para>
>          The name can be any string of less
>          than <symbol>NAMEDATALEN</symbol> characters (64 characters
> in
> a standard
> -        build). Only printable ASCII characters may be used in the
> -        <varname>cluster_name</varname> value. Other characters will be
> -        replaced with question marks (<literal>?</literal>).  No name
> is shown
> -        if this parameter is set to the empty string
> <literal>''</literal> (which is
> +        build). Non-ASCII characters used in the
> <varname>cluster_name</varname>
> +        will be replaced with hexadecimal strings. No name is shown if
> this
> +        parameter is set to the empty string <literal>''</literal>
> (which is the default). This parameter can only be set at server start.
>         </para>
>        </listitem>
>
> Same review as for the first patch hunk.

Fixed like above. You can refer the patch.

> diff --git a/doc/src/sgml/postgres-fdw.sgml
> b/doc/src/sgml/postgres-fdw.sgml index 5062d712e7..98785e87ea 100644
> --- a/doc/src/sgml/postgres-fdw.sgml
> +++ b/doc/src/sgml/postgres-fdw.sgml
> @@ -1067,9 +1067,8 @@ postgres=# SELECT postgres_fdw_disconnect_all();
>        of any length and contain even non-ASCII characters.  However
> when it's passed to and used as <varname>application_name</varname>
>        in a foreign server, note that it will be truncated to less than
> -      <symbol>NAMEDATALEN</symbol> characters and anything other than
> -      printable ASCII characters will be replaced with question
> -      marks (<literal>?</literal>).
> +      <symbol>NAMEDATALEN</symbol> characters and non-ASCII
> characters
> will be
> +      replaced with hexadecimal strings.
>        See <xref linkend="guc-application-name"/> for details.
>       </para>
>
> Same review as for the first patch hunk.

Fixed like above.

> Since the both of you have looked and confirmed that the
> actual behavior matches the new documentation I have not
> done this.

I showed the result again, please see.

> But, have either of you checked that we're really talking about
> replacing everything outside the 7-bit ASCII encodings?
> My reading of the commit referenced in the first email of this
> thread says that it's everything outside of the _printable_
> ASCII encodings, ASCII values outside of the range 32 to 127,
> inclusive.
>
> Please check.  The docs should probably say "printable ASCII",
> or "non-printable ASCII", depending.  I think the meaning
> of "printable ASCII" is widely enough known to be 32-127.
> So "printable" is good enough, right?

For me, "non-printable ASCII" sounds like control characters. So that I used the
sentence "Characters that are not printable ASCII ... are replaced with...".
Please tell me if you have better explanation?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Вложения

Re: [PGdocs] fix description for handling pf non-ASCII characters

От
"Karl O. Pinc"
Дата:
On Tue, 26 Sep 2023 13:40:26 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

> Your effort is quite helpful for me.

You're welcome.

> Before replying your comments, I thought I should show the difference
> between versions. Regarding old versions (here PG15 was used),
> non-ASCIIs (like Japanese) are replaced with '?'.
>
> ```
> psql (15.4)
> Type "help" for help.
>
> postgres=# SET application_name TO 'あああ';
> SET
> postgres=# SHOW application_name ;
>  application_name
> ------------------
>  ?????????
> (1 row)
> ```
>
> As for the HEAD, as my patch said, non-ASCIIs are replaced
> with hexadecimal representations. (Were my terminologies correct?).
>
> ```
> psql (17devel)
> Type "help" for help.
>
> postgres=# SET application_name TO 'あああ';
> SET
> postgres=# SHOW application_name ;
>            application_name
> --------------------------------------
>  \xe3\x81\x82\xe3\x81\x82\xe3\x81\x82
> (1 row)
> ```

I think you're terminology is correct, but see my
suggestion at bottom.

Never hurts to have output to look at.  I noticed here
and when reading the patch that changed the output --
the patch is operating on bytes, not characters per-se.

> > First, it is now best practice in the PG docs to
> > put a line break at the end of each sentence.
> > At least for the sentences on the lines you change.
> > (No need to update the whole document!)  Please
> > do this in the next version of your patch.  I don't
> > know if this is a requirement for acceptance by
> > a committer, but it won't hurt.
>
> I didn't know that. Could you please tell me if you have a source?

I thought I could find an email but the search is taking
forever.  If I find something I'll let you know.
I could even be mis-remembering, but it's a nice
practice regardless.

There are a number of undocumented conventions.
Another is the use of gender neutral language.

The place to discuss doc conventions/styles would
be the pgsql-docs list.  (If you felt like
asking there.)

You could try submitting another patch to add various
doc conventions to the official docs at
https://www.postgresql.org/docs/current/docguide-style.html
:-)

> Anyway, I put a line break for each sentences for now.

Thanks.

A related thing that's nice to have is to limit the line
length of the documentation source to 80 characters or less.
79 is probably best.  Since the source text around your patch
conforms to this convention you should also.

> IIUC the word " Postgres" cannot be used in the doc.

I think you're right.

> Based on your all comments, I changed as below. How do you think?
>
> ```
>         Characters that are not printable ASCII, like
> <literal>\x03</literal>, are replaced with the
> <productname>PostgreSQL</productname> <link
> linkend="sql-syntax-strings-escape">C-style escaped hexadecimal byte
> value</link>. ```

>
> > Since the both of you have looked and confirmed that the
> > actual behavior matches the new documentation I have not
> > done this.
>
> I showed the result again, please see.

Should the committer be interested, your patch applies cleanly
and the docs build as expected.

Also, based on the comments in the
patch which changed the system's behavior, I believe that
your patch updates all the relevant places in the documentation.

> > But, have either of you checked that we're really talking about
> > replacing everything outside the 7-bit ASCII encodings?
> > My reading of the commit referenced in the first email of this
> > thread says that it's everything outside of the _printable_
> > ASCII encodings, ASCII values outside of the range 32 to 127,
> > inclusive.
> >
> > Please check.  The docs should probably say "printable ASCII",
> > or "non-printable ASCII", depending.  I think the meaning
> > of "printable ASCII" is widely enough known to be 32-127.
> > So "printable" is good enough, right?
>
> For me, "non-printable ASCII" sounds like control characters. So that
> I used the sentence "Characters that are not printable ASCII ... are
> replaced with...". Please tell me if you have better explanation?

Your explanation sounds good to me.

I now think that you should consider another change to your wording.
Instead of starting with "Characters that are not printable ASCII ..."
consider writing "The bytes of the string which are not printable ASCII
...".  Notice above that characters (e.g. あ) generate output for
each non-ASCII byte (e.g. \xe3\x81\x82).  So my thought is that
the docs should be talking about bytes.

For the last hunk you'd change around "anything".  Write:
"... it will be truncated to less than NAMEDATALEN characters and
the bytes of the string which are not printable ASCII characters ...".

Notice that I have also changed "that" to "which" just above.
I _think_ this is better English.  See sense 3 of:
https://en.wiktionary.org/wiki/which
See also the first paragraph of:
https://en.wikipedia.org/wiki/Relative_pronoun

If the comments above move you, send another patch.  Seems to me
we're close to sending this on to a committer.

Regards,

Karl <kop@karlpinc.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein



Re: [PGdocs] fix description for handling pf non-ASCII characters

От
Tom Lane
Дата:
"Karl O. Pinc" <kop@karlpinc.com> writes:
> For the last hunk you'd change around "anything".  Write:
> "... it will be truncated to less than NAMEDATALEN characters and
> the bytes of the string which are not printable ASCII characters ...".

> Notice that I have also changed "that" to "which" just above.  
> I _think_ this is better English.

No, I'm pretty sure you're mistaken.  It's been a long time since
high school English, but the way I think this works is that "that"
introduces a restrictive clause, which narrows the scope of what
you are saying.  That is, you say "that" when you want to talk
about only the bytes of the string that aren't ASCII.  But "which"
introduces a non-restrictive clause that adds information or
commentary.  If you say "bytes of the string which are not ASCII",
you are effectively making a side assertion that no byte of the
string is ASCII.  Which is not the meaning you want here.

A smell test that works for native speakers (not sure how helpful
it is for others) is: if the sentence would read well with commas
or parens added before and after the clause, then it's probably
non-restrictive and should use "which".  If it looks wrong that way
then it's a restrictive clause and should use "that".

            regards, tom lane



Re: [PGdocs] fix description for handling pf non-ASCII characters

От
"Karl O. Pinc"
Дата:
Sep 26, 2023 1:10:55 PM Tom Lane <tgl@sss.pgh.pa.us>:

> "Karl O. Pinc" <kop@karlpinc.com> writes:
>> For the last hunk you'd change around "anything".  Write:
>> "... it will be truncated to less than NAMEDATALEN characters and
>> the bytes of the string which are not printable ASCII characters ...".
>
>> Notice that I have also changed "that" to "which" just above. 
>> I _think_ this is better English.
>
> No, I'm pretty sure you're mistaken.  It's been a long time since
> high school English, but the way I think this works is that "that"
> introduces a restrictive clause, which narrows the scope of what
> you are saying.  That is, you say "that" when you want to talk
> about only the bytes of the string that aren't ASCII.  But "which"
> introduces a non-restrictive clause that adds information or
> commentary.  If you say "bytes of the string which are not ASCII",
> you are effectively making a side assertion that no byte of the
> string is ASCII.  Which is not the meaning you want here.

Makes sense to me.  "That" it is.

Thanks for the help. I never would have figured that out.




Re: [PGdocs] fix description for handling pf non-ASCII characters

От
Peter Smith
Дата:
Hi Kuroda-san.

Here are my review comments for your v3 patch.

TBH, I felt the new text descriptions deviated a bit too much from the
originals. IMO only quite a small tweak was needed, so my suggested
text in the comments below reflects that.

======
Commit message.

1.
missing description

======
src/sgml/config.sgml

2. application_name:

-        Only printable ASCII characters may be used in the
-        <varname>application_name</varname> value. Other characters will be
-        replaced with question marks (<literal>?</literal>).
+        Characters that are not printable ASCII, like <literal>\x03</literal>,
+        are replaced with the <productname>PostgreSQL</productname>
+        <link linkend="sql-syntax-strings-escape">C-style escaped
hexadecimal byte value</link>.

BEFORE
Other characters will be replaced with question marks (<literal>?</literal>).

SUGGESTION
Other characters will be replaced with <link
linkend="sql-syntax-strings-escape">C-style escaped hexadecimal byte
values</link>.

~~~

3. cluster_name:

-        build). Only printable ASCII characters may be used in the
-        <varname>cluster_name</varname> value. Other characters will be
-        replaced with question marks (<literal>?</literal>).  No name is shown
-        if this parameter is set to the empty string
<literal>''</literal> (which is
-        the default). This parameter can only be set at server start.
+        build).
+        Characters that are not printable ASCII, like <literal>\x03</literal>,
+        are replaced with the <productname>PostgreSQL</productname>
+        <link linkend="sql-syntax-strings-escape">C-style escaped
hexadecimal byte value</link>.
+        No name is shown if this parameter is set to the empty string
+        <literal>''</literal> (which is the default). This parameter can only
+        be set at server start.

<same as previous review comment #2>

======
src/sgml/postgres-fdw.sgml

4.
      <para>
       <varname>postgres_fdw.application_name</varname> can be any string
-      of any length and contain even non-ASCII characters.  However when
-      it's passed to and used as <varname>application_name</varname>
+      of any length and contain even characters that are not printable ASCII.
+      However when it's passed to and used as
<varname>application_name</varname>
       in a foreign server, note that it will be truncated to less than
       <symbol>NAMEDATALEN</symbol> characters and anything other than
-      printable ASCII characters will be replaced with question
-      marks (<literal>?</literal>).
+      printable ASCII characters are replaced with the
<productname>PostgreSQL</productname>
+      <link linkend="sql-syntax-strings-escape">C-style escaped
hexadecimal byte value</link>.
       See <xref linkend="guc-application-name"/> for details.
      </para>

~

AFAICT the first change wasn't necessary.

~

As for the 2nd change:

BEFORE
... and anything other than printable ASCII characters will be
replaced with question marks (<literal>?</literal>).

SUGGESTION
... and anything other than printable ASCII characters will be
replaced with <link linkend="sql-syntax-strings-escape">C-style
escaped hexadecimal byte values</link>.

======
Kind Regards,
Peter Smith.
Fujitsu Australia



RE: [PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Karl,

Thank you for reviewing!

> A related thing that's nice to have is to limit the line
> length of the documentation source to 80 characters or less.
> 79 is probably best.  Since the source text around your patch
> conforms to this convention you should also.

IIUC it is not hard limit, but I followed this.

> Should the committer be interested, your patch applies cleanly
> and the docs build as expected.

Yeah, but cfbot accepted previous version. Did you have anything in your mind?

> Also, based on the comments in the
> patch which changed the system's behavior, I believe that
> your patch updates all the relevant places in the documentation.

Thanks. Actually, I think it should be backpatched to PG16 because the commit was
done last year. I will make the patch for it after deciding the explanation.

> 
> I now think that you should consider another change to your wording.
> Instead of starting with "Characters that are not printable ASCII ..."
> consider writing "The bytes of the string which are not printable ASCII
> ...".  Notice above that characters (e.g. あ) generate output for
> each non-ASCII byte (e.g. \xe3\x81\x82).  So my thought is that
> the docs should be talking about bytes.
> 
> For the last hunk you'd change around "anything".  Write:
> "... it will be truncated to less than NAMEDATALEN characters and
> the bytes of the string which are not printable ASCII characters ...".
> 

Hmm, what you said looked right. But as Peter pointed out [1], the fix seems too
much. So I attached three version of patches. How do you think?
For me, type C is best.

A. A patch which completely follows your comments. The name is "v3-0001-...patch".
   Cfbot tests it.
B. A patch which completely follows Peter's comments [1]. The name is "Peter_v3-....txt".
C. A patch which follows both comments. Based on b, but some comments
   (Don't use the future tense, "Other characters"->"The bytes of other characters"...)
   were picked. The name is "Both_v3-....txt".

[1]: https://www.postgresql.org/message-id/CAHut%2BPvEbKC8ABA_daX-XPNOTFzuAmHGhjPj%3DtPZYQskRHECOg%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Вложения

RE: [PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Peter,

Thank you for reviewing!

> 
> TBH, I felt the new text descriptions deviated a bit too much from the
> originals. IMO only quite a small tweak was needed, so my suggested
> text in the comments below reflects that.

Good point, my patch may be too much.

> Commit message.
> 
> 1.
> missing description

Added.
If we should use only printable ascii as a commit message, I can use '\x03'
instead of 'あああ'.

> src/sgml/config.sgml
> 
> 2. application_name:
> 
> -        Only printable ASCII characters may be used in the
> -        <varname>application_name</varname> value. Other characters will
> be
> -        replaced with question marks (<literal>?</literal>).
> +        Characters that are not printable ASCII, like <literal>\x03</literal>,
> +        are replaced with the <productname>PostgreSQL</productname>
> +        <link linkend="sql-syntax-strings-escape">C-style escaped
> hexadecimal byte value</link>.
> 
> BEFORE
> Other characters will be replaced with question marks (<literal>?</literal>).
> 
> SUGGESTION
> Other characters will be replaced with <link
> linkend="sql-syntax-strings-escape">C-style escaped hexadecimal byte
> values</link>.
> 
> ~~~
> 
> 3. cluster_name:
> 
> -        build). Only printable ASCII characters may be used in the
> -        <varname>cluster_name</varname> value. Other characters will be
> -        replaced with question marks (<literal>?</literal>).  No name is
> shown
> -        if this parameter is set to the empty string
> <literal>''</literal> (which is
> -        the default). This parameter can only be set at server start.
> +        build).
> +        Characters that are not printable ASCII, like <literal>\x03</literal>,
> +        are replaced with the <productname>PostgreSQL</productname>
> +        <link linkend="sql-syntax-strings-escape">C-style escaped
> hexadecimal byte value</link>.
> +        No name is shown if this parameter is set to the empty string
> +        <literal>''</literal> (which is the default). This parameter can only
> +        be set at server start.
> 
> <same as previous review comment #2>
> 
> ======
> src/sgml/postgres-fdw.sgml
> 
> 4.
>       <para>
>        <varname>postgres_fdw.application_name</varname> can be any
> string
> -      of any length and contain even non-ASCII characters.  However when
> -      it's passed to and used as <varname>application_name</varname>
> +      of any length and contain even characters that are not printable ASCII.
> +      However when it's passed to and used as
> <varname>application_name</varname>
>        in a foreign server, note that it will be truncated to less than
>        <symbol>NAMEDATALEN</symbol> characters and anything other than
> -      printable ASCII characters will be replaced with question
> -      marks (<literal>?</literal>).
> +      printable ASCII characters are replaced with the
> <productname>PostgreSQL</productname>
> +      <link linkend="sql-syntax-strings-escape">C-style escaped
> hexadecimal byte value</link>.
>        See <xref linkend="guc-application-name"/> for details.
>       </para>
> 
> ~
> 
> AFAICT the first change wasn't necessary.
> 
> ~
> 
> As for the 2nd change:
> 
> BEFORE
> ... and anything other than printable ASCII characters will be
> replaced with question marks (<literal>?</literal>).
> 
> SUGGESTION
> ... and anything other than printable ASCII characters will be
> replaced with <link linkend="sql-syntax-strings-escape">C-style
> escaped hexadecimal byte values</link>.

They seem good, but they conflict with Karl's comments.
I made three patches based on comments [1], could you check?

[1]:
https://www.postgresql.org/message-id/TYAPR01MB58663EB061888B2715A39217F5C2A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: [PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Tom,

> No, I'm pretty sure you're mistaken.  It's been a long time since
> high school English, but the way I think this works is that "that"
> introduces a restrictive clause, which narrows the scope of what
> you are saying.  That is, you say "that" when you want to talk
> about only the bytes of the string that aren't ASCII.  But "which"
> introduces a non-restrictive clause that adds information or
> commentary.  If you say "bytes of the string which are not ASCII",
> you are effectively making a side assertion that no byte of the
> string is ASCII.  Which is not the meaning you want here.
>
> A smell test that works for native speakers (not sure how helpful
> it is for others) is: if the sentence would read well with commas
> or parens added before and after the clause, then it's probably
> non-restrictive and should use "which".  If it looks wrong that way
> then it's a restrictive clause and should use "that".

Thanks for giving your opinion. The suggestion is quite helpful for me,
non-native speaker. If you check my patch [1] I'm very happy.

[1]:
https://www.postgresql.org/message-id/TYAPR01MB58663EB061888B2715A39217F5C2A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




Re: [PGdocs] fix description for handling pf non-ASCII characters

От
"Karl O. Pinc"
Дата:
On Wed, 27 Sep 2023 12:58:54 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

> > Should the committer be interested, your patch applies cleanly
> > and the docs build as expected.  
> 
> Yeah, but cfbot accepted previous version. Did you have anything in
> your mind?

No.  I'm letting the committer know everything I've checked
so that they can decide what they want to check.

> Hmm, what you said looked right. But as Peter pointed out [1], the
> fix seems too much. So I attached three version of patches. How do
> you think? For me, type C is best.
> 
> A. A patch which completely follows your comments. The name is
> "v3-0001-...patch". Cfbot tests it.
> B. A patch which completely follows Peter's comments [1]. The name is
> "Peter_v3-....txt". 
> C. A patch which follows both comments. Based on
> b, but some comments (Don't use the future tense, "Other
> characters"->"The bytes of other characters"...) were picked. The
> name is "Both_v3-....txt".

I also like C.  Fewer words is better.  So long
as nothing is left unsaid fewer words make for clarity.

However, in the last hunk, "of other than" does not read well.
Instead of writing
"and the bytes of other than printable ASCII characters"
you want "and the bytes that are not printable ASCII characters".
That would be my suggestion.  

Regards,

Karl <kop@karlpinc.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein



Re: [PGdocs] fix description for handling pf non-ASCII characters

От
Peter Smith
Дата:
On Wed, Sep 27, 2023 at 11:59 PM Karl O. Pinc <kop@karlpinc.com> wrote:
>
> On Wed, 27 Sep 2023 12:58:54 +0000
> "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
>
> > > Should the committer be interested, your patch applies cleanly
> > > and the docs build as expected.
> >
> > Yeah, but cfbot accepted previous version. Did you have anything in
> > your mind?
>
> No.  I'm letting the committer know everything I've checked
> so that they can decide what they want to check.
>
> > Hmm, what you said looked right. But as Peter pointed out [1], the
> > fix seems too much. So I attached three version of patches. How do
> > you think? For me, type C is best.
> >
> > A. A patch which completely follows your comments. The name is
> > "v3-0001-...patch". Cfbot tests it.
> > B. A patch which completely follows Peter's comments [1]. The name is
> > "Peter_v3-....txt".
> > C. A patch which follows both comments. Based on
> > b, but some comments (Don't use the future tense, "Other
> > characters"->"The bytes of other characters"...) were picked. The
> > name is "Both_v3-....txt".
>
> I also like C.  Fewer words is better.  So long
> as nothing is left unsaid fewer words make for clarity.
>
> However, in the last hunk, "of other than" does not read well.
> Instead of writing
> "and the bytes of other than printable ASCII characters"
> you want "and the bytes that are not printable ASCII characters".
> That would be my suggestion.
>

I also prefer Option C, but...

~~~

+        <varname>application_name</varname> value.
+        The bytes of other characters are replaced with
+        <link linkend="sql-syntax-strings-escape">C-style escaped hexadecimal
+        byte values</link>.

V

+        <varname>cluster_name</varname> value.
+        The bytes of other characters are replaced with
+        <link linkend="sql-syntax-strings-escape">C-style escaped hexadecimal
+        byte values</link>.

V

+      <symbol>NAMEDATALEN</symbol> characters and the bytes of other than
+      printable ASCII characters are replaced with <link
+      linkend="sql-syntax-strings-escape">C-style escaped hexadecimal byte
+      values</link>.


IIUC all of these 3 places can have exactly the same wording change
(e.g. like Karl's last suggestion [1]).

SUGGESTION
Any bytes that are not printable ASCII characters are replaced with
<link linkend="sql-syntax-strings-escape">C-style escaped hexadecimal
byte values</link>.

======
[1] https://www.postgresql.org/message-id/20230927085924.4198c3d2%40slate.karlpinc.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: [PGdocs] fix description for handling pf non-ASCII characters

От
"Karl O. Pinc"
Дата:
On Thu, 28 Sep 2023 09:49:03 +1000
Peter Smith <smithpb2250@gmail.com> wrote:

> On Wed, Sep 27, 2023 at 11:59 PM Karl O. Pinc <kop@karlpinc.com>
> wrote:
> >
> > On Wed, 27 Sep 2023 12:58:54 +0000
> > "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
> >
> > > > Should the committer be interested, your patch applies cleanly
> > > > and the docs build as expected.
> > >
> > > Yeah, but cfbot accepted previous version. Did you have anything
> > > in your mind?
> >
> > No.  I'm letting the committer know everything I've checked
> > so that they can decide what they want to check.
> >
> > > Hmm, what you said looked right. But as Peter pointed out [1], the
> > > fix seems too much. So I attached three version of patches. How do
> > > you think? For me, type C is best.
> > >
> > > A. A patch which completely follows your comments. The name is
> > > "v3-0001-...patch". Cfbot tests it.
> > > B. A patch which completely follows Peter's comments [1]. The
> > > name is "Peter_v3-....txt".
> > > C. A patch which follows both comments. Based on
> > > b, but some comments (Don't use the future tense, "Other
> > > characters"->"The bytes of other characters"...) were picked. The
> > > name is "Both_v3-....txt".
> >
> > I also like C.  Fewer words is better.  So long
> > as nothing is left unsaid fewer words make for clarity.
> >
> > However, in the last hunk, "of other than" does not read well.
> > Instead of writing
> > "and the bytes of other than printable ASCII characters"
> > you want "and the bytes that are not printable ASCII characters".
> > That would be my suggestion.
> >
>
> I also prefer Option C, but...
>
> ~~~
>
> +        <varname>application_name</varname> value.
> +        The bytes of other characters are replaced with
> +        <link linkend="sql-syntax-strings-escape">C-style escaped
> hexadecimal
> +        byte values</link>.
>
> V
>
> +        <varname>cluster_name</varname> value.
> +        The bytes of other characters are replaced with
> +        <link linkend="sql-syntax-strings-escape">C-style escaped
> hexadecimal
> +        byte values</link>.
>
> V
>
> +      <symbol>NAMEDATALEN</symbol> characters and the bytes of other
> than
> +      printable ASCII characters are replaced with <link
> +      linkend="sql-syntax-strings-escape">C-style escaped
> hexadecimal byte
> +      values</link>.
>
>
> IIUC all of these 3 places can have exactly the same wording change
> (e.g. like Karl's last suggestion [1]).
>
> SUGGESTION
> Any bytes that are not printable ASCII characters are replaced with
> <link linkend="sql-syntax-strings-escape">C-style escaped hexadecimal
> byte values</link>.

I don't see the utility in having exactly the same phrase everywhere,
especially since the last hunk is modifying the end of a long
sentence.  (Apologies if I'm mis-reading what Peter wrote above.)

I like short sentences.  So I prefer "The bytes of other characters"
rather than "Any bytes that are not printable ASCII characters"
for the first 2 hunks.  In context I don't see the need to repeat
the whole "printable ASCII characters" part that appears in the
preceding sentence of both hunks.  "Other" is clear, IMHO.

But because I like short sentences I now think that it's a good
idea to break the long sentence of the last hunk into two.
Add a period and use the Peter's SUGGESTION above as the
text for the second sentence.

Is this desireable?

Regards,

Karl <kop@karlpinc.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

P.S. Hayato, it is good practice to cc everybody who has
replied to a thread.  At least I think that's what I see,
it's not just people being lazy with reply-all. So I'm
adding Tom Lane back to the thread.  He can tell us otherwise
if I'm wrong to add him back.



Re: [PGdocs] fix description for handling pf non-ASCII characters

От
Peter Smith
Дата:
On Thu, Sep 28, 2023 at 10:30 AM Karl O. Pinc <kop@karlpinc.com> wrote:
>
> On Thu, 28 Sep 2023 09:49:03 +1000
> Peter Smith <smithpb2250@gmail.com> wrote:
>
> > On Wed, Sep 27, 2023 at 11:59 PM Karl O. Pinc <kop@karlpinc.com>
> > wrote:
> > >
> > > On Wed, 27 Sep 2023 12:58:54 +0000
> > > "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
> > >
> > > > > Should the committer be interested, your patch applies cleanly
> > > > > and the docs build as expected.
> > > >
> > > > Yeah, but cfbot accepted previous version. Did you have anything
> > > > in your mind?
> > >
> > > No.  I'm letting the committer know everything I've checked
> > > so that they can decide what they want to check.
> > >
> > > > Hmm, what you said looked right. But as Peter pointed out [1], the
> > > > fix seems too much. So I attached three version of patches. How do
> > > > you think? For me, type C is best.
> > > >
> > > > A. A patch which completely follows your comments. The name is
> > > > "v3-0001-...patch". Cfbot tests it.
> > > > B. A patch which completely follows Peter's comments [1]. The
> > > > name is "Peter_v3-....txt".
> > > > C. A patch which follows both comments. Based on
> > > > b, but some comments (Don't use the future tense, "Other
> > > > characters"->"The bytes of other characters"...) were picked. The
> > > > name is "Both_v3-....txt".
> > >
> > > I also like C.  Fewer words is better.  So long
> > > as nothing is left unsaid fewer words make for clarity.
> > >
> > > However, in the last hunk, "of other than" does not read well.
> > > Instead of writing
> > > "and the bytes of other than printable ASCII characters"
> > > you want "and the bytes that are not printable ASCII characters".
> > > That would be my suggestion.
> > >
> >
> > I also prefer Option C, but...
> >
> > ~~~
> >
> > +        <varname>application_name</varname> value.
> > +        The bytes of other characters are replaced with
> > +        <link linkend="sql-syntax-strings-escape">C-style escaped
> > hexadecimal
> > +        byte values</link>.
> >
> > V
> >
> > +        <varname>cluster_name</varname> value.
> > +        The bytes of other characters are replaced with
> > +        <link linkend="sql-syntax-strings-escape">C-style escaped
> > hexadecimal
> > +        byte values</link>.
> >
> > V
> >
> > +      <symbol>NAMEDATALEN</symbol> characters and the bytes of other
> > than
> > +      printable ASCII characters are replaced with <link
> > +      linkend="sql-syntax-strings-escape">C-style escaped
> > hexadecimal byte
> > +      values</link>.
> >
> >
> > IIUC all of these 3 places can have exactly the same wording change
> > (e.g. like Karl's last suggestion [1]).
> >
> > SUGGESTION
> > Any bytes that are not printable ASCII characters are replaced with
> > <link linkend="sql-syntax-strings-escape">C-style escaped hexadecimal
> > byte values</link>.
>
> I don't see the utility in having exactly the same phrase everywhere,
> especially since the last hunk is modifying the end of a long
> sentence.  (Apologies if I'm mis-reading what Peter wrote above.)
>
> I like short sentences.  So I prefer "The bytes of other characters"
> rather than "Any bytes that are not printable ASCII characters"
> for the first 2 hunks.  In context I don't see the need to repeat
> the whole "printable ASCII characters" part that appears in the
> preceding sentence of both hunks.  "Other" is clear, IMHO.
>

I had in mind something like a SHIFT-JIS encoding where a single
"character" may include some trail bytes that happen to be in the
ASCII printable range. AFAIK because the new logic is processing
bytes, not characters, I thought the end result could be a mix of
escaped and unescaped bytes for the single SJIS character. In that
context, I felt "The bytes of other characters" was not quite
accurate.

But now looking at PostgreSQL-supported character sets [1] I saw SJIS
is not supported anyhow. Unfortunately, I am not familiar enough with
other encodings to know if there is still a chance of similar
printable ASCII trail bytes so I am fine with whatever wording is
chosen.

> But because I like short sentences I now think that it's a good
> idea to break the long sentence of the last hunk into two.
> Add a period and use the Peter's SUGGESTION above as the
> text for the second sentence.
>
> Is this desireable?
>

+1.

======
[1] https://www.postgresql.org/docs/current/multibyte.html

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: [PGdocs] fix description for handling pf non-ASCII characters

От
Tom Lane
Дата:
Peter Smith <smithpb2250@gmail.com> writes:
> I had in mind something like a SHIFT-JIS encoding where a single
> "character" may include some trail bytes that happen to be in the
> ASCII printable range. AFAIK because the new logic is processing
> bytes, not characters, I thought the end result could be a mix of
> escaped and unescaped bytes for the single SJIS character.

It will not, because ...

> But now looking at PostgreSQL-supported character sets [1] I saw SJIS
> is not supported anyhow. Unfortunately, I am not familiar enough with
> other encodings to know if there is still a chance of similar
> printable ASCII trail bytes so I am fine with whatever wording is
> chosen.

... trailing bytes that could be mistaken for ASCII are precisely
the property that causes us to reject an encoding as not backend-safe.
So this code doesn't need to consider that hazard, and processing the
string byte-by-byte is perfectly OK.

I'd be inclined to keep the text as simple as possible and not focus on
the distinction between bytes and characters.

            regards, tom lane



Re: [PGdocs] fix description for handling pf non-ASCII characters

От
Peter Smith
Дата:
On Thu, Sep 28, 2023 at 11:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> ... trailing bytes that could be mistaken for ASCII are precisely
> the property that causes us to reject an encoding as not backend-safe.

Oh, that is good to know. Thanks for the information.

======
Kind Regards,
Peter Smith.
Fujitsu Australia



RE: [PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Karl,

Thank you for giving comments! PSA new version.
I attached two patches - one is for HEAD, and another one is for REL_16_STABLE
branch. As shown below, PG16 has the same behavior.

```
psql (16beta3)
Type "help" for help.

postgres=# SET application_name TO 'あああ';
SET
postgres=# SHOW application_name ;
           application_name           
--------------------------------------
 \xe3\x81\x82\xe3\x81\x82\xe3\x81\x82
(1 row)
```


> > > > A. A patch which completely follows your comments. The name is
> > > > "v3-0001-...patch". Cfbot tests it.
> > > > B. A patch which completely follows Peter's comments [1]. The
> > > > name is "Peter_v3-....txt".
> > > > C. A patch which follows both comments. Based on
> > > > b, but some comments (Don't use the future tense, "Other
> > > > characters"->"The bytes of other characters"...) were picked. The
> > > > name is "Both_v3-....txt".
> > >
> > > I also like C.  Fewer words is better.  So long
> > > as nothing is left unsaid fewer words make for clarity.

Okay, basically I used C.

> > >
> > > However, in the last hunk, "of other than" does not read well.
> > > Instead of writing
> > > "and the bytes of other than printable ASCII characters"
> > > you want "and the bytes that are not printable ASCII characters".
> > > That would be my suggestion.
> > >
> >
> > I also prefer Option C, but...
> >
> > ~~~
> >
> > +        <varname>application_name</varname> value.
> > +        The bytes of other characters are replaced with
> > +        <link linkend="sql-syntax-strings-escape">C-style escaped
> > hexadecimal
> > +        byte values</link>.
> >
> > V
> >
> > +        <varname>cluster_name</varname> value.
> > +        The bytes of other characters are replaced with
> > +        <link linkend="sql-syntax-strings-escape">C-style escaped
> > hexadecimal
> > +        byte values</link>.
> >
> > V
> >
> > +      <symbol>NAMEDATALEN</symbol> characters and the bytes of
> other
> > than
> > +      printable ASCII characters are replaced with <link
> > +      linkend="sql-syntax-strings-escape">C-style escaped
> > hexadecimal byte
> > +      values</link>.
> >
> >
> > IIUC all of these 3 places can have exactly the same wording change
> > (e.g. like Karl's last suggestion [1]).
> >
> > SUGGESTION
> > Any bytes that are not printable ASCII characters are replaced with
> > <link linkend="sql-syntax-strings-escape">C-style escaped hexadecimal
> > byte values</link>.
> 
> I don't see the utility in having exactly the same phrase everywhere,
> especially since the last hunk is modifying the end of a long
> sentence.  (Apologies if I'm mis-reading what Peter wrote above.)

Right, here we cannot use exactly the same sentence.

> 
> I like short sentences.  So I prefer "The bytes of other characters"
> rather than "Any bytes that are not printable ASCII characters"
> for the first 2 hunks.  In context I don't see the need to repeat
> the whole "printable ASCII characters" part that appears in the
> preceding sentence of both hunks.  "Other" is clear, IMHO.

Based on the suggestion [1], I removed the word "byte".
(Sorry, but a comment from senior members has higher priority)

> 
> But because I like short sentences I now think that it's a good
> idea to break the long sentence of the last hunk into two.
> Add a period and use the Peter's SUGGESTION above as the
> text for the second sentence.

Right, the sentence is separated into two.

[1]: https://www.postgresql.org/message-id/803569.1695863971%40sss.pgh.pa.us

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Вложения

RE: [PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Peter,

Thank you for reviewing!

> > > > > A. A patch which completely follows your comments. The name is
> > > > > "v3-0001-...patch". Cfbot tests it.
> > > > > B. A patch which completely follows Peter's comments [1]. The
> > > > > name is "Peter_v3-....txt".
> > > > > C. A patch which follows both comments. Based on
> > > > > b, but some comments (Don't use the future tense, "Other
> > > > > characters"->"The bytes of other characters"...) were picked. The
> > > > > name is "Both_v3-....txt".
> > > >
> > > > I also like C.  Fewer words is better.  So long
> > > > as nothing is left unsaid fewer words make for clarity.
> > > >
> > > > However, in the last hunk, "of other than" does not read well.
> > > > Instead of writing
> > > > "and the bytes of other than printable ASCII characters"
> > > > you want "and the bytes that are not printable ASCII characters".
> > > > That would be my suggestion.
> > > >
> > >
> > > I also prefer Option C, but...

Okay, C was chosen.


> > >
> > > ~~~
> > >
> > > +        <varname>application_name</varname> value.
> > > +        The bytes of other characters are replaced with
> > > +        <link linkend="sql-syntax-strings-escape">C-style escaped
> > > hexadecimal
> > > +        byte values</link>.
> > >
> > > V
> > >
> > > +        <varname>cluster_name</varname> value.
> > > +        The bytes of other characters are replaced with
> > > +        <link linkend="sql-syntax-strings-escape">C-style escaped
> > > hexadecimal
> > > +        byte values</link>.
> > >
> > > V
> > >
> > > +      <symbol>NAMEDATALEN</symbol> characters and the bytes of
> other
> > > than
> > > +      printable ASCII characters are replaced with <link
> > > +      linkend="sql-syntax-strings-escape">C-style escaped
> > > hexadecimal byte
> > > +      values</link>.
> > >
> > >
> > > IIUC all of these 3 places can have exactly the same wording change
> > > (e.g. like Karl's last suggestion [1]).
> > >
> > > SUGGESTION
> > > Any bytes that are not printable ASCII characters are replaced with
> > > <link linkend="sql-syntax-strings-escape">C-style escaped hexadecimal
> > > byte values</link>.

Hmm, I felt that using exactly the same wording seemed strange here, so similar
words were used. Also, based on the comment [1], "byte" was removed.

> 
> I had in mind something like a SHIFT-JIS encoding where a single
> "character" may include some trail bytes that happen to be in the
> ASCII printable range. AFAIK because the new logic is processing
> bytes, not characters, I thought the end result could be a mix of
> escaped and unescaped bytes for the single SJIS character. In that
> context, I felt "The bytes of other characters" was not quite
> accurate.
> 
> But now looking at PostgreSQL-supported character sets [1] I saw SJIS
> is not supported anyhow. Unfortunately, I am not familiar enough with
> other encodings to know if there is still a chance of similar
> printable ASCII trail bytes so I am fine with whatever wording is
> chosen.

Based on the discussion [1], I did not handle the part.

> 
> > But because I like short sentences I now think that it's a good
> > idea to break the long sentence of the last hunk into two.
> > Add a period and use the Peter's SUGGESTION above as the
> > text for the second sentence.
> >
> > Is this desireable?
> >
> 
> +1.

OK, divided.

New patch is available in [2].

[1]: https://www.postgresql.org/message-id/803569.1695863971%40sss.pgh.pa.us
[2]:
https://www.postgresql.org/message-id/TYAPR01MB5866DD962CA4FC03E338C6BBF5C1A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Tom,

Thank you for giving a comment!
New patch is available in [1].

> I'd be inclined to keep the text as simple as possible and not focus on
> the distinction between bytes and characters.
>

Okay, in the latest version, the word "byte" was removed.

[1]:
https://www.postgresql.org/message-id/TYAPR01MB5866DD962CA4FC03E338C6BBF5C1A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




RE: [PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
> I attached two patches - one is for HEAD, and another one is for REL_16_STABLE
> branch. As shown below, PG16 has the same behavior.

Hmm, cfbot got angry because it tried to apply both of patches. To avoid it, I repost renamed patch.
(I'm happy if we can specify the target branch of patches)

Sorry for inconvenience.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: [PGdocs] fix description for handling pf non-ASCII characters

От
Michael Paquier
Дата:
On Thu, Sep 28, 2023 at 03:23:30AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Hmm, cfbot got angry because it tried to apply both of patches. To avoid it, I repost renamed patch.
> (I'm happy if we can specify the target branch of patches)

I was looking at this thread overall, the three v3 flavors of the doc
changes and v4.

-        <varname>application_name</varname> value. Other characters will be
-        replaced with question marks (<literal>?</literal>).
+        <varname>application_name</varname> value.
+        Other characters are replaced with <link
+        linkend="sql-syntax-strings-escape">C-style hexadecimal escapes</link>.

The simplicity of the change in v4 seems like the best approach to me,
so +1 for that (including the mention to "C-style").
--
Michael

Вложения

Re: [PGdocs] fix description for handling pf non-ASCII characters

От
Peter Smith
Дата:
v4 LGTM.

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: [PGdocs] fix description for handling pf non-ASCII characters

От
"Karl O. Pinc"
Дата:
On Thu, 28 Sep 2023 12:54:33 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> I was looking at this thread overall, the three v3 flavors of the doc
> changes and v4.
> 
> -        <varname>application_name</varname> value. Other characters
> will be
> -        replaced with question marks (<literal>?</literal>).
> +        <varname>application_name</varname> value.
> +        Other characters are replaced with <link
> +        linkend="sql-syntax-strings-escape">C-style hexadecimal
> escapes</link>.
> 
> The simplicity of the change in v4 seems like the best approach to me,
> so +1 for that (including the mention to "C-style").

I agree with Tom that it's not worth spending anyone's attention
on bytes v.s. characters.

So I'm marking the patch ready for committer.
(I have not tried the version that patches against PGv16.)

Thank you everyone, especially Hayato, for spending time
and making this better.

Regards,

Karl <kop@karlpinc.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein



Re: [PGdocs] fix description for handling pf non-ASCII characters

От
Michael Paquier
Дата:
On Thu, Sep 28, 2023 at 02:23:59PM +1000, Peter Smith wrote:
> v4 LGTM.

Applied v4 down to 16, then.
--
Michael

Вложения

RE: [PGdocs] fix description for handling pf non-ASCII characters

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael,

I confirmed your commit. Thanks!
CF entry was closed as "Committed".

Best Regards,
Hayato Kuroda
FUJITSU LIMITED