Обсуждение: Minor documentation error regarding streaming replication protocol

Поиск
Список
Период
Сортировка

Minor documentation error regarding streaming replication protocol

От
Brar Piening
Дата:
While implementing streaming replication client functionality for Npgsql
I stumbled upon a minor documentation error at
https://www.postgresql.org/docs/current/protocol-replication.html

The "content" return value for the TIMELINE_HISTORYcommand is advertised
as bytea while it is in fact raw ASCII bytes.

How did I find out?
Since the value I get doesn't start with a "\x" and contains ascii text,
although I've bytea_outputset to hex, I first thought that the streaming
replication protocol simply doesn't honor bytea_output, but then I
realized that I also get unencoded tabs and newlines which wouldn't be
possible if the value woud be passed through byteaout.

This is certainly a minor problem since the timeline history file only
contains generated strings that are ASCII-only, so just using the
unencoded bytes is actually easier than decoding bytea.
OTOH it did cost me a few hours (writing a bytea decoder and figuring
out why it doesn't work by looking at varlena.c and proving the docs
wrong) so I want to point this out here since it is possibly an
unintended behavior or at least a documentation error.
Also I'm wary of taking dependency on an undocumented implementation
detail that could possibly change at any point.

Regards,
Brar




Re: Minor documentation error regarding streaming replication protocol

От
Brar Piening
Дата:
Brar Piening wrote:
> This is certainly a minor problem

After thinking about it a little more I think that at least the fact
that even the row description message advertises the content as bytea
could be considered as a bug - no?





Re: Minor documentation error regarding streaming replication protocol

От
Bruce Momjian
Дата:
On Sun, Sep 13, 2020 at 10:25:01PM +0200, Brar Piening wrote:
> While implementing streaming replication client functionality for Npgsql
> I stumbled upon a minor documentation error at
> https://www.postgresql.org/docs/current/protocol-replication.html
> 
> The "content" return value for the TIMELINE_HISTORYcommand is advertised
> as bytea while it is in fact raw ASCII bytes.
> 
> How did I find out?
> Since the value I get doesn't start with a "\x" and contains ascii text,
> although I've bytea_outputset to hex, I first thought that the streaming
> replication protocol simply doesn't honor bytea_output, but then I
> realized that I also get unencoded tabs and newlines which wouldn't be
> possible if the value woud be passed through byteaout.
> 
> This is certainly a minor problem since the timeline history file only
> contains generated strings that are ASCII-only, so just using the
> unencoded bytes is actually easier than decoding bytea.
> OTOH it did cost me a few hours (writing a bytea decoder and figuring
> out why it doesn't work by looking at varlena.c and proving the docs
> wrong) so I want to point this out here since it is possibly an
> unintended behavior or at least a documentation error.
> Also I'm wary of taking dependency on an undocumented implementation
> detail that could possibly change at any point.

I have looked at this.  It seems SendTimeLineHistory() is sending raw
bytes from the history file, with no encoding conversion, and
ReceiveXlogStream() is receiving it, again assuming it is just plain
text.  I am not sure we really have an SQL data type where we do this. 
BYTEA doesn't do encoding conversion, but does backslash procesing, and
TEXT does encoding conversion.

I suppose we either have to document this as BYTEA with no backslash
processing, or TEXT with no encoding conversion --- I think I prefer the
later.

Attached is a patch to update the docs, and the data type passed by
SendTimeLineHistory().  Does this look safe to everyone?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Вложения

Re: Minor documentation error regarding streaming replication protocol

От
Michael Paquier
Дата:
On Thu, Oct 08, 2020 at 04:23:06PM -0400, Bruce Momjian wrote:
> I have looked at this.  It seems SendTimeLineHistory() is sending raw
> bytes from the history file, with no encoding conversion, and
> ReceiveXlogStream() is receiving it, again assuming it is just plain
> text.  I am not sure we really have an SQL data type where we do this.
> BYTEA doesn't do encoding conversion, but does backslash procesing, and
> TEXT does encoding conversion.
>
> I suppose we either have to document this as BYTEA with no backslash
> processing, or TEXT with no encoding conversion --- I think I prefer the
> later.

As StartupXLOG() tells, The timeline history file can include as
reason the recovery target name which may not be made just of ASCII
characters as that's the value specified in pg_create_restore_point by
the user, so bytea is correct, no?
--
Michael

Вложения

Re: Minor documentation error regarding streaming replication protocol

От
Bruce Momjian
Дата:
On Fri, Oct  9, 2020 at 08:52:50AM +0900, Michael Paquier wrote:
> On Thu, Oct 08, 2020 at 04:23:06PM -0400, Bruce Momjian wrote:
> > I have looked at this.  It seems SendTimeLineHistory() is sending raw
> > bytes from the history file, with no encoding conversion, and
> > ReceiveXlogStream() is receiving it, again assuming it is just plain
> > text.  I am not sure we really have an SQL data type where we do this. 
> > BYTEA doesn't do encoding conversion, but does backslash procesing, and
> > TEXT does encoding conversion.
> > 
> > I suppose we either have to document this as BYTEA with no backslash
> > processing, or TEXT with no encoding conversion --- I think I prefer the
> > later.
> 
> As StartupXLOG() tells, The timeline history file can include as
> reason the recovery target name which may not be made just of ASCII
> characters as that's the value specified in pg_create_restore_point by
> the user, so bytea is correct, no?

Good point.  The reporter was assuming the data would come to the client
in the bytea output format specified by the GUC, e.g. \x..., so that
doesn't happen either.  As I said before, it is more raw bytes, but we
don't have an SQL data type for that.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Minor documentation error regarding streaming replication protocol

От
Bruce Momjian
Дата:
On Thu, Oct  8, 2020 at 11:17:59PM -0400, Bruce Momjian wrote:
> Good point.  The reporter was assuming the data would come to the client
> in the bytea output format specified by the GUC, e.g. \x..., so that
> doesn't happen either.  As I said before, it is more raw bytes, but we
> don't have an SQL data type for that.

I did some more research on this.  It turns out timeline 'content' is
the only BYTEA listed in the protocol docs, even though it just passes C
strings to pq_sendbytes(), just like many other fields like the field
above it, the timeline history filename.  The proper fix is to change
the code to return the timeline history file contents as TEXT instead of
BYTEA.

Patch attached.  I would like to backpatch this to all supported
versions so we are consistent and people don't think different PG
versions use different return values for this.  Is that safe?  Looking
at the uses of this in our code, it seems so.  We aren't doing BYTEA
escaping or TEXT encoding conversion in any of these cases.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Вложения

Re: Minor documentation error regarding streaming replication protocol

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> Patch attached.  I would like to backpatch this to all supported
> versions so we are consistent and people don't think different PG
> versions use different return values for this.  Is that safe?  Looking
> at the uses of this in our code, it seems so.  We aren't doing BYTEA
> escaping or TEXT encoding conversion in any of these cases.

I do not think a back-patch is a good idea.  Yeah, it probably
won't break anything, but by the same token it won't fix anything.

            regards, tom lane



Aw: Re: Minor documentation error regarding streaming replication protocol

От
Brar Piening
Дата:
Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> Good point. The reporter was assuming the data would come to the client
>> in the bytea output format specified by the GUC, e.g. \x..., so that
>> doesn't happen either. As I said before, it is more raw bytes, but we
>> don't have an SQL data type for that.

> I did some more research on this. It turns out timeline 'content' is
> the only BYTEA listed in the protocol docs, even though it just passes C
> strings to pq_sendbytes(), just like many other fields like the field
> above it, the timeline history filename. The proper fix is to change
> the code to return the timeline history file contents as TEXT instead of
> BYTEA.

In the light of what Michael wrote above, I don't think that this is really enough.

If the timeline history file can contain strings which "may not be made just of ASCII characters" this would probably
makethe client side assume that the content is being sent as TEXT encoded in client_encoding which again isn't true.
 
In the worst case this could lead to nasty decoding bugs on the client side which could even result in security
issues.

Since you probably can't tell in which encoding the aforementioned "recovery target name" was written to the timeline
historyfile, I agree with Michael that BYTEA is probably the sanest way to send this file.
 

IMO the best way out of this is to either really encode the content as BYTEA by passing it through byteaout() and by
thatescaping characters <0x20 and >0x7e, or to document that the file is being sent "as raw bytes that can be read as
'byteaEscape Format' by parsers compatible with byteain()" (this works because byteain() doesn't check whether bytes
<0x20or >0x7e are actually escaped).
 

Again, reading the raw bytes, either via byteain() or just as raw bytes, isn't really a problem and I don't want to
bringyou into a situation where the cure is worse than the disease.
 



Re: Aw: Re: Minor documentation error regarding streaming replication protocol

От
Tom Lane
Дата:
Brar Piening <Brar@gmx.de> writes:
> Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> I did some more research on this. It turns out timeline 'content' is
>> the only BYTEA listed in the protocol docs, even though it just passes C
>> strings to pq_sendbytes(), just like many other fields like the field
>> above it, the timeline history filename. The proper fix is to change
>> the code to return the timeline history file contents as TEXT instead of
>> BYTEA.

> If the timeline history file can contain strings which "may not be made just of ASCII characters" this would probably
makethe client side assume that the content is being sent as TEXT encoded in client_encoding which again isn't true. 

I think this is overthinking the problem, TBH.  Yeah, perhaps somebody
put non-ASCII comments into that file, but they'd probably be expecting
the exact same encoding they used there to come out the other end.

We should certainly *not* apply byteaout, as that would break cases that
work fine today; and if we do not do that, it is quite misleading to
suggest that the data is given in bytea format.

I don't really see why our only documentation choices are "text" or
"bytea".  Can't we just write "(raw file contents)" or the like?

            regards, tom lane



Re: Aw: Re: Minor documentation error regarding streaming replication protocol

От
Fujii Masao
Дата:

On 2020/10/15 23:03, Tom Lane wrote:
> Brar Piening <Brar@gmx.de> writes:
>> Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>> I did some more research on this. It turns out timeline 'content' is
>>> the only BYTEA listed in the protocol docs, even though it just passes C
>>> strings to pq_sendbytes(), just like many other fields like the field
>>> above it, the timeline history filename. The proper fix is to change
>>> the code to return the timeline history file contents as TEXT instead of
>>> BYTEA.
> 
>> If the timeline history file can contain strings which "may not be made just of ASCII characters" this would
probablymake the client side assume that the content is being sent as TEXT encoded in client_encoding which again isn't
true.
> 
> I think this is overthinking the problem, TBH.  Yeah, perhaps somebody
> put non-ASCII comments into that file, but they'd probably be expecting
> the exact same encoding they used there to come out the other end.
> 
> We should certainly *not* apply byteaout, as that would break cases that
> work fine today;

+1

> and if we do not do that, it is quite misleading to
> suggest that the data is given in bytea format.
> 
> I don't really see why our only documentation choices are "text" or
> "bytea".  Can't we just write "(raw file contents)" or the like?

+1

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Aw: Re: Minor documentation error regarding streaming replication protocol

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> On Thu, Oct 15, 2020 at 10:03:46AM -0400, Tom Lane wrote:
>> I don't really see why our only documentation choices are "text" or
>> "bytea".  Can't we just write "(raw file contents)" or the like?

> Agreed.  The reason they are those labels is that the C code has to
> label the output fields, so it can only use SQL types.  There are many
> protocol fields mislabeled in this way, not just timeline history.

Ah, right.  Well, let's label it as text and then in the description
of the message, note that the field contains the raw file contents,
whose encoding is unknown to the server.

            regards, tom lane



Aw: Re: Re: Minor documentation error regarding streaming replication protocol

От
Brar Piening
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Well, let's label it as text and then in the description
> of the message, note that the field contains the raw file contents,
> whose encoding is unknown to the server.

+1

This would definitely have solved my initial issue and looks like a sane way forward without over-engineering the
problemor causing any backwards-compatibility issues.
 

Regards,
Brar Piening



Re: Aw: Re: Minor documentation error regarding streaming replication protocol

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> We would want the timeline history file contents label changed from
> BYTEA to TEXT in the docs changed for all supported versions, add a C
> comment to all backbranches that BYTEA is the same as TEXT protocol
> fields, and change the C code to return TEXT IN PG 14.  Is that what
> people want?

I still think there is no need to touch back branches.  What you
propose here is more likely to confuse people than help them.
Having the documentation disagree with the code about how the
field is labeled is not good either.

Furthermore, it absolutely does not make sense to say (or imply)
that the unknown-encoding business applies to all text fields.
There are a very small number of fields where we should say that.

            regards, tom lane



Re: Minor documentation error regarding streaming replication protocol

От
Jeff Davis
Дата:
On Thu, 2020-10-08 at 23:17 -0400, Bruce Momjian wrote:
> As I said before, it is more raw bytes, but
> we
> don't have an SQL data type for that.

Sorry to jump in to this thread late.

Can't we just set both "filename" and "content" to be BYTEA, but then
set the format code to 1 (indicating binary)?

For clients that do look at the descriptor, they are more likely to get
it right; and for clients that don't look at the descriptor, there will
be no change.

Regards,
    Jeff Davis





Re: Minor documentation error regarding streaming replication protocol

От
Andres Freund
Дата:
Hi,

On 2020-12-15 01:22:44 -0800, Jeff Davis wrote:
> Attached a simple patch that enforces an all-ASCII restore point name
> rather than documenting the possibility of non-ASCII text.

+1


> diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
> index 290658b22c..48daed56f6 100644
> --- a/src/backend/access/transam/xlogfuncs.c
> +++ b/src/backend/access/transam/xlogfuncs.c
> @@ -44,6 +44,8 @@
>  static StringInfo label_file;
>  static StringInfo tblspc_map_file;
>  
> +static bool is_all_ascii(const char *str);

Minor nit: I'd put this into common/string.[ch], besides
pg_clean_ascii(). Probably renaming it to pg_is_ascii().

Greetings,

Andres Freund



Re: Minor documentation error regarding streaming replication protocol

От
Michael Paquier
Дата:
On Tue, Dec 15, 2020 at 12:54:51PM -0800, Andres Freund wrote:
> Minor nit: I'd put this into common/string.[ch], besides
> pg_clean_ascii(). Probably renaming it to pg_is_ascii().

Yeah.  There is already one pg_is_ascii_string() in saslprep.c,
is_all_ascii() in  collationcmds.c and string_is_ascii() in
pgp-pgsql.c.  I don't think we need a fourth copy of the same logic,
even if it is dead simple.
--
Michael

Вложения

Re: Minor documentation error regarding streaming replication protocol

От
Brar Piening
Дата:
Jeff Davis wrote:
> Attached a simple patch that enforces an all-ASCII restore point name
> rather than documenting the possibility of non-ASCII text.

+1

This is probably the best solution because it gives guarantees to the
client without causing compatibility issues with old clients.

Thanks!

Brar




Re: Minor documentation error regarding streaming replication protocol

От
Jeff Davis
Дата:
On Tue, 2020-12-15 at 12:54 -0800, Andres Freund wrote:
> Hi,
> 
> On 2020-12-15 01:22:44 -0800, Jeff Davis wrote:
> > Attached a simple patch that enforces an all-ASCII restore point
> > name
> > rather than documenting the possibility of non-ASCII text.
> 
> +1

On second thought, I'm going to retract this patch. The docs[1] say:

"You can, if you like, add comments to a history file to record your
own notes about how and why this particular timeline was created. Such
comments will be especially valuable when you have a thicket of
different timelines as a result of experimentation."

It doesn't say how the comments should be written, but they are
apparently "#"-style comments, and presumably can contain text in
whatever encoding you want, and will be sent along to the client.

I don't see a lot of value in taking an already minor hole and closing
it halfway.

Changing the descriptor to be bytea-binary would be a principled
solution, but others objected and I don't have strong enough feelings
about it. I can live with the documentation note mentioning that it's
really binary data, and just ignore the descriptor.

Regards,
    Jeff Davis

[1] 
https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-TIMELINES





Re: Minor documentation error regarding streaming replication protocol

От
Michael Paquier
Дата:
On Thu, Dec 17, 2020 at 03:13:25PM -0800, Jeff Davis wrote:
> On second thought, I'm going to retract this patch. The docs[1] say:
>
> "You can, if you like, add comments to a history file to record your
> own notes about how and why this particular timeline was created. Such
> comments will be especially valuable when you have a thicket of
> different timelines as a result of experimentation."

This actually rings a bell.  Now that I think about it, I have seen in
the past Japanese customers making use of Japanese characters in
history files.  I am adding Fujii-san and Horiguchi-san in CC for more
details as I recall that they were involved in such things, with
pg_rman coming first into mind.  (No idea about French users.)
--
Michael

Вложения

Re: Minor documentation error regarding streaming replication protocol

От
Kyotaro Horiguchi
Дата:
At Fri, 18 Dec 2020 10:18:45 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Thu, Dec 17, 2020 at 03:13:25PM -0800, Jeff Davis wrote:
> > On second thought, I'm going to retract this patch. The docs[1] say:
> > 
> > "You can, if you like, add comments to a history file to record your
> > own notes about how and why this particular timeline was created. Such
> > comments will be especially valuable when you have a thicket of
> > different timelines as a result of experimentation."
> 
> This actually rings a bell.  Now that I think about it, I have seen in
> the past Japanese customers making use of Japanese characters in
> history files.  I am adding Fujii-san and Horiguchi-san in CC for more
> details as I recall that they were involved in such things, with
> pg_rman coming first into mind.  (No idea about French users.)

Sorry, my faint memory says that something like that happened on
backup label but I don't recall that clearly.

pg_basebackup can set the LABEL field in the file and the string fed
to the command's "-l" option is copied to there as-is (that is,
getting no conversion), as far as client encoding is valid as server
encoding. (since database encoding of walsender is always SQL_ASCII)
I'm not sure it is the expected behavior.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center