Обсуждение: Minor documentation error regarding streaming replication protocol
While implementing streaming replication client functionality for Npgsql I stumbled upon a minor documentation error at https://www.postgresql.org/docs/current/protocol-replication.html The "content" return value for the TIMELINE_HISTORYcommand is advertised as bytea while it is in fact raw ASCII bytes. How did I find out? Since the value I get doesn't start with a "\x" and contains ascii text, although I've bytea_outputset to hex, I first thought that the streaming replication protocol simply doesn't honor bytea_output, but then I realized that I also get unencoded tabs and newlines which wouldn't be possible if the value woud be passed through byteaout. This is certainly a minor problem since the timeline history file only contains generated strings that are ASCII-only, so just using the unencoded bytes is actually easier than decoding bytea. OTOH it did cost me a few hours (writing a bytea decoder and figuring out why it doesn't work by looking at varlena.c and proving the docs wrong) so I want to point this out here since it is possibly an unintended behavior or at least a documentation error. Also I'm wary of taking dependency on an undocumented implementation detail that could possibly change at any point. Regards, Brar
Brar Piening wrote: > This is certainly a minor problem After thinking about it a little more I think that at least the fact that even the row description message advertises the content as bytea could be considered as a bug - no?
On Sun, Sep 13, 2020 at 10:25:01PM +0200, Brar Piening wrote: > While implementing streaming replication client functionality for Npgsql > I stumbled upon a minor documentation error at > https://www.postgresql.org/docs/current/protocol-replication.html > > The "content" return value for the TIMELINE_HISTORYcommand is advertised > as bytea while it is in fact raw ASCII bytes. > > How did I find out? > Since the value I get doesn't start with a "\x" and contains ascii text, > although I've bytea_outputset to hex, I first thought that the streaming > replication protocol simply doesn't honor bytea_output, but then I > realized that I also get unencoded tabs and newlines which wouldn't be > possible if the value woud be passed through byteaout. > > This is certainly a minor problem since the timeline history file only > contains generated strings that are ASCII-only, so just using the > unencoded bytes is actually easier than decoding bytea. > OTOH it did cost me a few hours (writing a bytea decoder and figuring > out why it doesn't work by looking at varlena.c and proving the docs > wrong) so I want to point this out here since it is possibly an > unintended behavior or at least a documentation error. > Also I'm wary of taking dependency on an undocumented implementation > detail that could possibly change at any point. I have looked at this. It seems SendTimeLineHistory() is sending raw bytes from the history file, with no encoding conversion, and ReceiveXlogStream() is receiving it, again assuming it is just plain text. I am not sure we really have an SQL data type where we do this. BYTEA doesn't do encoding conversion, but does backslash procesing, and TEXT does encoding conversion. I suppose we either have to document this as BYTEA with no backslash processing, or TEXT with no encoding conversion --- I think I prefer the later. Attached is a patch to update the docs, and the data type passed by SendTimeLineHistory(). Does this look safe to everyone? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Вложения
On Thu, Oct 08, 2020 at 04:23:06PM -0400, Bruce Momjian wrote: > I have looked at this. It seems SendTimeLineHistory() is sending raw > bytes from the history file, with no encoding conversion, and > ReceiveXlogStream() is receiving it, again assuming it is just plain > text. I am not sure we really have an SQL data type where we do this. > BYTEA doesn't do encoding conversion, but does backslash procesing, and > TEXT does encoding conversion. > > I suppose we either have to document this as BYTEA with no backslash > processing, or TEXT with no encoding conversion --- I think I prefer the > later. As StartupXLOG() tells, The timeline history file can include as reason the recovery target name which may not be made just of ASCII characters as that's the value specified in pg_create_restore_point by the user, so bytea is correct, no? -- Michael
Вложения
On Fri, Oct 9, 2020 at 08:52:50AM +0900, Michael Paquier wrote: > On Thu, Oct 08, 2020 at 04:23:06PM -0400, Bruce Momjian wrote: > > I have looked at this. It seems SendTimeLineHistory() is sending raw > > bytes from the history file, with no encoding conversion, and > > ReceiveXlogStream() is receiving it, again assuming it is just plain > > text. I am not sure we really have an SQL data type where we do this. > > BYTEA doesn't do encoding conversion, but does backslash procesing, and > > TEXT does encoding conversion. > > > > I suppose we either have to document this as BYTEA with no backslash > > processing, or TEXT with no encoding conversion --- I think I prefer the > > later. > > As StartupXLOG() tells, The timeline history file can include as > reason the recovery target name which may not be made just of ASCII > characters as that's the value specified in pg_create_restore_point by > the user, so bytea is correct, no? Good point. The reporter was assuming the data would come to the client in the bytea output format specified by the GUC, e.g. \x..., so that doesn't happen either. As I said before, it is more raw bytes, but we don't have an SQL data type for that. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Thu, Oct 8, 2020 at 11:17:59PM -0400, Bruce Momjian wrote: > Good point. The reporter was assuming the data would come to the client > in the bytea output format specified by the GUC, e.g. \x..., so that > doesn't happen either. As I said before, it is more raw bytes, but we > don't have an SQL data type for that. I did some more research on this. It turns out timeline 'content' is the only BYTEA listed in the protocol docs, even though it just passes C strings to pq_sendbytes(), just like many other fields like the field above it, the timeline history filename. The proper fix is to change the code to return the timeline history file contents as TEXT instead of BYTEA. Patch attached. I would like to backpatch this to all supported versions so we are consistent and people don't think different PG versions use different return values for this. Is that safe? Looking at the uses of this in our code, it seems so. We aren't doing BYTEA escaping or TEXT encoding conversion in any of these cases. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Вложения
Bruce Momjian <bruce@momjian.us> writes: > Patch attached. I would like to backpatch this to all supported > versions so we are consistent and people don't think different PG > versions use different return values for this. Is that safe? Looking > at the uses of this in our code, it seems so. We aren't doing BYTEA > escaping or TEXT encoding conversion in any of these cases. I do not think a back-patch is a good idea. Yeah, it probably won't break anything, but by the same token it won't fix anything. regards, tom lane
Bruce Momjian <bruce(at)momjian(dot)us> wrote: >> Good point. The reporter was assuming the data would come to the client >> in the bytea output format specified by the GUC, e.g. \x..., so that >> doesn't happen either. As I said before, it is more raw bytes, but we >> don't have an SQL data type for that. > I did some more research on this. It turns out timeline 'content' is > the only BYTEA listed in the protocol docs, even though it just passes C > strings to pq_sendbytes(), just like many other fields like the field > above it, the timeline history filename. The proper fix is to change > the code to return the timeline history file contents as TEXT instead of > BYTEA. In the light of what Michael wrote above, I don't think that this is really enough. If the timeline history file can contain strings which "may not be made just of ASCII characters" this would probably makethe client side assume that the content is being sent as TEXT encoded in client_encoding which again isn't true. In the worst case this could lead to nasty decoding bugs on the client side which could even result in security issues. Since you probably can't tell in which encoding the aforementioned "recovery target name" was written to the timeline historyfile, I agree with Michael that BYTEA is probably the sanest way to send this file. IMO the best way out of this is to either really encode the content as BYTEA by passing it through byteaout() and by thatescaping characters <0x20 and >0x7e, or to document that the file is being sent "as raw bytes that can be read as 'byteaEscape Format' by parsers compatible with byteain()" (this works because byteain() doesn't check whether bytes <0x20or >0x7e are actually escaped). Again, reading the raw bytes, either via byteain() or just as raw bytes, isn't really a problem and I don't want to bringyou into a situation where the cure is worse than the disease.
Brar Piening <Brar@gmx.de> writes: > Bruce Momjian <bruce(at)momjian(dot)us> wrote: >> I did some more research on this. It turns out timeline 'content' is >> the only BYTEA listed in the protocol docs, even though it just passes C >> strings to pq_sendbytes(), just like many other fields like the field >> above it, the timeline history filename. The proper fix is to change >> the code to return the timeline history file contents as TEXT instead of >> BYTEA. > If the timeline history file can contain strings which "may not be made just of ASCII characters" this would probably makethe client side assume that the content is being sent as TEXT encoded in client_encoding which again isn't true. I think this is overthinking the problem, TBH. Yeah, perhaps somebody put non-ASCII comments into that file, but they'd probably be expecting the exact same encoding they used there to come out the other end. We should certainly *not* apply byteaout, as that would break cases that work fine today; and if we do not do that, it is quite misleading to suggest that the data is given in bytea format. I don't really see why our only documentation choices are "text" or "bytea". Can't we just write "(raw file contents)" or the like? regards, tom lane
On 2020/10/15 23:03, Tom Lane wrote: > Brar Piening <Brar@gmx.de> writes: >> Bruce Momjian <bruce(at)momjian(dot)us> wrote: >>> I did some more research on this. It turns out timeline 'content' is >>> the only BYTEA listed in the protocol docs, even though it just passes C >>> strings to pq_sendbytes(), just like many other fields like the field >>> above it, the timeline history filename. The proper fix is to change >>> the code to return the timeline history file contents as TEXT instead of >>> BYTEA. > >> If the timeline history file can contain strings which "may not be made just of ASCII characters" this would probablymake the client side assume that the content is being sent as TEXT encoded in client_encoding which again isn't true. > > I think this is overthinking the problem, TBH. Yeah, perhaps somebody > put non-ASCII comments into that file, but they'd probably be expecting > the exact same encoding they used there to come out the other end. > > We should certainly *not* apply byteaout, as that would break cases that > work fine today; +1 > and if we do not do that, it is quite misleading to > suggest that the data is given in bytea format. > > I don't really see why our only documentation choices are "text" or > "bytea". Can't we just write "(raw file contents)" or the like? +1 Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Bruce Momjian <bruce@momjian.us> writes: > On Thu, Oct 15, 2020 at 10:03:46AM -0400, Tom Lane wrote: >> I don't really see why our only documentation choices are "text" or >> "bytea". Can't we just write "(raw file contents)" or the like? > Agreed. The reason they are those labels is that the C code has to > label the output fields, so it can only use SQL types. There are many > protocol fields mislabeled in this way, not just timeline history. Ah, right. Well, let's label it as text and then in the description of the message, note that the field contains the raw file contents, whose encoding is unknown to the server. regards, tom lane
Aw: Re: Re: Minor documentation error regarding streaming replication protocol
От
Brar Piening
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, let's label it as text and then in the description > of the message, note that the field contains the raw file contents, > whose encoding is unknown to the server. +1 This would definitely have solved my initial issue and looks like a sane way forward without over-engineering the problemor causing any backwards-compatibility issues. Regards, Brar Piening
Bruce Momjian <bruce@momjian.us> writes: > We would want the timeline history file contents label changed from > BYTEA to TEXT in the docs changed for all supported versions, add a C > comment to all backbranches that BYTEA is the same as TEXT protocol > fields, and change the C code to return TEXT IN PG 14. Is that what > people want? I still think there is no need to touch back branches. What you propose here is more likely to confuse people than help them. Having the documentation disagree with the code about how the field is labeled is not good either. Furthermore, it absolutely does not make sense to say (or imply) that the unknown-encoding business applies to all text fields. There are a very small number of fields where we should say that. regards, tom lane
On Thu, 2020-10-08 at 23:17 -0400, Bruce Momjian wrote: > As I said before, it is more raw bytes, but > we > don't have an SQL data type for that. Sorry to jump in to this thread late. Can't we just set both "filename" and "content" to be BYTEA, but then set the format code to 1 (indicating binary)? For clients that do look at the descriptor, they are more likely to get it right; and for clients that don't look at the descriptor, there will be no change. Regards, Jeff Davis
Hi, On 2020-12-15 01:22:44 -0800, Jeff Davis wrote: > Attached a simple patch that enforces an all-ASCII restore point name > rather than documenting the possibility of non-ASCII text. +1 > diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c > index 290658b22c..48daed56f6 100644 > --- a/src/backend/access/transam/xlogfuncs.c > +++ b/src/backend/access/transam/xlogfuncs.c > @@ -44,6 +44,8 @@ > static StringInfo label_file; > static StringInfo tblspc_map_file; > > +static bool is_all_ascii(const char *str); Minor nit: I'd put this into common/string.[ch], besides pg_clean_ascii(). Probably renaming it to pg_is_ascii(). Greetings, Andres Freund
On Tue, Dec 15, 2020 at 12:54:51PM -0800, Andres Freund wrote: > Minor nit: I'd put this into common/string.[ch], besides > pg_clean_ascii(). Probably renaming it to pg_is_ascii(). Yeah. There is already one pg_is_ascii_string() in saslprep.c, is_all_ascii() in collationcmds.c and string_is_ascii() in pgp-pgsql.c. I don't think we need a fourth copy of the same logic, even if it is dead simple. -- Michael
Вложения
Jeff Davis wrote: > Attached a simple patch that enforces an all-ASCII restore point name > rather than documenting the possibility of non-ASCII text. +1 This is probably the best solution because it gives guarantees to the client without causing compatibility issues with old clients. Thanks! Brar
On Tue, 2020-12-15 at 12:54 -0800, Andres Freund wrote: > Hi, > > On 2020-12-15 01:22:44 -0800, Jeff Davis wrote: > > Attached a simple patch that enforces an all-ASCII restore point > > name > > rather than documenting the possibility of non-ASCII text. > > +1 On second thought, I'm going to retract this patch. The docs[1] say: "You can, if you like, add comments to a history file to record your own notes about how and why this particular timeline was created. Such comments will be especially valuable when you have a thicket of different timelines as a result of experimentation." It doesn't say how the comments should be written, but they are apparently "#"-style comments, and presumably can contain text in whatever encoding you want, and will be sent along to the client. I don't see a lot of value in taking an already minor hole and closing it halfway. Changing the descriptor to be bytea-binary would be a principled solution, but others objected and I don't have strong enough feelings about it. I can live with the documentation note mentioning that it's really binary data, and just ignore the descriptor. Regards, Jeff Davis [1] https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-TIMELINES
On Thu, Dec 17, 2020 at 03:13:25PM -0800, Jeff Davis wrote: > On second thought, I'm going to retract this patch. The docs[1] say: > > "You can, if you like, add comments to a history file to record your > own notes about how and why this particular timeline was created. Such > comments will be especially valuable when you have a thicket of > different timelines as a result of experimentation." This actually rings a bell. Now that I think about it, I have seen in the past Japanese customers making use of Japanese characters in history files. I am adding Fujii-san and Horiguchi-san in CC for more details as I recall that they were involved in such things, with pg_rman coming first into mind. (No idea about French users.) -- Michael
Вложения
At Fri, 18 Dec 2020 10:18:45 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Thu, Dec 17, 2020 at 03:13:25PM -0800, Jeff Davis wrote: > > On second thought, I'm going to retract this patch. The docs[1] say: > > > > "You can, if you like, add comments to a history file to record your > > own notes about how and why this particular timeline was created. Such > > comments will be especially valuable when you have a thicket of > > different timelines as a result of experimentation." > > This actually rings a bell. Now that I think about it, I have seen in > the past Japanese customers making use of Japanese characters in > history files. I am adding Fujii-san and Horiguchi-san in CC for more > details as I recall that they were involved in such things, with > pg_rman coming first into mind. (No idea about French users.) Sorry, my faint memory says that something like that happened on backup label but I don't recall that clearly. pg_basebackup can set the LABEL field in the file and the string fed to the command's "-l" option is copied to there as-is (that is, getting no conversion), as far as client encoding is valid as server encoding. (since database encoding of walsender is always SQL_ASCII) I'm not sure it is the expected behavior. regards. -- Kyotaro Horiguchi NTT Open Source Software Center