Обсуждение: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

Поиск
Список
Период
Сортировка

[PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Evan Jones
Дата:
This patch fixes a rare parsing bug with unicode characters on Mac OS X. The problem is that isspace() on Mac OS X changes its behaviour with the locale. Use scanner_isspace instead, which only returns true for ASCII whitespace. It appears other places in the Postgres code have already run into this, since a number of places use scanner_isspace instead. However, there are still a lot of other calls to isspace(). I'll try to take a quick look to see if there might be other instances of this bug.

The bug is that in the following hstore value, the unicode character "disappears", and is replaced with "key\xc4", because it is parsed incorrectly:

select E'keyą=>value'::hstore;
     hstore      
-----------------
 "keyą"=>"value"
(1 row)

select 'keyą=>value'::hstore::text::bytea;
              bytea              
----------------------------------
 \x226b6579c4223d3e2276616c756522
(1 row)

The correct result should be:

     hstore      
-----------------
 "keyą"=>"value"
(1 row)

That query is added to the regression test. The query works on Linux, but failed on Mac OS X.

For a more detailed explanation of how isspace() works, on Mac OS X, see: https://github.com/evanj/isspace_locale

Thanks!

Evan Jones

Вложения

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Michael Paquier
Дата:
On Mon, Jun 05, 2023 at 11:26:56AM -0400, Evan Jones wrote:
> This patch fixes a rare parsing bug with unicode characters on Mac OS X.
> The problem is that isspace() on Mac OS X changes its behaviour with the
> locale. Use scanner_isspace instead, which only returns true for ASCII
> whitespace. It appears other places in the Postgres code have already run
> into this, since a number of places use scanner_isspace instead. However,
> there are still a lot of other calls to isspace(). I'll try to take a quick
> look to see if there might be other instances of this bug.

Indeed.  It looks like 9ae2661 missed this spot.
--
Michael

Вложения

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Evan Jones
Дата:
On Tue, Jun 6, 2023 at 7:37 AM Michael Paquier <michael@paquier.xyz> wrote:
Indeed.  It looks like 9ae2661 missed this spot.

I didn't think to look for a previous fix, thanks for finding this commit id!

I did a quick look at the places found with "git grep isspace" yesterday. I agree with the comment from commit 9ae2661: "I've left alone isspace() calls in places that aren't really expecting any non-ASCII input characters, such as float8in()." There are a number of other calls where I think it would likely be safe, and possibly even a good idea, to replace isspace() with scanner_isspace(). However, I couldn't find any where I could cause a bug like the one I hit in hstore parsing.

Original mailing list post for commit 9ae2661 in case it is helpful for others: https://www.postgresql.org/message-id/10129.1495302480@sss.pgh.pa.us

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Michael Paquier
Дата:
On Tue, Jun 06, 2023 at 10:16:09AM -0400, Evan Jones wrote:
> I did a quick look at the places found with "git grep isspace" yesterday. I
> agree with the comment from commit 9ae2661: "I've left alone isspace()
> calls in places that aren't really expecting any non-ASCII input
> characters, such as float8in()." There are a number of other calls where I
> think it would likely be safe, and possibly even a good idea, to replace
> isspace() with scanner_isspace(). However, I couldn't find any where I
> could cause a bug like the one I hit in hstore parsing.

Yes, I agree with this feeling.  Like 9ae2661, I can't get really
excited about plastering more of that, especially if it were for
timezone value input or dictionary options.  One area with a new
isspace() since 2017 is multirangetypes.c, but it is just a copy of
rangetypes.c.

> Original mailing list post for commit 9ae2661 in case it is helpful for
> others: https://www.postgresql.org/message-id/10129.1495302480@sss.pgh.pa.us

I have reproduced the original problem reported on macOS 13.4, which
is close to the top of what's available.

Passing to pg_regress some options to use something else than UTF-8
leads to a failure in the tests, so we need a split like
fussyztrmatch to test that:
REGRESS_OPTS='--encoding=SQL_ASCII --no-locale' make check

An other error pattern without a split could be found on Windows, as
of:
 select E'key\u0105=>value'::hstore;
-     hstore
------------------
- "keyÄ…"=>"value"
-(1 row)
-
+ERROR:  character with byte sequence 0xc4 0x85 in encoding "UTF8" has
no equivalent in encoding "WIN1252"
+LINE 1: select E'key\u0105=>value'::hstore;

We don't do that for unaccent, actually, leading to similar failures..
I'll launch a separate thread about that shortly.

With that fixed, the fix has been applied and backpatched.  Thanks for
the report, Evan!
--
Michael

Вложения

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Evan Jones
Дата:
Unfortunately I just noticed a possible "bug" with this change. The scanner_isspace() function only recognizes *five* ASCII space characters: ' ' \t \n \r \f. It *excludes* VTAB \v, which the C standard function isspace() includes. This means this patch changed the behavior of hstore parsing for some "unusual" cases where the \v character was previously ignored, and now is not, such as: "select 'k=>\vv'::hstore" . It seems unlikely to me that anyone would be depending on this. The application/programming language library would need to be explicitly depending on VTAB being ignored as leading/trailing characters for hstore key/values. I am hopeful that most implementations encode hstore values the same way Postgres does: always using quoted strings, which avoids this problem.

However, if we think this change could be a problem, one fix would be to switch scanner_isspace() to array_isspace(), which returns true for these *six* ASCII characters. I am happy to submit a patch to do this.

However, I am now wondering if the fact that scanner_isspace() and array_isspace() disagree with each other could be problematic somewhere, but so far I haven found anything.
 

Problematic example before my hstore change:

$ printf "select 'k=>\vv'::hstore" | psql
  hstore  
----------
 "k"=>"v"
(1 row)

Same example after my hstore change on postgres master commit a14e75eb0b from 2023-06-16:

$ printf "select 'k=>\vv'::hstore" | psql
    hstore    
--------------
 "k"=>"\x0Bv"
(1 row)




On Sun, Jun 11, 2023 at 8:18 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Jun 06, 2023 at 10:16:09AM -0400, Evan Jones wrote:
> I did a quick look at the places found with "git grep isspace" yesterday. I
> agree with the comment from commit 9ae2661: "I've left alone isspace()
> calls in places that aren't really expecting any non-ASCII input
> characters, such as float8in()." There are a number of other calls where I
> think it would likely be safe, and possibly even a good idea, to replace
> isspace() with scanner_isspace(). However, I couldn't find any where I
> could cause a bug like the one I hit in hstore parsing.

Yes, I agree with this feeling.  Like 9ae2661, I can't get really
excited about plastering more of that, especially if it were for
timezone value input or dictionary options.  One area with a new
isspace() since 2017 is multirangetypes.c, but it is just a copy of
rangetypes.c.

> Original mailing list post for commit 9ae2661 in case it is helpful for
> others: https://www.postgresql.org/message-id/10129.1495302480@sss.pgh.pa.us

I have reproduced the original problem reported on macOS 13.4, which
is close to the top of what's available.

Passing to pg_regress some options to use something else than UTF-8
leads to a failure in the tests, so we need a split like
fussyztrmatch to test that:
REGRESS_OPTS='--encoding=SQL_ASCII --no-locale' make check

An other error pattern without a split could be found on Windows, as
of:
 select E'key\u0105=>value'::hstore;
-     hstore     
------------------
- "keyÄ…"=>"value"
-(1 row)
-
+ERROR:  character with byte sequence 0xc4 0x85 in encoding "UTF8" has
no equivalent in encoding "WIN1252"
+LINE 1: select E'key\u0105=>value'::hstore;

We don't do that for unaccent, actually, leading to similar failures..
I'll launch a separate thread about that shortly.

With that fixed, the fix has been applied and backpatched.  Thanks for
the report, Evan!
--
Michael

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Michael Paquier
Дата:
On Sat, Jun 17, 2023 at 10:57:05AM -0400, Evan Jones wrote:
> However, if we think this change could be a problem, one fix would be to
> switch scanner_isspace() to array_isspace(), which returns true for these
> *six* ASCII characters. I am happy to submit a patch to do this.

The difference between scanner_isspace() and array_isspace() is that
the former matches with what scan.l stores as rules for whitespace
characters, but the latter works on values.  For hstore, we want the
latter, with something that works on values.  To keep the change
locale to hstore, I think that we should just introduce an
hstore_isspace() which is a copy of array_isspace.  That's a
duplication, sure, but I think that we may want to think harder about
\v in the flex scanner, and that's just a few extra lines for
something that has not changed in 13 years for arrays.  That's also
easier to think about for stable branches.  If you can send a patch,
that helps a lot, for sure!

Worth noting that the array part has been changed in 2010, with
95cacd1, for the same reason as what you've proposed for hstore.
Thread is here, and it does not mention our flex rules, either:
https://www.postgresql.org/message-id/8F72262C-5694-4626-A87F-00604FB5E1D6@trumpet.io

Perhaps we could consider \v as a whitespace in the flex scanner
itself, but I am scared to do that in any stable branch.  Perhaps
we could consider that for HEAD in 17~?  That's a lot to work around
an old BSD bug that macOS has inherited, though.
--
Michael

Вложения

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Michael Paquier
Дата:
On Sun, Jun 18, 2023 at 10:50:16AM +0900, Michael Paquier wrote:
> The difference between scanner_isspace() and array_isspace() is that
> the former matches with what scan.l stores as rules for whitespace
> characters, but the latter works on values.  For hstore, we want the
> latter, with something that works on values.  To keep the change
> locale to hstore, I think that we should just introduce an
> hstore_isspace() which is a copy of array_isspace.  That's a
> duplication, sure, but I think that we may want to think harder about
> \v in the flex scanner, and that's just a few extra lines for
> something that has not changed in 13 years for arrays.  That's also
> easier to think about for stable branches.  If you can send a patch,
> that helps a lot, for sure!

At the end, no need to do that.  I have been able to hack the
attached, that shows the difference of treatment for \v when running
in macOS.  Evan, what do you think?
--
Michael

Вложения

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> At the end, no need to do that.  I have been able to hack the
> attached, that shows the difference of treatment for \v when running
> in macOS.  Evan, what do you think?

FWIW, I think the status quo is fine.  Having hstore do something that
is neither its historical behavior nor aligned with the core parser
doesn't seem like a great idea.  I don't buy this argument that
somebody might be depending on the handling of \v in particular.  It's
not any stronger than the argument that they might be depending on,
say, recognizing no-break space (0xA0) in LATIN1, which the old code
did (probably, depending on platform) and scanner_isspace will not.

If anything, the answer for these concerns is that d522b05c8
should not have been back-patched.  But I'm okay with where we are.

            regards, tom lane



Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Michael Paquier
Дата:
On Sun, Jun 18, 2023 at 12:38:12PM -0400, Tom Lane wrote:
> FWIW, I think the status quo is fine.  Having hstore do something that
> is neither its historical behavior nor aligned with the core parser
> doesn't seem like a great idea.

Okay.  Fine by me.

> I don't buy this argument that
> somebody might be depending on the handling of \v in particular.  It's
> not any stronger than the argument that they might be depending on,
> say, recognizing no-break space (0xA0) in LATIN1, which the old code
> did (probably, depending on platform) and scanner_isspace will not.

Another thing that I was wondering, though..  Do you think that there
would be an argument in being stricter in the hstore code regarding
the handling of multi-byte characters with some checks based on
IS_HIGHBIT_SET() when parsing the keys and values?
--
Michael

Вложения

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> Another thing that I was wondering, though..  Do you think that there
> would be an argument in being stricter in the hstore code regarding
> the handling of multi-byte characters with some checks based on
> IS_HIGHBIT_SET() when parsing the keys and values?

What have you got in mind?  We should already have validated encoding
correctness before the text ever gets to hstore_in, and I'm not clear
what additional checks would be useful.

            regards, tom lane



Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Michael Paquier
Дата:
On Sun, Jun 18, 2023 at 09:10:59PM -0400, Tom Lane wrote:
> What have you got in mind?  We should already have validated encoding
> correctness before the text ever gets to hstore_in, and I'm not clear
> what additional checks would be useful.

I was staring at the hstore parsing code and got the impression that
multi-byte character handling could be improved, but looking closer it
seems that I got that wrong.  Apologies for the noise.
--
Michael

Вложения

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Evan Jones
Дата:
Thanks for the detailed discussion. To confirm that I've understood everything:

* Michael's proposed patch to add hstore_isspace() would be a potential fix: it resolves my original bug, and does not change the behavior of '\v'.
* We believe the change to '\v' is not a problem, and may be an improvement because it now follows the "core" Postgres parser.

In conclusion: we don't need to make an additional change. Thank you all for investigating!

My one last suggestion: We *could* revert the backpatching if we are concerned about this change, but I'm not personally sure that is necessary. As we discussed, this is an unusual corner case in an "extension" type that many won't even have enabled.

Evan


On Tue, Jun 20, 2023 at 2:02 AM Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Jun 18, 2023 at 09:10:59PM -0400, Tom Lane wrote:
> What have you got in mind?  We should already have validated encoding
> correctness before the text ever gets to hstore_in, and I'm not clear
> what additional checks would be useful.

I was staring at the hstore parsing code and got the impression that
multi-byte character handling could be improved, but looking closer it
seems that I got that wrong.  Apologies for the noise.
--
Michael

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Michael Paquier
Дата:
On Tue, Jun 20, 2023 at 09:04:26AM -0400, Evan Jones wrote:
> My one last suggestion: We *could* revert the backpatching if we are
> concerned about this change, but I'm not personally sure that is necessary.
> As we discussed, this is an unusual corner case in an "extension" type that
> many won't even have enabled.

As a whole, I'd like to think that this is an improvement even for
stable branches with these weird isspace() handlings, so I'm OK with
the current status in all the branches.  There's an argument about \v,
IMO, but I won't fight hard for it either even if it would be more
consistent with the way array values are handled.
--
Michael

Вложения

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> As a whole, I'd like to think that this is an improvement even for
> stable branches with these weird isspace() handlings, so I'm OK with
> the current status in all the branches.

Sounds like we're all content with that.

> There's an argument about \v,
> IMO, but I won't fight hard for it either even if it would be more
> consistent with the way array values are handled.

I'd be okay with adding \v to the set of whitespace characters in
scan.l and scanner_isspace (and other affected places) for v17.
Don't want to back-patch it though.

            regards, tom lane



Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Michael Paquier
Дата:
On Tue, Jun 20, 2023 at 11:39:31PM -0400, Tom Lane wrote:
> I'd be okay with adding \v to the set of whitespace characters in
> scan.l and scanner_isspace (and other affected places) for v17.
> Don't want to back-patch it though.

Okay.  No idea where this will lead, but for now I have sent a patch
that adds \v to the parser paths where it would be needed, as far as I
checked:
https://www.postgresql.org/message-id/ZJKcjNwWHHvw9ksQ@paquier.xyz
--
Michael

Вложения

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Thomas Munro
Дата:
FTR I ran into a benign case of the phenomenon in this thread when
dealing with row types.  In rowtypes.c, we double-quote stuff
containing spaces, but we detect them by passing individual bytes of
UTF-8 sequences to isspace().  Like macOS, Windows thinks that 0xa0 is
a space when you do that, so for example the Korean character '점'
(code point C810, UTF-8 sequence EC A0 90) gets quotes on Windows but
not on Linux.  That confused a migration/diff tool while comparing
Windows and Linux database servers using that representation.  Not a
big deal, I guess no one ever promised that the format was stable
across platforms, and I don't immediately see a way for anything more
serious to go wrong (though I may lack imagination).  It does seem a
bit weird to be using locale-aware tokenising for a machine-readable
format, and then making sure its behaviour is undefined by feeding it
chopped up bytes.



Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Evan Jones
Дата:
Thanks for bringing this up! I just looked at the uses if isspace() in that file. It looks like it is the usual thing: it is allowing leading or trailing whitespace when parsing values, or for this "needs quoting" logic on output. The fix would be the same: this *should* be using scanner_isspace. This has the same disadvantage: it would change Postgres's results for some inputs that contain these non-ASCII "space" characters.


Here is a quick demonstration of this issue, showing that the quoting behavior is different between these two. Mac OS X with the "default" locale includes quotes because ą includes  0x85 in its UTF-8 encoding:

postgres=# SELECT ROW('keyą');
   row    
----------
 ("keyą")
(1 row)

On Mac OS X with the LANG=C environment variable set, it does not include quotes:

postgres=# SELECT ROW('keyą');
  row  
--------
 (keyą)
(1 row)
 

On Mon, Oct 9, 2023 at 11:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
FTR I ran into a benign case of the phenomenon in this thread when
dealing with row types.  In rowtypes.c, we double-quote stuff
containing spaces, but we detect them by passing individual bytes of
UTF-8 sequences to isspace().  Like macOS, Windows thinks that 0xa0 is
a space when you do that, so for example the Korean character '점'
(code point C810, UTF-8 sequence EC A0 90) gets quotes on Windows but
not on Linux.  That confused a migration/diff tool while comparing
Windows and Linux database servers using that representation.  Not a
big deal, I guess no one ever promised that the format was stable
across platforms, and I don't immediately see a way for anything more
serious to go wrong (though I may lack imagination).  It does seem a
bit weird to be using locale-aware tokenising for a machine-readable
format, and then making sure its behaviour is undefined by feeding it
chopped up bytes.

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

От
Michael Paquier
Дата:
On Tue, Oct 10, 2023 at 10:51:10AM -0400, Evan Jones wrote:
> Here is a quick demonstration of this issue, showing that the quoting
> behavior is different between these two. Mac OS X with the "default" locale
> includes quotes because ą includes  0x85 in its UTF-8 encoding:

Ugh.  rowtypes.c has reminded me as well of gistfuncs.c in pageinspect
where included columns are printed in a ROW-like fashion.  And it also
uses isspace() when we check if double quotes are needed or not.  So
the use of the quotes would equally depend on what macos thinks is
a correct space in this case.
--
Michael

Вложения