Обсуждение: [PATCH] json_lex_string: don't overread on bad UTF8

Поиск
Список
Период
Сортировка

[PATCH] json_lex_string: don't overread on bad UTF8

От
Jacob Champion
Дата:
Hi all,

When json_lex_string() hits certain types of invalid input, it calls
pg_encoding_mblen_bounded(), which assumes that its input is
null-terminated and calls strnlen(). But the JSON lexer is constructed
with an explicit string length, and we don't ensure that the string is
null-terminated in all cases, so we can walk off the end of the
buffer. This isn't really relevant on the server side, where you'd
have to get a superuser to help you break string encodings, but for
client-side usage on untrusted input (such as my OAuth patch) it would
be more important.

Attached is a draft patch that explicitly checks against the
end-of-string pointer and clamps the token_terminator to it. Note that
this removes the only caller of pg_encoding_mblen_bounded() and I'm
not sure what we should do with that function. It seems like a
reasonable API, just not here.

The new test needs to record two versions of the error message, one
for invalid token and one for invalid escape sequence. This is
because, for smaller chunk sizes, the partial-token logic in the
incremental JSON parser skips the affected code entirely when it can't
find an ending double-quote.

Tangentially: Should we maybe rethink pieces of the json_lex_string
error handling? For example, do we really want to echo an incomplete
multibyte sequence once we know it's bad? It also looks like there are
places where the FAIL_AT_CHAR_END macro is called after the `s`
pointer has already advanced past the code point of interest. I'm not
sure if that's intentional.

Thanks,
--Jacob

Вложения

Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Michael Paquier
Дата:
On Tue, Apr 30, 2024 at 10:39:04AM -0700, Jacob Champion wrote:
> When json_lex_string() hits certain types of invalid input, it calls
> pg_encoding_mblen_bounded(), which assumes that its input is
> null-terminated and calls strnlen(). But the JSON lexer is constructed
> with an explicit string length, and we don't ensure that the string is
> null-terminated in all cases, so we can walk off the end of the
> buffer. This isn't really relevant on the server side, where you'd
> have to get a superuser to help you break string encodings, but for
> client-side usage on untrusted input (such as my OAuth patch) it would
> be more important.

Not sure to like much the fact that this advances token_terminator
first.  Wouldn't it be better to calculate pg_encoding_mblen() first,
then save token_terminator?  I feel a bit uneasy about saving a value
in token_terminator past the end of the string.  That a nit in this
context, still..

> Attached is a draft patch that explicitly checks against the
> end-of-string pointer and clamps the token_terminator to it. Note that
> this removes the only caller of pg_encoding_mblen_bounded() and I'm
> not sure what we should do with that function. It seems like a
> reasonable API, just not here.

Hmm.  Keeping it around as currently designed means that it could
cause more harm than anything in the long term, even in the stable
branches if new code uses it.  There is a risk of seeing this new code
incorrectly using it again, even if its top comment tells that we rely
on the string being nul-terminated.  A safer alternative would be to
redesign it so as the length of the string is provided in input,
removing the dependency of strlen in its internals, perhaps.  Anyway,
without any callers of it, I'd be tempted to wipe it from HEAD and
call it a day.

> The new test needs to record two versions of the error message, one
> for invalid token and one for invalid escape sequence. This is
> because, for smaller chunk sizes, the partial-token logic in the
> incremental JSON parser skips the affected code entirely when it can't
> find an ending double-quote.

Ah, that makes sense.  That looks OK here.  A comment around the test
would be adapted to document that, I guess.

> Tangentially: Should we maybe rethink pieces of the json_lex_string
> error handling? For example, do we really want to echo an incomplete
> multibyte sequence once we know it's bad? It also looks like there are
> places where the FAIL_AT_CHAR_END macro is called after the `s`
> pointer has already advanced past the code point of interest. I'm not
> sure if that's intentional.

Advancing the tracking pointer 's' before reporting an error related
the end of the string is a bad practive, IMO, and we should avoid
that.  json_lex_string() does not offer a warm feeling regarding that
with escape characters, at least :/
--
Michael

Вложения

Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Jacob Champion
Дата:
On Tue, Apr 30, 2024 at 11:09 PM Michael Paquier <michael@paquier.xyz> wrote:
> Not sure to like much the fact that this advances token_terminator
> first.  Wouldn't it be better to calculate pg_encoding_mblen() first,
> then save token_terminator?  I feel a bit uneasy about saving a value
> in token_terminator past the end of the string.  That a nit in this
> context, still..

v2 tries it that way; see what you think. Is the concern that someone
might add code later that escapes that macro early?

> Hmm.  Keeping it around as currently designed means that it could
> cause more harm than anything in the long term, even in the stable
> branches if new code uses it.  There is a risk of seeing this new code
> incorrectly using it again, even if its top comment tells that we rely
> on the string being nul-terminated.  A safer alternative would be to
> redesign it so as the length of the string is provided in input,
> removing the dependency of strlen in its internals, perhaps.  Anyway,
> without any callers of it, I'd be tempted to wipe it from HEAD and
> call it a day.

Removed in v2.

> > The new test needs to record two versions of the error message, one
> > for invalid token and one for invalid escape sequence. This is
> > because, for smaller chunk sizes, the partial-token logic in the
> > incremental JSON parser skips the affected code entirely when it can't
> > find an ending double-quote.
>
> Ah, that makes sense.  That looks OK here.  A comment around the test
> would be adapted to document that, I guess.

Done.

> Advancing the tracking pointer 's' before reporting an error related
> the end of the string is a bad practive, IMO, and we should avoid
> that.  json_lex_string() does not offer a warm feeling regarding that
> with escape characters, at least :/

Yeah... I think some expansion of the json_errdetail test coverage is
probably in my future. :)

Thanks,
--Jacob

Вложения

Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Michael Paquier
Дата:
On Wed, May 01, 2024 at 04:22:24PM -0700, Jacob Champion wrote:
> On Tue, Apr 30, 2024 at 11:09 PM Michael Paquier <michael@paquier.xyz> wrote:
>> Not sure to like much the fact that this advances token_terminator
>> first.  Wouldn't it be better to calculate pg_encoding_mblen() first,
>> then save token_terminator?  I feel a bit uneasy about saving a value
>> in token_terminator past the end of the string.  That a nit in this
>> context, still..
>
> v2 tries it that way; see what you think. Is the concern that someone
> might add code later that escapes that macro early?

Yeah, I am not sure if that's something that would really happen, but
that looks like a good practice to keep anyway to keep a clean stack
at any time.

>> Ah, that makes sense.  That looks OK here.  A comment around the test
>> would be adapted to document that, I guess.
>
> Done.

That seems OK at quick glance.  I don't have much room to do something
about this patch this week as an effect of Golden Week and the
buildfarm effect, but I should be able to get to it next week once the
next round of minor releases is tagged.

About the fact that we may finish by printing unfinished UTF-8
sequences, I'd be curious to hear your thoughts.  Now, the information
provided about the partial byte sequences can be also useful for
debugging on top of having the error code, no?
--
Michael

Вложения

Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Michael Paquier
Дата:
On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote:
> About the fact that we may finish by printing unfinished UTF-8
> sequences, I'd be curious to hear your thoughts.  Now, the information
> provided about the partial byte sequences can be also useful for
> debugging on top of having the error code, no?

By the way, as long as I have that in mind..  I am not sure that it is
worth spending cycles in detecting the unfinished sequences and make
these printable.  Wouldn't it be enough for more cases to adjust
token_error() to truncate the byte sequences we cannot print?

Another thing that I think would be nice would be to calculate the
location of what we're parsing on a given line, and provide that in
the error context.  That would not be backpatchable as it requires a
change in JsonLexContext, unfortunately, but it would help in making
more sense with an error if the incomplete byte sequence is at the
beginning of a token or after an expected character.
--
Michael

Вложения

Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Jacob Champion
Дата:
On Wed, May 1, 2024 at 8:40 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote:
> > About the fact that we may finish by printing unfinished UTF-8
> > sequences, I'd be curious to hear your thoughts.  Now, the information
> > provided about the partial byte sequences can be also useful for
> > debugging on top of having the error code, no?

Yes, but which information do you want? Do you want to know the bad
byte sequence, or see the glyph that corresponds to it (which is
probably �)? The glyph is better as long as it's complete; if it's a
bad sequence, then maybe you'd prefer to know the particular byte, but
that assumes a lot of technical knowledge on the part of whoever's
reading the message.

> By the way, as long as I have that in mind..  I am not sure that it is
> worth spending cycles in detecting the unfinished sequences and make
> these printable.  Wouldn't it be enough for more cases to adjust
> token_error() to truncate the byte sequences we cannot print?

Maybe. I'm beginning to wonder if I'm overthinking this particular
problem, and if we should just go ahead and print the bad sequence. At
least for the case of UTF-8 console encoding, replacement glyphs will
show up as needed.

There is the matter of a client that's not using UTF-8, though. Do we
deal with that correctly today? (I understand why it was done the way
it was, at least on the server side, but it's still really weird to
have code that parses "JSON" that isn't actually Unicode.)

> Another thing that I think would be nice would be to calculate the
> location of what we're parsing on a given line, and provide that in
> the error context.  That would not be backpatchable as it requires a
> change in JsonLexContext, unfortunately, but it would help in making
> more sense with an error if the incomplete byte sequence is at the
> beginning of a token or after an expected character.

+1, at least that way you can skip directly to the broken spot during
a postmortem.

Thanks,
--Jacob



Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Peter Eisentraut
Дата:
On 30.04.24 19:39, Jacob Champion wrote:
> Tangentially: Should we maybe rethink pieces of the json_lex_string
> error handling? For example, do we really want to echo an incomplete
> multibyte sequence once we know it's bad?

I can't quite find the place you might be looking at in 
json_lex_string(), but for the general encoding conversion we have what 
would appear to be the same behavior in report_invalid_encoding(), and 
we go out of our way there to produce a verbose error message including 
the invalid data.




Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Jacob Champion
Дата:
On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 30.04.24 19:39, Jacob Champion wrote:
> > Tangentially: Should we maybe rethink pieces of the json_lex_string
> > error handling? For example, do we really want to echo an incomplete
> > multibyte sequence once we know it's bad?
>
> I can't quite find the place you might be looking at in
> json_lex_string(),

(json_lex_string() reports the beginning and end of the "area of
interest" via the JsonLexContext; it's json_errdetail() that turns
that into an error message.)

> but for the general encoding conversion we have what
> would appear to be the same behavior in report_invalid_encoding(), and
> we go out of our way there to produce a verbose error message including
> the invalid data.

We could port something like that to src/common. IMO that'd be more
suited for an actual conversion routine, though, as opposed to a
parser that for the most part assumes you didn't lie about the input
encoding and is just trying not to crash if you're wrong. Most of the
time, the parser just copies bytes between delimiters around and it's
up to the caller to handle encodings... the exceptions to that are the
\uXXXX escapes and the error handling.

Offhand, are all of our supported frontend encodings
self-synchronizing? By that I mean, is it safe to print a partial byte
sequence if the locale isn't UTF-8? (As I type this I'm starting at
Shift-JIS, and thinking "probably not.")

Actually -- hopefully this is not too much of a tangent -- that
further crystallizes a vague unease about the API that I have. The
JsonLexContext is initialized with something called the
"input_encoding", but that encoding is necessarily also the output
encoding for parsed string literals and error messages. For the server
side that's fine, but frontend clients have the input_encoding locked
to UTF-8, which seems like it might cause problems? Maybe I'm missing
code somewhere, but I don't see a conversion routine from
json_errdetail() to the actual client/locale encoding. (And the parser
does not support multibyte input_encodings that contain ASCII in trail
bytes.)

Thanks,
--Jacob



Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Michael Paquier
Дата:
On Fri, May 03, 2024 at 07:05:38AM -0700, Jacob Champion wrote:
> On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>> but for the general encoding conversion we have what
>> would appear to be the same behavior in report_invalid_encoding(), and
>> we go out of our way there to produce a verbose error message including
>> the invalid data.

I was looking for that a couple of days ago in the backend but could
not put my finger on it.  Thanks.

> We could port something like that to src/common. IMO that'd be more
> suited for an actual conversion routine, though, as opposed to a
> parser that for the most part assumes you didn't lie about the input
> encoding and is just trying not to crash if you're wrong. Most of the
> time, the parser just copies bytes between delimiters around and it's
> up to the caller to handle encodings... the exceptions to that are the
> \uXXXX escapes and the error handling.

Hmm.  That would still leave the backpatch issue at hand, which is
kind of confusing to leave as it is.  Would it be complicated to
truncate the entire byte sequence in the error message and just give
up because we cannot do better if the input byte sequence is
incomplete?  We could still have some information depending on the
string given in input, which should be enough, hopefully.  With the
location pointing to the beginning of the sequence, even better.

> Offhand, are all of our supported frontend encodings
> self-synchronizing? By that I mean, is it safe to print a partial byte
> sequence if the locale isn't UTF-8? (As I type this I'm starting at
> Shift-JIS, and thinking "probably not.")
>
> Actually -- hopefully this is not too much of a tangent -- that
> further crystallizes a vague unease about the API that I have. The
> JsonLexContext is initialized with something called the
> "input_encoding", but that encoding is necessarily also the output
> encoding for parsed string literals and error messages. For the server
> side that's fine, but frontend clients have the input_encoding locked
> to UTF-8, which seems like it might cause problems? Maybe I'm missing
> code somewhere, but I don't see a conversion routine from
> json_errdetail() to the actual client/locale encoding. (And the parser
> does not support multibyte input_encodings that contain ASCII in trail
> bytes.)

Referring to json_lex_string() that does UTF-8 -> ASCII -> give-up in
its conversion for FRONTEND, I guess?  Yep.  This limitation looks
like a problem, especially if plugging that to libpq.
--
Michael

Вложения

Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Jacob Champion
Дата:
On Mon, May 6, 2024 at 8:43 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, May 03, 2024 at 07:05:38AM -0700, Jacob Champion wrote:
> > We could port something like that to src/common. IMO that'd be more
> > suited for an actual conversion routine, though, as opposed to a
> > parser that for the most part assumes you didn't lie about the input
> > encoding and is just trying not to crash if you're wrong. Most of the
> > time, the parser just copies bytes between delimiters around and it's
> > up to the caller to handle encodings... the exceptions to that are the
> > \uXXXX escapes and the error handling.
>
> Hmm.  That would still leave the backpatch issue at hand, which is
> kind of confusing to leave as it is.  Would it be complicated to
> truncate the entire byte sequence in the error message and just give
> up because we cannot do better if the input byte sequence is
> incomplete?

Maybe I've misunderstood, but isn't that what's being done in v2?

> > Maybe I'm missing
> > code somewhere, but I don't see a conversion routine from
> > json_errdetail() to the actual client/locale encoding. (And the parser
> > does not support multibyte input_encodings that contain ASCII in trail
> > bytes.)
>
> Referring to json_lex_string() that does UTF-8 -> ASCII -> give-up in
> its conversion for FRONTEND, I guess?  Yep.  This limitation looks
> like a problem, especially if plugging that to libpq.

Okay. How we deal with that will likely guide the "optimal" fix to
error reporting, I think...

Thanks,
--Jacob



Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Michael Paquier
Дата:
On Tue, May 07, 2024 at 02:06:10PM -0700, Jacob Champion wrote:
> Maybe I've misunderstood, but isn't that what's being done in v2?

Something a bit different..  I was wondering if it could be possible
to tweak this code to truncate the data in the generated error string
so as the incomplete multi-byte sequence is entirely cut out, which
would come to setting token_terminator to "s" (last byte before the
incomplete byte sequence) rather than "term" (last byte available,
even if incomplete):
#define FAIL_AT_CHAR_END(code) \
do { \
   char       *term = s + pg_encoding_mblen(lex->input_encoding, s); \
   lex->token_terminator = (term <= end) ? term : s; \
   return code; \
} while (0)

But looking closer, I can see that in the JSON_INVALID_TOKEN case,
when !tok_done, we set token_terminator to point to the end of the
token, and that would include an incomplete byte sequence like in your
case.  :/

At the end of the day, I think that I'm OK with your patch and avoid
the overread for now in the back-branches.  This situation makes me
uncomfortable and we should put more effort in printing error messages
in a readable format, but that could always be tackled later as a
separate problem..  And I don't see something backpatchable at short
sight for v16.

Thoughts and/or objections?
--
Michael

Вложения

Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Jacob Champion
Дата:
On Tue, May 7, 2024 at 10:31 PM Michael Paquier <michael@paquier.xyz> wrote:
> But looking closer, I can see that in the JSON_INVALID_TOKEN case,
> when !tok_done, we set token_terminator to point to the end of the
> token, and that would include an incomplete byte sequence like in your
> case.  :/

Ah, I see what you're saying. Yeah, that approach would need some more
invasive changes.

> This situation makes me
> uncomfortable and we should put more effort in printing error messages
> in a readable format, but that could always be tackled later as a
> separate problem..  And I don't see something backpatchable at short
> sight for v16.

Agreed. Fortunately (or unfortunately?) I think the JSON
client-encoding work is now a prerequisite for OAuth in libpq, so
hopefully some improvements can fall out of that work too.

> Thoughts and/or objections?

None here.

Thanks!
--Jacob



Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Michael Paquier
Дата:
On Wed, May 08, 2024 at 07:01:08AM -0700, Jacob Champion wrote:
> On Tue, May 7, 2024 at 10:31 PM Michael Paquier <michael@paquier.xyz> wrote:
>> But looking closer, I can see that in the JSON_INVALID_TOKEN case,
>> when !tok_done, we set token_terminator to point to the end of the
>> token, and that would include an incomplete byte sequence like in your
>> case.  :/
>
> Ah, I see what you're saying. Yeah, that approach would need some more
> invasive changes.

My first feeling was actually to do that, and report the location in
the input string where we are seeing issues.  All code paths playing
with token_terminator would need to track that.

> Agreed. Fortunately (or unfortunately?) I think the JSON
> client-encoding work is now a prerequisite for OAuth in libpq, so
> hopefully some improvements can fall out of that work too.

I'm afraid so.  I don't quite see how this would be OK to tweak on
stable branches, but all areas that could report error states with
partial byte sequence contents would benefit from such a change.

>> Thoughts and/or objections?
>
> None here.

This is a bit mitigated by the fact that d6607016c738 is recent, but
this is incorrect since v13 so backpatched down to that.
--
Michael

Вложения

Re: [PATCH] json_lex_string: don't overread on bad UTF8

От
Jacob Champion
Дата:
On Wed, May 8, 2024 at 9:27 PM Michael Paquier <michael@paquier.xyz> wrote:
> This is a bit mitigated by the fact that d6607016c738 is recent, but
> this is incorrect since v13 so backpatched down to that.

Thank you!

--Jacob