Re: [PATCH] json_lex_string: don't overread on bad UTF8

Поиск
Список
Период
Сортировка
От Jacob Champion
Тема Re: [PATCH] json_lex_string: don't overread on bad UTF8
Дата
Msg-id CAOYmi+=BomJrQUBgy5FQY9ZtHvuK7WOJNB6foPUv21qfb2+YPw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] json_lex_string: don't overread on bad UTF8  (Peter Eisentraut <peter@eisentraut.org>)
Ответы Re: [PATCH] json_lex_string: don't overread on bad UTF8
Список pgsql-hackers
On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 30.04.24 19:39, Jacob Champion wrote:
> > Tangentially: Should we maybe rethink pieces of the json_lex_string
> > error handling? For example, do we really want to echo an incomplete
> > multibyte sequence once we know it's bad?
>
> I can't quite find the place you might be looking at in
> json_lex_string(),

(json_lex_string() reports the beginning and end of the "area of
interest" via the JsonLexContext; it's json_errdetail() that turns
that into an error message.)

> but for the general encoding conversion we have what
> would appear to be the same behavior in report_invalid_encoding(), and
> we go out of our way there to produce a verbose error message including
> the invalid data.

We could port something like that to src/common. IMO that'd be more
suited for an actual conversion routine, though, as opposed to a
parser that for the most part assumes you didn't lie about the input
encoding and is just trying not to crash if you're wrong. Most of the
time, the parser just copies bytes between delimiters around and it's
up to the caller to handle encodings... the exceptions to that are the
\uXXXX escapes and the error handling.

Offhand, are all of our supported frontend encodings
self-synchronizing? By that I mean, is it safe to print a partial byte
sequence if the locale isn't UTF-8? (As I type this I'm starting at
Shift-JIS, and thinking "probably not.")

Actually -- hopefully this is not too much of a tangent -- that
further crystallizes a vague unease about the API that I have. The
JsonLexContext is initialized with something called the
"input_encoding", but that encoding is necessarily also the output
encoding for parsed string literals and error messages. For the server
side that's fine, but frontend clients have the input_encoding locked
to UTF-8, which seems like it might cause problems? Maybe I'm missing
code somewhere, but I don't see a conversion routine from
json_errdetail() to the actual client/locale encoding. (And the parser
does not support multibyte input_encodings that contain ASCII in trail
bytes.)

Thanks,
--Jacob



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Justin Pryzby
Дата:
Сообщение: Re: pg17 issues with not-null contraints
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: Document NULL