Clean up formatting.c's logic for matching constant strings.
seq_search(), which is used to match input substrings to constants
such as month and day names, had a lot of bizarre and unnecessary
behaviors. It was mostly possible to avert our eyes from that before,
but we don't want to duplicate those behaviors in the upcoming patch
to allow recognition of non-English month and day names. So it's time
to clean this up. In particular:
* seq_search scribbled on the input string, which is a pretty dangerous
thing to do, especially in the badly underdocumented way it was done here.
Fortunately the input string is a temporary copy, but that was being made
three subroutine levels away, making it something easy to break
accidentally. The behavior is externally visible nonetheless, in the form
of odd case-folding in error reports about unrecognized month/day names.
The scribbling is evidently being done to save a few calls to pg_tolower,
but that's such a cheap function (at least for ASCII data) that it's
pretty pointless to worry about. In HEAD I switched it to be
pg_ascii_tolower to ensure it is cheap in all cases; but there are corner
cases in Turkish where this'd change behavior, so leave it as pg_tolower
in the back branches.
* seq_search insisted on knowing the case form (all-upper, all-lower,
or initcap) of the constant strings, so that it didn't have to case-fold
them to perform case-insensitive comparisons. This likewise seems like
excessive micro-optimization, given that pg_tolower is certainly very
cheap for ASCII data. It seems unsafe to assume that we know the case
form that will come out of pg_locale.c for localized month/day names, so
it's better just to define the comparison rule as "downcase all strings
before comparing". (The choice between downcasing and upcasing is
arbitrary so far as English is concerned, but it might not be in other
locales, so follow citext's lead here.)
* seq_search also had a parameter that'd cause it to report a match
after a maximum number of characters, even if the constant string were
longer than that. This was not actually used because no caller passed
a value small enough to cut off a comparison. Replicating that behavior
for localized month/day names seems expensive as well as useless, so
let's get rid of that too.
* from_char_seq_search used the maximum-length parameter to truncate
the input string in error reports about not finding a matching name.
This leads to rather confusing reports in many cases. Worse, it is
outright dangerous if the input string isn't all-ASCII, because we
risk truncating the string in the middle of a multibyte character.
That'd lead either to delivering an illegible error message to the
client, or to encoding-conversion failures that obscure the actual
data problem. Get rid of that in favor of truncating at whitespace
if any (a suggestion due to Alvaro Herrera).
In addition to fixing these things, I const-ified the input string
pointers of DCH_from_char and its subroutines, to make sure there
aren't any other scribbling-on-input problems.
The risk of generating a badly-encoded error message seems like
enough of a bug to justify back-patching, so patch all supported
branches.
Discussion: https://postgr.es/m/29432.1579731087@sss.pgh.pa.us
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/4c70098ffa8cf19e79e7b124ccac05dd338c937b
Modified Files
--------------
src/backend/utils/adt/formatting.c | 172 ++++++++++++++++-----------------
src/test/regress/expected/horology.out | 4 +-
2 files changed, 85 insertions(+), 91 deletions(-)