Обсуждение: Underscores in numeric literals
Here is a patch to add support for underscores in numeric literals, for visual grouping, like 1_500_000_000 0b10001000_00000000 0o_1_755 0xFFFF_FFFF 1.618_034 per SQL:202x draft. This adds support in the lexer as well as in the integer type input functions. TODO: float/numeric type input support I did some performance tests similar to what was done in [0] and didn't find any problematic deviations. Other tests would be welcome. [0]: https://www.postgresql.org/message-id/flat/b239564c-cad0-b23e-c57e-166d883cb97d@enterprisedb.com
Вложения
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > Here is a patch to add support for underscores in numeric literals, for > visual grouping, like > 1_500_000_000 > 0b10001000_00000000 > 0o_1_755 > 0xFFFF_FFFF > 1.618_034 > per SQL:202x draft. > This adds support in the lexer as well as in the integer type input > functions. > TODO: float/numeric type input support Hmm ... I'm on board with allowing this in SQL if the committee says so. I'm not especially on board with accepting it in datatype input functions. There's been zero demand for that AFAIR. Moreover, I don't think we need the inevitable I/O performance hit, nor the increased risk of accepting garbage, nor the certainty of inconsistency with other places that don't get converted (because they depend on strtoul() or whatever). We already accept that numeric input is different from numeric literals: you can't write Infinity or NaN in SQL without quotes. So I don't see an argument that we have to allow this in numeric input for consistency. regards, tom lane
On Tue, Dec 27, 2022 at 09:55:32AM -0500, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > > Here is a patch to add support for underscores in numeric literals, for > > visual grouping, like > > > 1_500_000_000 > > 0b10001000_00000000 > > 0o_1_755 > > 0xFFFF_FFFF > > 1.618_034 > > > per SQL:202x draft. > > > This adds support in the lexer as well as in the integer type input > > functions. > > TODO: float/numeric type input support > > Hmm ... I'm on board with allowing this in SQL if the committee says > so. > I'm not especially on board with accepting it in datatype input > functions. There's been zero demand for that AFAIR. Moreover, > I don't think we need the inevitable I/O performance hit, nor the > increased risk of accepting garbage, nor the certainty of > inconsistency with other places that don't get converted (because > they depend on strtoul() or whatever). +1 to accept underscores only in literals and leave input functions alone. (When I realized that python3.6 changed to accept things like int("3_5"), I felt compelled to write a wrapper to check for embedded underscores and raise an exception in that case. And I'm sure it affected performance.) -- Justin
On 2022-12-27 Tu 09:55, Tom Lane wrote: > We already accept that numeric input is different from numeric > literals: you can't write Infinity or NaN in SQL without quotes. > So I don't see an argument that we have to allow this in numeric > input for consistency. > That's almost the same, but not quite, ISTM. Those are things you can't say without quotes, but here unless I'm mistaken you'd be disallowing this style if you use quotes. I get the difficulties with input functions, but it seems like we'll be building lots of grounds for confusion. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Wed, 28 Dec 2022 at 14:28, Andrew Dunstan <andrew@dunslane.net> wrote: > > On 2022-12-27 Tu 09:55, Tom Lane wrote: > > We already accept that numeric input is different from numeric > > literals: you can't write Infinity or NaN in SQL without quotes. > > So I don't see an argument that we have to allow this in numeric > > input for consistency. > > That's almost the same, but not quite, ISTM. Those are things you can't > say without quotes, but here unless I'm mistaken you'd be disallowing > this style if you use quotes. I get the difficulties with input > functions, but it seems like we'll be building lots of grounds for > confusion. > Yeah, it's easy to see why something like 'NaN' needs quotes, but it would be harder to explain why something like 1000_000 mustn't have quotes, and couldn't be used as input to COPY. My feeling is that we should try to make the datatype input functions accept anything that is legal syntax as a numeric literal, even if the reverse isn't always possible. That said, I think it's very important to minimise any performance hit, especially in the existing case of inputs with no underscores. Looking at the patch's changes to pg_strtointNN(), I think there's more that can be done to reduce that performance hit. As it stands, every input character is checked to see if it's an underscore, and then there's a new check at the end to ensure that the input string doesn't have a trailing underscore. Both of those can be avoided by rearranging things a little, as in the attached v2 patch. In the v2 patch, each input character is only compared with underscore if it's not a digit, so in the case of an input with no underscores or trailing spaces, the new checks for underscores are never executed. In addition, if an underscore is seen, it now checks that the next character is a digit. This eliminates the possibility of two underscores in a row, and also of a trailing underscore, and so there is no need for the final check for trailing underscores. Thus, if the input consists only of digits, it never has to test for underscores at all, and the performance hit for this case is minimised. My other concern with this patch is that the responsibility for handling underscores is distributed over a couple of different places. I had the same concern about the non-decimal integer patch, but at the time I couldn't see any way round it. Now that we have soft error handling though, I think that there is a way to improve this, centralising the logic for both underscore and non-decimal handling to one place for each datatype, reducing code duplication and the chances of bugs. For example, make_const() in the T_Float case has gained new code to parse both the sign and base-prefix of the input, duplicating the logic in pg_strtointNN(). That can now be avoided by having it call pg_strtoint64_safe() with an ErrorSaveContext, instead of strtoi64(). In the process, it would then gain the ability to handle underscores, so they wouldn't need to be stripped off elsewhere. Similarly, process_integer_literal() could be made to call pg_strtoint32_safe() with an ErrorSaveContext instead of strtoint(), and it then wouldn't need to strip off underscores, or be passed the number's base, since pg_strtoint32_safe() would handle all of that. In addition, I think that strip_underscores() could then go away if numeric_in() were made to handle underscores. Essentially then, that would move all responsibility for parsing underscores and non-decimal integers to the datatype input functions, or their support routines, rather than having it distributed. Regards, Dean
Вложения
Oh, one other minor nit -- in parser/scan.l: -real ({decinteger}|{numeric})[Ee][-+]?{decdigit}+ +real ({decinteger}|{numeric})[Ee][-+]?{decinteger}+ the final "+" isn't necessary now. Regards, Dean
On Wed, 4 Jan 2023 at 09:28, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > In addition, I think that strip_underscores() could then go away if > numeric_in() were made to handle underscores. > > Essentially then, that would move all responsibility for parsing > underscores and non-decimal integers to the datatype input functions, > or their support routines, rather than having it distributed. > Here's an update with those changes. Regards, Dean
Вложения
On 23.01.23 21:45, Dean Rasheed wrote: > On Wed, 4 Jan 2023 at 09:28, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >> >> In addition, I think that strip_underscores() could then go away if >> numeric_in() were made to handle underscores. >> >> Essentially then, that would move all responsibility for parsing >> underscores and non-decimal integers to the datatype input functions, >> or their support routines, rather than having it distributed. > > Here's an update with those changes. This looks good to me. Did you have any thoughts about what to do with the float types? I guess we could handle those in a separate patch?
On Tue, 31 Jan 2023 at 15:28, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > Did you have any thoughts about what to do with the float types? I > guess we could handle those in a separate patch? > I was assuming that we'd do nothing for float types, because anything we did would necessarily impact their performance. Regards, Dean
On 31.01.23 17:09, Dean Rasheed wrote: > On Tue, 31 Jan 2023 at 15:28, Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >> >> Did you have any thoughts about what to do with the float types? I >> guess we could handle those in a separate patch? >> > > I was assuming that we'd do nothing for float types, because anything > we did would necessarily impact their performance. Yeah, as long as we are using strtof() and strtod() we should just leave it alone. If we have break that open and hand-code something, we can reconsider it. So I think you could go ahead with committing your patch and we can consider this topic done for now.
On Thu, 2 Feb 2023 at 22:40, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 31.01.23 17:09, Dean Rasheed wrote: > > On Tue, 31 Jan 2023 at 15:28, Peter Eisentraut > > <peter.eisentraut@enterprisedb.com> wrote: > >> > >> Did you have any thoughts about what to do with the float types? I > >> guess we could handle those in a separate patch? > >> > > > > I was assuming that we'd do nothing for float types, because anything > > we did would necessarily impact their performance. > > Yeah, as long as we are using strtof() and strtod() we should just leave > it alone. If we have break that open and hand-code something, we can > reconsider it. > > So I think you could go ahead with committing your patch and we can > consider this topic done for now. > Done. Regards, Dean