Обсуждение: [HACKERS] Sum aggregate calculation for single precsion real
Hi hackers, I wonder why SUM aggregate is calculated for real (float4) type using floating point accumulator? It cause very confusing and unexpected behavior: -- postgres=# select sum(l_quantity) from lineitem where l_shipdate <= '1998-12-01'; sum ------------- 1.52688e+09 (1 row) postgres=# select sum(l_quantity+0.0) from lineitem where l_shipdate <= '1998-12-01'; sum ------------ 1529738036 It is specified in any SQL standard how aggregates should be calculated? At least Oracle and MS-SQL are calculating SUM for single precision type in different (and more natual) way. Are there are reasons of using float4pl function for SUM aggregate instead of float4_accum? Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes: > I wonder why SUM aggregate is calculated for real (float4) type using > floating point accumulator? If you can't deal with the vagaries of floating-point arithmetic, you shouldn't be storing your data in float format. Use numeric. > Are there are reasons of using float4pl function for SUM aggregate instead of float4_accum? The latter is probably a good two orders of magnitude slower, and it wouldn't really do much to solve the inherent accuracy problems of adding float4 values that have a wide dynamic range. The expectation for SUM(float4) is that you want speed and are prepared to cope with the consequences. It's easy enough to cast your input to float8 if you want a wider accumulator, or to numeric if you'd like more stable (not necessarily more accurate :-() results. I do not think it's the database's job to make those choices for you. regards, tom lane
On 13.02.2017 19:20, Tom Lane wrote:
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:I wonder why SUM aggregate is calculated for real (float4) type using floating point accumulator?If you can't deal with the vagaries of floating-point arithmetic, you shouldn't be storing your data in float format. Use numeric.
4-byte floats are widely used for example in trading applications just because it is two times shorter then double and range of stored data is relatively small (do not need a lot of significant digits). At the same time volume of stored data is very large and switching from float4 to float8 will almost double it. It requires two times more storage and almost two times increase query execution time.
So this is not acceptable answer.
Are there are reasons of using float4pl function for SUM aggregate instead of float4_accum?The latter is probably a good two orders of magnitude slower, and it wouldn't really do much to solve the inherent accuracy problems of adding float4 values that have a wide dynamic range.
It is not true - please notice query execution time of this two queries:
postgres=# select sum(l_quantity) from lineitem where l_shipdate <= '1998-12-01';
sum
-------------
1.52688e+09
(1 row)
Time: 2858.852 ms
postgres=# select sum(l_quantity+0.0) from lineitem where l_shipdate <= '1998-12-01';
sum
------------
1529738036
(1 row)
Time: 3174.529 ms
Looks like now in Postgres aggregate calculation itself is not a bottleneck, comparing with tuple deform cost.
The expectation for SUM(float4) is that you want speed and are prepared to cope with the consequences. It's easy enough to cast your input to float8 if you want a wider accumulator, or to numeric if you'd like more stable (not necessarily more accurate :-() results. I do not think it's the database's job to make those choices for you.
From my point of your it is strange and wrong expectation.
I am choosing "float4" type for a column just because it is enough to represent range of data I have and I need to minimize size of record.
But when I am calculating sum, I expect to receive more or less precise result. Certainly I realize that even in case of using double it is possible to loose precision while calculation and result may depend on sum order (if we add very small and very larger values). But in real use cases (for example in trading data) such large difference in attribute values is very rare. If you have, for example, stock price, then it is very unlikely that one company has value 0.000001 and another 10000000.0
At least in TPC-H example (which certainly deal with dummy generated data), double type produce "almost price" result.
regards, tom lane
-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2/13/17 10:45 AM, Konstantin Knizhnik wrote: > It is not true - please notice query execution time of this two queries: I bet you'd get even less difference if you simply cast to float8 instead of adding 0.0. Same result, no floating point addition. >> The expectation for SUM(float4) is that you want speed and are >> prepared to cope with the consequences. It's easy enough to cast your >> input to float8 if you want a wider accumulator, or to numeric if >> you'd like more stable (not necessarily more accurate :-() results. >> I do not think it's the database's job to make those choices for you. > > From my point of your it is strange and wrong expectation. > I am choosing "float4" type for a column just because it is enough to > represent range of data I have and I need to minimize size of record. In other words, you've decided to trade accuracy for performance... > But when I am calculating sum, I expect to receive more or less precise > result. Certainly I realize that even in case of using double it is ... but now you want to trade performance for accuracy? Why would you expect the database to magically come to that conclusion? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On Tue, Feb 14, 2017 at 8:59 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> From my point of your it is strange and wrong expectation. >> I am choosing "float4" type for a column just because it is enough to >> represent range of data I have and I need to minimize size of record. > > In other words, you've decided to trade accuracy for performance... > >> But when I am calculating sum, I expect to receive more or less precise >> result. Certainly I realize that even in case of using double it is > > ... but now you want to trade performance for accuracy? Why would you expect > the database to magically come to that conclusion? Well put. Although it's worth noting that we aren't 100% consistent about this stuff: sum(smallint), sum(integer), and sum(bigint) all use an output data type different from the input data type, but other versions of sum() don't. To some extent all of these decisions are just guesses about what users will find useful, and as this thread shows, not everybody's going to agree. But I don't think our guesses are flagrantly unreasonable or anything. There's also nothing to prevent Konstantin or anybody else who doesn't like the default behavior to create their own version of sum(float4) and put it in a schema that's listed before pg_catalog in search_path. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Well put. Although it's worth noting that we aren't 100% consistent > about this stuff: sum(smallint), sum(integer), and sum(bigint) all use > an output data type different from the input data type, but other > versions of sum() don't. In those cases I believe the main reason for the different output type is that there's a significant risk of overflow if we don't. See commits bec98a31c and 5f7c2bdb5 for some history. You could perhaps make an argument that sum(float4) would have less risk of overflow if it accumulated in and returned float8, but frankly that seems a bit thin. regards, tom lane
On 14.02.2017 16:59, Jim Nasby wrote:
On 2/13/17 10:45 AM, Konstantin Knizhnik wrote:It is not true - please notice query execution time of this two queries:
I bet you'd get even less difference if you simply cast to float8 instead of adding 0.0. Same result, no floating point addition.The expectation for SUM(float4) is that you want speed and are
prepared to cope with the consequences. It's easy enough to cast your
input to float8 if you want a wider accumulator, or to numeric if
you'd like more stable (not necessarily more accurate :-() results.
I do not think it's the database's job to make those choices for you.
From my point of your it is strange and wrong expectation.
I am choosing "float4" type for a column just because it is enough to
represent range of data I have and I need to minimize size of record.
In other words, you've decided to trade accuracy for performance...
Could not agree with it...
1. If I choose float4 type to store bid price (which usually has 5-6 significant digits) - I do not loose precision and accuracy is not suffered.
The accuracy is important when I am calculating sum of prices. But here the assumption that accuracy of sum calculation should depend on type of summed field
is non obvious. May be it is more or less clear for C programmers but not for SQL users.
In all database I have tested SUM of single precision floats is calculated at least using double precision numbers (or using numeric type).
2. There is no huge gap in performance between accumulating in float4 and float8. There are no "orders of magnitude":
postgres=# select sum(l_quantity) from lineitem_projection;
sum
-------------
1.07374e+09
(1 row)
Time: 4659.509 ms (00:04.660)
postgres=# select sum(l_quantity::float8) from lineitem_projection;
sum
------------
1529738036
(1 row)
Time: 5465.320 ms (00:05.465)
So do not think that there is actually compromise here between performance and accuracy.
But current implementation cause leads to many confusions and contradictions with users expectations:
1. The fact that sum(l_quantity) and sum(l_quantity::float8) are absolutely different (1.5 times!!! - we loose 0.5 milliard dollars:)
2. avg(l_quantity)*count(l_quantity) is not equal to sum(l_quantity) But in case of casting to float8 result is the same.
3. sum of aggregates for groups is not equal to total sum (once again no problem for float8 type_/
But when I am calculating sum, I expect to receive more or less precise
result. Certainly I realize that even in case of using double it is
... but now you want to trade performance for accuracy? Why would you expect the database to magically come to that conclusion?
Se above. No trading here. Please notice that current Postgres implementation of AVG aggregates calculates at sum and sum of squares even if last one is not needed for AVG.
The comment in the code says:
* It might seem attractive to optimize this by having multiple accumulator
* functions that only calculate the sums actually needed. But on most
* modern machines, a couple of extra floating-point multiplies will be
* insignificant compared to the other per-tuple overhead, so I've chosen
* to minimize code space instead.
And it is true!
In the addition to the results above I can add AVG timing for AVG calculation:
postgres=# select avg(l_quantity) from lineitem_projection;
avg
------------------
25.5015621964919
(1 row)
postgres=# select avg(l_quantity::float8) from lineitem_projection;
avg
------------------
25.5015621964919
(1 row)
Please notice that avg for float is calculated using float4_accum which use float8 accumulator and also calculates sumX2!
Time: 6103.807 ms (00:06.104)
So I do not see reasonable arguments here for using float4pl for sum(float4)!
And I do not know any database which has such strange behavior.
I know that "be as others" or especially "be as Oracle" are never good argument for Postgres community but doing something differently (and IMHO wrong) without any significant reasons seems to be very strange.
-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, Feb 14, 2017 at 11:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > You could perhaps make an argument that sum(float4) would have less risk > of overflow if it accumulated in and returned float8, but frankly that > seems a bit thin. I think that's more or less the argument Konstantin is in fact making. Whether it's a good argument or a thin one is a value judgement. Personally, I find it somewhere in the middle: I think the way it works now is reasonable, and I think what he wants would have been reasonable as well. However, I find it hard to believe it would be worth changing now on backward compatibility grounds. He doesn't like the way it works currently, but we have no way of knowing how many people who are happy with the way it works today would become unhappy if we changed it. We need a fairly compelling reason to risk breaking somebody's SQL, and I don't think this rises to that level. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I think that's more or less the argument Konstantin is in fact making. > Whether it's a good argument or a thin one is a value judgement. > Personally, I find it somewhere in the middle: I think the way it > works now is reasonable, and I think what he wants would have been > reasonable as well. However, I find it hard to believe it would be > worth changing now on backward compatibility grounds. I agree with that conclusion. It might have been reasonable to change it fifteen years ago when we changed the integer sum() implementations, but I think doing so now would make more people unhappy than it would make happy; or at least, there's little reason to think the reverse is true. regards, tom lane
On Wed, Feb 15, 2017 at 9:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 14, 2017 at 11:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> You could perhaps make an argument that sum(float4) would have less risk >> of overflow if it accumulated in and returned float8, but frankly that >> seems a bit thin. > > I think that's more or less the argument Konstantin is in fact making. > Whether it's a good argument or a thin one is a value judgement. > Personally, I find it somewhere in the middle: I think the way it > works now is reasonable, and I think what he wants would have been > reasonable as well. However, I find it hard to believe it would be > worth changing now on backward compatibility grounds. He doesn't like > the way it works currently, but we have no way of knowing how many > people who are happy with the way it works today would become unhappy > if we changed it. We need a fairly compelling reason to risk breaking > somebody's SQL, and I don't think this rises to that level. I know this is said from time to time in this list, but a third option that wouldn't break anybody's SQL would be using compensated summation in the input type. AFAIK, that can only increase precision, but it will cost cycles. The impact could however fall below the noise and be worth a try.
On 15 February 2017 at 12:52, Robert Haas <robertmhaas@gmail.com> wrote: > Personally, I find it somewhere in the middle: I think the way it > works now is reasonable, and I think what he wants would have been > reasonable as well. However, I find it hard to believe it would be > worth changing now on backward compatibility grounds. Honestly I was quite surprised to discover that sum(float4) worked this way because I assumed it followed the same pattern as integers. But I wonder what you mean by backward compatibility grounds. If sum(float4) returned a float8 the only compatibility issue would be someone who did "create table as" and then expected to get a float4 column and instead got a float8 column. That seems like a much more minor corner case than most any other backward incompatible change in any release. Moreover, it wouldn't be hard to make sum(float4) use a float8 as an accumulator and then cast to float4 for the final state. That would be 100% compatible with the existing behaviour aside from producing more accurate results. (Though as an aside, I think Konstantin would be much better served by using integers and storing cents or whatever unit of currency is small enough. That would actually result in accurate results which neither float4 nor float8 guarantee.) -- greg
On Fri, Feb 17, 2017 at 6:51 AM, Greg Stark <stark@mit.edu> wrote: > On 15 February 2017 at 12:52, Robert Haas <robertmhaas@gmail.com> wrote: >> Personally, I find it somewhere in the middle: I think the way it >> works now is reasonable, and I think what he wants would have been >> reasonable as well. However, I find it hard to believe it would be >> worth changing now on backward compatibility grounds. > > Honestly I was quite surprised to discover that sum(float4) worked > this way because I assumed it followed the same pattern as integers. > > But I wonder what you mean by backward compatibility grounds. If > sum(float4) returned a float8 the only compatibility issue would be > someone who did "create table as" and then expected to get a float4 > column and instead got a float8 column. That seems like a much more > minor corner case than most any other backward incompatible change in > any release. That is what I mean, yes. Plus somebody could have a view defined, and the types would change. I think it's only possible to consider that kind of thing as minor because most people don't use float4 for anything. Changing the SQL types of things tends to result in fairly wide-ranging SQL breakage. > Moreover, it wouldn't be hard to make sum(float4) use a float8 as an > accumulator and then cast to float4 for the final state. That would be > 100% compatible with the existing behaviour aside from producing more > accurate results. Sure, but if somebody wants that, they can get it already just by a minor change to the SQL. The point is that adding up float4 as float4 is a reasonable thing to do. Adding it as float8 might also be a reasonable thing to do, but nobody's keeping anybody from doing that. It's just not the behavior you get out of the box. > (Though as an aside, I think Konstantin would be much better served by > using integers and storing cents or whatever unit of currency is small > enough. That would actually result in accurate results which neither > float4 nor float8 guarantee.) Yep. Or changing the SQL to do sum(f4::float8)::float4, if that's what he wants. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Feb 17, 2017 at 6:51 AM, Greg Stark <stark@mit.edu> wrote: >> Moreover, it wouldn't be hard to make sum(float4) use a float8 as an >> accumulator and then cast to float4 for the final state. That would be >> 100% compatible with the existing behaviour aside from producing more >> accurate results. > Sure, but if somebody wants that, they can get it already just by a > minor change to the SQL. The point is that adding up float4 as float4 > is a reasonable thing to do. Adding it as float8 might also be a > reasonable thing to do, but nobody's keeping anybody from doing that. Also, if we changed sum(float4) to work that way, it would become very hard to sum float4 in float4 --- you'd pretty much have to build your own aggregate function, which is a lot harder than just inserting a cast. The argument for changing this boils down to "nobody would ever want that", which I do not think I buy. It's been a long time since I studied numerical analysis, but I think there are applications where you do want to do that. regards, tom lane