Обсуждение: [HACKERS] Sum aggregate calculation for single precsion real

Поиск
Список
Период
Сортировка

[HACKERS] Sum aggregate calculation for single precsion real

От
Konstantin Knizhnik
Дата:
Hi hackers,

I wonder why SUM aggregate is calculated for real (float4) type using 
floating point accumulator?
It cause very confusing and unexpected behavior:

-- postgres=# select sum(l_quantity)  from lineitem where l_shipdate <= '1998-12-01';     sum
------------- 1.52688e+09
(1 row)

postgres=# select sum(l_quantity+0.0)  from lineitem where l_shipdate <= '1998-12-01';    sum
------------ 1529738036


It is specified in any SQL standard how aggregates should be calculated?
At least Oracle and MS-SQL are calculating SUM for single precision type in different (and more natual) way.
Are there are reasons of using float4pl function for SUM aggregate instead of float4_accum?



Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Tom Lane
Дата:
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
> I wonder why SUM aggregate is calculated for real (float4) type using 
> floating point accumulator?

If you can't deal with the vagaries of floating-point arithmetic, you
shouldn't be storing your data in float format.  Use numeric.

> Are there are reasons of using float4pl function for SUM aggregate instead of float4_accum?

The latter is probably a good two orders of magnitude slower, and it
wouldn't really do much to solve the inherent accuracy problems of
adding float4 values that have a wide dynamic range.

The expectation for SUM(float4) is that you want speed and are
prepared to cope with the consequences.  It's easy enough to cast your
input to float8 if you want a wider accumulator, or to numeric if
you'd like more stable (not necessarily more accurate :-() results.
I do not think it's the database's job to make those choices for you.
        regards, tom lane



Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Konstantin Knizhnik
Дата:


On 13.02.2017 19:20, Tom Lane wrote:
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
I wonder why SUM aggregate is calculated for real (float4) type using 
floating point accumulator?
If you can't deal with the vagaries of floating-point arithmetic, you
shouldn't be storing your data in float format.  Use numeric.

4-byte floats are widely used for example in trading applications just because it is two times shorter then double and range of stored data is relatively small (do not need a lot of significant digits). At the same time volume of stored data is very large and switching from float4 to float8 will almost double it. It requires two times more storage and almost two times increase query execution time.
So this is not acceptable answer.


Are there are reasons of using float4pl function for SUM aggregate instead of float4_accum?
The latter is probably a good two orders of magnitude slower, and it
wouldn't really do much to solve the inherent accuracy problems of
adding float4 values that have a wide dynamic range.

It is not true - please notice query execution time of this two queries:

postgres=# select sum(l_quantity)  from lineitem where l_shipdate <= '1998-12-01';
     sum    
-------------
 1.52688e+09
(1 row)

Time: 2858.852 ms
postgres=# select sum(l_quantity+0.0)  from lineitem where l_shipdate <= '1998-12-01';
    sum    
------------
 1529738036
(1 row)

Time: 3174.529 ms


Looks like now in Postgres aggregate calculation itself is not a bottleneck, comparing with tuple deform cost.



The expectation for SUM(float4) is that you want speed and are
prepared to cope with the consequences.  It's easy enough to cast your
input to float8 if you want a wider accumulator, or to numeric if
you'd like more stable (not necessarily more accurate :-() results.
I do not think it's the database's job to make those choices for you.

From my point of your it is strange and wrong expectation.
I am choosing "float4" type for a column just because it is enough to represent range of data I have and I need to minimize size of record.
But when I am calculating sum, I expect to receive more or less precise result. Certainly I realize that even in case of using double it is possible to loose precision while calculation and result may depend on sum order (if we add very small and very larger values). But in real use cases (for example in trading data) such large difference in attribute values is very rare. If you have, for example, stock price, then it is very unlikely that one company has value 0.000001 and another 10000000.0
At least in TPC-H example (which certainly deal with dummy generated data), double type produce "almost price" result.

		regards, tom lane

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Jim Nasby
Дата:
On 2/13/17 10:45 AM, Konstantin Knizhnik wrote:
> It is not true - please notice query execution time of this two queries:

I bet you'd get even less difference if you simply cast to float8 
instead of adding 0.0. Same result, no floating point addition.

>> The expectation for SUM(float4) is that you want speed and are
>> prepared to cope with the consequences.  It's easy enough to cast your
>> input to float8 if you want a wider accumulator, or to numeric if
>> you'd like more stable (not necessarily more accurate :-() results.
>> I do not think it's the database's job to make those choices for you.
>
> From my point of your it is strange and wrong expectation.
> I am choosing "float4" type for a column just because it is enough to
> represent range of data I have and I need to minimize size of record.

In other words, you've decided to trade accuracy for performance...

> But when I am calculating sum, I expect to receive more or less precise
> result. Certainly I realize that even in case of using double it is

... but now you want to trade performance for accuracy? Why would you 
expect the database to magically come to that conclusion?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Robert Haas
Дата:
On Tue, Feb 14, 2017 at 8:59 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> From my point of your it is strange and wrong expectation.
>> I am choosing "float4" type for a column just because it is enough to
>> represent range of data I have and I need to minimize size of record.
>
> In other words, you've decided to trade accuracy for performance...
>
>> But when I am calculating sum, I expect to receive more or less precise
>> result. Certainly I realize that even in case of using double it is
>
> ... but now you want to trade performance for accuracy? Why would you expect
> the database to magically come to that conclusion?

Well put.  Although it's worth noting that we aren't 100% consistent
about this stuff: sum(smallint), sum(integer), and sum(bigint) all use
an output data type different from the input data type, but other
versions of sum() don't.  To some extent all of these decisions are
just guesses about what users will find useful, and as this thread
shows, not everybody's going to agree.  But I don't think our guesses
are flagrantly unreasonable or anything.

There's also nothing to prevent Konstantin or anybody else who doesn't
like the default behavior to create their own version of sum(float4)
and put it in a schema that's listed before pg_catalog in search_path.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> Well put.  Although it's worth noting that we aren't 100% consistent
> about this stuff: sum(smallint), sum(integer), and sum(bigint) all use
> an output data type different from the input data type, but other
> versions of sum() don't.

In those cases I believe the main reason for the different output type is
that there's a significant risk of overflow if we don't.  See commits
bec98a31c and 5f7c2bdb5 for some history.

You could perhaps make an argument that sum(float4) would have less risk
of overflow if it accumulated in and returned float8, but frankly that
seems a bit thin.
        regards, tom lane



Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Konstantin Knizhnik
Дата:


On 14.02.2017 16:59, Jim Nasby wrote:
On 2/13/17 10:45 AM, Konstantin Knizhnik wrote:
It is not true - please notice query execution time of this two queries:

I bet you'd get even less difference if you simply cast to float8 instead of adding 0.0. Same result, no floating point addition.

The expectation for SUM(float4) is that you want speed and are
prepared to cope with the consequences.  It's easy enough to cast your
input to float8 if you want a wider accumulator, or to numeric if
you'd like more stable (not necessarily more accurate :-() results.
I do not think it's the database's job to make those choices for you.

From my point of your it is strange and wrong expectation.
I am choosing "float4" type for a column just because it is enough to
represent range of data I have and I need to minimize size of record.

In other words, you've decided to trade accuracy for performance...

Could not agree with it...
1. If I choose float4 type to store bid price (which usually has 5-6 significant digits) - I do not loose precision and accuracy is not suffered.
The accuracy is important when I am calculating sum of prices. But here the assumption that accuracy of sum calculation should depend on type of summed field
is non obvious. May be it is more or less clear for C programmers but not for SQL users.
In all database I have tested SUM  of single precision floats is calculated at least using double precision numbers (or using numeric type).

2. There is no huge gap in performance between accumulating  in float4 and float8. There are no "orders of magnitude":
postgres=# select sum(l_quantity) from lineitem_projection;
     sum    
-------------
 1.07374e+09
(1 row)

Time: 4659.509 ms (00:04.660)

postgres=# select sum(l_quantity::float8) from lineitem_projection;
    sum    
------------
 1529738036
(1 row)

Time: 5465.320 ms (00:05.465)


So do not think that there is actually compromise here between performance and accuracy.
But current implementation cause leads to many confusions and contradictions with users expectations:

1. The fact that sum(l_quantity) and sum(l_quantity::float8) are absolutely different (1.5 times!!! - we loose 0.5 milliard dollars:)
2. avg(l_quantity)*count(l_quantity) is not equal to sum(l_quantity)    But in case of casting to float8 result is the same.
3. sum of aggregates for groups is not equal to total sum (once again no problem for float8 type_/

But when I am calculating sum, I expect to receive more or less precise
result. Certainly I realize that even in case of using double it is

... but now you want to trade performance for accuracy? Why would you expect the database to magically come to that conclusion?

Se above. No trading here. Please notice that current Postgres implementation of AVG aggregates calculates at sum and sum of squares even if last one is not needed for AVG.
The comment in the code says:

 * It might seem attractive to optimize this by having multiple accumulator
 * functions that only calculate the sums actually needed.  But on most
 * modern machines, a couple of extra floating-point multiplies will be
 * insignificant compared to the other per-tuple overhead, so I've chosen
 * to minimize code space instead.

And it is true!
In the addition to the results above I can add AVG timing for AVG calculation:

postgres=# select avg(l_quantity) from lineitem_projection;
       avg       
------------------
 25.5015621964919
(1 row)

postgres=# select avg(l_quantity::float8) from lineitem_projection;
       avg       
------------------
 25.5015621964919
(1 row)

Please notice that avg for float is calculated using float4_accum which use float8 accumulator and also calculates sumX2!


Time: 6103.807 ms (00:06.104)



So I do not see reasonable arguments here for using float4pl for sum(float4)!
And I do not know any database which has such strange behavior.
I know that "be as others" or especially "be as Oracle" are never good argument for Postgres community but doing something differently (and IMHO  wrong) without any significant reasons seems to be very strange.


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Robert Haas
Дата:
On Tue, Feb 14, 2017 at 11:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> You could perhaps make an argument that sum(float4) would have less risk
> of overflow if it accumulated in and returned float8, but frankly that
> seems a bit thin.

I think that's more or less the argument Konstantin is in fact making.
Whether it's a good argument or a thin one is a value judgement.
Personally, I find it somewhere in the middle: I think the way it
works now is reasonable, and I think what he wants would have been
reasonable as well.  However, I find it hard to believe it would be
worth changing now on backward compatibility grounds.  He doesn't like
the way it works currently, but we have no way of knowing how many
people who are happy with the way it works today would become unhappy
if we changed it.  We need a fairly compelling reason to risk breaking
somebody's SQL, and I don't think this rises to that level.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I think that's more or less the argument Konstantin is in fact making.
> Whether it's a good argument or a thin one is a value judgement.
> Personally, I find it somewhere in the middle: I think the way it
> works now is reasonable, and I think what he wants would have been
> reasonable as well.  However, I find it hard to believe it would be
> worth changing now on backward compatibility grounds.

I agree with that conclusion.  It might have been reasonable to change
it fifteen years ago when we changed the integer sum() implementations,
but I think doing so now would make more people unhappy than it would
make happy; or at least, there's little reason to think the reverse
is true.
        regards, tom lane



Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Claudio Freire
Дата:
On Wed, Feb 15, 2017 at 9:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 14, 2017 at 11:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> You could perhaps make an argument that sum(float4) would have less risk
>> of overflow if it accumulated in and returned float8, but frankly that
>> seems a bit thin.
>
> I think that's more or less the argument Konstantin is in fact making.
> Whether it's a good argument or a thin one is a value judgement.
> Personally, I find it somewhere in the middle: I think the way it
> works now is reasonable, and I think what he wants would have been
> reasonable as well.  However, I find it hard to believe it would be
> worth changing now on backward compatibility grounds.  He doesn't like
> the way it works currently, but we have no way of knowing how many
> people who are happy with the way it works today would become unhappy
> if we changed it.  We need a fairly compelling reason to risk breaking
> somebody's SQL, and I don't think this rises to that level.


I know this is said from time to time in this list, but a third option
that wouldn't break anybody's SQL would be using compensated summation
in the input type.

AFAIK, that can only increase precision, but it will cost cycles. The
impact could however fall below the noise and be worth a try.



Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Greg Stark
Дата:
On 15 February 2017 at 12:52, Robert Haas <robertmhaas@gmail.com> wrote:
> Personally, I find it somewhere in the middle: I think the way it
> works now is reasonable, and I think what he wants would have been
> reasonable as well.  However, I find it hard to believe it would be
> worth changing now on backward compatibility grounds.

Honestly I was quite surprised to discover that sum(float4) worked
this way because I assumed it followed the same pattern as integers.

But I wonder what you mean by backward compatibility grounds. If
sum(float4) returned a float8 the only compatibility issue would be
someone who did "create table as" and then expected to get a float4
column and instead got a float8 column. That seems like a much more
minor corner case than most any other backward incompatible change in
any release.

Moreover, it wouldn't be hard to make sum(float4) use a float8 as an
accumulator and then cast to float4 for the final state. That would be
100% compatible with the existing behaviour aside from producing more
accurate results.

(Though as an aside, I think Konstantin would be much better served by
using integers and storing cents or whatever unit of currency is small
enough. That would actually result in accurate results which neither
float4 nor float8 guarantee.)

-- 
greg



Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Robert Haas
Дата:
On Fri, Feb 17, 2017 at 6:51 AM, Greg Stark <stark@mit.edu> wrote:
> On 15 February 2017 at 12:52, Robert Haas <robertmhaas@gmail.com> wrote:
>> Personally, I find it somewhere in the middle: I think the way it
>> works now is reasonable, and I think what he wants would have been
>> reasonable as well.  However, I find it hard to believe it would be
>> worth changing now on backward compatibility grounds.
>
> Honestly I was quite surprised to discover that sum(float4) worked
> this way because I assumed it followed the same pattern as integers.
>
> But I wonder what you mean by backward compatibility grounds. If
> sum(float4) returned a float8 the only compatibility issue would be
> someone who did "create table as" and then expected to get a float4
> column and instead got a float8 column. That seems like a much more
> minor corner case than most any other backward incompatible change in
> any release.

That is what I mean, yes.  Plus somebody could have a view defined,
and the types would change.  I think it's only possible to consider
that kind of thing as minor because most people don't use float4 for
anything.  Changing the SQL types of things tends to result in fairly
wide-ranging SQL breakage.

> Moreover, it wouldn't be hard to make sum(float4) use a float8 as an
> accumulator and then cast to float4 for the final state. That would be
> 100% compatible with the existing behaviour aside from producing more
> accurate results.

Sure, but if somebody wants that, they can get it already just by a
minor change to the SQL.  The point is that adding up float4 as float4
is a reasonable thing to do.  Adding it as float8 might also be a
reasonable thing to do, but nobody's keeping anybody from doing that.
It's just not the behavior you get out of the box.

> (Though as an aside, I think Konstantin would be much better served by
> using integers and storing cents or whatever unit of currency is small
> enough. That would actually result in accurate results which neither
> float4 nor float8 guarantee.)

Yep.  Or changing the SQL to do sum(f4::float8)::float4, if that's
what he wants.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Sum aggregate calculation for single precsion real

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Feb 17, 2017 at 6:51 AM, Greg Stark <stark@mit.edu> wrote:
>> Moreover, it wouldn't be hard to make sum(float4) use a float8 as an
>> accumulator and then cast to float4 for the final state. That would be
>> 100% compatible with the existing behaviour aside from producing more
>> accurate results.

> Sure, but if somebody wants that, they can get it already just by a
> minor change to the SQL.  The point is that adding up float4 as float4
> is a reasonable thing to do.  Adding it as float8 might also be a
> reasonable thing to do, but nobody's keeping anybody from doing that.

Also, if we changed sum(float4) to work that way, it would become very
hard to sum float4 in float4 --- you'd pretty much have to build your
own aggregate function, which is a lot harder than just inserting a
cast.  The argument for changing this boils down to "nobody would ever
want that", which I do not think I buy.  It's been a long time since
I studied numerical analysis, but I think there are applications where
you do want to do that.
        regards, tom lane