Обсуждение: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

Поиск

Список

Период

Сортировка

BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

От

PG Bug reporting form

Дата:

19 сентября 2020 г., 06:32:41

The following bug has been logged on the website:

Bug reference:      16625
Logged by:          XINYU LIU
Email address:      xinyuliu@umich.edu
PostgreSQL version: 12.3
Operating system:   Ubuntu 19.10
Description:

Hello,

We are developing a tool for automatically finding performance bugs in
PostgreSQL. Our key insight is that given a pair of semantic equivalent
queries, a robust DBMS should return the same result within a similar
execution time. Significant time difference suggests a potential performance
bug in the DBMS.

We are sharing a pair of TPC-H queries that exhibit a potential performance
bug in this report:

First query:
SELECT "s_suppkey" 
FROM   "supplier" 
WHERE  s_suppkey > 100;  

Second query:
SELECT "s_suppkey" 
FROM   "supplier" 
WHERE  s_suppkey > 100 
GROUP  BY s_suppkey; 

[Actual Behavior]
We executed both queries on the TPC-H benchmark of scale factor 5: the first
query takes only 17 millisecond, while the second query takes 42
millisecond. We think the time difference results from different plans
selected.

[Query Execution Plan]
First query:
                                                       QUERY PLAN
                                      
--------------------------------------------------------------------------------------------------------------
 Seq Scan on supplier  (cost=0.00..1734.00 rows=49909 width=4) (actual
time=0.047..14.898 rows=49900 loops=1)
   Filter: (s_suppkey > 100)
   Rows Removed by Filter: 100
 Planning Time: 0.639 ms
 Execution Time: 17.469 ms
(5 rows)




Second query:
                                                                      QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1858.77..2357.86 rows=49909 width=4) (actual
time=30.093..38.541 rows=49900 loops=1)
   Group Key: s_suppkey
   ->  Seq Scan on supplier  (cost=0.00..1734.00 rows=49909 width=4) (actual
time=0.047..13.656 rows=49900 loops=1)
         Filter: (s_suppkey > 100)
         Rows Removed by Filter: 100
 Planning Time: 0.669 ms
 Execution Time: 42.270 ms
(7 rows)


[Expected Behavior]
Since these two queries are semantically equivalent, we were hoping that
PostgreSQL will return the same results in roughly the same amount of
time.


[Test Environment]
Ubuntu 19.10
PostgreSQL v12.3
Database: TPC-H benchmark (with scale factor 5)


[Steps for reproducing our observations]
    
* Download the dataset from the link:
https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing
* Set up TPC-H benchmark
tar xzvf tpch5_postgresql.tar.gz
cd tpch5_postgresql
db=tpch5
createdb $db
psql -d $db < dss.ddl
for i in `ls *.tbl`
do
    echo $i
    name=`echo $i|cut -d'.' -f1`
    psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING
'LATIN1';"
done
psql -d $db < dss_postgres.ri
*Execute the queries

Re: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

От

Christophe Pettus

Дата:

19 сентября 2020 г., 18:57:24


> On Sep 18, 2020, at 20:32, PG Bug reporting form <noreply@postgresql.org> wrote:
[snip]
> First query:
> SELECT "s_suppkey"
> FROM   "supplier"
> WHERE  s_suppkey > 100;
>
> Second query:
> SELECT "s_suppkey"
> FROM   "supplier"
> WHERE  s_suppkey > 100
> GROUP  BY s_suppkey;

[snip]

> [Expected Behavior]
> Since these two queries are semantically equivalent, we were hoping that
> PostgreSQL will return the same results in roughly the same amount of
> time.

These two queries are not semantically equivalent, as described.  I might guess that the table definition has a primary
key(or should have had) on "supplier"."s_suppkey", and thus the GROUP BY should be redundant, but there's nothing in
thereport or the query plans that indicate this is the case. 
--
-- Christophe Pettus
   xof@thebuild.com

Re: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

От

Xinyu Liu

Дата:

19 сентября 2020 г., 19:05:39

We will make sure we conclude such important context information in our future bug reports.

On Sat, Sep 19, 2020 at 11:57 AM Christophe Pettus <xof@thebuild.com> wrote:

> On Sep 18, 2020, at 20:32, PG Bug reporting form <noreply@postgresql.org> wrote:
[snip]
> First query:
> SELECT "s_suppkey"
> FROM "supplier"
> WHERE s_suppkey > 100;
>
> Second query:
> SELECT "s_suppkey"
> FROM "supplier"
> WHERE s_suppkey > 100
> GROUP BY s_suppkey;

[snip]

> [Expected Behavior]
> Since these two queries are semantically equivalent, we were hoping that
> PostgreSQL will return the same results in roughly the same amount of
> time.

These two queries are not semantically equivalent, as described. I might guess that the table definition has a primary key (or should have had) on "supplier"."s_suppkey", and thus the GROUP BY should be redundant, but there's nothing in the report or the query plans that indicate this is the case.
--
-- Christophe Pettus
xof@thebuild.com

-Xinyu

Re: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

От

David Rowley

Дата:

21 сентября 2020 г., 03:00:09

On Sun, 20 Sep 2020 at 03:53, PG Bug reporting form
<noreply@postgresql.org> wrote:
> We are developing a tool for automatically finding performance bugs in
> PostgreSQL. Our key insight is that given a pair of semantic equivalent
> queries, a robust DBMS should return the same result within a similar
> execution time. Significant time difference suggests a potential performance
> bug in the DBMS.

It's important to differentiate between bugs and optimisations that
PostgreSQL does not perform.   I imagine the findings of running this
tool is best directed towards a wiki page on
https://wiki.postgresql.org/

It's also important to always go ahead and apply such optimisations
without any regard to the cost of checking if the optimisation can
apply.  For optimisations like this, it's not always just a simple
case of attempting to apply them regardless. Checking for such cases
will penalise queries where the optimisation cannot be applied. The
people that benefit are the ones that write bad SQL and the people
that lose out are the ones who quite good SQL. That's not a
particularly good incentive to write good SQL.  In some cases, the
cost of checking if the optimisation can be applied will be so
negligible that it's worth it as the gains are good if it can be
applied.  The answer to whether this the case for this particilar
optimisation will depend on who you ask.

> We are sharing a pair of TPC-H queries that exhibit a potential performance
> bug in this report:
>
> First query:
> SELECT "s_suppkey"
> FROM   "supplier"
> WHERE  s_suppkey > 100;
>
> Second query:
> SELECT "s_suppkey"
> FROM   "supplier"
> WHERE  s_suppkey > 100
> GROUP  BY s_suppkey;
>
> [Actual Behavior]
> We executed both queries on the TPC-H benchmark of scale factor 5: the first
> query takes only 17 millisecond, while the second query takes 42
> millisecond. We think the time difference results from different plans
> selected.

Accounting for the information on the other email that mentions
s_suppkey is the primary key of the supplier table, this is not a bug.
It's simply an optimisation that we currently don't apply. There's a
patch around that aims to implement this but it's not yet been
applied. So the optimisation may appear in some future version of
PostgreSQL. If you'd like to help with that then please look at
https://commitfest.postgresql.org/29/2433/

I'd suggest a wiki page would be a good place to note down other
possible future optimisations. There's no shortage of possible
optimisations that we don't apply, and an endless stream of bug
reports is not the way to have new optimisations added to the
PostgreSQL planner.

David

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

Re: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

Re: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

Re: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function