Обсуждение: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function
BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 16625 Logged by: XINYU LIU Email address: xinyuliu@umich.edu PostgreSQL version: 12.3 Operating system: Ubuntu 19.10 Description: Hello, We are developing a tool for automatically finding performance bugs in PostgreSQL. Our key insight is that given a pair of semantic equivalent queries, a robust DBMS should return the same result within a similar execution time. Significant time difference suggests a potential performance bug in the DBMS. We are sharing a pair of TPC-H queries that exhibit a potential performance bug in this report: First query: SELECT "s_suppkey" FROM "supplier" WHERE s_suppkey > 100; Second query: SELECT "s_suppkey" FROM "supplier" WHERE s_suppkey > 100 GROUP BY s_suppkey; [Actual Behavior] We executed both queries on the TPC-H benchmark of scale factor 5: the first query takes only 17 millisecond, while the second query takes 42 millisecond. We think the time difference results from different plans selected. [Query Execution Plan] First query: QUERY PLAN -------------------------------------------------------------------------------------------------------------- Seq Scan on supplier (cost=0.00..1734.00 rows=49909 width=4) (actual time=0.047..14.898 rows=49900 loops=1) Filter: (s_suppkey > 100) Rows Removed by Filter: 100 Planning Time: 0.639 ms Execution Time: 17.469 ms (5 rows) Second query: QUERY PLAN -------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=1858.77..2357.86 rows=49909 width=4) (actual time=30.093..38.541 rows=49900 loops=1) Group Key: s_suppkey -> Seq Scan on supplier (cost=0.00..1734.00 rows=49909 width=4) (actual time=0.047..13.656 rows=49900 loops=1) Filter: (s_suppkey > 100) Rows Removed by Filter: 100 Planning Time: 0.669 ms Execution Time: 42.270 ms (7 rows) [Expected Behavior] Since these two queries are semantically equivalent, we were hoping that PostgreSQL will return the same results in roughly the same amount of time. [Test Environment] Ubuntu 19.10 PostgreSQL v12.3 Database: TPC-H benchmark (with scale factor 5) [Steps for reproducing our observations] * Download the dataset from the link: https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing * Set up TPC-H benchmark tar xzvf tpch5_postgresql.tar.gz cd tpch5_postgresql db=tpch5 createdb $db psql -d $db < dss.ddl for i in `ls *.tbl` do echo $i name=`echo $i|cut -d'.' -f1` psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING 'LATIN1';" done psql -d $db < dss_postgres.ri *Execute the queries
Re: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function
От
Christophe Pettus
Дата:
> On Sep 18, 2020, at 20:32, PG Bug reporting form <noreply@postgresql.org> wrote: [snip] > First query: > SELECT "s_suppkey" > FROM "supplier" > WHERE s_suppkey > 100; > > Second query: > SELECT "s_suppkey" > FROM "supplier" > WHERE s_suppkey > 100 > GROUP BY s_suppkey; [snip] > [Expected Behavior] > Since these two queries are semantically equivalent, we were hoping that > PostgreSQL will return the same results in roughly the same amount of > time. These two queries are not semantically equivalent, as described. I might guess that the table definition has a primary key(or should have had) on "supplier"."s_suppkey", and thus the GROUP BY should be redundant, but there's nothing in thereport or the query plans that indicate this is the case. -- -- Christophe Pettus xof@thebuild.com
Thank you so much for your reply!
Yes, I should have pointed out in the bug report that "supplier"."s_suppkey" is the primary key and thus the GROUP BY should be redundant.
I am also attaching the result of "\d supplier" to this email:
tpch5=# \d supplier
Table "public.supplier"
Column | Type | Collation | Nullable | Default
-------------+------------------------+-----------+----------+---------
s_suppkey | integer | | not null |
s_name | character(25) | | not null |
s_address | character varying(40) | | not null |
s_nationkey | integer | | not null |
s_phone | character(15) | | not null |
s_acctbal | numeric(15,2) | | not null |
s_comment | character varying(101) | | not null |
Indexes:
"supplier_pkey" PRIMARY KEY, btree (s_suppkey)
Foreign-key constraints:
"supplier_fk1" FOREIGN KEY (s_nationkey) REFERENCES nation(n_nationkey)
Referenced by:
TABLE "partsupp" CONSTRAINT "partsupp_fk1" FOREIGN KEY (ps_suppkey) REFERENCES supplier(s_suppkey)
Yes, I should have pointed out in the bug report that "supplier"."s_suppkey" is the primary key and thus the GROUP BY should be redundant.
I am also attaching the result of "\d supplier" to this email:
tpch5=# \d supplier
Table "public.supplier"
Column | Type | Collation | Nullable | Default
-------------+------------------------+-----------+----------+---------
s_suppkey | integer | | not null |
s_name | character(25) | | not null |
s_address | character varying(40) | | not null |
s_nationkey | integer | | not null |
s_phone | character(15) | | not null |
s_acctbal | numeric(15,2) | | not null |
s_comment | character varying(101) | | not null |
Indexes:
"supplier_pkey" PRIMARY KEY, btree (s_suppkey)
Foreign-key constraints:
"supplier_fk1" FOREIGN KEY (s_nationkey) REFERENCES nation(n_nationkey)
Referenced by:
TABLE "partsupp" CONSTRAINT "partsupp_fk1" FOREIGN KEY (ps_suppkey) REFERENCES supplier(s_suppkey)
We will make sure we conclude such important context information in our future bug reports.
On Sat, Sep 19, 2020 at 11:57 AM Christophe Pettus <xof@thebuild.com> wrote:
> On Sep 18, 2020, at 20:32, PG Bug reporting form <noreply@postgresql.org> wrote:
[snip]
> First query:
> SELECT "s_suppkey"
> FROM "supplier"
> WHERE s_suppkey > 100;
>
> Second query:
> SELECT "s_suppkey"
> FROM "supplier"
> WHERE s_suppkey > 100
> GROUP BY s_suppkey;
[snip]
> [Expected Behavior]
> Since these two queries are semantically equivalent, we were hoping that
> PostgreSQL will return the same results in roughly the same amount of
> time.
These two queries are not semantically equivalent, as described. I might guess that the table definition has a primary key (or should have had) on "supplier"."s_suppkey", and thus the GROUP BY should be redundant, but there's nothing in the report or the query plans that indicate this is the case.
--
-- Christophe Pettus
xof@thebuild.com
-Xinyu
On Sun, 20 Sep 2020 at 03:53, PG Bug reporting form <noreply@postgresql.org> wrote: > We are developing a tool for automatically finding performance bugs in > PostgreSQL. Our key insight is that given a pair of semantic equivalent > queries, a robust DBMS should return the same result within a similar > execution time. Significant time difference suggests a potential performance > bug in the DBMS. It's important to differentiate between bugs and optimisations that PostgreSQL does not perform. I imagine the findings of running this tool is best directed towards a wiki page on https://wiki.postgresql.org/ It's also important to always go ahead and apply such optimisations without any regard to the cost of checking if the optimisation can apply. For optimisations like this, it's not always just a simple case of attempting to apply them regardless. Checking for such cases will penalise queries where the optimisation cannot be applied. The people that benefit are the ones that write bad SQL and the people that lose out are the ones who quite good SQL. That's not a particularly good incentive to write good SQL. In some cases, the cost of checking if the optimisation can be applied will be so negligible that it's worth it as the gains are good if it can be applied. The answer to whether this the case for this particilar optimisation will depend on who you ask. > We are sharing a pair of TPC-H queries that exhibit a potential performance > bug in this report: > > First query: > SELECT "s_suppkey" > FROM "supplier" > WHERE s_suppkey > 100; > > Second query: > SELECT "s_suppkey" > FROM "supplier" > WHERE s_suppkey > 100 > GROUP BY s_suppkey; > > [Actual Behavior] > We executed both queries on the TPC-H benchmark of scale factor 5: the first > query takes only 17 millisecond, while the second query takes 42 > millisecond. We think the time difference results from different plans > selected. Accounting for the information on the other email that mentions s_suppkey is the primary key of the supplier table, this is not a bug. It's simply an optimisation that we currently don't apply. There's a patch around that aims to implement this but it's not yet been applied. So the optimisation may appear in some future version of PostgreSQL. If you'd like to help with that then please look at https://commitfest.postgresql.org/29/2433/ I'd suggest a wiki page would be a good place to note down other possible future optimisations. There's no shortage of possible optimisations that we don't apply, and an endless stream of bug reports is not the way to have new optimisations added to the PostgreSQL planner. David