Обсуждение: Weird behaviour of the planner

Поиск
Список
Период
Сортировка

Weird behaviour of the planner

От
Guillaume Lelarge
Дата:
Hi,

While working on a slow query of one of our customers, I found this behaviour which seems quite weird to me. Here it is:

(no changes on any parameter)
EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;

                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Unique  (cost=12005.97..12049.20 rows=1 width=1430) (actual time=20055.294..20323.348 rows=1 loops=1)
   ->  Sort  (cost=12005.97..12006.30 rows=132 width=1430) (actual time=20055.290..20105.738 rows=60000 loops=1)
         Sort Key: (... 130 columns ...)
         Sort Method: external sort  Disk: 84464kB
         ->  Seq Scan on gleu2  (cost=0.00..12001.32 rows=132 width=1430) (actual time=0.109..114.142 rows=60000 loops=1)
 Planning time: 10.012 ms
 Execution time: 20337.854 ms
(7 rows)

That looks quite good. The total cost is 12049, so I expect this plan to have the smaller cost as it's the choosen plan. Now, I'm disabling Sort, and here is what I get:

SET enable_sort TO off;
EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;

                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=12044.22..12044.23 rows=1 width=1430) (actual time=508.342..508.343 rows=1 loops=1)
   Group Key: (... 130 columns ...)
   ->  Seq Scan on gleu2  (cost=0.00..12001.32 rows=132 width=1430) (actual time=0.036..57.088 rows=60000 loops=1)
 Planning time: 1.335 ms
 Execution time: 508.529 ms
(5 rows)

I now have a Hash Aggregate. Sounds good, but I do have a smaller total cost than the former plan. How is it possible that the planner chooses a plan with a higher cost? (and a much much higher duration, but it can't obviously know that).

Still working on my customer's issue, I found that it worked great before commit 3fc6e2d7f5b652b417fa6937c34de2438d60fa9f (Make the upper part of the planner work by generating and comparing Paths.) and not after. That kind of patch is way out of my league to be honest :) But before that commit, the Hash Aggregate is choosen right out of the box. And after that commit, the Sort is choosen even with its higher cost.

Oh, BTW, yeah, I know. The row estimate is quite off... I'm not asking how to make my query go faster, I'm just wondering why PostgreSQL chooses a higher-cost plan :)

Thanks.


--
Guillaume.

Re: Weird behaviour of the planner

От
David Rowley
Дата:
On 2 August 2018 at 02:48, Guillaume Lelarge <guillaume@lelarge.info> wrote:
> EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;
>
>                                                        QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------
>  Unique  (cost=12005.97..12049.20 rows=1 width=1430) (actual
> time=20055.294..20323.348 rows=1 loops=1)
>    ->  Sort  (cost=12005.97..12006.30 rows=132 width=1430) (actual
> time=20055.290..20105.738 rows=60000 loops=1)
>          Sort Key: (... 130 columns ...)
>          Sort Method: external sort  Disk: 84464kB
>          ->  Seq Scan on gleu2  (cost=0.00..12001.32 rows=132 width=1430)
> (actual time=0.109..114.142 rows=60000 loops=1)
>  Planning time: 10.012 ms
>  Execution time: 20337.854 ms
> (7 rows)
>
> That looks quite good. The total cost is 12049, so I expect this plan to
> have the smaller cost as it's the choosen plan. Now, I'm disabling Sort, and
> here is what I get:
>
> SET enable_sort TO off;
> EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;
>
>                                                        QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=12044.22..12044.23 rows=1 width=1430) (actual
> time=508.342..508.343 rows=1 loops=1)
>    Group Key: (... 130 columns ...)
>    ->  Seq Scan on gleu2  (cost=0.00..12001.32 rows=132 width=1430) (actual
> time=0.036..57.088 rows=60000 loops=1)
>  Planning time: 1.335 ms
>  Execution time: 508.529 ms
> (5 rows)

When the planner adds a new path it compares the cost not exactly, but
with a 'fuzz factor'.  It's very likely that the hashagg path did not
make it as it was not fuzzily any cheaper than the unique path. By
default, this fuzz factor is 1%.

It seems in your case the costs don't quite match reality which is
quite likely due to the poor row estimates on "gleu2".  Has that table
been analyzed recently? or is there some reason that auto-vacuum is
not getting to it?

There's a bit more reading of what I'm talking about in
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/util/pathnode.c#L141

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Weird behaviour of the planner

От
Guillaume Lelarge
Дата:
2018-08-01 16:59 GMT+02:00 David Rowley <david.rowley@2ndquadrant.com>:
On 2 August 2018 at 02:48, Guillaume Lelarge <guillaume@lelarge.info> wrote:
> EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;
>
>                                                        QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------
>  Unique  (cost=12005.97..12049.20 rows=1 width=1430) (actual
> time=20055.294..20323.348 rows=1 loops=1)
>    ->  Sort  (cost=12005.97..12006.30 rows=132 width=1430) (actual
> time=20055.290..20105.738 rows=60000 loops=1)
>          Sort Key: (... 130 columns ...)
>          Sort Method: external sort  Disk: 84464kB
>          ->  Seq Scan on gleu2  (cost=0.00..12001.32 rows=132 width=1430)
> (actual time=0.109..114.142 rows=60000 loops=1)
>  Planning time: 10.012 ms
>  Execution time: 20337.854 ms
> (7 rows)
>
> That looks quite good. The total cost is 12049, so I expect this plan to
> have the smaller cost as it's the choosen plan. Now, I'm disabling Sort, and
> here is what I get:
>
> SET enable_sort TO off;
> EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;
>
>                                                        QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=12044.22..12044.23 rows=1 width=1430) (actual
> time=508.342..508.343 rows=1 loops=1)
>    Group Key: (... 130 columns ...)
>    ->  Seq Scan on gleu2  (cost=0.00..12001.32 rows=132 width=1430) (actual
> time=0.036..57.088 rows=60000 loops=1)
>  Planning time: 1.335 ms
>  Execution time: 508.529 ms
> (5 rows)

When the planner adds a new path it compares the cost not exactly, but
with a 'fuzz factor'.  It's very likely that the hashagg path did not
make it as it was not fuzzily any cheaper than the unique path. By
default, this fuzz factor is 1%.


That may definitely be it. There's not much of a difference in the total costs.

It seems in your case the costs don't quite match reality which is
quite likely due to the poor row estimates on "gleu2".  Has that table
been analyzed recently? or is there some reason that auto-vacuum is
not getting to it?


This is a small test case of a much bigger query joining a large number of tables, materialized views, views (calling functions), etc. The actual plan contains 84 nodes (32 scans, lots of joins, and a few other nodes). The mis-estimate is to follow what the big query gives me.

There's a bit more reading of what I'm talking about in
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/util/pathnode.c#L141


I'm gonna read that. Thank you.


--
Guillaume.