Re: Weirdly pesimistic estimates in optimizer
От | Tomas Vondra |
---|---|
Тема | Re: Weirdly pesimistic estimates in optimizer |
Дата | |
Msg-id | 54F4C835.8030902@2ndquadrant.com обсуждение исходный текст |
Ответ на | Weirdly pesimistic estimates in optimizer (David Kubečka <kubecka.dav@gmail.com>) |
Ответы |
Re: Weirdly pesimistic estimates in optimizer
(David Kubečka <kubecka.dav@gmail.com>)
|
Список | pgsql-hackers |
Hi David ;-) On 2.3.2015 20:19, David Kubečka wrote: > > The question is why optimizer, or rather the cost estimator, > produced so much different estimates upon very small change in input. > Moreover it seems that the main culprit of bad estimates isn't > actually directly related to outer table, but rather to inner table. > Just compare estimates for the two index scans: > > With 'random_fk_dupl': > -> Index Scan using facts_fk_idx on facts (cost=0.42..5.75 > rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100) > With 'random_fk_uniq': > -> Index Scan using facts_fk_idx on facts (cost=0.42..214.26 > rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100) > > I have read the optimizer README file and also looked briefly at the > code, but this seems to be something not related to particular > implementation of algorithm (e.g. nested loop). Perhaps it's the way > how cost estimates are propagated down (or sideways? that would be > weird...) the query tree. But I am really not sure, since this is my > first time lookng at the optimizer code base. I should also add that > I have reproduced this behaviour for all versions of Pg from 9.2 up > to current devel. Interesting. I've only spent a few minutes looking at this, but it certainly is a bit strange that using smaller table with unique values results in a slower plan than using a table with duplicities (but the same number of unique values). Another observation is that the planner does see other plans, because tweaking the cost variables like this: SET random_page_cost = 2; produces this plan (which on my machine runs just a tad slower than the first query in your post): QUERY PLAN ---------------------------------------------------------------------HashAggregate (cost=10564.81..10688.47 rows=9893 width=15) Group Key: facts.fk -> Nested Loop (cost=5.42..10515.34 rows=9893 width=15) -> HashAggregate (cost=2.24..3.23rows=99 width=4) Group Key: random_fk_uniq.fk -> Seq Scan on random_fk_uniq (cost=0.00..1.99...) -> Bitmap Heap Scan on facts (cost=3.18..105.18 rows=100 ...) Recheck Cond: (fk = random_fk_uniq.fk) -> Bitmap Index Scan on facts_fk_idx (cost=0.00..3.15 ...) Index Cond: (fk= random_fk_uniq.fk) I can get similar results by setting cpu_operator_cost=0.005. And further lowering to random_page_cost to 1.1 actually gets me this: QUERY PLAN ---------------------------------------------------------------------HashAggregate (cost=6185.41..6309.08 rows=9893 width=15) Group Key: facts.fk -> Nested Loop (cost=2.66..6135.95 rows=9893 width=15) -> HashAggregate (cost=2.24..3.23rows=99 width=4) Group Key: random_fk_uniq.fk -> Seq Scan on random_fk_uniq (cost=0.00..1.99...) -> Index Scan using facts_fk_idx on facts (cost=0.42..60.95 ...) Index Cond: (fk= random_fk_uniq.fk) which is exactly the first plan from your post. I still don't understand why using the smaller table produces less efficient plan, but the estimated costs are very clost (20013 vs. 18147 is very small difference in this context), and the estimator shoots for minimal average error. So it may seem 'weirdly pessimistic' but it might be equally optimistic for other queries. Also, keep in mind that the cost model is just a simplification of reality, so it can't be correct 100% of the time. The fact that this small difference in costs results in significant duration difference suggests that the default optimizer cost values do not reflect your environment. For example, you assume that this is very fast: NestLoop -> Seq Scan on A -> Index Scan using B_Y_IDX on B Index Condition: B.Y = A.X but index scans often produce a lot of random I/O, and that may be quite expensive if it really hits the storage. And that's what the default values (random_page_cost=4) is set for. But if you're on a system with lots of RAM, or SSD, ... that simply is not the case, and may penalize index scans (and prefer more sequential workloads). The other thing you might do is creating index on (fk,f), which on the new releases produces this: QUERY PLAN -------------------------------------------------------------------------HashAggregate (cost=783.77..907.43 rows=9893 width=15) Group Key: facts.fk -> Nested Loop (cost=2.66..734.30 rows=9893 width=15) -> HashAggregate (cost=2.24..3.23rows=99 width=4) Group Key: random_fk_uniq.fk -> Seq Scan on random_fk_uniq (cost=0.00..1.99...) -> Index Only Scan using facts_2_idx on facts (cost=...) Index Cond: (fk = random_fk_uniq.fk) Which limits the amount of random I/O by only scanning the index, and is even faster than the first query. regard -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-hackers по дате отправления: