Обсуждение: missing estimation for coalesce function
Hi
I have a report from my customer about migration his application from Oracle to Postgres.
The most significant issue was missing correct estimation for coalesce function. He had to rewrite coalesce(var, X) = X to "var IS NULL or var = X". Then the result was very satisfactory.
Example:
create table xxx(a int);
insert into xxx select null from generate_series(1,10000);
insert into xxx select 1 from generate_series(1,1000);
insert into xxx select 0 from generate_series(1,1000);
insert into xxx select null from generate_series(1,10000);
insert into xxx select 1 from generate_series(1,1000);
insert into xxx select 0 from generate_series(1,1000);
analyze xxx;
postgres=# explain analyze select * from xxx where coalesce(a, 0) = 0;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on xxx (cost=0.00..194.00 rows=60 width=4) (actual time=0.041..4.276 rows=11000 loops=1)
Filter: (COALESCE(a, 0) = 0)
Rows Removed by Filter: 1000
Planning Time: 0.099 ms
Execution Time: 5.412 ms
(5 rows)
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on xxx (cost=0.00..194.00 rows=60 width=4) (actual time=0.041..4.276 rows=11000 loops=1)
Filter: (COALESCE(a, 0) = 0)
Rows Removed by Filter: 1000
Planning Time: 0.099 ms
Execution Time: 5.412 ms
(5 rows)
postgres=# explain analyze select * from xxx where a is null or a = 0;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on xxx (cost=0.00..194.00 rows=10167 width=4) (actual time=0.052..5.891 rows=11000 loops=1)
Filter: ((a IS NULL) OR (a = 0))
Rows Removed by Filter: 1000
Planning Time: 0.136 ms
Execution Time: 7.522 ms
(5 rows)
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on xxx (cost=0.00..194.00 rows=10167 width=4) (actual time=0.052..5.891 rows=11000 loops=1)
Filter: ((a IS NULL) OR (a = 0))
Rows Removed by Filter: 1000
Planning Time: 0.136 ms
Execution Time: 7.522 ms
(5 rows)
I think so pattern coalesce(var, X) = X is very common so can be very interesting to support it better.
Regards
Pavel
On Wed, Nov 27, 2019 at 08:47:56AM +0100, Pavel Stehule wrote: > Hi > > I have a report from my customer about migration his application from > Oracle to Postgres. > > The most significant issue was missing correct estimation for coalesce > function. He had to rewrite coalesce(var, X) = X to "var IS NULL or var = > X". Then the result was very satisfactory. > > Example: > > create table xxx(a int); > insert into xxx select null from generate_series(1,10000); > insert into xxx select 1 from generate_series(1,1000); > insert into xxx select 0 from generate_series(1,1000); > analyze xxx; > > postgres=# explain analyze select * from xxx where coalesce(a, 0) = 0; > QUERY PLAN > > ---------------------------------------------------------------------------------------------------- > Seq Scan on xxx (cost=0.00..194.00 rows=60 width=4) (actual > time=0.041..4.276 rows=11000 loops=1) > Filter: (COALESCE(a, 0) = 0) > Rows Removed by Filter: 1000 > Planning Time: 0.099 ms > Execution Time: 5.412 ms > (5 rows) > > postgres=# explain analyze select * from xxx where a is null or a = 0; > QUERY PLAN > > ------------------------------------------------------------------------------------------------------- > Seq Scan on xxx (cost=0.00..194.00 rows=10167 width=4) (actual > time=0.052..5.891 rows=11000 loops=1) > Filter: ((a IS NULL) OR (a = 0)) > Rows Removed by Filter: 1000 > Planning Time: 0.136 ms > Execution Time: 7.522 ms > (5 rows) > > I think so pattern coalesce(var, X) = X is very common so can be very > interesting to support it better. Better support sounds great! How specifically might this be better supported? On this relatively short table, I see planning times considerably longer, I assume because they need to take a function call into account, and execution times longer but not all that much longer. I tried with 3 million rows, and got the representative samples below: shackle@[local]:5413/ctest(13devel)(149711) # EXPLAIN ANALYZE SELECT * FROM xxx WHERE COALESCE(a, 0)=0; QUERY PLAN ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ Gather (cost=1000.00..30391.00 rows=15000 width=4) (actual time=1.315..346.406 rows=999772 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on xxx (cost=0.00..27891.00 rows=6250 width=4) (actual time=0.029..216.419 rows=333257 loops=3) Filter: (COALESCE(a, 0) = 0) Rows Removed by Filter: 666743 Planning Time: 0.204 ms Execution Time: 389.307 ms (8 rows) Time: 391.394 ms shackle@[local]:5413/ctest(13devel)(149711) # EXPLAIN ANALYZE SELECT * FROM xxx WHERE a IS NULL OR a = 0; QUERY PLAN ═════════════════════════════════════════════════════════════════════════════════════════════════════════════ Seq Scan on xxx (cost=0.00..49766.00 rows=995700 width=4) (actual time=0.043..524.401 rows=999772 loops=1) Filter: ((a IS NULL) OR (a = 0)) Rows Removed by Filter: 2000228 Planning Time: 0.106 ms Execution Time: 560.593 ms (5 rows) Time: 561.186 ms Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hi
čt 28. 11. 2019 v 3:56 odesílatel David Fetter <david@fetter.org> napsal:
On Wed, Nov 27, 2019 at 08:47:56AM +0100, Pavel Stehule wrote:
> Hi
>
> I have a report from my customer about migration his application from
> Oracle to Postgres.
>
> The most significant issue was missing correct estimation for coalesce
> function. He had to rewrite coalesce(var, X) = X to "var IS NULL or var =
> X". Then the result was very satisfactory.
>
> Example:
>
> create table xxx(a int);
> insert into xxx select null from generate_series(1,10000);
> insert into xxx select 1 from generate_series(1,1000);
> insert into xxx select 0 from generate_series(1,1000);
> analyze xxx;
>
> postgres=# explain analyze select * from xxx where coalesce(a, 0) = 0;
> QUERY PLAN
>
> ----------------------------------------------------------------------------------------------------
> Seq Scan on xxx (cost=0.00..194.00 rows=60 width=4) (actual
> time=0.041..4.276 rows=11000 loops=1)
> Filter: (COALESCE(a, 0) = 0)
> Rows Removed by Filter: 1000
> Planning Time: 0.099 ms
> Execution Time: 5.412 ms
> (5 rows)
>
> postgres=# explain analyze select * from xxx where a is null or a = 0;
> QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------
> Seq Scan on xxx (cost=0.00..194.00 rows=10167 width=4) (actual
> time=0.052..5.891 rows=11000 loops=1)
> Filter: ((a IS NULL) OR (a = 0))
> Rows Removed by Filter: 1000
> Planning Time: 0.136 ms
> Execution Time: 7.522 ms
> (5 rows)
>
> I think so pattern coalesce(var, X) = X is very common so can be very
> interesting to support it better.
Better support sounds great!
How specifically might this be better supported? On this relatively
short table, I see planning times considerably longer, I assume
because they need to take a function call into account, and execution
times longer but not all that much longer. I tried with 3 million
rows, and got the representative samples below:
shackle@[local]:5413/ctest(13devel)(149711) # EXPLAIN ANALYZE SELECT * FROM xxx WHERE COALESCE(a, 0)=0;
QUERY PLAN
══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Gather (cost=1000.00..30391.00 rows=15000 width=4) (actual time=1.315..346.406 rows=999772 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on xxx (cost=0.00..27891.00 rows=6250 width=4) (actual time=0.029..216.419 rows=333257 loops=3)
Filter: (COALESCE(a, 0) = 0)
Rows Removed by Filter: 666743
Planning Time: 0.204 ms
Execution Time: 389.307 ms
(8 rows)
Time: 391.394 ms
shackle@[local]:5413/ctest(13devel)(149711) # EXPLAIN ANALYZE SELECT * FROM xxx WHERE a IS NULL OR a = 0;
QUERY PLAN
═════════════════════════════════════════════════════════════════════════════════════════════════════════════
Seq Scan on xxx (cost=0.00..49766.00 rows=995700 width=4) (actual time=0.043..524.401 rows=999772 loops=1)
Filter: ((a IS NULL) OR (a = 0))
Rows Removed by Filter: 2000228
Planning Time: 0.106 ms
Execution Time: 560.593 ms
(5 rows)
Time: 561.186 ms
I didn't thing about rewriting. The correct solution should be via own selectivity function. Now for coalesce is used 5% estimation (like for other functions). Probably it should not be hard code because coalesce is a node already. But it is part of code that I never modified.
Pavel
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
čt 28. 11. 2019 v 4:48 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
Hičt 28. 11. 2019 v 3:56 odesílatel David Fetter <david@fetter.org> napsal:On Wed, Nov 27, 2019 at 08:47:56AM +0100, Pavel Stehule wrote:
> Hi
>
> I have a report from my customer about migration his application from
> Oracle to Postgres.
>
> The most significant issue was missing correct estimation for coalesce
> function. He had to rewrite coalesce(var, X) = X to "var IS NULL or var =
> X". Then the result was very satisfactory.
>
> Example:
>
> create table xxx(a int);
> insert into xxx select null from generate_series(1,10000);
> insert into xxx select 1 from generate_series(1,1000);
> insert into xxx select 0 from generate_series(1,1000);
> analyze xxx;
>
> postgres=# explain analyze select * from xxx where coalesce(a, 0) = 0;
> QUERY PLAN
>
> ----------------------------------------------------------------------------------------------------
> Seq Scan on xxx (cost=0.00..194.00 rows=60 width=4) (actual
> time=0.041..4.276 rows=11000 loops=1)
> Filter: (COALESCE(a, 0) = 0)
> Rows Removed by Filter: 1000
> Planning Time: 0.099 ms
> Execution Time: 5.412 ms
> (5 rows)
>
> postgres=# explain analyze select * from xxx where a is null or a = 0;
> QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------
> Seq Scan on xxx (cost=0.00..194.00 rows=10167 width=4) (actual
> time=0.052..5.891 rows=11000 loops=1)
> Filter: ((a IS NULL) OR (a = 0))
> Rows Removed by Filter: 1000
> Planning Time: 0.136 ms
> Execution Time: 7.522 ms
> (5 rows)
>
> I think so pattern coalesce(var, X) = X is very common so can be very
> interesting to support it better.
Better support sounds great!
How specifically might this be better supported? On this relatively
short table, I see planning times considerably longer, I assume
because they need to take a function call into account, and execution
times longer but not all that much longer. I tried with 3 million
rows, and got the representative samples below:
shackle@[local]:5413/ctest(13devel)(149711) # EXPLAIN ANALYZE SELECT * FROM xxx WHERE COALESCE(a, 0)=0;
QUERY PLAN
══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Gather (cost=1000.00..30391.00 rows=15000 width=4) (actual time=1.315..346.406 rows=999772 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on xxx (cost=0.00..27891.00 rows=6250 width=4) (actual time=0.029..216.419 rows=333257 loops=3)
Filter: (COALESCE(a, 0) = 0)
Rows Removed by Filter: 666743
Planning Time: 0.204 ms
Execution Time: 389.307 ms
(8 rows)
Time: 391.394 ms
shackle@[local]:5413/ctest(13devel)(149711) # EXPLAIN ANALYZE SELECT * FROM xxx WHERE a IS NULL OR a = 0;
QUERY PLAN
═════════════════════════════════════════════════════════════════════════════════════════════════════════════
Seq Scan on xxx (cost=0.00..49766.00 rows=995700 width=4) (actual time=0.043..524.401 rows=999772 loops=1)
Filter: ((a IS NULL) OR (a = 0))
Rows Removed by Filter: 2000228
Planning Time: 0.106 ms
Execution Time: 560.593 ms
(5 rows)
Time: 561.186 msI didn't thing about rewriting. The correct solution should be via own selectivity function. Now for coalesce is used 5% estimation (like for other functions). Probably it should not be hard code because coalesce is a node already. But it is part of code that I never modified.
but support functions can be used
postgres=# create table test(id integer);
CREATE TABLE
postgres=# insert into test select generate_series(1,100000);
INSERT 0 100000
postgres=# insert into test select null from generate_series(1,1000);
INSERT 0 1000
postgres=# analyze test;
ANALYZE
postgres=# create index on test(id);
CREATE INDEX
postgres=# explain analyze select * from test where coalesce(id, 10) = 10;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Seq Scan on test (cost=0.00..1708.50 rows=505 width=4) (actual time=0.062..18.370 rows=1001 loops=1) │
│ Filter: (COALESCE(id, 10) = 10) │
│ Rows Removed by Filter: 99999 │
│ Planning Time: 37.212 ms │
│ Execution Time: 18.479 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────┘
(5 rows)
postgres=# explain analyze select * from test where id is null or id = 10;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Bitmap Heap Scan on test (cost=24.30..482.35 rows=964 width=4) (actual time=0.197..0.334 rows=1001 loops=1) │
│ Recheck Cond: ((id IS NULL) OR (id = 10)) │
│ Heap Blocks: exact=5 │
│ -> BitmapOr (cost=24.30..24.30 rows=964 width=0) (actual time=0.189..0.189 rows=0 loops=1) │
│ -> Bitmap Index Scan on test_id_idx (cost=0.00..19.52 rows=963 width=0) (actual time=0.170..0.170 rows=1000 loops=1) │
│ Index Cond: (id IS NULL) │
│ -> Bitmap Index Scan on test_id_idx (cost=0.00..4.30 rows=1 width=0) (actual time=0.019..0.019 rows=1 loops=1) │
│ Index Cond: (id = 10) │
│ Planning Time: 0.090 ms │
│ Execution Time: 0.413 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(10 rows)
CREATE TABLE
postgres=# insert into test select generate_series(1,100000);
INSERT 0 100000
postgres=# insert into test select null from generate_series(1,1000);
INSERT 0 1000
postgres=# analyze test;
ANALYZE
postgres=# create index on test(id);
CREATE INDEX
postgres=# explain analyze select * from test where coalesce(id, 10) = 10;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Seq Scan on test (cost=0.00..1708.50 rows=505 width=4) (actual time=0.062..18.370 rows=1001 loops=1) │
│ Filter: (COALESCE(id, 10) = 10) │
│ Rows Removed by Filter: 99999 │
│ Planning Time: 37.212 ms │
│ Execution Time: 18.479 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────┘
(5 rows)
postgres=# explain analyze select * from test where id is null or id = 10;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Bitmap Heap Scan on test (cost=24.30..482.35 rows=964 width=4) (actual time=0.197..0.334 rows=1001 loops=1) │
│ Recheck Cond: ((id IS NULL) OR (id = 10)) │
│ Heap Blocks: exact=5 │
│ -> BitmapOr (cost=24.30..24.30 rows=964 width=0) (actual time=0.189..0.189 rows=0 loops=1) │
│ -> Bitmap Index Scan on test_id_idx (cost=0.00..19.52 rows=963 width=0) (actual time=0.170..0.170 rows=1000 loops=1) │
│ Index Cond: (id IS NULL) │
│ -> Bitmap Index Scan on test_id_idx (cost=0.00..4.30 rows=1 width=0) (actual time=0.019..0.019 rows=1 loops=1) │
│ Index Cond: (id = 10) │
│ Planning Time: 0.090 ms │
│ Execution Time: 0.413 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(10 rows)
There can be strong benefit from replacement if indexes are used.
Pavel
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, 2019-11-27 at 08:47 +0100, Pavel Stehule wrote: > The most significant issue was missing correct estimation for coalesce function. > He had to rewrite coalesce(var, X) = X to "var IS NULL or var = X". > Then the result was very satisfactory. > > postgres=# explain analyze select * from xxx where coalesce(a, 0) = 0; > QUERY PLAN > ---------------------------------------------------------------------------------------------------- > Seq Scan on xxx (cost=0.00..194.00 rows=60 width=4) (actual time=0.041..4.276 rows=11000 loops=1) I think that this is asking for a planner support function: https://www.postgresql.org/docs/current/xfunc-optimization.html Yours, Laurenz Albe
čt 28. 11. 2019 v 15:51 odesílatel Laurenz Albe <laurenz.albe@cybertec.at> napsal:
On Wed, 2019-11-27 at 08:47 +0100, Pavel Stehule wrote:
> The most significant issue was missing correct estimation for coalesce function.
> He had to rewrite coalesce(var, X) = X to "var IS NULL or var = X".
> Then the result was very satisfactory.
>
> postgres=# explain analyze select * from xxx where coalesce(a, 0) = 0;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------
> Seq Scan on xxx (cost=0.00..194.00 rows=60 width=4) (actual time=0.041..4.276 rows=11000 loops=1)
I think that this is asking for a planner support function:
https://www.postgresql.org/docs/current/xfunc-optimization.html
Probably it needs more work - currently this support is for SRF function or for boolean functions.
On second hand coalesce is not function - it's expr node. Originally I though so selectivity function can be enough. Now I think so it is not enough. It is similar to DISTINCT FROM operator.
So some plan can look like
1. introduction isnull_or_eq operator
2. this operator can be used for indexscan too
3. implement selectivity function for this operator (and maybe for coalesce)
4. translate COALESCE(var, const) = const --> var isnull_or_eq const
I am not sure if @4 is possible or if some more complex transformations are possible COALESCE(var1, var2) = var2
But what I read about it - MSSQL and Oracle has does this optimization
Regards
Pavel
Yours,
Laurenz Albe