Обсуждение: Poor performance due to parallel seq scan on indexed date field
Dead simple date scan across a big-ish table (est. 23,153,666 rows)
explain analyze select count(*) from vw_pitches where game_date >= '2022-06-21' and game_date <= '2023-06-21';
The view does do some joins but those don't seem to be the issue to me.
Planner does:
Finalize Aggregate (cost=3596993.88..3596993.89 rows=1 width=8) (actual time=69980.491..69982.076 rows=1 loops=1)
-> Gather (cost=3596993.46..3596993.87 rows=4 width=8) (actual time=69979.137..69982.071 rows=5 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Partial Aggregate (cost=3595993.46..3595993.47 rows=1 width=8) (actual time=69975.136..69975.137 rows=1 loops=5)
-> Nested Loop (cost=0.44..3591408.37 rows=1834034 width=0) (actual time=0.882..69875.934 rows=1458419 loops=5)
-> Parallel Seq Scan on pitches p (cost=0.00..3537431.89 rows=1834217 width=12) (actual time=0.852..68914.256 rows=1458419 loops=5)
Filter: ((game_date >= '2022-06-21'::date) AND (game_date <= '2023-06-21'::date))
Rows Removed by Filter: 3212310
-> Memoize (cost=0.44..0.47 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=7292095)
Cache Key: p.pitcher_identity_id
Cache Mode: logical
Hits: 1438004 Misses: 21042 Evictions: 0 Overflows: 0 Memory Usage: 2138kB
Worker 0: Hits: 1429638 Misses: 21010 Evictions: 0 Overflows: 0 Memory Usage: 2134kB
Worker 1: Hits: 1456755 Misses: 21435 Evictions: 0 Overflows: 0 Memory Usage: 2177kB
Worker 2: Hits: 1433557 Misses: 21201 Evictions: 0 Overflows: 0 Memory Usage: 2154kB
Worker 3: Hits: 1428727 Misses: 20726 Evictions: 0 Overflows: 0 Memory Usage: 2105kB
-> Index Only Scan using identity_pkey on identity idpitcher (cost=0.43..0.46 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=105414)
Index Cond: (identity_id = p.pitcher_identity_id)
Heap Fetches: 83
Planning Time: 1.407 ms
Execution Time: 69982.927 ms
-> Gather (cost=3596993.46..3596993.87 rows=4 width=8) (actual time=69979.137..69982.071 rows=5 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Partial Aggregate (cost=3595993.46..3595993.47 rows=1 width=8) (actual time=69975.136..69975.137 rows=1 loops=5)
-> Nested Loop (cost=0.44..3591408.37 rows=1834034 width=0) (actual time=0.882..69875.934 rows=1458419 loops=5)
-> Parallel Seq Scan on pitches p (cost=0.00..3537431.89 rows=1834217 width=12) (actual time=0.852..68914.256 rows=1458419 loops=5)
Filter: ((game_date >= '2022-06-21'::date) AND (game_date <= '2023-06-21'::date))
Rows Removed by Filter: 3212310
-> Memoize (cost=0.44..0.47 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=7292095)
Cache Key: p.pitcher_identity_id
Cache Mode: logical
Hits: 1438004 Misses: 21042 Evictions: 0 Overflows: 0 Memory Usage: 2138kB
Worker 0: Hits: 1429638 Misses: 21010 Evictions: 0 Overflows: 0 Memory Usage: 2134kB
Worker 1: Hits: 1456755 Misses: 21435 Evictions: 0 Overflows: 0 Memory Usage: 2177kB
Worker 2: Hits: 1433557 Misses: 21201 Evictions: 0 Overflows: 0 Memory Usage: 2154kB
Worker 3: Hits: 1428727 Misses: 20726 Evictions: 0 Overflows: 0 Memory Usage: 2105kB
-> Index Only Scan using identity_pkey on identity idpitcher (cost=0.43..0.46 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=105414)
Index Cond: (identity_id = p.pitcher_identity_id)
Heap Fetches: 83
Planning Time: 1.407 ms
Execution Time: 69982.927 ms
Is there something to be done here? Kind of a frequent style of query and quite slow.
--
Wells Oliver
wells.oliver@gmail.com
wells.oliver@gmail.com
Am 21.06.23 um 20:31 schrieb Wells Oliver: > Dead simple date scan across a big-ish table (est. 23,153,666 rows) > > explain analyze select count(*) from vw_pitches where game_date >= > '2022-06-21' and game_date <= '2023-06-21'; > > The view does do some joins but those don't seem to be the issue to me. > > Planner does: > > Finalize Aggregate (cost=3596993.88..3596993.89 rows=1 width=8) > (actual time=69980.491..69982.076 rows=1 loops=1) > -> Gather (cost=3596993.46..3596993.87 rows=4 width=8) (actual > time=69979.137..69982.071 rows=5 loops=1) > Workers Planned: 4 > Workers Launched: 4 > -> Partial Aggregate (cost=3595993.46..3595993.47 rows=1 > width=8) (actual time=69975.136..69975.137 rows=1 loops=5) > -> Nested Loop (cost=0.44..3591408.37 rows=1834034 > width=0) (actual time=0.882..69875.934 rows=1458419 loops=5) > -> Parallel Seq Scan on pitches p > (cost=0.00..3537431.89 rows=1834217 width=12) (actual > time=0.852..68914.256 rows=1458419 loops=5) > Filter: ((game_date >= '2022-06-21'::date) > AND (game_date <= '2023-06-21'::date)) > Rows Removed by Filter: 3212310 > -> Memoize (cost=0.44..0.47 rows=1 width=4) > (actual time=0.000..0.000 rows=1 loops=7292095) > Cache Key: p.pitcher_identity_id > Cache Mode: logical > Hits: 1438004 Misses: 21042 Evictions: 0 > Overflows: 0 Memory Usage: 2138kB > Worker 0: Hits: 1429638 Misses: 21010 > Evictions: 0 Overflows: 0 Memory Usage: 2134kB > Worker 1: Hits: 1456755 Misses: 21435 > Evictions: 0 Overflows: 0 Memory Usage: 2177kB > Worker 2: Hits: 1433557 Misses: 21201 > Evictions: 0 Overflows: 0 Memory Usage: 2154kB > Worker 3: Hits: 1428727 Misses: 20726 > Evictions: 0 Overflows: 0 Memory Usage: 2105kB > -> Index Only Scan using identity_pkey on > identity idpitcher (cost=0.43..0.46 rows=1 width=4) (actual > time=0.007..0.007 rows=1 loops=105414) > Index Cond: (identity_id = > p.pitcher_identity_id) > Heap Fetches: 83 > Planning Time: 1.407 ms > Execution Time: 69982.927 ms > > Is there something to be done here? Kind of a frequent style of query > and quite slow. > Could you provide the definition of the view(s) down to the base tables? -- Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
Вложения
It's just this.
CREATE OR REPLACE VIEW vw_pitches AS
SELECT
p.year,
p.game_id,
p.game_date,
p.game_level,
...
from synergy.pitches as p
join alias.identity as idpitcher
on p.pitcher_identity_id = idpitcher.identity_id
left join alias.identity as idcatcher
on p.catcher_identity_id = idcatcher.identity_id
left join alias.identity as idbatter
on p.batter_identity_id = idbatter.identity_id;
SELECT
p.year,
p.game_id,
p.game_date,
p.game_level,
...
from synergy.pitches as p
join alias.identity as idpitcher
on p.pitcher_identity_id = idpitcher.identity_id
left join alias.identity as idcatcher
on p.catcher_identity_id = idcatcher.identity_id
left join alias.identity as idbatter
on p.batter_identity_id = idbatter.identity_id;
The alias.identity.identity_id column is indexed.
The main issue is SELECT COUNT(*) over wide date ranges, which is something we'd like to do frequently.
On Wed, Jun 21, 2023 at 11:37 AM Holger Jakobs <holger@jakobs.com> wrote:
Am 21.06.23 um 20:31 schrieb Wells Oliver:
> Dead simple date scan across a big-ish table (est. 23,153,666 rows)
>
> explain analyze select count(*) from vw_pitches where game_date >=
> '2022-06-21' and game_date <= '2023-06-21';
>
> The view does do some joins but those don't seem to be the issue to me.
>
> Planner does:
>
> Finalize Aggregate (cost=3596993.88..3596993.89 rows=1 width=8)
> (actual time=69980.491..69982.076 rows=1 loops=1)
> -> Gather (cost=3596993.46..3596993.87 rows=4 width=8) (actual
> time=69979.137..69982.071 rows=5 loops=1)
> Workers Planned: 4
> Workers Launched: 4
> -> Partial Aggregate (cost=3595993.46..3595993.47 rows=1
> width=8) (actual time=69975.136..69975.137 rows=1 loops=5)
> -> Nested Loop (cost=0.44..3591408.37 rows=1834034
> width=0) (actual time=0.882..69875.934 rows=1458419 loops=5)
> -> Parallel Seq Scan on pitches p
> (cost=0.00..3537431.89 rows=1834217 width=12) (actual
> time=0.852..68914.256 rows=1458419 loops=5)
> Filter: ((game_date >= '2022-06-21'::date)
> AND (game_date <= '2023-06-21'::date))
> Rows Removed by Filter: 3212310
> -> Memoize (cost=0.44..0.47 rows=1 width=4)
> (actual time=0.000..0.000 rows=1 loops=7292095)
> Cache Key: p.pitcher_identity_id
> Cache Mode: logical
> Hits: 1438004 Misses: 21042 Evictions: 0
> Overflows: 0 Memory Usage: 2138kB
> Worker 0: Hits: 1429638 Misses: 21010
> Evictions: 0 Overflows: 0 Memory Usage: 2134kB
> Worker 1: Hits: 1456755 Misses: 21435
> Evictions: 0 Overflows: 0 Memory Usage: 2177kB
> Worker 2: Hits: 1433557 Misses: 21201
> Evictions: 0 Overflows: 0 Memory Usage: 2154kB
> Worker 3: Hits: 1428727 Misses: 20726
> Evictions: 0 Overflows: 0 Memory Usage: 2105kB
> -> Index Only Scan using identity_pkey on
> identity idpitcher (cost=0.43..0.46 rows=1 width=4) (actual
> time=0.007..0.007 rows=1 loops=105414)
> Index Cond: (identity_id =
> p.pitcher_identity_id)
> Heap Fetches: 83
> Planning Time: 1.407 ms
> Execution Time: 69982.927 ms
>
> Is there something to be done here? Kind of a frequent style of query
> and quite slow.
>
Could you provide the definition of the view(s) down to the base tables?
--
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
Wells Oliver
wells.oliver@gmail.com
wells.oliver@gmail.com
Is there an index on synergy.pitches.game_date?
It's no guarantee* that will speed up the query, but "index on WHERE clause fields" is the first thing to look for.
* The problem with "over wide date ranges" is that the query optimizer might decide that the date range is so wide that it's just as cheap -- or cheaper -- to scan the whole table.
It's no guarantee* that will speed up the query, but "index on WHERE clause fields" is the first thing to look for.
* The problem with "over wide date ranges" is that the query optimizer might decide that the date range is so wide that it's just as cheap -- or cheaper -- to scan the whole table.
On 6/21/23 13:39, Wells Oliver wrote:
It's just this.CREATE OR REPLACE VIEW vw_pitches AS
SELECT
p.year,
p.game_id,
p.game_date,
p.game_level,
...
from synergy.pitches as p
join alias.identity as idpitcher
on p.pitcher_identity_id = idpitcher.identity_id
left join alias.identity as idcatcher
on p.catcher_identity_id = idcatcher.identity_id
left join alias.identity as idbatter
on p.batter_identity_id = idbatter.identity_id;The alias.identity.identity_id column is indexed.The main issue is SELECT COUNT(*) over wide date ranges, which is something we'd like to do frequently.On Wed, Jun 21, 2023 at 11:37 AM Holger Jakobs <holger@jakobs.com> wrote:
Am 21.06.23 um 20:31 schrieb Wells Oliver:
> Dead simple date scan across a big-ish table (est. 23,153,666 rows)
>
> explain analyze select count(*) from vw_pitches where game_date >=
> '2022-06-21' and game_date <= '2023-06-21';
>
> The view does do some joins but those don't seem to be the issue to me.
>
> Planner does:
>
> Finalize Aggregate (cost=3596993.88..3596993.89 rows=1 width=8)
> (actual time=69980.491..69982.076 rows=1 loops=1)
> -> Gather (cost=3596993.46..3596993.87 rows=4 width=8) (actual
> time=69979.137..69982.071 rows=5 loops=1)
> Workers Planned: 4
> Workers Launched: 4
> -> Partial Aggregate (cost=3595993.46..3595993.47 rows=1
> width=8) (actual time=69975.136..69975.137 rows=1 loops=5)
> -> Nested Loop (cost=0.44..3591408.37 rows=1834034
> width=0) (actual time=0.882..69875.934 rows=1458419 loops=5)
> -> Parallel Seq Scan on pitches p
> (cost=0.00..3537431.89 rows=1834217 width=12) (actual
> time=0.852..68914.256 rows=1458419 loops=5)
> Filter: ((game_date >= '2022-06-21'::date)
> AND (game_date <= '2023-06-21'::date))
> Rows Removed by Filter: 3212310
> -> Memoize (cost=0.44..0.47 rows=1 width=4)
> (actual time=0.000..0.000 rows=1 loops=7292095)
> Cache Key: p.pitcher_identity_id
> Cache Mode: logical
> Hits: 1438004 Misses: 21042 Evictions: 0
> Overflows: 0 Memory Usage: 2138kB
> Worker 0: Hits: 1429638 Misses: 21010
> Evictions: 0 Overflows: 0 Memory Usage: 2134kB
> Worker 1: Hits: 1456755 Misses: 21435
> Evictions: 0 Overflows: 0 Memory Usage: 2177kB
> Worker 2: Hits: 1433557 Misses: 21201
> Evictions: 0 Overflows: 0 Memory Usage: 2154kB
> Worker 3: Hits: 1428727 Misses: 20726
> Evictions: 0 Overflows: 0 Memory Usage: 2105kB
> -> Index Only Scan using identity_pkey on
> identity idpitcher (cost=0.43..0.46 rows=1 width=4) (actual
> time=0.007..0.007 rows=1 loops=105414)
> Index Cond: (identity_id =
> p.pitcher_identity_id)
> Heap Fetches: 83
> Planning Time: 1.407 ms
> Execution Time: 69982.927 ms
>
> Is there something to be done here? Kind of a frequent style of query
> and quite slow.
>
Could you provide the definition of the view(s) down to the base tables?
--
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012--Wells Oliver
wells.oliver@gmail.com
--
Born in Arizona, moved to Babylonia.
Born in Arizona, moved to Babylonia.
On Wed, 2023-06-21 at 11:31 -0700, Wells Oliver wrote: > Dead simple date scan across a big-ish table (est. 23,153,666 rows) > > explain analyze select count(*) from vw_pitches where game_date >= '2022-06-21' and game_date <= '2023-06-21'; > > The view does do some joins but those don't seem to be the issue to me. > > Planner does: > > Finalize Aggregate (cost=3596993.88..3596993.89 rows=1 width=8) (actual time=69980.491..69982.076 rows=1 loops=1) > -> Gather (cost=3596993.46..3596993.87 rows=4 width=8) (actual time=69979.137..69982.071 rows=5 loops=1) > Workers Planned: 4 > Workers Launched: 4 > -> Partial Aggregate (cost=3595993.46..3595993.47 rows=1 width=8) (actual time=69975.136..69975.137 rows=1 loops=5) > -> Nested Loop (cost=0.44..3591408.37 rows=1834034 width=0) (actual time=0.882..69875.934 rows=1458419loops=5) > -> Parallel Seq Scan on pitches p (cost=0.00..3537431.89 rows=1834217 width=12) (actual time=0.852..68914.256rows=1458419 loops=5) > Filter: ((game_date >= '2022-06-21'::date) AND (game_date <= '2023-06-21'::date)) > Rows Removed by Filter: 3212310 > -> Memoize (cost=0.44..0.47 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=7292095) > Cache Key: p.pitcher_identity_id > Cache Mode: logical > Hits: 1438004 Misses: 21042 Evictions: 0 Overflows: 0 Memory Usage: 2138kB > Worker 0: Hits: 1429638 Misses: 21010 Evictions: 0 Overflows: 0 Memory Usage: 2134kB > Worker 1: Hits: 1456755 Misses: 21435 Evictions: 0 Overflows: 0 Memory Usage: 2177kB > Worker 2: Hits: 1433557 Misses: 21201 Evictions: 0 Overflows: 0 Memory Usage: 2154kB > Worker 3: Hits: 1428727 Misses: 20726 Evictions: 0 Overflows: 0 Memory Usage: 2105kB > -> Index Only Scan using identity_pkey on identity idpitcher (cost=0.43..0.46 rows=1 width=4)(actual time=0.007..0.007 rows=1 loops=105414) > Index Cond: (identity_id = p.pitcher_identity_id) > Heap Fetches: 83 > Planning Time: 1.407 ms > Execution Time: 69982.927 ms > > Is there something to be done here? Kind of a frequent style of query and quite slow. First, make sure that PostgreSQL isn't picking the correct plan. Play with "enable_seqscan" and see if the query becomes slower. Then, make sure that "random_page_cost" is set correctly. For SSDs and other storage where random I/O is not more expensive, 1.1 or 1.0 are better values than the default 4. Finally, try disabling or reducing parallel query. PostgreSQL does not optimize for low resource usage, but for short execution time. Perhaps running the qurey with 5 processes actually is faster this way, but it might use way more resorces than a different execution plan. Yours, Laurenz Albe
On Wed, Jun 21, 2023 at 2:32 PM Wells Oliver <wells.oliver@gmail.com> wrote:
Dead simple date scan across a big-ish table (est. 23,153,666 rows)
Simple, but massive.
You should turn on track_io_timing (if it is not already) and then do EXPLAIN (ANALYZE, BUFFERS)
You haven't shown that poor performance is due to the parallel seq scan, only that one particular plan happens to have both of those features at the same time. If you disable parallel queries (max_parallel_workers_per_gather=0), what plan do you get instead and how long does it take? What if you turn off enable_seqscan altogether?
Cheers,
Jeff
On Wed, Jun 21, 2023 at 2:40 PM Wells Oliver <wells.oliver@gmail.com> wrote:
It's just this.CREATE OR REPLACE VIEW vw_pitches AS
SELECT
p.year,
p.game_id,
p.game_date,
p.game_level,
...
from synergy.pitches as p
join alias.identity as idpitcher
on p.pitcher_identity_id = idpitcher.identity_id
left join alias.identity as idcatcher
on p.catcher_identity_id = idcatcher.identity_id
left join alias.identity as idbatter
on p.batter_identity_id = idbatter.identity_id;
Is there a reason the join to idpitcher is not a left join, like the other joins to alias.identity are? If it were, then this join could also be removed by the planner, and then you wouldn't need access to p.pitcher_identity_id which means it should be able to use an index-only scan on the game_date index.
Alternatively, if you made a multicolumn index over (game_date, pitcher_identity_id), then it could use that index as an index-only scan even with the existing view definition.
Cheers,
Jeff