Обсуждение: Avoiding seq scan over 3.2 millions rows

Поиск
Список
Период
Сортировка

Avoiding seq scan over 3.2 millions rows

От
"Andrus"
Дата:
explain analyze SELECT sum(xxx)
   FROM dok JOIN rid USING (dokumnr)
   WHERE  dok.kuupaev BETWEEN '2008-04-01' AND '2008-04-30'

"Hash Join  (cost=29584.84..308259.32 rows=142686 width=0) (actual
time=68510.748..96932.174 rows=117883 loops=1)"
"  Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
"  ->  Seq Scan on rid  (cost=0.00..195599.68 rows=3217368 width=4) (actual
time=17.130..56572.857 rows=3247363 loops=1)"
"  ->  Hash  (cost=29243.76..29243.76 rows=53231 width=4) (actual
time=15878.782..15878.782 rows=44685 loops=1)"
"        ->  Index Scan using dok_kuupaev_idx on dok  (cost=0.00..29243.76
rows=53231 width=4) (actual time=47.102..15651.059 rows=44685 loops=1)"
"              Index Cond: ((kuupaev >= '2008-04-01'::date) AND (kuupaev <=
'2008-04-30'::date))"
"Total runtime: 97364.282 ms"

Query performs seq scan over 3.2 million of rows.

dokumnr is of type int, kuupaev is of type DATE.
There are regular indexes on

rid(dokumnr)
dok(dokumnr)
dok(kuupaev)

Vacuum is running automatically.
How to speed up this query ?

Andrus.

"PostgreSQL 8.1.4 on i686-pc-linux-gnu, compiled by GCC
i686-pc-linux-gnu-gcc (GCC) 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0,
pie-8.7.9)"


Re: Avoiding seq scan over 3.2 millions rows

От
Tom Lane
Дата:
"Andrus" <kobruleht2@hot.ee> writes:
> explain analyze SELECT sum(xxx)
>    FROM dok JOIN rid USING (dokumnr)
>    WHERE  dok.kuupaev BETWEEN '2008-04-01' AND '2008-04-30'

> "Hash Join  (cost=29584.84..308259.32 rows=142686 width=0) (actual
> time=68510.748..96932.174 rows=117883 loops=1)"
> "  Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
> "  ->  Seq Scan on rid  (cost=0.00..195599.68 rows=3217368 width=4) (actual
> time=17.130..56572.857 rows=3247363 loops=1)"
> "  ->  Hash  (cost=29243.76..29243.76 rows=53231 width=4) (actual
> time=15878.782..15878.782 rows=44685 loops=1)"
> "        ->  Index Scan using dok_kuupaev_idx on dok  (cost=0.00..29243.76
> rows=53231 width=4) (actual time=47.102..15651.059 rows=44685 loops=1)"
> "              Index Cond: ((kuupaev >= '2008-04-01'::date) AND (kuupaev <=
> '2008-04-30'::date))"
> "Total runtime: 97364.282 ms"

> Query performs seq scan over 3.2 million of rows.

There isn't anything particularly wrong with that plan.  The alternative
that you seem to be wishing for would involve ~50000 index probes into
"rid", which is hardly going to be free.

You could try reducing random_page_cost to push the planner in the
direction of preferring the indexscan solution, but whether this is
actually better in your situation remains to be seen.

            regards, tom lane

Re: Avoiding seq scan over 3.2 millions rows

От
tv@fuzzy.cz
Дата:
> "Andrus" <kobruleht2@hot.ee> writes:
>> explain analyze SELECT sum(xxx)
>>    FROM dok JOIN rid USING (dokumnr)
>>    WHERE  dok.kuupaev BETWEEN '2008-04-01' AND '2008-04-30'
>
>> "Hash Join  (cost=29584.84..308259.32 rows=142686 width=0) (actual
>> time=68510.748..96932.174 rows=117883 loops=1)"
>> "  Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
>> "  ->  Seq Scan on rid  (cost=0.00..195599.68 rows=3217368 width=4)
>> (actual
>> time=17.130..56572.857 rows=3247363 loops=1)"
>> "  ->  Hash  (cost=29243.76..29243.76 rows=53231 width=4) (actual
>> time=15878.782..15878.782 rows=44685 loops=1)"
>> "        ->  Index Scan using dok_kuupaev_idx on dok
>> (cost=0.00..29243.76
>> rows=53231 width=4) (actual time=47.102..15651.059 rows=44685 loops=1)"
>> "              Index Cond: ((kuupaev >= '2008-04-01'::date) AND (kuupaev
>> <=
>> '2008-04-30'::date))"
>> "Total runtime: 97364.282 ms"
>
>> Query performs seq scan over 3.2 million of rows.
>
> There isn't anything particularly wrong with that plan.  The alternative
> that you seem to be wishing for would involve ~50000 index probes into
> "rid", which is hardly going to be free.
>
> You could try reducing random_page_cost to push the planner in the
> direction of preferring the indexscan solution, but whether this is
> actually better in your situation remains to be seen.

Or just use "enable_seqscan = off" - that should penalize the seq scan and
use an index scan instead. You can then execute the EXPLAIN ANALYZE again
and you'll see how fast the index scan is in this case.

You may try some "standard" optimization - I'd try clustering the 'rid'
table according to the "dokumnr" column, but I doubt it will outperform
the seq scan in this case. If the rows are 'wide' you may try to 'compact'
the table so that it's possible to read it faster during the seq scan. Try
something like

CREATE TABLE rid2 AS SELECT dokumnr, xxx FROM rid;

regards
Tomas