Обсуждение: Index usage vs large repetitions of key

Поиск
Список
Период
Сортировка

Index usage vs large repetitions of key

От
Francisco Reyes
Дата:
I have a table with a "year" column.
A count grouped by year is as follows:
=> select year, count(*) from ystats group by year;
 year | count
------+--------
 1992 | 367191
 1993 | 341324
 1994 | 324867
 1995 | 314674
 1996 | 307149
 1997 | 303540
 1998 | 298040
 1999 | 300269
 2000 | 302648
 2001 | 305797
 2002 | 127445
(11 rows)

I created an index by year, but it never seems to get used, not even with
"set enable_seqscan to off;"

I have numerous queries I do against this table that only need to acces
one or two years.

I am making the wrong assumption here that doing an index scan would be
significantly more efficient? If so how do I help the optimizer see this?


Re: Index usage vs large repetitions of key

От
Neil Conway
Дата:
On Sat, 4 May 2002 16:25:47 -0400 (EDT)
"Francisco Reyes" <lists@natserv.com> wrote:
> I have numerous queries I do against this table that only need to acces
> one or two years.

Can you post some of the queries that are problematic, as well as the
output of EXPLAIN ANALYZE for them?

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: Index usage vs large repetitions of key

От
Francisco Reyes
Дата:
On Sat, 4 May 2002, Neil Conway wrote:

> On Sat, 4 May 2002 16:25:47 -0400 (EDT)
> "Francisco Reyes" <lists@natserv.com> wrote:
> > I have numerous queries I do against this table that only need to acces
> > one or two years.
>
> Can you post some of the queries that are problematic, as well as the
> output of EXPLAIN ANALYZE for them?

Most of the queries are large and ugly.
Following is simple one that is very much like the common queries we have.

-- Query
explain analyze
select record_key, ystart, cstart
        from ystats, hearn
        where year = 2002 and
        ystats.record_key = hearn.horse_key and
        cstat_date > '1-1-2002'
        and ystart <> cstart ;
--

-- Explain Analyze
Hash Join  (cost=528.26..101381.81 rows=14 width=12) (actual
time=5237.61..16835.63 rows=69552 loops=1)
  ->  Seq Scan on ystats  (cost=0.00..99960.80 rows=178505 width=6)
      (actual time=2049.91..13066.82 rows=127445 loops=1)

  ->  Hash  (cost=527.88..527.88 rows=154 width=6)
      (actual time=833.22..833.22 rows=0 loops=1)
        ->  Index Scan using he_cstat_date on hearn
            (cost=0.00..527.88 rows=154 width=6)
            (actual time=0.47..568.92 rows=40821 loops=1)
Total runtime: 17525.13 msec
--

The estimate for ystats comes out to 178505, which is not far from the
actual 127445. This is MUCH smaller than the 3 Million + records on that
table, yet the optimizer insists on doing a sequential scan.

The estimate for hearn was also pretty bad. :-(
The optimizer estimated 154 rows and 40,821 were returned.


Re: Index usage vs large repetitions of key

От
Francisco Reyes
Дата:
On Sun, 5 May 2002 felix@crowfix.com wrote:

> This sequential scan bugged the heck out of me, until I finally
> understood what is going on.  Usually records on disk are scattered
> all over, so many per data page.  I think the standard is 8K.  Suppose
> your table has 10 records per data page.  Roughly 1/20 of the records
> will be selected, so it is going to hit about half the data pages.  In
> this case, it is better to simply march through all data pages
> sequentially than to read half the data pages randomly, and also hit a
> lot of index pages, also randomly scattered around the disk.  It's
> better for the OS to read twice as much sequentially compared to half
> as much randomly.  Throw in more random reading for index pages, and
> the sequential scan is a big win.


Thanks for the explanation. So I guess it is a factor of how big each row
is, the percentage of records to be selected and the page size.
The particular queries I am doing would return about %10 of the records
and using 8K pages there would be about 200 rows per page. Now I better
understand why the index is not been used.

> I think there is some way to force an indexed read, but I have
> forgotten what little I knew about that.  If there is, you could try
> both ways and compare timings.

Based on this info it may make sense to let it do the sequential scan.

In the coming months the table in question is going to grow 3 to 4 times
it's number of records so at that point the index may make more sense.

Is there a drawback on having the index right now?
I guess it would make the optimizer's work more even though it would
likely not choose the index anyway.


Re: Index usage vs large repetitions of key

От
Neil Conway
Дата:
On Tue, 7 May 2002 09:48:13 -0400 (EDT)
"Francisco Reyes" <lists@natserv.com> wrote:
> On Sun, 5 May 2002 felix@crowfix.com wrote:
> > I think there is some way to force an indexed read, but I have
> > forgotten what little I knew about that.  If there is, you could try
> > both ways and compare timings.
>
> Based on this info it may make sense to let it do the sequential scan.

You can easily test this hypothesis by disabling sequential scans (SET
enable_seqscan = off;), and using EXPLAIN ANALYZE to compare the performance
of the resulting query plan with the one chosen by the planner to
begin with.

> In the coming months the table in question is going to grow 3 to 4 times
> it's number of records so at that point the index may make more sense.
>
> Is there a drawback on having the index right now?

Yes; inserts and updates will need to update the index. Depending on
your queries, this can be a significant performance hit.

> I guess it would make the optimizer's work more even though it would
> likely not choose the index anyway.

My guess would be that this wouldn't be a very significant factor.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: Index usage vs large repetitions of key

От
Francisco Reyes
Дата:
On Tue, 7 May 2002, Neil Conway wrote:

> On Tue, 7 May 2002 09:48:13 -0400 (EDT)
> "Francisco Reyes" <lists@natserv.com> wrote:
> > On Sun, 5 May 2002 felix@crowfix.com wrote:
> > > I think there is some way to force an indexed read, but I have
> > > forgotten what little I knew about that.  If there is, you could try
> > > both ways and compare timings.
> >
> > Based on this info it may make sense to let it do the sequential scan.
>
> You can easily test this hypothesis by disabling sequential scans (SET
> enable_seqscan = off;), and using EXPLAIN ANALYZE to compare the performance
> of the resulting query plan with the one chosen by the planner to
> begin with.

I tried to set enable_seqscan = off and it still did a sequential scan.

> > Is there a drawback on having the index right now?
>
> Yes; inserts and updates will need to update the index. Depending on
> your queries, this can be a significant performance hit.


This is a "reporting" server and I do a set of "copy" jobs once a day,
followed by a vacuum analyze.. and a nightly "vacuum full"