Обсуждение: Odd statistics behaviour in 7.2

Поиск
Список
Период
Сортировка

Odd statistics behaviour in 7.2

От
Gordon Runkle
Дата:
Hello, all,

I'm having a strange problem with v7.2 relating to statistics collection
and plan calculation.  I'm not sure if this relates to the problems Marc
was seeing, but here goes.

I have a table with 1,066,673 rows.  The column I'm interested in has
this distribution of values:
tdnr_ct |   ct   
---------+--------     16 |      1      4 |      1      3 |     58      2 |  68904      1 | 928171

This means that 'ct' records have 'tdnr_ct' duplicate values.  As you can
see, the index I have on this column is highly selective, and should be
used to look up records based on this column.  In v7.1.3, it always does.

Under v7.2, it only sometimes does.  I've looked at the statistics,
thanks to what I learned from Tom and Marc's discussion, and I see that
sometimes when I VACUUM ANALYZE the table, 'n_distinct' for this column
gets a value of '-1' (desireable), and other times a value such as 59483
or something.

This is with the default setting for the statistics.

Doing a 'SET STATISTICS 40' on the column got me to '-0.106047', which is
better.  But even so, the values do change somewhat over subsequent runs
of VACUUM ANALYZE.  And sometimes I get the coveted '-1'.

The query I'm running is fairly complex.  The difference between getting
the index lookup versus the sequential scan causes an order of magnitude 
difference in run time.

The query plans are below.  Same query, no changes, just the difference
in statistics.

The desireable query plan:

Unique  (cost=176572.08..177673.89 rows=3673 width=176) ->  Sort  (cost=176572.08..176572.08 rows=36727 width=176)
->  Merge Join  (cost=172982.30..173787.35 rows=36727 width=176)             ->  Sort  (cost=169436.41..169436.41
rows=27883width=142)                   ->  Nested Loop  (cost=0.00..167377.66 rows=27883 width=142)
    ->  Seq Scan on pprv_ticket ptk  (cost=0.00..3345.83 rows=27883 width=125)                         ->  Index Scan
usingxie01_cat24 on cat24_ticket_doc_id c24  (cost=0.00..5.87 rows=1 width=17)             ->  Sort
(cost=3545.89..3545.89rows=37048 width=34)                   ->  Seq Scan on pprv_violation pe  (cost=0.00..734.48
rows=37048width=34)             SubPlan               ->  Aggregate  (cost=5.87..5.87 rows=1 width=17)
  ->  Index Scan using xie01_cat24 on cat24_ticket_doc_id  (cost=0.00..5.87 rows=1 width=17)               ->
Aggregate (cost=5.88..5.88 rows=1 width=17)                     ->  Index Scan using xie01_cat24 on cat24_ticket_doc_id
(cost=0.00..5.88 rows=1 width=17)
 




The undesireable query plan:

Unique  (cost=1129322.57..1187392.58 rows=193567 width=176) ->  Sort  (cost=1129322.57..1129322.57 rows=1935667
width=176)      ->  Merge Join  (cost=204226.57..249046.32 rows=1935667 width=176)             ->  Merge Join
(cost=200135.91..209436.90rows=525268 width=142)                    ->  Sort  (cost=6435.89..6435.89 rows=27883
width=125)                        ->  Seq Scan on pprv_ticket ptk  (cost=0.00..3335.83 rows=27883 width=125)
      ->  Sort  (cost=193700.02..193700.02 rows=1066173 width=17)                          ->  Seq Scan on
cat24_ticket_doc_idc24  (cost=0.00..50164.73 rows=1066173 width=17)             ->  Sort  (cost=4090.66..4090.66
rows=37048width=34)                   ->  Seq Scan on pprv_violation pv  (cost=0.00..734.48 rows=37048 width=34)
    SubPlan               ->  Aggregate  (cost=74.72..74.72 rows=1 width=17)                     ->  Index Scan using
xie01_cat24on cat24_ticket_doc_id  (cost=0.00..74.67 rows=19 width=17)               ->  Aggregate  (cost=29.12..29.12
rows=1width=17)                     ->  Index Scan using xie07_cat24 on cat24_ticket_doc_id  (cost=0.00..29.12 rows=1
width=17)


I hope I've given enough information that it makes sense.  If there's anything
I can do my end to help figure this out, let me know.  

Thanks,

Gordon.
-- 
"Far and away the best prize that life has to offer is the chance to work hard at work worth doing."      -- Theodore
Roosevelt


Re: Odd statistics behaviour in 7.2

От
Tom Lane
Дата:
Gordon Runkle <gar@integrated-dynamics.com> writes:
> I have a table with 1,066,673 rows.  The column I'm interested in has
> this distribution of values:

>  tdnr_ct |   ct   
> ---------+--------
>       16 |      1
>        4 |      1
>        3 |     58
>        2 |  68904
>        1 | 928171

> This means that 'ct' records have 'tdnr_ct' duplicate values.

I'm confused.  You mean that there is one value that appears 16 times,
one that appears 4 times, etc etc, and 928171 values that appear only
once?

> Under v7.2, it only sometimes does.  I've looked at the statistics,
> thanks to what I learned from Tom and Marc's discussion, and I see that
> sometimes when I VACUUM ANALYZE the table, 'n_distinct' for this column
> gets a value of '-1' (desireable), and other times a value such as 59483
> or something.

This seems quite bizarre; given those stats it's hard to see how you
could get anything but -1 or close to it, even with a very unlucky
statistical sampling.  Don't suppose you'd want to trace through the
ANALYZE code and find out why it's computing a bad value?

Alternatively, if you could send me a dump of just the ct column,
I could try to reproduce the behavior here.  (CREATE TABLE foo AS
SELECT ct FROM yourtab and then pg_dump -t foo should do it.)
        regards, tom lane


Re: Odd statistics behaviour in 7.2

От
"Gordon A. Runkle"
Дата:
On Wed, 2002-02-13 at 11:21, Tom Lane wrote:
> Gordon Runkle <gar@integrated-dynamics.com> writes:
> > I have a table with 1,066,673 rows.  The column I'm interested in has
> > this distribution of values:
> 
> >  tdnr_ct |   ct   
> > ---------+--------
> >       16 |      1
> >        4 |      1
> >        3 |     58
> >        2 |  68904
> >        1 | 928171
> 
> > This means that 'ct' records have 'tdnr_ct' duplicate values.
> 
> I'm confused.  You mean that there is one value that appears 16 times,
> one that appears 4 times, etc etc, and 928171 values that appear only
> once?

Yes, exactly.  I could have stated that more clearly, but probably not
at 0-dark-thirty...  ;-)


> > Under v7.2, it only sometimes does.  I've looked at the statistics,
> > thanks to what I learned from Tom and Marc's discussion, and I see that
> > sometimes when I VACUUM ANALYZE the table, 'n_distinct' for this column
> > gets a value of '-1' (desireable), and other times a value such as 59483
> > or something.
> 
> This seems quite bizarre; given those stats it's hard to see how you
> could get anything but -1 or close to it, even with a very unlucky
> statistical sampling.  Don't suppose you'd want to trace through the
> ANALYZE code and find out why it's computing a bad value?

I can do that.  I need to build a version of PostgreSQL with debug
enabled, right?


> Alternatively, if you could send me a dump of just the ct column,
> I could try to reproduce the behavior here.  (CREATE TABLE foo AS
> SELECT ct FROM yourtab and then pg_dump -t foo should do it.)

I can do that too.  It's pretty large, I'll email you separately
with a URL from which you can retrieve it.  Thanks!

Thomas suggested in his reply that perhaps the table isn't randomly
populated, but if it is storing the data in the order in which it was
COPYed in, it's pretty random.  The column's values generally trend
upward, but pretty randomly.  We load 6000-10000 new records per week.

Thanks again,

Gordon.
-- 
"Far and away the best prize that life has to offer is the chance to work hard at work worth doing."      -- Theodore
Roosevelt