Re: Problem with n_distinct being consistently inaccurate.

Поиск
Список
Период
Сортировка
От Nick Fankhauser
Тема Re: Problem with n_distinct being consistently inaccurate.
Дата
Msg-id NEBBLAAHGLEEPCGOBHDGOEIOIJAA.nickf@ontko.com
обсуждение исходный текст
Ответ на Re: Problem with n_distinct being consistently inaccurate.  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Problem with n_distinct being consistently inaccurate.  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-admin
> AFAIK, estimating number of distinct values from a small sample is
> inherently an ill-conditioned problem.

If I had been getting estimates all over the map, I'd have been a bit more
unconcerned, but what I'm seeing is a very consistent number that also
increases and tends to be more consistent in proportion to the stats
"target" number. This makes me think that there is more at work here than
the inaccuracy likely to occur from small samples. It's as if the algorithm
and sample size (even at default) are pretty reasonable for returning
consistent results in this case, but a multiplier needs to be changed.

For instance, with the various values for statistics, if I do an analyze on
the table and then look at n_distinct 6 times, these are the results I get:

(actual number is 92,000)

set statistics = -1 (default):
13549
14268
14772
14518
13863
13526

mean = 14083
std dev = 518 or 3.7% of mean


set statistics = 100
22457
22598
22566
22580
22767
22490

mean = 22576
std dev = 108 or .5% of mean


set statistics = 500
39878
39984
40018
39977
39885
40070

mean = 39968
std dev = 75 or .2% of mean


set statistics = 1000
51428
51503
51486
51658
51625
51589

mean = 51548
std dev = 74 or .1% of mean


> You could try sticking the correct n_distinct into pg_statistic by hand
> just to see if it really does change the plan

OK... but I'm a bit confused on how to get to the right row in pg_statistic.

when I do a \d on pg_stats, it appears that pg_statistic.starelid matches up
with pg_class.oid, but apparently this is not the case. Is there a place I
can look to find which keys correspond among the pg_catalog tables?


> but I'd like to think
> that getting within a factor of 2 is good enough.

Probably so... but with the default stats, it is more like a factor of 6,
which seems significant to me, and if my conjecture is correct, it might be
an easy fix. (Easy for me to say, since I'm not a developer. <grin>)

-Nick




В списке pgsql-admin по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Problem with n_distinct being consistently inaccurate.
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Problem with n_distinct being consistently inaccurate.