Re: n_distinct off by a factor of 1000

Поиск
Список
Период
Сортировка
От Michael Lewis
Тема Re: n_distinct off by a factor of 1000
Дата
Msg-id CAHOFxGpv06xOQqxXcfCrb-A5LgwOyBaeHuZvVP3CEvk8L=MsTQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: n_distinct off by a factor of 1000  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Ответы Re: n_distinct off by a factor of 1000  (Klaudie Willis <Klaudie.Willis@protonmail.com>)
Re: n_distinct off by a factor of 1000  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Список pgsql-general


On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
Yes, estimating the number of distinct values from a relatively small
sample is hard when you don't know the underlying distribution. It might
be possible to analyze the sample to find the distribution and get a
better estimate. But I'm not sure how useful that would really be: If
a few values are very common and most very rare you are probably also
much more likely to use the common values in a query: And for those you
you would massively underestimate their frequency if you had an accurate
n_distinct value. That might be just as bad or even worse.


This would only be true for values that are "common" but not in the MCVs list, right?

If we could increase the sampling ratio beyond the hard coded 300x to get a more representative sample and use that to estimate ndistinct (and also the frequency of the most common values) but only actually stored the 100 MCVs (or whatever the stats target is set to for the system or column) then the issue may be mitigated without increasing planning time because of stats that are larger than prudent, and the "only" cost should be longer processing time when (auto) analyzing... plus overhead for considering this potential new setting in all analyze cases I suppose.

В списке pgsql-general по дате отправления:

Предыдущее
От: "Peter J. Holzer"
Дата:
Сообщение: Re: n_distinct off by a factor of 1000
Следующее
От: Tim Cross
Дата:
Сообщение: Re: Persistent Connections