Re: n_distinct off by a factor of 1000

Поиск
Список
Период
Сортировка
От Peter J. Holzer
Тема Re: n_distinct off by a factor of 1000
Дата
Msg-id 20200624203519.GA24048@hjp.at
обсуждение исходный текст
Ответ на n_distinct off by a factor of 1000  (Klaudie Willis <Klaudie.Willis@protonmail.com>)
Ответы Re: n_distinct off by a factor of 1000  (Michael Lewis <mlewis@entrata.com>)
Список pgsql-general
[Please keep replies on the list]

On 2020-06-24 11:02:22 +0000, Klaudie Willis wrote:
> Holzer,  thanks for your feedback.  Yes, your guess is very good.  The
> data consists of millions of instruments that occur a handful of cases
> (rare), and instruments that are very common.
>
> I am still a little surprised that it is so hard to sample data and
> estimate distinct values within a 1000X. My intuition misleads me into
> thinking this should be simpler, but I understand that this is no
> simple task at all.  To your information, it seems like SQL server
> 2016 makes the same "mistake" when calculating distincts (or 1/density
> as they call it). I have similar data there that I have looked into,
> and it seems "fooled" as well.

Yes, estimating the number of distinct values from a relatively small
sample is hard when you don't know the underlying distribution. It might
be possible to analyze the sample to find the distribution and get a
better estimate. But I'm not sure how useful that would really be: If
a few values are very common and most very rare you are probably also
much more likely to use the common values in a query: And for those you
you would massively underestimate their frequency if you had an accurate
n_distinct value. That might be just as bad or even worse.

I'm wondering whether it would make sense to use a range instead of a
single value in cost calculations. Then the planner could prefer plans
which had reasonable cost over the whole range over plans which are
good at one end of the range but horrible at the other.

I'm a afraid that would lead to combinatorial explosion, though. Even
with a small number of ranges there would be a large number of possible
combinations to calculate. So one would probably have to resort to monte
carlo simulation or soemthing like that.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

В списке pgsql-general по дате отправления:

Предыдущее
От: Erwin Sebastian Andreasen
Дата:
Сообщение: Curious behaviour with "order by random()"
Следующее
От: Michael Lewis
Дата:
Сообщение: Re: n_distinct off by a factor of 1000