Re: estimating # of distinct values

Поиск
Список
Период
Сортировка
От tv@fuzzy.cz
Тема Re: estimating # of distinct values
Дата
Msg-id 3f230db8df732b42dffcc8836be7e03e.squirrel@sq.gransy.com
обсуждение исходный текст
Ответ на estimating # of distinct values  (Tomas Vondra <tv@fuzzy.cz>)
Список pgsql-hackers
> On Fri, 2011-01-07 at 12:32 +0100, tv@fuzzy.cz wrote:
>> the problem is you will eventually need to drop the results and rebuild
>> it, as the algorithms do not handle deletes (ok, Florian mentioned an
>> algorithm L_0 described in one of the papers, but I'm not sure we can
>> use
>> it).
>
> Yes, but even then you can start with much better cards if you already
> have an estimate of what it looks like, based on the fact that you did
> continuous updating of it. For example you'll have a pretty good
> estimate of the bounds of the number of distinct values, while if you
> really start from scratch you have nothing to start with but assume that
> you must cope with the complete range between all values are distinct ->
> there's only a few of them.

Sure, using the previous estimate is a good idea. I just wanted to point
out there is no reasonable way to handle deletes, so that you have to drop
the stats are rebuild it from scratch.

The biggest problem is not choosing a reasonable parameters (some of the
parameters can handle a few billions ndistinct values with something like
128kB of memory and less than 5% error). The really serious concern is I/O
generated by rebuilding the stats.

>> Another thing I'm not sure about is where to store those intermediate
>> stats (used to get the current estimate, updated incrementally).
>
> The forks implementation proposed in other responses is probably the
> best idea if usable. It will also solve you the problem of memory
> limitations, at the expense of more resources used if the structure
> grows too big, but it will still be probably fast enough if it stays
> small and in cache.

Hm, the forks seem to be an interesting option. It's probably much better
than storing that directly in the memory (not a good idea if there is a
lot of data). And you don't really need the data to get the estimate, it's
needed just when updating the stats.

So the last thing I'm not sure is how to store the changed rows, so that
the update process can get a list of new values. Someone already offered
LISTEN/NOTIFY, but I guess we could just write the ctids into a file
(maybe another fork)?

regards
Tomas



В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Fetter
Дата:
Сообщение: Re: hstore ?& operator versus mathematics
Следующее
От: Magnus Hagander
Дата:
Сообщение: Re: Streaming base backups