MCV lists for highly skewed distributions

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема MCV lists for highly skewed distributions
Дата
Msg-id CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
обсуждение исходный текст
Ответы Re: MCV lists for highly skewed distributions  (John Naylor <jcnaylor@gmail.com>)
Re: MCV lists for highly skewed distributions  (Simon Riggs <simon@2ndquadrant.com>)
Список pgsql-hackers
I want to revive a patch I sent  couple years ago to the performance list, as I have seen the issue pop up repeatedly since then.

The problem is that if you have a highly skewed distribution, such as exponential frequencies, it usually stores too few entries in the MCV list.

For example:

create table foo as select floor(-log(random())/log(2))::int  as who from generate_series(1,10000000);

This will usually store 0 to 5 as MCV with the default stat target[1].  Then value 6 is not stored, and its ~75k repetitions get averaged over the remaining distinct values.  This leads to horrible misplanning for rare values.  For example, who=17 has 37 entries, but is estimated to have 20,000.  The correct join for 37 values is often not the correct one for 20,000.

If we stored just a few more values, their inclusion in the MCV would mean they are depleted from the residual count, correctly lowering the estimate we would get for very rare values not included in the sample.

So instead of having the threshold of 1.25x the average frequency over all values, I think we should use 1.25x the average frequency of only those values not already included in the MCV, as in the attached.

As it is, you can partially overcome the too short MCV list by cranking up the default statistics target, but this is a weak effect and comes at a high cost of CPU time.  In some of the real distributions I've looked at, cranking up default statistics target is almost entirely ineffective.

I think that perhaps maxmincount should also use the dynamic values_cnt_remaining rather than the static one.  After all, things included in the MCV don' get represented in the histogram.  When I've seen planning problems due to skewed distributions I also usually see redundant values in the histogram boundary list which I think should be in the MCV list instead. But I have not changed that here, pending discussion.

[1] Occasionally it will store a much longer MCV list, because no values was sampled exactly once, which triggers a different code path in which all seen values are put in the MCV and the histogram is NULL.  This is not reliable, as whether the least-sample value is present in the sample once or twice is pretty brittle.

Cheers,

Jeff
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: [HACKERS] taking stdbool.h into use
Следующее
От: David Rowley
Дата:
Сообщение: Re: [HACKERS] path toward faster partition pruning