Re: cost_sort() improvements

Поиск

Список

Период

Сортировка

От	Teodor Sigaev
Тема	Re: cost_sort() improvements
Дата	12 июля 2018 г. 21:00:27
Msg-id	72f7e61e-95ad-ee6a-ba62-a500dd089f1a@sigaev.ru обсуждение исходный текст
Ответ на	Re: cost_sort() improvements (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы	Re: cost_sort() improvements (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список	pgsql-hackers

Дерево обсуждения

> One more thought about estimating the worst case - I wonder if simply
> multiplying the per-tuple cost by 1.5 is the right approach. It does not
> seem particularly principled, and it's trivial simple to construct
> counter-examples defeating it (imagine columns with 99% of the rows
> having the same value, and then many values in exactly one row - that
> will defeat the ndistinct-based group size estimates).

Agree. But right now that is best what we have. and constant 1.5 easily could be 
changed to 1 or 10 - it is just arbitrary value, intuitively chosen.  As I 
mentioned in v7 patch description, new estimation function provides ~10% bigger 
estimation and this constant should not be very large, because we could easily 
get overestimation.

> 
> The reason why constructing such counter-examples is so simple is that
> this relies entirely on the ndistinct stats, ignoring the other stats we
> already have. I think this might leverage the per-column MCV lists (and
> eventually multi-column MCV lists, if that ever gets committed).
> 
> We're estimating the number of tuples in group (after fixing values in
> the preceding columns), because that's what determines the number of
> comparisons we need to make at a given stage. The patch does this by
> estimating the average group size, by estimating ndistinct of the
> preceding columns G(i-1), and computing ntuples/G(i-1).
> 
> But consider that we also have MCV lists for each column - if there is a
> very common value, it should be there. So let's say M(i) is a frequency
> of the most common value in i-th column, determined either from the MCV
> list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i)
> for the fist i columns, then the largest possible group of values when
> comparing values in i-th column is
> 
>      N * m(i-1)
> 
> This seems like a better way to estimate the worst case. I don't know if
> this should be used instead of NG(i), or if those two estimates should
> be combined in some way.
Agree, definitely we need an estimation of larger group size to use it in 
cost_sort. But I don't feel power to do that, pls, could you attack a computing 
group size issue? Thank you.

> 
> I think estimating the largest group we need to sort should be helpful
> for the incremental sort patch, so I'm adding Alexander to this thread.Agree
I think so. suggested estimation algorithm should be modified to support leading 
preordered keys and this form could be directly used in GROUP BY and incremental 
sort patches.
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Teodor Sigaev
Дата: 12 июля 2018 г., 20:48:21
Сообщение: Re: cost_sort() improvements

Следующее

От: Nikita Glukhov
Дата: 12 июля 2018 г., 21:06:21
Сообщение: Re: [PATCH] Add missing type conversion functions for PL/Python

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: cost_sort() improvements

Предыдущее

Следующее