Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Поиск
Список
Период
Сортировка
От Dean Rasheed
Тема Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Дата
Msg-id CAEZATCX04NOAqJm7+vb_jyrMxDo44Pzx8xd_z5rRH7snsLqxEA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On 16 July 2018 at 21:55, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 07/16/2018 02:54 PM, Dean Rasheed wrote:
>> On 16 July 2018 at 13:23, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>>> The top-level clauses allow us to make such deductions, with deeper
>>>>> clauses it's much more difficult (perhaps impossible). Because for
>>>>> example with (a=1 AND b=1) there can be just a single match, so if we
>>>>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
>>>>> b=2)) it's not that simple, because there may be multiple combinations
>>>>> and so a match in MCV does not guarantee anything.
>>>>
>>>> Actually, it guarantees a lower bound on the overall selectivity, and
>>>> maybe that's the best that we can do in the absence of any other
>>>> stats.
>>>>
>>> Hmmm, is that actually true? Let's consider a simple example, with two
>>> columns, each with just 2 values, and a "perfect" MCV list:
>>>
>>>     a | b | frequency
>>>    -------------------
>>>     1 | 1 | 0.5
>>>     2 | 2 | 0.5
>>>
>>> And let's estimate sel(a=1 & b=2).
>>
>> OK.In this case, there are no MCV matches, so there is no lower bound (it's 0).
>>
>> What we could do though is also impose an upper bound, based on the
>> sum of non-matching MCV frequencies. In this case, the upper bound is
>> also 0, so we could actually say the resulting selectivity is 0.
>>
>
> Hmmm, it's not very clear to me how would we decide which of these cases
> applies, because in most cases we don't have MCV covering 100% rows.
>
> Anyways, I've been thinking about how to modify the code to wort the way
> you proposed (in a way sufficient for a PoC). But after struggling with
> it for a while it occurred to me it might be useful to do it on paper
> first, to verify how would it work on your examples.
>
> So let's use this data
>
> create table foo(a int, b int);
> insert into foo select 1,1 from generate_series(1,50000);
> insert into foo select 1,2 from generate_series(1,40000);
> insert into foo select 1,x/10 from generate_series(30,250000) g(x);
> insert into foo select 2,1 from generate_series(1,30000);
> insert into foo select 2,2 from generate_series(1,20000);
> insert into foo select 2,x/10 from generate_series(30,500000) g(x);
> insert into foo select 3,1 from generate_series(1,10000);
> insert into foo select 3,2 from generate_series(1,5000);
> insert into foo select 3,x from generate_series(3,600000) g(x);
> insert into foo select x,x/10 from generate_series(4,750000) g(x);
>
> Assuming we have perfectly exact statistics, we have these MCV lists
> (both univariate and multivariate):
>
>   select a, count(*), round(count(*) /2254937.0, 4) AS frequency
>     from foo group by a order by 2 desc;
>
>      a    | count  | frequency
>   --------+--------+-----------
>         3 | 614998 |    0.2727
>         2 | 549971 |    0.2439
>         1 | 339971 |    0.1508
>      1014 |      1 |    0.0000
>     57220 |      1 |    0.0000
>     ...
>
>   select b, count(*), round(count(*) /2254937.0, 4) AS frequency
>     from foo group by b order by 2 desc;
>
>      b    | count | frequency
>   --------+-------+-----------
>         1 | 90010 |    0.0399
>         2 | 65010 |    0.0288
>         3 |    31 |    0.0000
>         7 |    31 |    0.0000
>        ...
>
>   select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency
>     from foo group by a, b order by 3 desc;
>
>      a    |   b    | count | frequency
>   --------+--------+-------+-----------
>         1 |      1 | 50000 |    0.0222
>         1 |      2 | 40000 |    0.0177
>         2 |      1 | 30000 |    0.0133
>         2 |      2 | 20000 |    0.0089
>         3 |      1 | 10000 |    0.0044
>         3 |      2 |  5000 |    0.0022
>         2 |  12445 |    10 |    0.0000
>         ...
>
> And let's estimate the two queries with complex clauses, where the
> multivariate stats gave 2x overestimates.
>
> SELECT * FROM foo WHERE a=1 and (b=1 or b=2);
> -- actual 90000, univariate: 24253, multivariate: 181091
>
>    univariate:
>
>      sel(a=1) = 0.1508
>      sel(b=1) = 0.0399
>      sel(b=2) = 0.0288
>      sel(b=1 or b=2) = 0.0673
>
>    multivariate:
>      sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770)
>
> The second multivariate estimate comes from assuming only the first 5
> items make it to the multivariate MCV list (covering 6.87% of the data)
> and extrapolating the selectivity to the non-MCV data too.
>
> (Notice it's about 2x the actual selectivity, so this extrapolation due
> to not realizing the MCV already contains all the matches is pretty much
> responsible for the whole over-estimate).
>

Agreed. I think the actual MCV stats I got included the first 6
entries, but yes, that's only around 7% of the data.


> So, how would the proposed algorithm work? Let's start with "a=1":
>
>    sel(a=1) = 0.1508
>
> I don't see much point in applying the two "b" clauses independently (or
> how would it be done, as it's effectively a single clause):
>
>    sel(b=1 or b=2) = 0.0673
>
> And we get
>
>    total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101
>
> From the multivariate MCV we get
>
>    mcv_sel = 0.0399
>
> And finally
>
>    total_sel = Max(total_sel, mcv_sel) = 0.0399
>
> Which is great, but I don't see how that actually helped anything? We
> still only have the estimate for the ~7% covered by the MCV list, and
> not the remaining non-MCV part.
>

Right. If these are the only stats available, and there are just 2
top-level clauses like this, it either returns the MCV estimate, or
the old univariate estimate (whichever is larger). It avoids
over-estimating, but will almost certainly under-estimate when the MCV
matches are incomplete.


> I could do the same thing for the second query, but the problem there is
> actually exactly the same - extrapolation from MCV to non-MCV part
> roughly doubles the estimate.
>
> So unless I'm applying your algorithm incorrectly, this does not seem
> like a very promising direction :-(
>

You could be right. Actually it's the order dependence with more than
2 top-level clauses that bothers me most about this algorithm. It's
also not entirely obvious how to include histogram stats in this
scheme.

A different approach that I have been thinking about is, in
mcv_update_match_bitmap(), attempt to work out the probability of all
the clauses matching and it not being an MCV value. For example, given
a clause like a=1 whose univariate estimate we know (0.1508 in the
above example), knowing that the MCV values account for 0.0222+0.0177
of that, the remainder must be from non-MCV values. So we could say
that the probability that a=1 and it not being and MCV is
0.1508-0.0222-0.0177 = 0.1109. So then the question is could we
combine these non-MCV probabilities to give an estimate of how many
non-MCV values we need to worry about? I've not fully thought that
through, but it might be useful. The problem is, it's still likely to
over-estimate when the MCV matches fully cover all possibilities, and
I still don't see a way to reliably detect that case.

Regards,
Dean


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: missing toast table for pg_policy
Следующее
От: Kyotaro HORIGUCHI
Дата:
Сообщение: Re: [HACKERS] WAL logging problem in 9.4.3?