Re: multivariate statistics v14

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: multivariate statistics v14
Дата
Msg-id eb8522d2-f1ee-7401-2118-e7d6b8990483@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: multivariate statistics v14  (Tatsuo Ishii <ishii@postgresql.org>)
Ответы Re: multivariate statistics v14  (Petr Jelinek <petr@2ndquadrant.com>)
Список pgsql-hackers
On 03/23/2016 06:20 AM, Tatsuo Ishii wrote:
>>> I am now looking into the create statistics doc to see if the example
>>> appearing in it is working. I will get back if I find any.
>
> I have the ref doc: CREATE STATISTICS
>
> There are nice examples how the multivariate statistics gives better
> row number estimation. So I gave them a try.
>
> "Create table t1 with two functionally dependent columns,
>  i.e. knowledge of a value in the first column is sufficient for
>  determining the value in the other column" The example creates table
>  "t1", then populates it using generate_series. After CREATE
>  STATISTICS, ANALYZE and EXPLAIN. I expected the EXPLAIN demonstrates
>  how result rows estimation is enhanced by using the multivariate
>  statistics.
>
> Here is the EXPLAIN output using the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);
>                                             QUERY PLAN
> ---------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..19425.00 rows=98 width=8) (actual time=76.876..76.876 rows=0 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 1000000
>  Planning time: 0.146 ms
>  Execution time: 76.896 ms
> (5 rows)
>
> Here is the EXPLAIN output without the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);
>                                             QUERY PLAN
> --------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..19425.00 rows=1 width=8) (actual time=78.867..78.867 rows=0 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 1000000
>  Planning time: 0.102 ms
>  Execution time: 78.885 ms
> (5 rows)
>
> It seems the row numbers estimation (98) using the multivariate
> statistics is actually *worse* than the one (1) not using the
> statistics because the actual row number is 0.

Yes, there's a mistake in the first query, because the conditions 
actually are not compatible. I.e. (i/100)=1 and (i/500)=1 have no 
overlapping rows, clearly. It should be

EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);

instead. Will fix.

>
> Next example (using table "t2") is much better than the case using t1.
>
> Here is the EXPLAIN output using the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
>                                                QUERY PLAN
> --------------------------------------------------------------------------------------------------------
>  Seq Scan on t2  (cost=0.00..19425.00 rows=9633 width=8) (actual time=0.012..75.350 rows=10000 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 990000
>  Planning time: 0.107 ms
>  Execution time: 75.680 ms
> (5 rows)
>
> Here is the EXPLAIN output without the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
>                                               QUERY PLAN
> ------------------------------------------------------------------------------------------------------
>  Seq Scan on t2  (cost=0.00..19425.00 rows=91 width=8) (actual time=0.008..76.614 rows=10000 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 990000
>  Planning time: 0.067 ms
>  Execution time: 76.935 ms
> (5 rows)
>
> This time it seems the row numbers estimation (9633) using the
> multivariate statistics is much better than the one (91) not using the
> statistics because the actual row number is 10000.
>
> The last example (using table "t3") seems no effect by multivariate statistics.

Yes. There's a typo in the example - it analyzes the wrong table (t2 
instead of t3). Once I fix that, the estimates are much better.

> In summary, the only case which shows the effect of the multivariate
> statistics is the "t2" case. So I don't see why other examples are
> shown in the manual. Am I missing something?

No, thanks for spotting those mistakes. I'll fix them and submit a new 
version of the patch - either later today or perhaps tomorrow.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Merlin Moncure
Дата:
Сообщение: Re: NOT EXIST for PREPARE
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Proposal: Generic WAL logical messages