Re: POC: GROUP BY optimization

Поиск
Список
Период
Сортировка
От jian he
Тема Re: POC: GROUP BY optimization
Дата
Msg-id CACJufxFoswKvkBi6putzLw67QCYsP+m9d3NLcbhkAj9Di=8eVQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: POC: GROUP BY optimization  (Andrei Lepikhov <a.lepikhov@postgrespro.ru>)
Ответы Re: POC: GROUP BY optimization
Re: POC: GROUP BY optimization
Список pgsql-hackers
On Thu, May 16, 2024 at 3:47 PM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
>
> On 24.04.2024 13:25, jian he wrote:
> > hi.
> > I found an interesting case.
> >
> > CREATE TABLE t1 AS
> >    SELECT (i % 10)::numeric AS x,(i % 10)::int8 AS y,'abc' || i % 10 AS
> > z, i::int4 AS w
> >    FROM generate_series(1, 100) AS i;
> > CREATE INDEX t1_x_y_idx ON t1 (x, y);
> > ANALYZE t1;
> > SET enable_hashagg = off;
> > SET enable_seqscan = off;
> >
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,y,w;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,y,z;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,w,y;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,z,y;
> > the above part will use:
> >    ->  Incremental Sort
> >           Sort Key: x, $, $, $
> >           Presorted Key: x
> >           ->  Index Scan using t1_x_y_idx on t1
> >
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY z,y,w,x;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY w,y,z,x;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,z,x,w;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,w,x,z;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,x,z,w;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,x,w,z;
> >
> > these will use:
> >    ->  Incremental Sort
> >           Sort Key: x, y, $, $
> >           Presorted Key: x, y
> >           ->  Index Scan using t1_x_y_idx on t1
> >
> > I guess this is fine, but not optimal?
> It looks like a bug right now - in current implementation we don't
> differentiate different orders. So:
> 1. Applying all the patches from the thread which I proposed as an
> answer to T.Lane last rebuke - does behavior still the same?.

I've applied these 4 patches in  this thread.
final_improvements.patch
minor_comment.patch
get_useful_group_keys_orderings.patch
group_by_asymmetry.diff

The behavior is still the same as the master.
meaning that below quoted queries are still using "Presorted Key: x".

> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,y,w;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,y,z;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,w,y;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,z,y;
> > the above part will use:
> >    ->  Incremental Sort
> >           Sort Key: x, $, $, $
> >           Presorted Key: x
> >           ->  Index Scan using t1_x_y_idx on t1


> 2. Could you try to find the reason?
the following are my understanding, it may be wrong...

function get_useful_group_keys_orderings, may generate alternative
pathkeys and group clauses.
for different PathKeyInfo
sub functions within add_paths_to_grouping_rel:
make_ordered_path, create_agg_path will yield the same cost.
function add_path (within add_paths_to_grouping_rel) will add these
paths (different PathKeyInfos) in a *sequential* order.

then later in the function set_cheapest.
`foreach(p, parent_rel->pathlist)`  will iterate these different paths
in a sequential order.
so in set_cheapest if 2 path costs are the same, then the first path
residing in parent_rel->pathlist wins.


fo a concrete example:
CREATE TABLE t1 AS
SELECT (i % 10)::numeric AS x,(i % 10)::int8 AS y,'abc' || i % 10 AS
z, i::float4 AS w FROM generate_series(1, 100) AS i;
CREATE INDEX t1_x_y_idx ON t1 (x, y);
ANALYZE t1;
SET enable_hashagg = off;
SET enable_seqscan = off;
EXPLAIN (COSTS OFF, verbose) SELECT count(*) FROM t1 GROUP BY x,z,y,w;

add_paths_to_grouping_rel will do the following in a sequential order.
A. generate and add original PathKeyInfo(groupby x,z,y,w)
B. generate and add alternative PathKeyInfo(groupby x,y,z,w). (because
of the index path)
C. can_hash is true, generate a HashAgg Path (because of
enable_hashagg is set to off, so this path the cost is very high)

As mentioned previously,
both A and B can use Incremental Sort, set_cheapest will choose the A
instead of B,
because A step generated path **first** satisfies increment sort.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andy Fan
Дата:
Сообщение: Re: using extended statistics to improve join estimates
Следующее
От: vignesh C
Дата:
Сообщение: Re: Pgoutput not capturing the generated columns