Обсуждение: why memoize is not used for correlated subquery

Поиск
Список
Период
Сортировка

why memoize is not used for correlated subquery

От
Pavel Stehule
Дата:
Hi

I am playing with examples for P2D2, and I found few issues related to memoize

1. I use dataset https://pgsql.cz/files/obce.sql - it is data about czech population

Dictionary - "obec" -> "village", "pocet_muzu" -> "number_of_men", "pocet_zen" -> "number_of_woman", "okres" -> "district", "nazev" -> "name"

I wrote the query - biggest village per district

select nazev 
  from obce o 
  where pocet_muzu + pocet_zen = (select max(pocet_muzu + pocet_zen)
                                    from obce
                                   where o.okres_id = okres_id);



I expected usage of memoize, because in this query, it can be very effective https://explain.depesz.com/s/0ubC

(2024-05-28 09:09:58) postgres=# explain select nazev from obce o where pocet_muzu + pocet_zen = (select max(pocet_muzu + pocet_zen) from obce where o.okres_id = okres_id);
                                                QUERY PLAN                                                
══════════════════════════════════════════════════════════════════════════════════════════════════════════
Seq Scan on obce o  (cost=0.00..33588.33 rows=31 width=10)
  Filter: ((pocet_muzu + pocet_zen) = (SubPlan 2))
  SubPlan 2
    ->  Result  (cost=5.34..5.35 rows=1 width=4)
          InitPlan 1
            ->  Limit  (cost=0.28..5.34 rows=1 width=4)
                  ->  Index Scan Backward using obce_expr_idx on obce  (cost=0.28..409.92 rows=81 width=4)
                        Index Cond: ((pocet_muzu + pocet_zen) IS NOT NULL)
                        Filter: ((o.okres_id)::text = (okres_id)::text)
(9 rows)


But it doesn't do. I rewrote this query to lateral join, and memoize was used, but the result was not good, because filter wa pushed to subquery

explain select * from obce o, lateral (select max(pocet_zen + pocet_muzu) from obce where o.okres_id = okres_id) where pocet_zen + pocet_muzu = max;
                                              QUERY PLAN                                              
══════════════════════════════════════════════════════════════════════════════════════════════════════
Nested Loop  (cost=12.83..19089.82 rows=31 width=45)
  ->  Seq Scan on obce o  (cost=0.00..121.50 rows=6250 width=41)
  ->  Memoize  (cost=12.83..12.85 rows=1 width=4)
        Cache Key: (o.pocet_zen + o.pocet_muzu), o.okres_id
        Cache Mode: binary
        ->  Subquery Scan on unnamed_subquery  (cost=12.82..12.84 rows=1 width=4)
              Filter: ((o.pocet_zen + o.pocet_muzu) = unnamed_subquery.max)
              ->  Aggregate  (cost=12.82..12.83 rows=1 width=4)
                    ->  Index Scan using obce_okres_id_idx on obce  (cost=0.28..12.41 rows=81 width=8)
                          Index Cond: ((okres_id)::text = (o.okres_id)::text)
(10 rows)

and then the effect of memoize is almost negative https://explain.depesz.com/s/TKLL

When I used optimization fence, then memoize was used effectively https://explain.depesz.com/s/hhgi

explain select * from (select * from obce o, lateral (select max(pocet_zen + pocet_muzu) from obce where o.okres_id = okres_id) offset 0) where pocet_zen + pocet_muzu = max;
                                              QUERY PLAN                                              
══════════════════════════════════════════════════════════════════════════════════════════════════════
Subquery Scan on unnamed_subquery  (cost=12.83..1371.93 rows=31 width=45)
  Filter: ((unnamed_subquery.pocet_zen + unnamed_subquery.pocet_muzu) = unnamed_subquery.max)
  ->  Nested Loop  (cost=12.83..1278.18 rows=6250 width=45)
        ->  Seq Scan on obce o  (cost=0.00..121.50 rows=6250 width=41)
        ->  Memoize  (cost=12.83..12.84 rows=1 width=4)
              Cache Key: o.okres_id
              Cache Mode: binary
              ->  Aggregate  (cost=12.82..12.83 rows=1 width=4)
                    ->  Index Scan using obce_okres_id_idx on obce  (cost=0.28..12.41 rows=81 width=8)
                          Index Cond: ((okres_id)::text = (o.okres_id)::text)
(10 rows)

My question is - does memoize support subqueries? And can be enhanced to support this exercise without LATERAL and optimization fences?

Regards

Pavel







Re: why memoize is not used for correlated subquery

От
Tender Wang
Дата:


Pavel Stehule <pavel.stehule@gmail.com> 于2024年5月28日周二 15:31写道:
Hi


My question is - does memoize support subqueries? And can be enhanced to support this exercise without LATERAL and optimization fences?


The commit messages in memoize may answer your question:

   "For now, the planner will only consider using a result cache for
    parameterized nested loop joins.  This works for both normal joins and
    also for LATERAL type joins to subqueries.  It is possible to use this new
    node for other uses in the future.  For example, to cache results from
    correlated subqueries.  However, that's not done here due to some
    difficulties obtaining a distinct estimation on the outer plan to
    calculate the estimated cache hit ratio.  Currently we plan the inner plan
    before planning the outer plan so there is no good way to know if a result
    cache would be useful or not since we can't estimate the number of times
    the subplan will be called until the outer plan is generated." 

git show b6002a796d
--
Tender Wang
OpenPie:  https://en.openpie.com/

Re: why memoize is not used for correlated subquery

От
David Rowley
Дата:
On Tue, 28 May 2024 at 19:31, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> My question is - does memoize support subqueries? And can be enhanced to support this exercise without LATERAL and
optimizationfences?
 

It's only currently considered for parameterized nested loop joins,
not for subplans.

I wrote a bit about this in [1] and there's even a patch.  The problem
with it is that we plan subqueries and generate an actual plan before
planning the outer query.  This means we don't have an ndistinct
estimate for the parameters to the subquery when we plan it, therefore
we can't tell if Memoize is a good choice or not.  It isn't a good
choice if each set of parameters the subplan is called with is unique.
That would always be a cache miss and would only result in making the
query run more slowly.

I imagined making this work by delaying the plan creation for
subqueries until the same time as create_plan() for the outer query.
If we have a Path with and without a Memoize node, at some point after
planning the outer query, we can choose which Path is the cheapest
based on the ndistinct estimate for the parameters.

David

[1] https://www.postgresql.org/message-id/CAApHDvpGX7RN%2Bsh7Hn9HWZQKp53SjKaL%3DGtDzYheHWiEd-8moQ%40mail.gmail.com



Re: why memoize is not used for correlated subquery

От
Pavel Stehule
Дата:


út 28. 5. 2024 v 9:48 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Tue, 28 May 2024 at 19:31, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> My question is - does memoize support subqueries? And can be enhanced to support this exercise without LATERAL and optimization fences?

It's only currently considered for parameterized nested loop joins,
not for subplans.

I wrote a bit about this in [1] and there's even a patch.  The problem
with it is that we plan subqueries and generate an actual plan before
planning the outer query.  This means we don't have an ndistinct
estimate for the parameters to the subquery when we plan it, therefore
we can't tell if Memoize is a good choice or not.  It isn't a good
choice if each set of parameters the subplan is called with is unique.
That would always be a cache miss and would only result in making the
query run more slowly.

I imagined making this work by delaying the plan creation for
subqueries until the same time as create_plan() for the outer query.
If we have a Path with and without a Memoize node, at some point after
planning the outer query, we can choose which Path is the cheapest
based on the ndistinct estimate for the parameters.

Thank you for explanation 

Pavel

David

[1] https://www.postgresql.org/message-id/CAApHDvpGX7RN%2Bsh7Hn9HWZQKp53SjKaL%3DGtDzYheHWiEd-8moQ%40mail.gmail.com

Re: why memoize is not used for correlated subquery

От
Andy Fan
Дата:

> I imagined making this work by delaying the plan creation for
> subqueries until the same time as create_plan() for the outer query.

Do you mean sublinks rather than subqueries? if so, we can get another
benefit I want very much.

explain (costs off) select * from t1 where t1.a = 1
                    and exists (select 1 from t2 where  t2.a = t1.a and random() > 0);
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Seq Scan on t1
   Filter: ((a = 1) AND EXISTS(SubPlan 1))
   SubPlan 1
     ->  Seq Scan on t2
           Filter: ((a = t1.a) AND (random() > '0'::double precision))

As for now, when we are planing the sublinks, we don't know t1.a = 1
which may lost some optimization chance.  Considering the t2.a is a
partition key of t2, this would yield some big improvement for planning
a big number of partitioned table.

-- 
Best Regards
Andy Fan