Re: Use cases for lateral that do not involve a set returning function

Поиск
Список
Период
Сортировка
От AJ Welch
Тема Re: Use cases for lateral that do not involve a set returning function
Дата
Msg-id CAO-RzR+Yw1uOMVLn6zKGk6+pnsJVmCDN9V2okK-rNxeD8T4jMA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Use cases for lateral that do not involve a set returning function  (Albe Laurenz <laurenz.albe@wien.gv.at>)
Список pgsql-general
Thanks for the response. Yea, using lateral there definitely reads better to me than using a correlated subquery. And it makes sense that performance is ok since you're filtering on a specific person's id (as you hinted at with `WHERE p.id = ...`) and the nested loop forced by `order by...limit 1` presumably only loops once.

However, I would consider that more of an OLTP style query where I'm kind of more interested in OLAP style queries as the referenced post was building a funnel analysis over an events table. I don't think their approach will scale. I guess it's kind of a specific question but that post got me wondering if there are any use cases for lateral outside of SRFs that do not generate a nested loop AND can not be achieved without lateral? Basically, something I could use in an OLAP query that I couldn't use prior to 9.3.

Also, just for the heck of it, I took a look at the explain plans for both of your queries without `WHERE p.id = ...` to see how they would scale.

Correlated subquery:

QUERY PLAN
Hash Right Join  (cost=163943.00..5932603426739.67 rows=7499298 width=24)
  Hash Cond: (n.people_id = p.people_id)
  Join Filter: (NOT (SubPlan 1))
  ->  Seq Scan on names n  (cost=0.00..320543.41 rows=14998595 width=18)
        Filter: (now() > validfrom)
  ->  Hash  (cost=77028.00..77028.00 rows=5000000 width=10)
        ->  Seq Scan on people p  (cost=0.00..77028.00 rows=5000000 width=10)
  SubPlan 1
    ->  Seq Scan on names n2  (cost=0.00..395543.88 rows=1 width=0)
          Filter: ((validfrom > n.validfrom) AND (people_id = p.people_id) AND (now() > validfrom))


Lateral:

Nested Loop Left Join  (cost=358043.67..1790218514528.00 rows=5000000 width=24)
  ->  Seq Scan on people p  (cost=0.00..77028.00 rows=5000000 width=10)
  ->  Limit  (cost=358043.67..358043.67 rows=1 width=18)
        ->  Sort  (cost=358043.67..358043.68 rows=4 width=18)
              Sort Key: n.validfrom
              ->  Seq Scan on names n  (cost=0.00..358043.65 rows=4 width=18)
                    Filter: ((people_id = p.people_id) AND (now() > validfrom))

Granted I haven't set up any indexes, but it looks like the correlated subquery, after an initial access of the names table before the join, accesses the names table again for each (person, name) pair after the join (in the join filter). So it's worse than just 2 scans per person. Indeed, the lateral subquery seems better because it accesses the person table and then the names table once for each person. However, I might instead do something like this to access each table just once:

explain

select *
from people p
left join (
  select *, rank() over(partition by people_id
                        order by validfrom desc) as rank
  from names
) n on p.people_id = n.people_id
   and n.rank = 1

QUERY PLAN
Hash Right Join  (cost=3427870.26..3942220.56 rows=5000000 width=36)
  Hash Cond: (n.people_id = p.people_id)
  ->  Subquery Scan on n  (cost=3263927.26..3751430.31 rows=75000 width=26)
        Filter: (n.rank = 1)
        ->  WindowAgg  (cost=3263927.26..3563929.14 rows=15000094 width=18)
              ->  Sort  (cost=3263927.26..3301427.49 rows=15000094 width=18)
                    Sort Key: names.people_id, names.validfrom
                    ->  Seq Scan on names  (cost=0.00..245542.94 rows=15000094 width=18)
  ->  Hash  (cost=77028.00..77028.00 rows=5000000 width=10)
        ->  Seq Scan on people p  (cost=0.00..77028.00 rows=5000000 width=10)

Thanks,
AJ

On Tue, Dec 9, 2014 at 2:24 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
AJ Welch wrote:
> http://blog.heapanalytics.com/postgresqls-powerful-new-join-type-lateral/
>
> I suspected some of the claims in the post may not have been accurate. This one in particular:
>
> "Without lateral joins, we would need to resort to PL/pgSQL to do this analysis. Or, if our data set
> were small, we could get away with complex, inefficient queries."
>
>
> The sum(1) and order by time limit 1 approach seemed less than ideal to me and I thought this analysis
> could be done with normal left joins instead of lateral left joins. So I came up with a proof of
> concept:
>
> https://github.com/ajw0100/snippets/tree/master/SQL/lateral
>
>
> Is my conclusion in the README correct? Does anything beyond select...from...where force a nested
> loop? In that case, is lateral really only useful with set returning functions as the docs suggest?
> Does anyone know of any use cases for lateral that do not involve a set returning function?

Only recently I used lateral joins to optimize a query.

This is a sample of how the query looked bfore:

SELECT ...
FROM people p
     LEFT JOIN names n
        ON (n.people_id = p.people_id
            AND current_timestamp > n.validfrom
            AND NOT EXISTS (SELECT 1 FROM names n2
                            WHERE n2.people_id = p.people_id
                            AND current_timestamp > n2.validfrom
                            AND n2.validfrom > n.validfrom)
           )
WHERE p.id = ...

So basically it is supposed to find the latest valid name for a person.

This required two scans of the "names" table per "person" record.

I rewrote it as

SELECT ...
FROM people p
     LEFT JOIN LATERAL (SELECT * FROM names n
                        WHERE n.people_id = p.people_id
                        AND current_timestamp > n.validfrom
                        ORDER BY n.validfrom DESC LIMIT 1) n
        ON TRUE
WHERE p.id = ...

With the correct index this touched fewer blocks and worked faster.
Also, though this is of course a matter of taste, it is more readable.

Of course this forces a nested loop, but that is not bad as such.
In my case it was not problem (I tried to hint at that with the WHERE clause).

So yes, I think that LATERAL is useful even without set returning functions.

Yours,
Laurenz Albe

В списке pgsql-general по дате отправления:

Предыдущее
От: Eric Svenson
Дата:
Сообщение: Re: Fwd: Fwd: Problem with pg_dump and decimal mark
Следующее
От: István
Дата:
Сообщение: Weird CPU utilization patterns with Postgres