Hi list.
I have a question about the different plans produced by postgres for
an outer join versus an inner join.
(complete sql script attached)
Take these two tables:
CREATE TABLE album
( id SERIAL PRIMARY KEY,
title TEXT NOT NULL
);
CREATE INDEX album_title ON album (title);
CREATE TABLE track
( id SERIAL PRIMARY KEY,
album_id INT NOT NULL REFERENCES album(id),
title TEXT NOT NULL
);
CREATE INDEX track_album ON track(album_id);
CREATE INDEX track_title ON track(title);
where crucially `track` references `album(id)` with a `NOT NULL` reference.
Now, if we query both an inner and an outer join on `track` and
`album` (after having suitably filled-in data), we get very different
plans where only the inner join exploits indices.
That is:
EXPLAIN ANALYZE
SELECT t.id
FROM track t
INNER JOIN album a ON (t.album_id = a.id)
ORDER BY a.title ASC
LIMIT 10;
Produces this query plan:
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..3.29 rows=10 width=36) (actual time=0.038..0.052
rows=10 loops=1)
-> Nested Loop (cost=0.56..3606.93 rows=13200 width=36) (actual
time=0.036..0.046 rows=10 loops=1)
-> Index Scan using album_title on album a
(cost=0.28..113.23 rows=1397 width=36) (actual time=0.015..0.016
rows=1 loops=1)
-> Index Scan using track_album on track t (cost=0.29..1.84
rows=66 width=8) (actual time=0.012..0.018 rows=10 loops=1)
Index Cond: (album_id = a.id)
Planning Time: 0.473 ms
Execution Time: 0.096 ms
(7 rows)
While this:
EXPLAIN ANALYZE
SELECT t.id
FROM track t
LEFT JOIN album a ON (t.album_id = a.id)
ORDER BY a.title ASC
LIMIT 10;
Produces this query plan:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=604.43..604.45 rows=10 width=36) (actual
time=20.934..20.943 rows=10 loops=1)
-> Sort (cost=604.43..637.43 rows=13200 width=36) (actual
time=20.932..20.934 rows=10 loops=1)
Sort Key: a.title
Sort Method: top-N heapsort Memory: 27kB
-> Hash Left Join (cost=42.43..319.18 rows=13200 width=36)
(actual time=1.082..12.333 rows=10000 loops=1)
Hash Cond: (t.album_id = a.id)
-> Seq Scan on track t (cost=0.00..242.00 rows=13200
width=8) (actual time=0.031..3.919 rows=10000 loops=1)
-> Hash (cost=24.97..24.97 rows=1397 width=36)
(actual time=0.990..0.991 rows=1000 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 101kB
-> Seq Scan on album a (cost=0.00..24.97
rows=1397 width=36) (actual time=0.014..0.451 rows=1000 loops=1)
Planning Time: 0.251 ms
Execution Time: 20.999 ms
(12 rows)
My question then is, shouldn't the inner and outer join queries be
semantically equivalent when the columns we are joining on are
non-nullable foreign keys?
Is there some corner case I'm not considering?
Would it be a good addition to postgres if it could detect this and
produce a plan that exploits the indices?
(My root motivation for asking this question is this github issue:
https://github.com/hasura/graphql-engine/issues/5949)
Regards,
Philip