Missed query planner optimization: `n in (select q)` -> `n in (q)`

Поиск
Список
Период
Сортировка
От Josh
Тема Missed query planner optimization: `n in (select q)` -> `n in (q)`
Дата
Msg-id d502b53a497ec80e5d1fcd18e353e7e9@sirjosh3917.com
обсуждение исходный текст
Ответы Re: Missed query planner optimization: `n in (select q)` -> `n in (q)`  ("David G. Johnston" <david.g.johnston@gmail.com>)
Список pgsql-performance
Hey y'all!

So recently, I ran into an issue where a query I wrote wasn't using an index, presumably because what I was doing was
toohard for the query planner to figure out. I've distilled the problem into its essence to the best of my ability, and
foundthat it's because `select` seems to hinder it. 

The problem boils down to the planner not figuring out that these two queries should use an index:

```sql
-- Setup
create table numbers(n int);
insert into numbers (n) select generate_series(1, 1000000);
create index numbers_n_idx on numbers(n);

-- Non-indexed queries
explain analyze select numbers.n from (values (5000000)) quantities(q)
join numbers on numbers.n in (select q);

explain analyze select numbers.n from (values (5000000)) quantities(q)
join numbers on numbers.n = any(select q);
```

These examples may seem silly, so let me provide a "case study" query that should justify the need for such an
optimization.I had a query that was generating an array of items, and wanted to join it to a table given that some
columnof that table was present in the array. It looked like so: 

```sql
select numbers.n from quantities join numbers on numbers.n in (select unnest(quantities.q));
```

This turned out to be horrendously slow, because it was performing a sequential scan! I did however end up settling on
thefollowing form: 

```sql
select numbers.n from quantities join numbers on numbers.n = any(quantities.q);
```

This was only possible because I was dealing with arrays though, and an operation such as `in (select unnest...)` can
beeasily converted to `= any(...)`. However for the general case, I believe an optimization in this area may provide
benefitas there may exist a circumstance that does not have an alternative to a sub-query select (`= any()` was my
alternative),but I am just a database newbie. 

I've noticed this problem has been around since at least 11.7, and is still present as of the `postgres:15beta1` docker
image.I've attached a script which reproduces the issue. It uses docker, so I'm confident you'll be able to run it
withoutissue. 

Finally, I ask:

- Is this an issue that should be fixed? I'm a database newbie so I have no idea about the deep semantics of SQL and
whata select inside a `join_condition` could imply to the planner to prevent it from optimizing it. 

- If "yes" to the previous question, what would be the precise semantics of such an optimization? I loosely say `n in
(selectq)` -> `n in (q)` for all n and q, but of course I don't have enough knowledge to know that this is correct in
termsof whatever Postgres' internal query optimization IR is. 

- Can a database newbie like myself contribute an optimization pass in Postgres to fix this? I'm fascinated by the work
y'alldo, and submitting a patch to Postgres that makes it into production would make my week. 

Thank you for your time, and have a great day!

Вложения

В списке pgsql-performance по дате отправления:

Предыдущее
От: Justin Pryzby
Дата:
Сообщение: Re: Strange behavior of limit clause in complex query
Следующее
От: "David G. Johnston"
Дата:
Сообщение: Re: Missed query planner optimization: `n in (select q)` -> `n in (q)`