On Sat, Aug 9, 2014 at 5:15 AM, Josh Berkus <josh@agliodbs.com> wrote:
Folks,
So one thing we tell users who have chronically long IN() lists is that they should create a temporary table and join against that instead. Other than not having the code, is there a reason why PostgreSQL shouldn't do something like this behind the scenes, automatically?
Hi Josh,
I know that problem for many years.
There are some workaround which doesn't require using the temporary tables (and I used that approach quite a lot when performance matter):
Instead of using: SELECT * FROM sometable WHERE somefield IN (val1, val2, ...) AND other_filters;
Query could be written as: SELECT * FROM sometable JOIN (VALUES ((val1), (val2) ...)) AS v(somefield) ON v.somefield=sometable.somefield WHERE
other_filters;
When there no index on somefield query plans would look like as:
In synthetic data I observed the following performance results (fully in-memory data with integer values): List length IN Performance JOIN VALUES Performance 10 5.39ms 5.38ms 100 9.74ms 5.49ms 1000 53.02ms 9.89ms 10000 231.10ms 13.14ms
So starting from 10 elements VALUES/HASH JOIN approach is clear winner.
In case of the text literals IN list performance difference even more obvious (~2 order of magnitude for 10000 list).
However, if IN list used for the primary key lookup - there are no visible performance difference between these two approaches.
So yes there are some space for optimization of "Filter: (somefield = ANY ('{...}'::integer[]))" via hashing.
"People problems are solved with people. If people cannot solve the problem, try technology. People will then wish they'd listened at the first stage."