Обсуждение: BUG #14107: Major query planner bug regarding subqueries and indices
BUG #14107: Major query planner bug regarding subqueries and indices
От
mathiaskunter@gmail.com
Дата:
The following bug has been logged on the website: Bug reference: 14107 Logged by: Mathias Kunter Email address: mathiaskunter@gmail.com PostgreSQL version: 9.5.0 Operating system: Windows 7 Description: The query planner doesn't use an index although it could, causing an unneccessary sequential table scan. Step by step instructions to reproduce the problem are given below. Step 1 - just create a simple test table with an indexed id column: CREATE TABLE test (id serial NOT NULL, CONSTRAINT pkey PRIMARY KEY (id)); Step 2 - note that the index is used for the following query as expected: EXPLAIN SELECT * FROM test WHERE id = 1 OR id IN (2); QUERY PLAN ------------------------------------------------------------------------- Bitmap Heap Scan on test (cost=8.33..13.67 rows=2 width=4) Recheck Cond: ((id = 1) OR (id = 2)) -> BitmapOr (cost=8.33..8.33 rows=2 width=0) -> Bitmap Index Scan on pkey (cost=0.00..4.16 rows=1 width=0) Index Cond: (id = 1) -> Bitmap Index Scan on pkey (cost=0.00..4.16 rows=1 width=0) Index Cond: (id = 2) Step 3 - note that the index is NOT used for the following query: EXPLAIN SELECT * FROM test WHERE id = 1 OR id IN (SELECT id FROM test WHERE id = 2); QUERY PLAN ------------------------------------------------------------------------------------- Seq Scan on test (cost=8.17..56.42 rows=1275 width=4) Filter: ((id = 1) OR (hashed SubPlan 1)) SubPlan 1 -> Index Only Scan using pkey on test test_1 (cost=0.16..8.17 rows=1 width=4) Index Cond: (id = 2)
Re: BUG #14107: Major query planner bug regarding subqueries and indices
От
"David G. Johnston"
Дата:
On Thu, Apr 21, 2016 at 4:56 AM, <mathiaskunter@gmail.com> wrote: > The following bug has been logged on the website: > > Bug reference: 14107 > Logged by: Mathias Kunter > Email address: mathiaskunter@gmail.com > PostgreSQL version: 9.5.0 > Operating system: Windows 7 > Description: > > The query planner doesn't use an index although it could, causing an > unneccessary sequential table scan. Step by step instructions to reproduc= e > the problem are given below. > > > Step 1 - just create a simple test table with an indexed id column: > > CREATE TABLE test (id serial NOT NULL, CONSTRAINT pkey PRIMARY KEY (id)); > > > Step 2 - note that the index is used for the following query as expected: > > EXPLAIN SELECT * FROM test WHERE id =3D 1 OR id IN (2); > QUERY PLAN > ------------------------------------------------------------------------- > Bitmap Heap Scan on test (cost=3D8.33..13.67 rows=3D2 width=3D4) > Recheck Cond: ((id =3D 1) OR (id =3D 2)) > -> BitmapOr (cost=3D8.33..8.33 rows=3D2 width=3D0) > -> Bitmap Index Scan on pkey (cost=3D0.00..4.16 rows=3D1 width= =3D0) > Index Cond: (id =3D 1) > -> Bitmap Index Scan on pkey (cost=3D0.00..4.16 rows=3D1 width= =3D0) > Index Cond: (id =3D 2) > > > Step 3 - note that the index is NOT used for the following query: > > EXPLAIN SELECT * FROM test WHERE id =3D 1 OR id IN (SELECT id FROM test W= HERE > id =3D 2); > QUERY PLAN > > -------------------------------------------------------------------------= ------------ > Seq Scan on test (cost=3D8.17..56.42 rows=3D1275 width=3D4) > Filter: ((id =3D 1) OR (hashed SubPlan 1)) > SubPlan 1 > -> Index Only Scan using pkey on test test_1 (cost=3D0.16..8.17 ro= ws=3D1 > width=3D4) > Index Cond: (id =3D 2) > > =E2=80=8BTo lazy to research at the moment but I think this has been fixed = and released. You show 9.5.0 as your version. Update and you should be fine. David J=E2=80=8B.
The problem unfortunately persists in 9.5.2 as well, where the query plan is exactly the same as in 9.5.0. (Tested on Windows 7, but this presumably is a cross-platform bug.) From what I can tell, this affects all queries of the type SELECT ... WHERE <condition> OR <indexed_column> IN (<subselect>); As a workaround, you can use UNION, which works as expected: SELECT ... WHERE <condition> UNION SELECT ... WHERE <indexed_column> IN (<subselect>); However, as it currently stands, queries of the above form are de facto unusable with PostgreSQL. So this is pretty serious IMO. Am 21.04.2016 um 23:01 schrieb David G. Johnston: > On Thu, Apr 21, 2016 at 4:56 AM, <mathiaskunter@gmail.com > <mailto:mathiaskunter@gmail.com>>wrote: > > The following bug has been logged on the website: > > Bug reference: 14107 > Logged by: Mathias Kunter > Email address: mathiaskunter@gmail.com > <mailto:mathiaskunter@gmail.com> > PostgreSQL version: 9.5.0 > Operating system: Windows 7 > Description: > > The query planner doesn't use an index although it could, causing an > unneccessary sequential table scan. Step by step instructions to > reproduce > the problem are given below. > > > Step 1 - just create a simple test table with an indexed id column: > > CREATE TABLE test (id serial NOT NULL, CONSTRAINT pkey PRIMARY KEY > (id)); > > > Step 2 - note that the index is used for the following query as > expected: > > EXPLAIN SELECT * FROM test WHERE id = 1 OR id IN (2); > QUERY PLAN > ------------------------------------------------------------------------- > Bitmap Heap Scan on test (cost=8.33..13.67 rows=2 width=4) > Recheck Cond: ((id = 1) OR (id = 2)) > -> BitmapOr (cost=8.33..8.33 rows=2 width=0) > -> Bitmap Index Scan on pkey (cost=0.00..4.16 rows=1 width=0) > Index Cond: (id = 1) > -> Bitmap Index Scan on pkey (cost=0.00..4.16 rows=1 width=0) > Index Cond: (id = 2) > > > Step 3 - note that the index is NOT used for the following query: > > EXPLAIN SELECT * FROM test WHERE id = 1 OR id IN (SELECT id FROM > test WHERE > id = 2); > QUERY PLAN > ------------------------------------------------------------------------------------- > Seq Scan on test (cost=8.17..56.42 rows=1275 width=4) > Filter: ((id = 1) OR (hashed SubPlan 1)) > SubPlan 1 > -> Index Only Scan using pkey on test test_1 (cost=0.16..8.17 > rows=1 > width=4) > Index Cond: (id = 2) > > > âTo lazy to research at the moment but I think this has been fixed and > released. You show 9.5.0 as your version. Update and you should be fine. > > David Jâ. >
On 21 April 2016 at 23:56, <mathiaskunter@gmail.com> wrote: > The following bug has been logged on the website: > > Bug reference: 14107 > Logged by: Mathias Kunter > Email address: mathiaskunter@gmail.com > PostgreSQL version: 9.5.0 > Operating system: Windows 7 > Description: > > The query planner doesn't use an index although it could, causing an > unneccessary sequential table scan. Step by step instructions to reproduce > the problem are given below. > .. > Step 3 - note that the index is NOT used for the following query: > > EXPLAIN SELECT * FROM test WHERE id = 1 OR id IN (SELECT id FROM test WHERE > id = 2); > QUERY PLAN > ------------------------------------------------------------------------------------- > Seq Scan on test (cost=8.17..56.42 rows=1275 width=4) > Filter: ((id = 1) OR (hashed SubPlan 1)) > SubPlan 1 > -> Index Only Scan using pkey on test test_1 (cost=0.16..8.17 rows=1 > width=4) > Index Cond: (id = 2) Thanks for the report, but I think your subject line is a little exaggerated. I'd describe this as a missed optimisation, and not even a bug. Having said that, I have looked previously at adding more smarts to the planner around detecting when each relation can return, at most, a single row, based on the restriction quals. When that can be proved, instead of performing a join to that relation, use an init plan, such as what you'd get if you'd written the query as; EXPLAIN SELECT * FROM test WHERE id = 1 OR id = (SELECT id FROM test WHERE id = 2); then just remove the join/subplan altogether. I'd not actually thought about expanding this to IN and NOT IN, but likely it would be possible, providing NULLs could be handled correctly too. Was this the optimisation you think is missing? or did you expect the subquery to feed the bitmap scan keys? If so I'm not too sure how that could be made to work well when the row estimates are way off, and the bitmap index scan has to deal with millions or billions of scan keys. It might not end well. As for if the IN() clause could be detected to return a single row... well that's in my "I'd like to do that one day list". -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> Thanks for the report, but I think your subject line is a little > exaggerated. Sorry for that :-) > I'd not actually thought about expanding this to IN and NOT IN, but > likely it would be possible, providing NULLs could be handled > correctly too. Was this the optimisation you think is missing? No - let's consider a real-world example where the problem becomes more evident. Assume we have the following two tables: CREATE TABLE book (id SERIAL NOT NULL, name VARCHAR, authorId SERIAL, CONSTRAINT book_pkey PRIMARY KEY (id)); CREATE TABLE author (id SERIAL NOT NULL, surname VARCHAR, CONSTRAINT author_pkey PRIMARY KEY (id)); All relevant columns are indexed. The book table contains 1.3 M rows. Now let's say we want to find books by either name or by author name. A simple problem, actually. The following query is used (which could of course also be rewritten to use a JOIN instead, but that's not the point here). We find out that it has an unexpected and unacceptably long execution time: EXPLAIN SELECT * FROM book WHERE name = 'Harry Potter' OR authorId IN (SELECT id FROM author WHERE surname = 'Rowling'); QUERY PLAN ------------------------------------------------------------------------------------------- Seq Scan on book (cost=13.68..26792.88 rows=576709 width=40) Filter: (((name)::text = 'Harry Potter'::text) OR (hashed SubPlan 1)) SubPlan 1 -> Bitmap Heap Scan on author (cost=4.20..13.67 rows=6 width=4) Recheck Cond: ((surname)::text = 'Rowling'::text) -> Bitmap Index Scan on author_surname_index (cost=0.00..4.20 rows=6 width=0) Index Cond: ((surname)::text = 'Rowling'::text) In contrast to that, consider the query plan for the following logically equivalent query, which executes in just a few milliseconds (!): EXPLAIN SELECT * FROM book WHERE name = 'Harry Potter' UNION SELECT * FROM book WHERE authorId IN (SELECT id FROM author WHERE surname = 'Rowling'); QUERY PLAN ------------------------------------------------------------------------------------------------------------ HashAggregate (cost=454.87..455.99 rows=112 width=29) Group Key: book.id, book.name, book.authorid -> Append (cost=0.43..454.03 rows=112 width=29) -> Index Scan using book_name_index on book (cost=0.43..16.48 rows=3 width=29) Index Cond: ((name)::text = 'Harry Potter'::text) -> Nested Loop (cost=4.63..436.43 rows=109 width=29) -> Bitmap Heap Scan on author (cost=4.20..13.67 rows=6 width=4) Recheck Cond: ((surname)::text = 'Rowling'::text) -> Bitmap Index Scan on author_surname_index (cost=0.00..4.20 rows=6 width=0) Index Cond: ((surname)::text = 'Rowling'::text) -> Index Scan using book_authorid_index on book book_1 (cost=0.43..70.28 rows=18 width=29) Index Cond: (authorid = author.id)
Sorry, typo: > CREATE TABLE book (id SERIAL NOT NULL, name VARCHAR, authorId SERIAL, CONSTRAINT book_pkey PRIMARY KEY (id)); It should of course be "authorId INTEGER" instead of "authorId SERIAL".
2016-04-22 17:02 GMT+03:00 Mathias Kunter <mathiaskunter@gmail.com>: > EXPLAIN SELECT * FROM book WHERE name = 'Harry Potter' OR authorId IN > (SELECT id FROM author WHERE surname = 'Rowling'); > QUERY PLAN > > ------------------------------------------------------------------------------------------- > Seq Scan on book (cost=13.68..26792.88 rows=576709 width=40) > ... > > EXPLAIN SELECT * FROM book WHERE name = 'Harry Potter' UNION SELECT * FROM > book WHERE authorId IN (SELECT id FROM author WHERE surname = 'Rowling'); > QUERY PLAN > > ------------------------------------------------------------------------------------------------------------ > HashAggregate (cost=454.87..455.99 rows=112 width=29) > Queries return different number of rows, meaning they're not fully equivalent. Can you provide full DDL for the tables, constraints and indexes in use? -- Victor Y. Yegorov
On Fri, Apr 22, 2016 at 4:14 PM, Victor Yegorov <vyegorov@gmail.com> wrote: >> EXPLAIN SELECT * FROM book WHERE name = 'Harry Potter' UNION SELECT * FROM ... > Queries return different number of rows, meaning they're not fully > equivalent. IIRC plain-EXPLAIN returns estimates, you need explain analyze to be comparable. Francisco Olarte.
2016-04-22 19:24 GMT+03:00 Francisco Olarte <folarte@peoplecall.com>: > IIRC plain-EXPLAIN returns estimates, you need explain analyze to be > comparable. Of course, sorry for this. Well, then `EXPLAIN (analyze, buffers)` is also wanted, together with object definitions. -- Victor Y. Yegorov
> Queries return different number of rows, meaning they're not fully equivalent. Well, I think in the given example they should actually be fully equivalent, since the unique id column is selected as well. Thus, UNION can't do any unwanted row eliminations. > Well, then `EXPLAIN (analyze, buffers)` is also wanted, together with > object definitions. So here are the exact SQL commands to reproduce the problem. -- Create table "book" CREATE TABLE book (id SERIAL NOT NULL, name VARCHAR, author INTEGER, CONSTRAINT book_pkey PRIMARY KEY (id)); CREATE INDEX book_name_index ON book (name); CREATE INDEX book_author_index ON book (author); -- Create table "author" CREATE TABLE author (id SERIAL NOT NULL, name VARCHAR, CONSTRAINT author_pkey PRIMARY KEY (id)); CREATE INDEX author_name_index ON author (name); -- Insert some test data so that the planner would never assume a sequential scan could be faster INSERT INTO book (id, name, author) SELECT generate_series(1, 100000), md5(random()::text), (generate_series(1, 100000) - 1) % 10000 + 1; INSERT INTO author (id, name) SELECT generate_series(1, 10000), md5(random()::text); ANALYZE book; ANALYZE author; -- Check query plan when using OR operator EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM book WHERE name = 'Harry Potter' OR author IN (SELECT id FROM author WHERE name = 'Rowling'); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Seq Scan on book (cost=8.30..2443.30 rows=50001 width=41) (actual time=25.527..25.527 rows=0 loops=1) Filter: (((name)::text = 'Harry Potter'::text) OR (hashed SubPlan 1)) Rows Removed by Filter: 100000 Buffers: shared hit=937 SubPlan 1 -> Index Scan using author_name_index on author (cost=0.29..8.30 rows=1 width=4) (actual time=0.041..0.041 rows=0 loops=1) Index Cond: ((name)::text = 'Rowling'::text) Buffers: shared hit=2 Planning time: 0.237 ms Execution time: 25.603 ms -- Check query plan when using UNION operator EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM book WHERE name = 'Harry Potter' UNION SELECT * FROM book WHERE author IN (SELECT id FROM author WHERE name = 'Rowling'); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=58.42..58.53 rows=11 width=41) (actual time=0.066..0.066 rows=0 loops=1) Group Key: book.id, book.name, book.author Buffers: shared hit=5 -> Append (cost=0.42..58.34 rows=11 width=41) (actual time=0.061..0.061 rows=0 loops=1) Buffers: shared hit=5 -> Index Scan using book_name_index on book (cost=0.42..8.44 rows=1 width=41) (actual time=0.035..0.035 rows=0 loops=1) Index Cond: ((name)::text = 'Harry Potter'::text) Buffers: shared hit=3 -> Nested Loop (cost=4.66..49.79 rows=10 width=41) (actual time=0.019..0.019 rows=0 loops=1) Buffers: shared hit=2 -> Index Scan using author_name_index on author (cost=0.29..8.30 rows=1 width=4) (actual time=0.015..0.015 rows=0 loops=1) Index Cond: ((name)::text = 'Rowling'::text) Buffers: shared hit=2 -> Bitmap Heap Scan on book book_1 (cost=4.37..41.39 rows=10 width=41) (never executed) Recheck Cond: (author = author.id) -> Bitmap Index Scan on book_author_index (cost=0.00..4.37 rows=10 width=0) (never executed) Index Cond: (author = author.id) Planning time: 0.669 ms Execution time: 0.183 ms
Hmm... and this is even worse (on the data you provided): EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM book WHERE name = 'Harry Potter' OR EXISTS ( SELECT 1 FROM author WHERE author.id = book.author AND author.name = 'Rowling' ); Seq Scan on book (cost=0.00..832685.00 rows=50000 width=41) (actual time=35.848..35.848 rows=0 loops=1) Filter: (((name)::text = 'Harry Potter'::text) OR (alternatives: SubPlan 1 or hashed SubPlan 2)) Rows Removed by Filter: 100000 Buffers: shared hit=937 SubPlan 1 -> Index Scan using author_name_index on author (cost=0.29..8.30 rows=1 width=0) (never executed) Index Cond: ((name)::text = 'Rowling'::text) Filter: (id = book.author) SubPlan 2 -> Index Scan using author_name_index on author author_1 (cost=0.29..8.30 rows=1 width=4) (actual time=0.027..0.027 rows=0 loops=1) Index Cond: ((name)::text = 'Rowling'::text) Buffers: shared hit=2 Planning time: 0.268 ms Execution time: 35.923 ms ----- WBR, Yaroslav Schekin. -- View this message in context: http://postgresql.nabble.com/BUG-14107-Major-query-planner-bug-regarding-subqueries-and-indices-tp5899873p5900147.html Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.
Re: Re: BUG #14107: Major query planner bug regarding subqueries and indices
От
Mathias Kunter
Дата:
> Hmm... and this is even worse (on the data you provided): > > EXPLAIN (ANALYZE, BUFFERS) > SELECT * > FROM book > WHERE name = 'Harry Potter' > OR EXISTS ( > SELECT 1 > FROM author > WHERE author.id = book.author AND author.name = 'Rowling' > ); Yes, but the problem seems to be even bigger. Apparently it's neither limited to subqueries nor to the operators EXISTS, IN, NOT IN, ANY, SOME, and ALL. It rather seems that the planner has a severe bug regarding usage of the OR operator itself. This seems hard to believe, so please verify the query plans given below (and also earlier). I'd be happy if I'm mistaken on this. EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM book JOIN author ON (book.author = author.id) WHERE book.name = 'Harry Potter' OR author.name = 'Rowling'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Hash Join (cost=309.00..4118.40 rows=11 width=78) (actual time=325.283..325.283 rows=0 loops=1) Hash Cond: (book.author = author.id) Join Filter: (((book.name)::text = 'Harry Potter'::text) OR ((author.name)::text = 'Rowling'::text)) Rows Removed by Join Filter: 100000 Buffers: shared hit=1019 -> Seq Scan on book (cost=0.00..1935.00 rows=100000 width=41) (actual time=0.010..130.936 rows=100000 loops=1) Buffers: shared hit=935 -> Hash (cost=184.00..184.00 rows=10000 width=37) (actual time=28.933..28.933 rows=10000 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 802kB Buffers: shared hit=84 -> Seq Scan on author (cost=0.00..184.00 rows=10000 width=37) (actual time=0.007..14.061 rows=10000 loops=1) Buffers: shared hit=84 Planning time: 0.456 ms Execution time: 325.546 ms EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM book WHERE author IN (SELECT id FROM author WHERE name = 'Rowling') OR FALSE; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Seq Scan on book (cost=8.30..2193.30 rows=50000 width=41) (actual time=13.838..13.838 rows=0 loops=1) Filter: (hashed SubPlan 1) Rows Removed by Filter: 100000 Buffers: shared hit=937 SubPlan 1 -> Index Scan using author_name_index on author (cost=0.29..8.30 rows=1 width=4) (actual time=0.032..0.032 rows=0 loops=1) Index Cond: ((name)::text = 'Rowling'::text) Buffers: shared hit=2 Planning time: 0.204 ms Execution time: 13.910 ms
> Yes, but the problem seems to be even bigger. Seems like it. I've asked folks on IRC for help, and it was confirmed the bug exists since 8.2, at least: https://gist.github.com/ilmari/b8ae0c37a7465d6fb3987dc651923660 Also, I've created sqlfiddle based on your example: http://sqlfiddle.com/#!15/76ef4/3 ----- WBR, Yaroslav Schekin. -- View this message in context: http://postgresql.nabble.com/BUG-14107-Major-query-planner-bug-regarding-subqueries-and-indices-tp5899873p5900430.html Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.
> Seems like it. I've asked folks on IRC for help, and it was confirmed the > bug exists since 8.2 So, as it currently stands, queries of the following synopsis are de facto unusable with PostgreSQL, as these will always ignore any existing indices and trigger a full table scan instead: SELECT ... FROM a WHERE a.x = ? OR a.y IN (...); SELECT ... FROM a JOIN b ON (...) WHERE a.x = ? OR b.y = ?; The following SQL fiddle demonstrates the bug: http://sqlfiddle.com/#!15/ab714/2 Does anybody care? Shouldn't something like this be fixed ASAP??
Mathias Kunter <mathiaskunter@gmail.com> writes: > Does anybody care? Shouldn't something like this be fixed ASAP?? As was already stated upthread, this is not a bug; it's an opportunity for future improvement. It's unlikely to get "fixed ASAP" because there would be quite a bit of work involved, possibly including new executor infrastructure not just planner work. As far as the first case goes, the executor doesn't currently have any direct way to treat "a.x IN (SELECT ...)" as an indexqual for a.x. In simple cases the planner can work around that by transforming the query into a join against a unique-ified version of the sub-select, but that doesn't work if the IN is underneath an OR. If you know that the sub-select isn't going to return very many rows, you could do SELECT ... FROM a WHERE a.x = ? OR a.y = ANY(ARRAY(SELECT ...)); but this would blow up rather badly with a large sub-select result, so I'm not sure I want to try to make the planner transform it that way automatically. I don't actually see any way to do very much with your second example at all: > SELECT ... FROM a JOIN b ON (...) WHERE a.x = ? OR b.y = ?; There's no way to push anything down to either the A or B scans from that WHERE condition: you can't remove any rows before the join because they might join to rows on the other side that satisfy the other half of the OR. regards, tom lane
On 29 April 2016 at 13:10, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I don't actually see any way to do very much with your second example at > all: > >> SELECT ... FROM a JOIN b ON (...) WHERE a.x = ? OR b.y = ?; > > There's no way to push anything down to either the A or B scans from > that WHERE condition: you can't remove any rows before the join because > they might join to rows on the other side that satisfy the other half > of the OR. I was confused when I read that too. The only way I thought to transform would be to create a UNION for each condition, which in theory would be possible providing the SELECT lists contained the PK columns. Of course, in practice this would be difficult, since UNIONs are planned later, and having the query execute in that way would have to be costed, and only used if cheaper. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> If you know that the > sub-select isn't going to return very many rows, you could do > > SELECT ... FROM a WHERE a.x = ? OR a.y = ANY(ARRAY(SELECT ...)); Isn't the planner already doing something like this, since the following query is using the index as expected: SELECT ... FROM a WHERE a.x IN (SELECT ...); > but this would blow up rather badly with a large sub-select result, > so I'm not sure I want to try to make the planner transform it that > way automatically. Wouldn't it be possible then to use this optimization based on the estimated result size of the subquery? I think this would almost always be faster than a sequential scan anyway. I observed that using the ANY(ARRAY(SELECT...)) syntax on my small test tables (100 K rows) already improves the query time by a factor of more than 100, and it will be even more when tables are large. Please consider implementing this optimization! > I don't actually see any way to do very much with your second example at > all: > >> SELECT ... FROM a JOIN b ON (...) WHERE a.x = ? OR b.y = ?; Assuming both joined tables contain a PK (or another unique column), then it should be possible by replanning the query as: SELECT ... FROM a JOIN b ON ((join_cond AND a.x = ?) OR (join_cond AND b.y = ?)); (Let "join_cond" denote the original join condition here.) Now, the JOIN implementation must be smart enough to handle OR conditions: First, obtain the rows satisfying join_cond AND a.x = ? by using the index as usual. For each matching row, create the tuple (a.id, b.id) and insert it into a search tree (or hash or whatever). Then, obtain the rows satisfying join_cond AND b.y = ? For each matching row, query the search tree whether it already contains the tuple (a.id, b.id), and only add the current row to the final result if it doesn't.
Sorry for bumping this one more time, but I'd like to share some more real-life performance test results of using ANY(ARRAY(...)) instead of IN(...), hoping that you'd maybe still consider implementing such an optimization into the query planner. Since the test results indicate that the performance boost can really be massive on certain query types (factor 1000), I think that it'd really be worth the work. ===== Test setup ===== The tables "mb.release" and "mb.release_group" both contain about 1.5 million rows of real data, taken from the MusicBrainz database, and are of course properly indexed. All performance tests have been repeated a few times to be comparable. The test covers subqueries which return just a few rows and also subqueries which return more than 100000 rows. The queries test the performance of IN vs. ANY(ARRAY()) when used in different scenarios. For reference, the full query plans of all used queries are linked below. ===== Tested queries ===== 1) SELECT id FROM mb.release WHERE release_group IN (SELECT id FROM mb.release_group WHERE name = 'Bear'); 2) SELECT id FROM mb.release WHERE release_group IN (SELECT id FROM mb.release_group WHERE name < 'Bear'); 3) SELECT id FROM mb.release WHERE name = 'Tiger' OR release_group IN (SELECT id FROM mb.release_group WHERE name = 'Bear'); 4) SELECT id FROM mb.release WHERE name = 'Tiger' OR release_group IN (SELECT id FROM mb.release_group WHERE name < 'Bear'); 5) SELECT id FROM mb.release WHERE name = 'Tiger' AND release_group IN (SELECT id FROM mb.release_group WHERE name = 'Bear'); 6) SELECT id FROM mb.release WHERE name = 'Tiger' AND release_group IN (SELECT id FROM mb.release_group WHERE name < 'Bear'); ===== Test results ===== All numbers are given in milliseconds and show the total query time (planning + execution). ------------------------------------------- | Query | IN (...) | = ANY(ARRAY(...)) | ------------------------------------------- | 1 | 0.7 | 0.4 | | 2 | 6001.1 | 2517.8 | | 3 | 711.3 | 0.5 | | 4 | > 1000000.0 | 1962.6 | | 5 | 0.8 | 0.5 | | 6 | 0.9 | 492.7 | ------------------------------------------- Note: Query 4 using the IN operator has been canceled after running for more than 15 minutes. ===== Full query plans ===== For reference, all query plans of this performance test have been recorded using EXPLAIN (ANALYZE, BUFFERS). Please find them at http://pastebin.com/zymkbcSf
On 11 May 2016 at 23:55, Mathias Kunter <mathiaskunter@gmail.com> wrote: > Sorry for bumping this one more time, but I'd like to share some more > real-life performance test results of using ANY(ARRAY(...)) instead of > IN(...), hoping that you'd maybe still consider implementing such an > optimization into the query planner. Since the test results indicate that > the performance boost can really be massive on certain query types (factor > 1000), I think that it'd really be worth the work. > > > ===== Test setup ===== > > The tables "mb.release" and "mb.release_group" both contain about 1.5 > million rows of real data, taken from the MusicBrainz database, and are of > course properly indexed. All performance tests have been repeated a few > times to be comparable. > > The test covers subqueries which return just a few rows and also subqueries > which return more than 100000 rows. The queries test the performance of IN > vs. ANY(ARRAY()) when used in different scenarios. > > For reference, the full query plans of all used queries are linked below. > > > ===== Tested queries ===== > > 1) SELECT id FROM mb.release WHERE release_group IN (SELECT id FROM > mb.release_group WHERE name = 'Bear'); > > 2) SELECT id FROM mb.release WHERE release_group IN (SELECT id FROM > mb.release_group WHERE name < 'Bear'); > > 3) SELECT id FROM mb.release WHERE name = 'Tiger' OR release_group IN > (SELECT id FROM mb.release_group WHERE name = 'Bear'); > > 4) SELECT id FROM mb.release WHERE name = 'Tiger' OR release_group IN > (SELECT id FROM mb.release_group WHERE name < 'Bear'); > > 5) SELECT id FROM mb.release WHERE name = 'Tiger' AND release_group IN > (SELECT id FROM mb.release_group WHERE name = 'Bear'); > > 6) SELECT id FROM mb.release WHERE name = 'Tiger' AND release_group IN > (SELECT id FROM mb.release_group WHERE name < 'Bear'); > > > ===== Test results ===== > > All numbers are given in milliseconds and show the total query time > (planning + execution). > > ------------------------------------------- > | Query | IN (...) | = ANY(ARRAY(...)) | > ------------------------------------------- > | 1 | 0.7 | 0.4 | > | 2 | 6001.1 | 2517.8 | > | 3 | 711.3 | 0.5 | > | 4 | > 1000000.0 | 1962.6 | > | 5 | 0.8 | 0.5 | > | 6 | 0.9 | 492.7 | > ------------------------------------------- > > Note: Query 4 using the IN operator has been canceled after running for more > than 15 minutes. How do you find the ANY(ARRAY(...)) version performs with say 10 million records in the array? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> How do you find the ANY(ARRAY(...)) version performs with say 10 > million records in the array? I've tested with a subquery which returns about 20 million different rows. In this case IN(...) is about 5 times faster than ANY(ARRAY(...)) for me. The exact numbers are: IN(...): about 22 seconds ANY(ARRAY(...)): about 115 seconds However, estimated query costs seem to be always correct. So shouldn't it be quite easy for the planner to create query plans for both the ANY(ARRAY(...)) and the IN(...) version and then just use the plan where costs are cheaper?