Обсуждение: Unusual slowdown using subselects
I'm stress testing my application by creating large data sets. This particular query selects rows from the schedule table that have a specific owner_id. (I'll show you the results of explain) calendar=# explain select * from schedule where schedule.owner_id=101 or schedule.owner_id=102; Index Scan using schedule_id_index, schedule_id_index on schedule (cost=0.00..78.64 rows=20 width=40) Looks great and executes very fast. calendar=# explain select group_id from groups where user_id=101; NOTICE: QUERY PLAN: Index Scan using groups_id_index on groups (cost=0.00..2.02 rows=1 width=4) Again, very fast. The groups table maps users to groups. However, this next one is slow. calendar=# explain select * from schedule where schedule.owner_id in (select group_id from groups where user_id=101); NOTICE: QUERY PLAN: Seq Scan on schedule (cost=0.00..2039895.00 rows=1000000 width=40) SubPlan -> Materialize (cost=2.02..2.02 rows=1 width=4) -> Index Scan using groups_id_index on groups (cost=0.00..2.02 rows=1 width=4) You'll see in this one, where the first example did a index scan, this one with a very similar query does a seq scan. The two queries should be nearly identical, but this one runs very slowly. Can anyone explain why this happens and/or how I can do a sub-select like this and get fast results? Thank you John Aughey
> calendar=# explain select * from schedule where schedule.owner_id in > (select group_id from groups where user_id=101); > calendar=# explain select * from schedule where schedule.owner_id=101 or > schedule.owner_id=102; A sub-select is different than a join, and is tretaed differently by the database, and they are much, much slower than simply doing a join. So, we use subselects only when absolutely, positively necessary, otherwise, we just do a join. steve
> calendar=# explain select * from schedule where schedule.owner_id in > (select group_id from groups where user_id=101); > NOTICE: QUERY PLAN: > Seq Scan on schedule (cost=0.00..2039895.00 rows=1000000 width=40) > SubPlan > -> Materialize (cost=2.02..2.02 rows=1 width=4) > -> Index Scan using groups_id_index on groups (cost=0.00..2.02 > rows=1 width=4) > > You'll see in this one, where the first example did a index scan, this one > with a very similar query does a seq scan. The two queries should be > nearly identical, but this one runs very slowly. > > Can anyone explain why this happens and/or how I can do a sub-select like > this and get fast results? Try using EXISTS rather than IN (see the FAQ for more information).
John Aughey wrote: > I'm stress testing my application by creating large data sets. This > particular query selects rows from the schedule table that have a specific > owner_id. (I'll show you the results of explain) > > calendar=# explain select * from schedule where schedule.owner_id=101 or > schedule.owner_id=102; > Index Scan using schedule_id_index, schedule_id_index on schedule > (cost=0.00..78.64 rows=20 width=40) > > Looks great and executes very fast. > > calendar=# explain select group_id from groups where > user_id=101; > NOTICE: QUERY PLAN: > Index Scan using groups_id_index on groups (cost=0.00..2.02 rows=1 > width=4) > > Again, very fast. The groups table maps users to groups. > > However, this next one is slow. > > calendar=# explain select * from schedule where schedule.owner_id in > (select group_id from groups where user_id=101); > NOTICE: QUERY PLAN: > Seq Scan on schedule (cost=0.00..2039895.00 rows=1000000 width=40) > SubPlan > -> Materialize (cost=2.02..2.02 rows=1 width=4) > -> Index Scan using groups_id_index on groups (cost=0.00..2.02 > rows=1 width=4) > In my DB: # explain SELECT * FROM grls WHERE grls.ag_id = 24; NOTICE: QUERY PLAN: Index Scan using grls_ag_id on grls (cost=0.00..597.87 rows=849 width=122) # explain SELECT ag_id FROM agncs WHERE ag_id = 24; NOTICE: QUERY PLAN: Seq Scan on agncs (cost=0.00..1.31 rows=1 width=4) And together: # explain select * from grls where grls.ag_id in (select ag_id from agncs where ag_id = 24); NOTICE: QUERY PLAN: Seq Scan on grls (cost=0.00..40623.38 rows=30195 width=122) SubPlan -> Materialize (cost=1.31..1.31 rows=1 width=4) -> Seq Scan on agncs (cost=0.00..1.31 rows=1 width=4) -------------------------------------- # select count(*) from grls; 30195 Summarize - with subselect indices ignores and search look all DB rows. IT'S BUG. P.S. Sorry my English.
> In my DB: > # explain SELECT * FROM grls WHERE grls.ag_id = 24; > NOTICE: QUERY PLAN: > Index Scan using grls_ag_id on grls (cost=0.00..597.87 rows=849 width=122) > > # explain SELECT ag_id FROM agncs WHERE ag_id = 24; > NOTICE: QUERY PLAN: > Seq Scan on agncs (cost=0.00..1.31 rows=1 width=4) > > And together: > # explain select * from grls where grls.ag_id in (select ag_id from agncs > where ag_id = 24); > NOTICE: QUERY PLAN: > Seq Scan on grls (cost=0.00..40623.38 rows=30195 width=122) > SubPlan > -> Materialize (cost=1.31..1.31 rows=1 width=4) > -> Seq Scan on agncs (cost=0.00..1.31 rows=1 width=4) > -------------------------------------- > # select count(*) from grls; > 30195 > > Summarize - with subselect indices ignores and search look all DB rows. Not quite. in(<subselect>) doesn't use indexes (postgres doesn't realize that this is effectively a join), but exists(<subselect>) often will, and is the work around for the problem until it gets fixed (see FAQ for more information - I forget the number, the title of the question is something like "why are my subselects using in so slow."