Обсуждение: SQL 'in' vs join.

Поиск
Список
Период
Сортировка

SQL 'in' vs join.

От
mlw
Дата:
Why is a "select * from table1 where field in (select field from table2
where condition )"

is so dramatically bad compared to:

"select * from table1, table2 where table1.field = table2.field and
condition"

I can't understand why the first query isn't optimized better than the
second one. The 'in' query forces a full table scan (it shouldn't) and
the second one uses the indexes. Does anyone know why?

I know I am no SQL guru, but my gut tells me that the 'in' operator
should be far more efficient than a join. 

Here are the actual queries:

cdinfo=# explain select trackid from zsong where muzenbr in (select
muzenbr from ztitles where title = 'Mulan') ;
NOTICE:  QUERY PLAN:
Seq Scan on zsong  (cost=100000000.00..219321449380756.66 rows=2193213
width=4) SubPlan   ->  Materialize  (cost=100000022.50..100000022.50 rows=10 width=4)         ->  Seq Scan on ztitles
(cost=100000000.00..100000022.50
rows=10 width=4) 

cdinfo=# explain select trackid from zsong, ztitles where
ztitles.muzenbr = zsong.muzenbr and title = 'Mulan' ;
NOTICE:  QUERY PLAN:
Merge Join  (cost=0.00..183664.10 rows=219321 width=12) ->  Index Scan using zsong_muznbr on zsong
(cost=0.00..156187.31
rows=2193213 width=8) ->  Index Scan using ztitles_pkey on ztitles  (cost=0.00..61.50
rows=10 width=4)  

cdinfo=# \d zsong                              Table "zsong"Attribute |       Type        |                 Modifier
-----------+-------------------+-------------------------------------------muzenbr   | integer           |disc      |
integer          |trk       | integer           |song      | character varying |trackid   | integer           | not
nulldefault
 
nextval('trackid'::text)artistid  | integer           |acd       | character varying |
Indices: zsong_muznbr,        zsong_pkey    

cdinfo=# \d ztitles             Table "ztitles"Attribute  |       Type        | Modifier
------------+-------------------+----------muzenbr    | integer           | not nullartistid   | integer
|cat2      | character varying |cat3       | character varying |cat4       | character varying |performer  | character
varying|performer2 | character varying |title      | character varying |artist1    | character varying |engineer   |
charactervarying |producer   | character varying |labelname  | character varying |catalog    | character varying
|distribut | character varying |released   | character varying |origrel    | character varying |nbrdiscs   | character
varying|spar       | character varying |minutes    | character varying |seconds    | character varying |monostereo |
charactervarying |studiolive | character varying |available  | character(1)      |previews   | character varying
|pnotes    | character varying |acd        | character varying |
 
Index: ztitles_pkey 

-- 
http://www.mohawksoft.com


Re: SQL 'in' vs join.

От
Hannu Krosing
Дата:
mlw wrote:
> 
> Why is a "select * from table1 where field in (select field from table2
> where condition )"
> 
> is so dramatically bad compared to:
> 
> "select * from table1, table2 where table1.field = table2.field and
> condition"
> 
> I can't understand why the first query isn't optimized better than the
> second one. The 'in' query forces a full table scan (it shouldn't) and
> the second one uses the indexes. Does anyone know why?

Its not done yet, and probably hsomewhat difficult to do in a general
fashion

> I know I am no SQL guru, but my gut tells me that the 'in' operator
> should be far more efficient than a join.
> 
> Here are the actual queries:
> 
> cdinfo=# explain select trackid from zsong where muzenbr in (select
> muzenbr from ztitles where title = 'Mulan') ;

try

explainselect trackid  from zsong where muzenbr in (        select muzenbr          from ztitles         where title =
'Mulan'          and ztitles.muzenbr=zsong.muzenbr );
 

this should hint the current optimizer to do the right thing;

-----------------
Hannu


Re: SQL 'in' vs join.

От
mlw
Дата:
Hannu Krosing wrote:
> 
> mlw wrote:
> >
> > Why is a "select * from table1 where field in (select field from table2
> > where condition )"
> >
> > is so dramatically bad compared to:
> >
> > "select * from table1, table2 where table1.field = table2.field and
> > condition"
> >
> > I can't understand why the first query isn't optimized better than the
> > second one. The 'in' query forces a full table scan (it shouldn't) and
> > the second one uses the indexes. Does anyone know why?
> 
> Its not done yet, and probably hsomewhat difficult to do in a general
> fashion
> 
> > I know I am no SQL guru, but my gut tells me that the 'in' operator
> > should be far more efficient than a join.
> >
> > Here are the actual queries:
> >
> > cdinfo=# explain select trackid from zsong where muzenbr in (select
> > muzenbr from ztitles where title = 'Mulan') ;
> 
> try
> 
> explain
>  select trackid
>    from zsong
>   where muzenbr in (
>          select muzenbr
>            from ztitles
>           where title = 'Mulan'
>             and ztitles.muzenbr=zsong.muzenbr
>   );
> 
> this should hint the current optimizer to do the right thing;
> 
> -----------------
> Hannu

Nope:

cdinfo=# explain
cdinfo-#  select trackid
cdinfo-#    from zsong
cdinfo-#   where muzenbr in (
cdinfo(#          select muzenbr
cdinfo(#            from ztitles
cdinfo(#           where title = 'Mulan'
cdinfo(#             and ztitles.muzenbr=zsong.muzenbr
cdinfo(#   );
NOTICE:  QUERY PLAN:
Seq Scan on zsong  (cost=100000000.00..104474515.18 rows=2193213
width=4) SubPlan   ->  Index Scan using ztitles_pkey on ztitles  (cost=0.00..4.05
rows=1 width=4)  


But what I also find odd is, look at the components:

cdinfo=# explain select muzenbr from ztitles where title = 'Mulan' ;
NOTICE:  QUERY PLAN:
Index Scan using ztitles_title_ndx on ztitles  (cost=0.00..7.08 rows=1
width=4)  

cdinfo=# explain select trackid from zsong where muzenbr in ( 1,2,3,4,5)
;
NOTICE:  QUERY PLAN:
Index Scan using zsong_muzenbr_ndx, zsong_muzenbr_ndx,
zsong_muzenbr_ndx, zsong_muzenbr_ndx, zsong_muzenbr_ndx on zsong 
(cost=0.00..392.66 rows=102 width=4) 


Now, given the two components, each with very low costs, it chooses to
do a sequential scan on the table. I don't get it. I have have been
having no end of problems with Postgres' optimizer. It just seems to be
brain dead at times. It is a huge point of frustration to me. I am tied
to postgres in my current project, and I fear that I will not be able to
implement certain features because of this sort of behavior.


-- 
http://www.mohawksoft.com


Re: SQL 'in' vs join.

От
Hannu Krosing
Дата:
mlw wrote:
> 
> Hannu Krosing wrote:
> >
> > mlw wrote:
> > >
> > > Why is a "select * from table1 where field in (select field from table2
> > > where condition )"
> > >
> > > is so dramatically bad compared to:
> > >
> > > "select * from table1, table2 where table1.field = table2.field and
> > > condition"
> > >
> > > I can't understand why the first query isn't optimized better than the
> > > second one. The 'in' query forces a full table scan (it shouldn't) and
> > > the second one uses the indexes. Does anyone know why?
> >
> > Its not done yet, and probably hsomewhat difficult to do in a general
> > fashion
> >
> > > I know I am no SQL guru, but my gut tells me that the 'in' operator
> > > should be far more efficient than a join.
> > >
> > > Here are the actual queries:
> > >
> > > cdinfo=# explain select trackid from zsong where muzenbr in (select
> > > muzenbr from ztitles where title = 'Mulan') ;
> >
> > try
> >
> > explain
> >  select trackid
> >    from zsong
> >   where muzenbr in (
> >          select muzenbr
> >            from ztitles
> >           where title = 'Mulan'
> >             and ztitles.muzenbr=zsong.muzenbr
> >   );
> >
> > this should hint the current optimizer to do the right thing;
> >
> > -----------------
> > Hannu

did you have indexes on both ztitles.muzenbr and zsong.muzenbr ?

--------------
Hannu


Re: SQL 'in' vs join.

От
Don Baccus
Дата:
At 08:37 AM 11/30/00 -0500, mlw wrote:
>> mlw wrote:
>> >
>> > Why is a "select * from table1 where field in (select field from table2
>> > where condition )"
>> >
>> > is so dramatically bad compared to:
>> >
>> > "select * from table1, table2 where table1.field = table2.field and
>> > condition"

>Now, given the two components, each with very low costs, it chooses to
>do a sequential scan on the table. I don't get it. I have have been
>having no end of problems with Postgres' optimizer. It just seems to be
>brain dead at times. It is a huge point of frustration to me. I am tied
>to postgres in my current project, and I fear that I will not be able to
>implement certain features because of this sort of behavior.

But but but ...

Not only is the join faster, but it is more readable and cleaner SQL as
well.  I would never write the query in its first form.  I'd change the
second one slightly to "select table1.* from ...", though, since those
are apparently the only fields you want.

The optimizer should do a better job on your first query, sure, but why
don't you like writing joins?




- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: SQL 'in' vs join.

От
Tom Lane
Дата:
Don Baccus <dhogaza@pacifier.com> writes:
> The optimizer should do a better job on your first query, sure, but why
> don't you like writing joins?

The join wouldn't give quite the same answers.  If there are multiple
rows in table2 matching a particular table1 row, then a join would give
multiple copies of the table1 row, whereas the WHERE foo IN (sub-select)
way would give only one copy.  SELECT DISTINCT can't be used to fix
this, because that would eliminate legitimate duplicates from identical
table1 rows.

Now that the executor understands about multiple join rules (for
OUTER JOIN support), I've been thinking about inventing a new join rule
that says "at most one output row per left-hand row" --- this'd be sort
of the opposite of the LEFT OUTER JOIN rule, "at least one output row
per left-hand row" --- and then transforming IN (sub-select) clauses 
that appear at the top level of WHERE into this kind of join.  Won't
happen for 7.1, though.
        regards, tom lane


Re: SQL 'in' vs join.

От
Don Baccus
Дата:
At 10:52 AM 11/30/00 -0500, Tom Lane wrote:
>Don Baccus <dhogaza@pacifier.com> writes:
>> The optimizer should do a better job on your first query, sure, but why
>> don't you like writing joins?
>
>The join wouldn't give quite the same answers.  If there are multiple
>rows in table2 matching a particular table1 row, then a join would give
>multiple copies of the table1 row, whereas the WHERE foo IN (sub-select)
>way would give only one copy.  SELECT DISTINCT can't be used to fix
>this, because that would eliminate legitimate duplicates from identical
>table1 rows.

Hmmm...I was presuming that "field" was a primary key of table1, so
such duplicates wouldn't exist (and SELECT DISTINCT would weed out
duplicates from table2 if "field" isn't a primary key of table2, i.e.
if table2 has a many-to-one relationship to table1).  For many-to-many
relationships yes, you're right, the "in" version returns a different
result.

>Now that the executor understands about multiple join rules (for
>OUTER JOIN support), I've been thinking about inventing a new join rule
>that says "at most one output row per left-hand row" --- this'd be sort
>of the opposite of the LEFT OUTER JOIN rule, "at least one output row
>per left-hand row" --- and then transforming IN (sub-select) clauses 
>that appear at the top level of WHERE into this kind of join.  Won't
>happen for 7.1, though.

Same trick could be used for some classes of queries which do a SELECT DISTINCT
on the results of a join, too ...



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


RE: SQL 'in' vs join.

От
"Andrew Snow"
Дата:
> Now, given the two components, each with very low costs, it chooses to
> do a sequential scan on the table. I don't get it. 


Read the FAQ?

http://www.postgresql.org/docs/faq-english.html#4.23
"4.23) Why are my subqueries using IN so slow?")


- Andrew




Re: SQL 'in' vs join.

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Now that the executor understands about multiple join rules (for
>> OUTER JOIN support), I've been thinking about inventing a new join rule
>> that says "at most one output row per left-hand row" --- this'd be sort
>> of the opposite of the LEFT OUTER JOIN rule, "at least one output row
>> per left-hand row" --- and then transforming IN (sub-select) clauses 
>> that appear at the top level of WHERE into this kind of join.  Won't
>> happen for 7.1, though.

> Of course, we will have the query tree redesign for 7.2, right, make
> that unnecessary.

No, I see that as part of the query tree redesign.  You'd still need
executor support as above, but what remains to be seen is how hard is it
for the planner to do the transformation I so blithely posited ... and
do we need to change the querytree structure to make it easier?
        regards, tom lane


Re: SQL 'in' vs join.

От
Bruce Momjian
Дата:
> Don Baccus <dhogaza@pacifier.com> writes:
> > The optimizer should do a better job on your first query, sure, but why
> > don't you like writing joins?
> 
> The join wouldn't give quite the same answers.  If there are multiple
> rows in table2 matching a particular table1 row, then a join would give
> multiple copies of the table1 row, whereas the WHERE foo IN (sub-select)
> way would give only one copy.  SELECT DISTINCT can't be used to fix
> this, because that would eliminate legitimate duplicates from identical
> table1 rows.
> 
> Now that the executor understands about multiple join rules (for
> OUTER JOIN support), I've been thinking about inventing a new join rule
> that says "at most one output row per left-hand row" --- this'd be sort
> of the opposite of the LEFT OUTER JOIN rule, "at least one output row
> per left-hand row" --- and then transforming IN (sub-select) clauses 
> that appear at the top level of WHERE into this kind of join.  Won't
> happen for 7.1, though.

Of course, we will have the query tree redesign for 7.2, right, make
that unnecessary.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026