Обсуждение: JOINs ... how I hate them ...
Okay, went back through teh archives, as I know that Tom provided a solution for this before, and found it at: http://archives.postgresql.org/pgsql-sql/2001-06/msg00329.php Plain and simple ... makes perfect sense ... doesn't work in v7.2, or, at least, not as I'm expecting it to ... I've broken what I'm trying to do down the the *basest* component I can: explain SELECT p.uid, p.handle FROM gender_f pgf JOIN profiles p ON (pgf.uid = p.uid) ; Which explains out as: Hash Join (cost=1354.80..45297.83 rows=75664 width=27) -> Seq Scan on profiles p (cost=0.00..35707.69 rows=485969 width=19)-> Hash (cost=1165.64..1165.64 rows=75664 width=8) -> Seq Scan on gender_f pgf (cost=0.00..1165.64 rows=75664width=8) Now, profiles has uid as its primary KEY, and there are no duplicates in gender_f ... so, as my HashJoin points out, I should have 75664 results returned ... that is expected ... and the SeqScan on gender_f is expected ... but the SeqScan on profiles is what I would hope to get rid of ... get uid from gender_f, find corresponding entry in profiles ... its only ever goign to pull out 75664 out of 485969 records from profiles, so why would it seqscan *through* profiles for each and every UID? Now, if I go to the next level that I'm trying to pull together: explain SELECT p.uid, p.handle FROM ( orientation_c poc JOIN gender_f pgf USING ( uid ) ) JOIN profiles p ON (pgf.uid =p.uid) ; It still explains, what I think, is wrong: Hash Join (cost=6023.92..47537.10 rows=75664 width=35) -> Seq Scan on profiles p (cost=0.00..35707.69 rows=485969 width=19)-> Hash (cost=5834.76..5834.76 rows=75664 width=16) -> Merge Join (cost=0.00..5834.76 rows=75664 width=16) -> Index Scan using poc_uid on orientation_c poc (cost=0.00..2807.82 rows=126477 width=8) -> Index Scan using pgf_uid on gender_f pgf (cost=0.00..1575.79 rows=75664 width=8) The MergeJoin between poc/pgf will only return 12000 records, and since it is a 1:1 relationship between each of those tables, there will *only* be 12000 records pulled from profiles ... yet its doing a SeqScan through all 485k records for each of those UIDs? This is after I've performed a VACUUM ANALYZE ... The final query itself is: SELECT p.uid, p.profiles_handle FROM ( ( profiles_orientation_c poc JOIN profiles_gender_f pgf USING ( uid ) ) JOIN iwantu_profilesp USING (uid ) ) LEFT JOIN iwantu_last_login ll USING ( uid ); Which explains as: Hash Join (cost=31636.40..78239.34 rows=75664 width=43) -> Hash Join (cost=6023.92..47537.10 rows=75664 width=35) -> Seq Scan on iwantu_profiles p (cost=0.00..35707.69 rows=485969 width=19) -> Hash (cost=5834.76..5834.76 rows=75664width=16) -> Merge Join (cost=0.00..5834.76 rows=75664 width=16) -> Index Scanusing poc_uid on profiles_orientation_c poc (cost=0.00..2807.82 rows=126477 width=8) -> Index Scanusing pgf_uid on profiles_gender_f pgf (cost=0.00..1575.79 rows=75664 width=8) -> Hash (cost=7955.64..7955.64 rows=485964width=8) -> Seq Scan on iwantu_last_login ll (cost=0.00..7955.64 rows=485964 width=8) EXPLAIN So, poc&pgf are MergeJoin's, leaving me with 12000 records again ... then there is the SeqScan/HashJoin wiht profiles, which will leave me with 12000 records, but with more information ... but, again, for each of *those* 12000 records, its doing a SeqScan on last_login's 485k records, instead of using the index ... again, like pgf and poc, there is only one record for every uid, so we aren't dealing with duplicates ... Now, if I 'set enable_seqscan=false;' and do the exact same explain, it definitely comes more in line with what I'd like to see, as far as index usage is concerned: Nested Loop (cost=0.00..546759.46 rows=75664 width=43) -> Nested Loop (cost=0.00..272274.75 rows=75664 width=35) -> Merge Join (cost=0.00..5834.76 rows=75664 width=16) -> Index Scan using poc_uid on profiles_orientation_cpoc (cost=0.00..2807.82 rows=126477 width=8) -> Index Scan using pgf_uid on profiles_gender_fpgf (cost=0.00..1575.79 rows=75664 width=8) -> Index Scan using iwantu_profiles_uid on iwantu_profilesp (cost=0.00..3.51 rows=1 width=19) -> Index Scan using ill_uid on iwantu_last_login ll (cost=0.00..3.62rows=1 width=8)
On Thu, 2002-02-07 at 16:23, Marc G. Fournier wrote: > > > > The MergeJoin between poc/pgf will only return 12000 records, and since it > is a 1:1 relationship between each of those tables, there will *only* be > 12000 records pulled from profiles ... yet its doing a SeqScan through all > 485k records for each of those UIDs? > > This is after I've performed a VACUUM ANALYZE ... > > The final query itself is: > > SELECT p.uid, p.profiles_handle > FROM ( ( profiles_orientation_c poc JOIN profiles_gender_f pgf USING ( uid ) )> JOIN iwantu_profiles pUSING (uid ) ) LEFT JOIN iwantu_last_login ll USING ( uid ); > > Which explains as: > > Hash Join (cost=31636.40..78239.34 rows=75664 width=43) > -> Hash Join (cost=6023.92..47537.10 rows=75664 width=35) > -> Seq Scan on iwantu_profiles p (cost=0.00..35707.69 rows=485969 width=19) > -> Hash (cost=5834.76..5834.76 rows=75664 width=16) > -> Merge Join (cost=0.00..5834.76 rows=75664 width=16) > -> Index Scan using poc_uid on profiles_orientation_c poc (cost=0.00..2807.82 rows=126477 width=8) > -> Index Scan using pgf_uid on profiles_gender_f pgf (cost=0.00..1575.79 rows=75664 width=8) > -> Hash (cost=7955.64..7955.64 rows=485964 width=8) > -> Seq Scan on iwantu_last_login ll (cost=0.00..7955.64 rows=485964 width=8) > > EXPLAIN > > So, poc&pgf are MergeJoin's, leaving me with 12000 records again ... then > there is the SeqScan/HashJoin wiht profiles, which will leave me with > 12000 records, but with more information ... but, again, for each of > *those* 12000 records, its doing a SeqScan on last_login's 485k records, > instead of using the index ... again, like pgf and poc, there is only one > record for every uid, so we aren't dealing with duplicates ... I recently sped up a somewhat similar query from 15 sec to < 1 sec by rewriting it to use a subselect: SELECT p.uid, p.profiles_handle FROM profiles_orientation_c poc, profiles_gender_f pgf (select uid, profiles_handle from iwantu_profiles ip where ip.uid = pgf.uid ) p WHERE poc.uid = pgf.uid If you need something from iwantu_last_login it should go into that subselect as well That tricked my case to do the small join first. ----------------- Hannu
In Rdb (for instance) you can edit the plan if you want. (Oracle too, IIRC -- but I never have edited a plan in Oracle) Sure, it opens a big can of worms, but it would be nice for someone technically inclined to be able to fix a plan if they know better than the SQL compiler did.
"Marc G. Fournier" <scrappy@hub.org> writes: > explain SELECT p.uid, p.handle > FROM gender_f pgf JOIN profiles p ON (pgf.uid = p.uid) ; > Which explains out as: > Hash Join (cost=1354.80..45297.83 rows=75664 width=27) > -> Seq Scan on profiles p (cost=0.00..35707.69 rows=485969 width=19) > -> Hash (cost=1165.64..1165.64 rows=75664 width=8) > -> Seq Scan on gender_f pgf (cost=0.00..1165.64 rows=75664 width=8) > Now, profiles has uid as its primary KEY, and there are no > duplicates in gender_f ... so, as my HashJoin points out, I should have 75664 > results returned ... that is expected ... and the SeqScan on gender_f is > expected ... but the SeqScan on profiles is what I would hope to get rid > of ... Um, why? Looks like a perfectly reasonable plan to me. > get uid from gender_f, find corresponding entry in profiles ... I'm not convinced that 75000 indexscan probes would be faster than a sequential scan across that table. You could probably force the issue with "set enable_hashjoin to off" (and maybe also "set enable_mergejoin to off") and then see what the plan is and what the actual timing is. (EXPLAIN ANALYZE should be real helpful here.) > explain SELECT p.uid, p.handle > FROM ( orientation_c poc JOIN gender_f pgf USING ( uid ) ) JOIN profiles p ON (pgf.uid = p.uid) ; > Hash Join (cost=6023.92..47537.10 rows=75664 width=35) > -> Seq Scan on profiles p (cost=0.00..35707.69 rows=485969 width=19) > -> Hash (cost=5834.76..5834.76 rows=75664 width=16) > -> Merge Join (cost=0.00..5834.76 rows=75664 width=16) > -> Index Scan using poc_uid on orientation_c poc (cost=0.00..2807.82 rows=126477 width=8) > -> Index Scan using pgf_uid on gender_f pgf (cost=0.00..1575.79 rows=75664 width=8) > The MergeJoin between poc/pgf will only return 12000 records, and since it > is a 1:1 relationship between each of those tables, there will *only* be > 12000 records pulled from profiles ... Hmm, it thinks that there will be 75664 not 12000 records out of that join. Why the discrepancy? Could we see the pg_stats data for these tables? > ... but, again, for each of > *those* 12000 records, its doing a SeqScan on last_login's 485k records, > instead of using the index No, certainly *not* "for each record". It's a hash join, so it only reads each table once. > Now, if I 'set enable_seqscan=false;' and do the exact same explain, it > definitely comes more in line with what I'd like to see, as far as index > usage is concerned: And what's the actual runtime come out to be? We definitely should standardize on asking for EXPLAIN ANALYZE results in bad-plan discussions, now that we have that capability. regards, tom lane