Обсуждение: Unable to get acceptable performance from EXCEPT
=# select count(*) from ref_old;count -------10595 (1 row) =# select count(*) from ref_new;count -------22997 (1 row) =# select ref_id from ref_old except select ref_id from ref_new; Takes over 10 minutes, probably closer to half an hour. I've also tried using 'NOT IN ( select ref_id from ref_new )' ref_id is an int4, this is on Postgresql 7.0. This confuses me because the way I'd plan to execute this query would be something like this: (pseudo code) result retval; sort(ref_old); sort(ref_new); i = k = 0; while (i < count(ref_old)) {while(ref_old[i] > ref_new[k]) k++;while(ref_old[i] == ref_new[k]) i++;while(ref_old[i]< ref_new[k]) store(&retval, ref_old[i++]); } return (retval); I can't imagine this algorithm would take over 10 minutes on my hardware. Can anyone shed some light on what's going on here? Is there a way to formulate my SQL to get Postgresql to follow this algorithm? thanks, -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
Alfred Perlstein <bright@wintelcom.net> writes: > =# select ref_id from ref_old except select ref_id from ref_new; > Takes over 10 minutes, probably closer to half an hour. > I've also tried using 'NOT IN ( select ref_id from ref_new )' Yup. EXCEPT is effectively translated to a NOT IN, if I recall correctly, and neither IN ( sub-select ) nor NOT IN ( sub-select ) are implemented very efficiently. Basically you get O(N^2) behavior because the inner select is rescanned for each outer tuple. We have a TODO list item to try to be smarter about this... > Is there a way to formulate my SQL to get Postgresql to follow > this algorithm [ kind of like a mergejoin ] No, but you could try select ref_id from ref_old where not exists (select ref_id from ref_new where ref_id = ref_old.ref_id); which would at least be smart enough to consider using an index on ref_new(ref_id) instead of a sequential scan. regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [000510 16:22] wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > > =# select ref_id from ref_old except select ref_id from ref_new; > > Takes over 10 minutes, probably closer to half an hour. > > I've also tried using 'NOT IN ( select ref_id from ref_new )' > > Yup. EXCEPT is effectively translated to a NOT IN, if I recall > correctly, and neither IN ( sub-select ) nor NOT IN ( sub-select ) > are implemented very efficiently. Basically you get O(N^2) behavior > because the inner select is rescanned for each outer tuple. > > We have a TODO list item to try to be smarter about this... > > > Is there a way to formulate my SQL to get Postgresql to follow > > this algorithm [ kind of like a mergejoin ] > > No, but you could try > > select ref_id from ref_old where not exists > (select ref_id from ref_new where ref_id = ref_old.ref_id); > > which would at least be smart enough to consider using an index > on ref_new(ref_id) instead of a sequential scan. Which cuts the query time down to less than a second! thanks! Ready for the evil magic? select distinct(o.ref_id) from ref_link o where o.stat_date < '2000-04-26 12:12:41-07' AND not exists ( select n.ref_id from ref_link n where n.stat_date >= '2000-04-26 12:12:41-07' AND n.ref_id = o.ref_id ) ; Thanks a ton. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."