Обсуждение: Improving NOT IN
It's a fairly common case to want to improve a query along the lines of TableA intersect ~TableB. We can write this as select * from tableA where key not in (select * from tableB) or we can get more fancy select tableA.*from tableA left outer join tableB on tableA.key = tableB.keywhere tableB is null; I've worked out a new join method that will improve performance over and above the second case. This is effective where the referenced table (tableB) is fairly large and the join columns are discrete. Currently that mostly means they're integers. The plan seeks to *prove* that there are no matches, rather than taking the exhaustive join approach taken currently. First we need to show that the referenced table's PK values are a fully continuous sequence of integers with no gaps. One this has been proved, we can then use that fact to scan the FK table using the values of the min and max PK to see if any outliers exist. There is no actual comparison of the values, just a proof that none is required. I'll describe this using SQL statements, which execute as SeqScans of the PK and then FK tables. There is no preparatory step - no building a sort table or preparing a hash table, so these SQL statements always execute faster than the fastest current plan. Most importantly there is no step that consumes large amounts of memory, so the case where two tables are very large performs much, much better. 1. Scan referenced table a) select max(aid), min(aid) from accounts; b) select count(*) from accounts; Sometimes this is faster using two queries when the table has a PK. 2. Decision Step if max - min - count == 0 then we have a contiguous range and because we know we have a discrete datatype we can now *prove* that there are no missing values from the set bounded by the min and the max. We can then use that directly in a new query: 3. a) Scan referencing table select aid from history where aid > ? or aid < ?;using parameters of max and min from step 1 b) normal query Step 1 & 2 can fail to find a contiguous range, in which case we need to fall back to an existing query plan. So there is only small overhead in the case where we run the first query but fail to use the optimisation at all and need to fall back to existing query. We can estimate whether this is the case by estimating the row count of the table and see if that compares favourably with the expected number of values if the whole range min-max of values is actually present. The min/max query uses the Primary Key index (which must always be present) so takes very little time. So overall this looks like a win, in certain common cases, but not a particular loss in any case. Try this SQL 1. select max(aid), min(aid) from accounts; 2. select count(*) from accounts; 3. select aid from history where aid > (select max(aid) from accounts) or aid < (select min(aid) from accounts) limit 1; against alter table history add foreign key (aid) references accounts; I get (1) about 0.2secs (2) 6secs (3) 9secs against Alter Table 27secs Using work_mem = 64MB and data that fits in memory We could implement the new SQL directly within ALTER TABLE, or we could actually create this as a new plan type that would then allow the existing SQL to perform better in specific cases. I've not seen such a plan on any other RDBMS and think it might be completely new, which I'm calling a Proof Join, for want of a better description. The preparatory steps are completely excluded, hence the x2 speedup. For larger referenced tables the performance improvement could be much more. ISTM that even though this is a special case it is actually a common one, so would be worth optimising for. Ideas stage at the moment: thoughts? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > First we need to show that the referenced table's PK values are a fully > continuous sequence of integers with no gaps. Since that is unlikely to be the case, I can't see that this is worth implementing... > I'll describe this using SQL statements, which execute as SeqScans of > the PK and then FK tables. There is no preparatory step - no building a > sort table or preparing a hash table, so these SQL statements always > execute faster than the fastest current plan. Except that when you fail to prove it, as you usually will, you have wasted a complete seqscan of the table, and still have to fall back on a regular plan. If the thing were amenable to falling out fairly quickly on proof failure, it would be better, but AFAICS you don't know anything until you've completed the whole scan. I think the NOT IN optimization that *would* be of use is to automatically transform the NOT IN representation to an outer-join-with-null-test type of operation, so as to give us a wider choice of join methods. However, I'm not sure about correct handling of NULLs on the RHS in such a scenario. The existing hashed-IN code has to jump through some really ugly hoops to give spec-compliant answers with NULLs. BTW, your sketch fails in the presence of NULLs on the RHS ... regards, tom lane
On Tue, 2007-01-30 at 17:34 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > First we need to show that the referenced table's PK values are a fully > > continuous sequence of integers with no gaps. > > Since that is unlikely to be the case, I can't see that this is worth > implementing... Integers are typically used as keys... > > I'll describe this using SQL statements, which execute as SeqScans of > > the PK and then FK tables. There is no preparatory step - no building a > > sort table or preparing a hash table, so these SQL statements always > > execute faster than the fastest current plan. > > Except that when you fail to prove it, as you usually will, you have > wasted a complete seqscan of the table, and still have to fall back on > a regular plan. If the thing were amenable to falling out fairly > quickly on proof failure, it would be better, but AFAICS you don't know > anything until you've completed the whole scan. Have some faith, please. It's fairly straightforward to make an estimate of whether the number of rows is approximately correct to make the scan worthwhile. On large queries it seems worth the risk; we might even store the answer as part of stats, so we'd know not to bother with the test in the future. > BTW, your sketch fails in the presence of NULLs on the RHS ... Certainly does, but the typical query has PK there, so no NULLs. One of the main use cases is the ALTER TABLE ... ADD FK case. As I said, we could just code that with altered SQL, or we could add a new plan. Anyway, it seemed like the right time to log the thought anyhow. > I think the NOT IN optimization that *would* be of use is to > automatically transform the NOT IN representation to an > outer-join-with-null-test type of operation, so as to give us a wider > choice of join methods. However, I'm not sure about correct handling > of NULLs on the RHS in such a scenario. The existing hashed-IN code > has to jump through some really ugly hoops to give spec-compliant > answers with NULLs. Yeh, NOT IN with NULLs is..... bizarre. What would be wrong with checking for a NOT NULL constraint? Thats how other planners cope with it. Or are you thinking about lack of plan invalidation? ISTM straightforward to do a search for a ANDed set of IS NOT NULL constraints. I've not found another server that does that, even though it seems like a straightforward win. Let me think on that. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Tue, 2007-01-30 at 17:34 -0500, Tom Lane wrote: >> Since that is unlikely to be the case, I can't see that this is worth >> implementing... > Integers are typically used as keys... Yeah, in the form of sequences, so you have a hole for every failed insert. If the key isn't coming from a sequence then there's still not any very good reason to suppose it's exactly contiguous. People do delete entries. > What would be wrong with checking for a NOT NULL constraint? Thats how > other planners cope with it. Or are you thinking about lack of plan > invalidation? Yup, without that, depending on constraints for plan correctness is pretty risky. Basically what I see here is a whole lot of work and new executor infrastructure for something that will be a win in a very narrow use-case and a significant loss the rest of the time. I think there are more productive ways to spend our development effort. regards, tom lane
On Tue, 2007-01-30 at 18:06 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > What would be wrong with checking for a NOT NULL constraint? Thats how > > other planners cope with it. Or are you thinking about lack of plan > > invalidation? > > Yup, without that, depending on constraints for plan correctness is > pretty risky. > > Basically what I see here is a whole lot of work and new executor > infrastructure for something that will be a win in a very narrow > use-case and a significant loss the rest of the time. I think there > are more productive ways to spend our development effort. For that part of the email, I was talking about your ideas on NOT IN. Checking for the explicit exclusion of NULLs is worthwhile with/without plan invalidation. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
--On Dienstag, Januar 30, 2007 23:24:40 +0000 Simon Riggs <simon@2ndquadrant.com> wrote: >> Basically what I see here is a whole lot of work and new executor >> infrastructure for something that will be a win in a very narrow >> use-case and a significant loss the rest of the time. I think there >> are more productive ways to spend our development effort. Maybe one should not aim for a special case of continuous sequences. It might be a better thing to have a fast look-up datastructure for row-existence. The optimization over the usual indices is that only existence, and no other information must be saved, thus a bit-field is really possible. Even 100 Mio rows would fit in 10 MB. So, instead of trying to find a sequence, find (or guess and later correct your bitfield) the minimum, and then set the bits as you encounter rows. During the join, test whether the bit you want to join to exists and voila, depending on whether you used IN or NOT IN, decide what to do. This datastructure could be used everywhere where only existence is important and no columns of a table are selected. Possibly, the bit-field should allow for large-gaps to be represented more efficiently, if you have an 32-bit index column, make a 256 entry top-level array pointing to bitfields representing the numbers 0x0-0x00ffffff, 0x01000000 - 0x01ffffff... each such bitfield would need 2MB, the pointers are negligible. But now large holes in the sequence don't waste too much space and thus the minimum needs not to be known. Regards, Jens Schicke -- Jens Schicke j.schicke@asco.de asco GmbH http://www.asco.de Mittelweg 7 Tel 0531/3906-127 38106 Braunschweig Fax 0531/3906-400
On Tue, 2007-01-30 at 17:34 -0500, Tom Lane wrote: > I think the NOT IN optimization that *would* be of use is to > automatically transform the NOT IN representation to an > outer-join-with-null-test type of operation, so as to give us a wider > choice of join methods. However, I'm not sure about correct handling > of NULLs on the RHS in such a scenario. The existing hashed-IN code > has to jump through some really ugly hoops to give spec-compliant > answers with NULLs. ISTM that we can handle this neatly by looking for a WHERE clause that specifically excludes NULLs in the NOT IN. i.e. a query of the form select ... from LHS where key NOT IN (select key from RHS where key is not null) can be optimised to select ... from LHS left outer join RHSon LHS.key = RHS.key where RHS.key IS NULL; This rewards people that understand the spec-compliant behaviour and ensure there coding is watertight in the presence of NULLs. We can extend that behaviour later when we have plan invalidation to make it also pick up NOT NULL constraints on the keys of the RHS table. Doing it this way round is much more useful, since not all tables have NOT NULL constraints on their join columns so is slightly wider in its usefulness than just checking constraints. It's also very annoying in this specific case to not have any way for the SQL developer to pass information to the planner - and I don't mean hints. This would be similar to pull_up_IN_clauses() -- Simon Riggs EnterpriseDB http://www.enterprisedb.com