Обсуждение: Index correlation versus multi-column indexes
I looked into Maxim Boguk's complaint here: http://archives.postgresql.org/pgsql-general/2009-02/msg01226.php in which the planner preferred to use an index despite the column being searched on being a lower-order column in that index. It turns out that the reason the planner is preferring the "wrong" index is that that index has a very high indexCorrelation score, evidently because its first column is well correlated with the table ordering. This causes cost_index to compute very low estimated heap access costs, outweighing the increased index access costs due to the index's relatively poor match to the query. Now we already knew that btcostestimate's estimate of index correlation was pretty bogus for multicolumn indexes. However, I now realize that there's another issue here as well, which would apply even if the index ordering correlation estimate were perfect. If you look closely at what cost_index is doing with the number, you'll realize that it is effectively assuming that high index correlation means that an indexscan returns TIDs that are adjacent or nearly so in the heap. Even given a perfect match of index and heap order, this fails to hold when the indexscan quals contain constraints on lower-order index columns, because we'll be skipping sections of the index in such cases. So apparently we need to rethink this, and derate the correlation effect somehow when there are constraints on non-first columns. I'm not entirely sure what the model ought to be. Thoughts? regards, tom lane
On Fri, 2009-02-27 at 13:25 -0500, Tom Lane wrote: > So apparently we need to rethink this, and derate the correlation effect > somehow when there are constraints on non-first columns. I'm not > entirely sure what the model ought to be. Thoughts? This seems similar to the problem of estimating correlation for a GiST index (as I recall you mentioned before that we should be tracking correlation per-index rather than per-attribute). Unless we get significantly smarter about what "correlation" means, I think its only purpose is for very simple range scans. And, as you point out, a selective predicate on a non-first attribute means that it's not really a range scan. I don't see an easy solution to this other than just saying that a predicate on a second attribute is not a range scan at all, unless the predicate is not very selective. Regards,Jeff Davis