Обсуждение: statistics horribly broken for row-wise comparison
It looks like for row-wise comparison, only the first column is used for generating the expected row count. This can lead to bad plans in some cases. Test case (takes seconds to minutes hardware depending): create table range as select v as id, v % 500 as key, now() + ((random() * 1000) || 'days')::interval as ts from generate_series(1,10000000) v; create index range_idx on range(key, ts); explain analyze select * from range where (key, ts) >= (222, '7/11/2009') and (key, ts) <= (222, '7/12/2009') order by key, ts; result (cut down a bit) Sort (cost=469723.46..475876.12 rows=2461061 width=16) (actual time=0.054..0.056 rows=13 loops=1) Sort Key: key, ts Sort Method: quicksort Memory: 25kB note the row count expected vs. got. Varying the ts parameters changes the expected rows, but varying the key does not. Note for the test case the returned plan is ok, but obviously the planner will freak out and drop to seq scan or so other nefarious things circumstances depending. I confirmed this on 8.2 and HEAD (a month old or so). merlin
On Mon, Mar 2, 2009 at 4:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > It looks like for row-wise comparison, only the first column is used > for generating the expected row count. This can lead to bad plans in > some cases. > > Test case (takes seconds to minutes hardware depending): > > create table range as select v as id, v % 500 as key, now() + > ((random() * 1000) || 'days')::interval as ts from > generate_series(1,10000000) v; > > create index range_idx on range(key, ts); > > explain analyze select * from range where (key, ts) >= (222, '7/11/2009') and > (key, ts) <= (222, '7/12/2009') > order by key, ts; > > result (cut down a bit) > Sort (cost=469723.46..475876.12 rows=2461061 width=16) (actual > time=0.054..0.056 rows=13 loops=1) > Sort Key: key, ts > Sort Method: quicksort Memory: 25kB > > note the row count expected vs. got. Varying the ts parameters > changes the expected rows, but varying the key does not. Note for the oop, thats backwards. rows expected depends on key (the first column in the index) only. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > It looks like for row-wise comparison, only the first column is used > for generating the expected row count. [ shrug... ] Short of multi-column statistics, it's hard to see how to do better. regards, tom lane
On Mon, Mar 2, 2009 at 8:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> It looks like for row-wise comparison, only the first column is used >> for generating the expected row count. > > [ shrug... ] Short of multi-column statistics, it's hard to see how to > do better. hm... Why can't you just multiply the range estimates for the fields together when doing an operation over the key? For example, in this case if the planner estimates 10% of rows for key, and 5% of matches for ts, just multiply .1 & .05 and get .005 when you fall into the row operation case. This would give a reasonably accurate answer...formally correct, even. All the information is there, or am I missing something (not knowing all the inner workings of the planner, I certainly might be)? IOW, I don't see this as a 'not enough statistics', more of a 'looking at the statistics wrong for multi-column index range operation' problem. Equality works correctly, as it always has. This is a kind of a stats loophole introduced when we got the ability to correctly do these types of operations in 8.2. There's no workaround that I see to this problem short of disabling seq_scan. The classic form of this query when looking for only one 'key' my problem case): select * from range where key = x and ts between a and b; usually gives a plain index scan, which can be really undesirable. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > On Mon, Mar 2, 2009 at 8:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Merlin Moncure <mmoncure@gmail.com> writes: >>> It looks like for row-wise comparison, only the first column is used >>> for generating the expected row count. >> >> [ shrug... ] �Short of multi-column statistics, it's hard to see how to >> do better. > hm... Why can't you just multiply the range estimates for the fields > together when doing an operation over the key? Because it would be the wrong answer, except in the uncommon case where the field values are completely independent (at least, I would expect that to be uncommon when people have multicolumn indexes on them). In the case at hand, I think you're barking up the wrong tree anyway. It's much more likely that what we need to do is fix clauselist_selectivity to recognize range conditions involving RowCompareExprs. regards, tom lane
On Tue, Mar 3, 2009 at 3:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Because it would be the wrong answer, except in the uncommon case where > the field values are completely independent (at least, I would expect > that to be uncommon when people have multicolumn indexes on them). Actually I think it's *more* likely to be true for multicolumn indexes. If the two columns were correlated then you wouldn't need the multicolumn index since you could just use one of the columns alone. The case where people create multicolumn indexes btree indexes is usually when they have a "major"-"minor" relationship between the two columns so for each value of the major key there is a whole range of minor keys. Sure multi-column statistics would be nice though. -- greg