Обсуждение: Stats for multi-column indexes
I know the idea has come up a few times to do cross-column statistics to improve plans when the data distributions are dependent. I found a couple references in the archives: http://archives.postgresql.org/pgsql-hackers/2006-09/msg02118.php http://archives.postgresql.org/pgsql-hackers/2006-08/msg00924.php We can already keep stats for a functional index. Is there a reason we can't keep stats for a multi-column index? Out of the two situations outlined by Jim Nasby here: http://archives.postgresql.org/pgsql-hackers/2006-08/msg00948.php It would not help with joins, but it would help with columns that are referred to as a group (e.g. a=1 AND b<20). Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > We can already keep stats for a functional index. Is there a reason we > can't keep stats for a multi-column index? The questions that need to be answered are (1) what stats are you gonna collect, and (2) exactly what are you going to do with them when you have 'em? All the previous discussions have stalled on the question of how to avoid trying to collect stats about an exponentially large number of column combinations; we've never even reached the question of what stats we'd actually want given that a particular combination has been determined to be interesting. Perhaps that's a trivial question, but it's been a mighty long time since I took statistics ... regards, tom lane
On Mon, 2007-03-19 at 21:24 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > We can already keep stats for a functional index. Is there a reason we > > can't keep stats for a multi-column index? > > The questions that need to be answered are (1) what stats are you gonna > collect, and (2) exactly what are you going to do with them when you > have 'em? > > All the previous discussions have stalled on the question of how to > avoid trying to collect stats about an exponentially large number of > column combinations; we've never even reached the question of what > stats we'd actually want given that a particular combination has been > determined to be interesting. Perhaps that's a trivial question, > but it's been a mighty long time since I took statistics ... > I know we can't keep stats on every combination of columns. My initial idea would be to only keep stats about a multi-column index (and probably optional for those, too). My thinking was that we could keep a histogram (and MCVs, etc.) of the non-scalar key in the multi-column index. That would provide the data the planner needs to answer a query like "WHERE a = 1 and b < 1000" if a and b are dependent and you have an index on (a,b). It seemed within reach to me initially because I could use a functional index (in which the function turns multiple values into a comparable scalar) and postgresql would index that and keep stats. And when it has those stats, it makes the correct plan. Of course, I have to litter the SQL with unnecessary function calls (so that it can use the functional index), which makes this undesirable. AndrewSN pointed out on IRC that keeping a histogram of non-scalar values is not as easy as I thought, because PostgreSQL doesn't allow arrays of composite types, among other problems. Is this a worthwhile area of exploration? Regards,Jeff Davis
Jeff Davis wrote: > > I know we can't keep stats on every combination of columns. My initial > idea would be to only keep stats about a multi-column index (and > probably optional for those, too). > Maybe you would want to keep single column indexes too, so that (more) accurate estimates for bitmap-and type plans are possible. Cheers Mark
On Tue, 2007-03-20 at 14:14 +1200, Mark Kirkwood wrote: > Jeff Davis wrote: > > > > I know we can't keep stats on every combination of columns. My initial > > idea would be to only keep stats about a multi-column index (and > > probably optional for those, too). > > > > Maybe you would want to keep single column indexes too, so that (more) > accurate estimates for bitmap-and type plans are possible. We should allow the DBA to specify which groups of cols to keep statistics on, if there is no index on that group. That solves the combinatorial explosion problem. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Tue, 2007-03-20 at 14:14 +1200, Mark Kirkwood wrote: >> Jeff Davis wrote: >>> I know we can't keep stats on every combination of columns. My initial >>> idea would be to only keep stats about a multi-column index (and >>> probably optional for those, too). >>> >> Maybe you would want to keep single column indexes too, so that (more) >> accurate estimates for bitmap-and type plans are possible. > > We should allow the DBA to specify which groups of cols to keep > statistics on, if there is no index on that group. > > That solves the combinatorial explosion problem. This is one hint I think everyone can agree on. Being able to say that values in different columns are related just gives the planner more information to work with. -- Richard Huxton Archonet Ltd
Richard Huxton wrote: > Simon Riggs wrote: > >On Tue, 2007-03-20 at 14:14 +1200, Mark Kirkwood wrote: > >>Jeff Davis wrote: > >>>I know we can't keep stats on every combination of columns. My initial > >>>idea would be to only keep stats about a multi-column index (and > >>>probably optional for those, too). > >>> > >>Maybe you would want to keep single column indexes too, so that (more) > >>accurate estimates for bitmap-and type plans are possible. > > > >We should allow the DBA to specify which groups of cols to keep > >statistics on, if there is no index on that group. > > > >That solves the combinatorial explosion problem. > > This is one hint I think everyone can agree on. Being able to say that > values in different columns are related just gives the planner more > information to work with. It was also suggested that column pairs in FK relationship could be automatically enabled. So you don't need to specify those manually. Now, the hard question is deciding what to keep track of. I don't think MCV makes much sense, because what's the MCV of two columns? Some sort of correlation index would seem to make certain sense. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Now, the hard question is deciding what to keep track of. I don't > think MCV makes much sense, because what's the MCV of two columns? The combination that occurs most often. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Alvaro Herrera <alvherre@commandprompt.com> writes: > It was also suggested that column pairs in FK relationship could be > automatically enabled. So you don't need to specify those manually. Actually, I think you don't particularly need stats for that in most cases --- if the planner simply took note that the FK relationship exists, it would know that each row of the FK side joins to exactly one row of the PK side, which in typical cases is sufficient. regards, tom lane
Tom, > Actually, I think you don't particularly need stats for that in most > cases --- if the planner simply took note that the FK relationship > exists, it would know that each row of the FK side joins to exactly > one row of the PK side, which in typical cases is sufficient. Is it? What about the other direction? Currently, doesn't the planner assume that the rowcount relationship is 1 to ( child total rows / parent total rows) ? That's ok for tables with relatively even distribution, but not for skewed ones. --Josh Berkus
On Tue, 2007-03-20 at 09:03 +0000, Simon Riggs wrote: > We should allow the DBA to specify which groups of cols to keep > statistics on, if there is no index on that group. > > That solves the combinatorial explosion problem. > I think it would be a good first step if we could just keep stats on multiple columns in the same table. If we can do more than that, great. We could probably keep stats on multiple columns across different tables, but I don't know how those statistics should be used. Using statistics to estimate joins seems like a tricky problem. Maybe it's already solved with known algorithms? Regards,Jeff Davis
On Tue, 2007-03-20 at 18:12 +0100, Josh Berkus wrote: > Tom, > > > Actually, I think you don't particularly need stats for that in most > > cases --- if the planner simply took note that the FK relationship > > exists, it would know that each row of the FK side joins to exactly > > one row of the PK side, which in typical cases is sufficient. > > Is it? What about the other direction? Currently, doesn't the planner > assume that the rowcount relationship is 1 to ( child total rows / > parent total rows) ? That's ok for tables with relatively even > distribution, but not for skewed ones. > In theory, the PK constrains the available values of the FK, but doesn't provide any additional information about the relationship between the columns. However, in practice there is limited space to store MCVs and limited accuracy to n_distinct. So there may be a reason to store more information, but I don't know what we'd store. Do we have reports of bad estimates by the planner in this situation? Regards,Jeff Davis
Josh Berkus <josh@agliodbs.com> writes: >> Actually, I think you don't particularly need stats for that in most >> cases --- if the planner simply took note that the FK relationship >> exists, it would know that each row of the FK side joins to exactly >> one row of the PK side, which in typical cases is sufficient. > Is it? What about the other direction? I recall that we had decided at the Greenplum meeting last year that we could use a better heuristic if we noted that a join was being done on an FK-and-PK combination, but I don't recall the details right at the moment. Did anyone take notes? regards, tom lane
On Mon, Mar 19, 2007 at 06:55:56PM -0700, Jeff Davis wrote: > On Mon, 2007-03-19 at 21:24 -0400, Tom Lane wrote: > > Jeff Davis <pgsql@j-davis.com> writes: > > > We can already keep stats for a functional index. Is there a reason we > > > can't keep stats for a multi-column index? > > > > The questions that need to be answered are (1) what stats are you gonna > > collect, and (2) exactly what are you going to do with them when you > > have 'em? > > > > All the previous discussions have stalled on the question of how to > > avoid trying to collect stats about an exponentially large number of > > column combinations; we've never even reached the question of what > > stats we'd actually want given that a particular combination has been > > determined to be interesting. Perhaps that's a trivial question, > > but it's been a mighty long time since I took statistics ... > > > > I know we can't keep stats on every combination of columns. My initial > idea would be to only keep stats about a multi-column index (and > probably optional for those, too). > > My thinking was that we could keep a histogram (and MCVs, etc.) of the > non-scalar key in the multi-column index. That would provide the data > the planner needs to answer a query like "WHERE a = 1 and b < 1000" if a > and b are dependent and you have an index on (a,b). <snip> > AndrewSN pointed out on IRC that keeping a histogram of non-scalar > values is not as easy as I thought, because PostgreSQL doesn't allow > arrays of composite types, among other problems. I don't think the array problem is that big a deal, since PostgreSQL doesn't enforce array dimensions at all. You can just make the arrays for multi-column stats 2 dimensional, though handling indexes with different data types among the columns would be a bit tricky... right now the only choice I can think of would be to require that values could be cast to and from text and just store text in the array. Though obviously it'd be better to just allow arrays of composite types... The other challenge is that you can't make all the same assumptions with a multi-field histogram that you can with a single-field one. For example, if this is our index: a b - - 1 1 1 2 ... 1 1000 2 500 2 501 ... 3 5000 The histogram would likely position the buckets such that 1,1000 and 2,500 would fall within one bucket, which means the planner has no idea that b doesn't exceed 1000 when a is 1. I'm not sure how big of an issue that is in reality, though, because the planner does know that the bucket can only represent so many rows. It might be worth coming up with a different means to store the histogram for the multi-column case. > Is this a worthwhile area of exploration? ISTM it trips people up often enough to make it worth at least exploring... -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes: > It might be worth coming up with a different means to store the > histogram for the multi-column case. A separate array for each column involved seems a whole lot less fragile than pretending we can handle mixed-type arrays. We probably need a different catalog anyway, or at least a reimagining of pg_statistic, since it doesn't hold more than one value of staattnum per row. regards, tom lane
On Tue, 2007-03-20 at 18:12, Josh Berkus wrote: > Tom, > > > Actually, I think you don't particularly need stats for that in most > > cases --- if the planner simply took note that the FK relationship > > exists, it would know that each row of the FK side joins to exactly > > one row of the PK side, which in typical cases is sufficient. > > Is it? What about the other direction? Currently, doesn't the planner > assume that the rowcount relationship is 1 to ( child total rows / > parent total rows) ? That's ok for tables with relatively even > distribution, but not for skewed ones. Wouldn't that be improved if the MCVs/histogram of the FK column are taken into account ? Considering that the FK part is unique, the skewness in the relationship is completely determined by the FK parts histogram. That would give at least a lower/upper bound and MCVs to the relationship. Cheers, Csaba.
This should read: > Considering that the FK part is unique, the ^^PK^^ > skewness in the relationship is completely determined by the FK part's > histogram. That would give at least a lower/upper bound and MCVs to the > relationship. Cheers, Csaba.