Re: multivariate statistics v14
От | Tomas Vondra |
---|---|
Тема | Re: multivariate statistics v14 |
Дата | |
Msg-id | 66a87077-0ad5-8a93-296d-f74ad0df5538@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: multivariate statistics v14 (Alvaro Herrera <alvherre@2ndquadrant.com>) |
Список | pgsql-hackers |
On 03/21/2016 04:34 AM, Alvaro Herrera wrote: > Another skim on 0002: > > reference.sgml is missing a call to &alterStatistic. > > ObjectProperty[] contains a comment that the ACL is "same as relation", > but is that still correct, given that now stats may be related to more > than one relation? Do we even know what the rules for ACLs on > cross-relation stats are? One very simple way to get around this is to > dictate that all the rels must have the same owner. Perhaps we're not > considering the multi-relation case yet? As I wrote in response to Robert's message, I don't think we need ACLs for statistics - the user should be able to use them when they can access all the underlying relations (in a query). For ALTER STATISTICS the (owner || superuser) check should be enough, right? > > We have this FIXME comment in do_analyze_rel: > > + * FIXME This sample sizing is mostly OK when computing stats for > + * individual columns, but when computing multi-variate stats > + * for multivariate stats (histograms, mcv, ...) it's rather > + * insufficient. For stats on multiple columns / complex stats > + * we need larger sample sizes, because we need to build more > + * detailed stats (more MCV items / histogram buckets) to get > + * good accuracy. Maybe it'd be appropriate to use samples > + * proportional to the table (say, 0.5% - 1%) instead of a > + * fixed size might be more appropriate. Also, this should be > + * bound to the requested statistics size - e.g. number of MCV > + * items or histogram buckets should require several sample > + * rows per item/bucket (so the sample should be k*size). > > Maybe this merits more discussion. Right now we have an upper bound on > how much to scan for analyze; if we introduce the idea of scanning a > percentage of the relation, the time to analyze very large relations > could increase significantly. Do we have an idea of what to do for > this? For instance, a rule that would make me comfortable would say to > scan a sample 3x the current size when you have a mvstats on 3 columns; > then the size of fraction to scan is still bounded. But does that > actually work? From the wording of this comment, I assume you don't > actually know. Yeah. I think more discussion is needed, because I myself am not sure the FIXME is actually correct. For now I think we're OK with using the same logic as statistics on a single column (300 * target). > > In this block (CreateStatistics) > + /* look for duplicities */ > + for (i = 0; i < numcols; i++) > + for (j = 0; j < numcols; j++) > + if ((i != j) && (attnums[i] == attnums[j])) > + ereport(ERROR, > + (errcode(ERRCODE_UNDEFINED_COLUMN), > + errmsg("duplicate column name in statistics definition"))); > > isn't it easier to have the inner loop go from i+1 to numcols? It probably is. > > I wonder if this is sensible with multi-relation statistics: > + /* > + * Store a dependency too, so that statistics are dropped on DROP TABLE > + */ > + parentobject.classId = RelationRelationId; > + parentobject.objectId = ObjectIdGetDatum(RelationGetRelid(rel)); > + parentobject.objectSubId = 0; > + childobject.classId = MvStatisticRelationId; > + childobject.objectId = statoid; > + childobject.objectSubId = 0; > > I suppose the idea is to drop the stats if any of the rels they are for > is dropped. What do you mean by sensible? I mean, we don't support multiple tables at this point (except for choosing a syntax that should allow that), but the code assumes a single relation on a few places (like this one). > > Right after that you create a dependency on the schema. Is that > necessary? Since you have the dependency on the relation, the stats > would be dropped by recursion. Hmmmm, that's probably right. Also, now that I think about it, it probably gets broken after ALTER STATISTICS ... SET SCHEMA, because the code does not remove the old dependency (and does not create a new one). > > Why are you #include'ing builtins.h everywhere? Stupidity. > > RelationGetMVStatList() needs a comment. OK. > > Please get rid of common.h. It's totally unlike the way we structure > our header files. We don't keep headers in src/backend; they're all in > src/include. One reason is that the latter gets installed as a whole in > include/server, which this file will not be. This file may be necessary > to build some extensions in the future, for example. OK, I'll rework that and move it to src/include/. > > In mvstats.h, please mark function prototypes as "extern". > > Many files need a pgindent pass. OK. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-hackers по дате отправления: