Обсуждение: analyze.c
Hi! About analyze.c: If taken out vacuum, couldn't it be completly taken out of pg? Say, to an external program? What's the big reason not to do that? I know that there is some code in analyze.c (like comparing) that uses other parts of pg, but that seems to be easily fixed. I'm leaning toward the implementation of end-biased histograms. There is an introductory reference in the IEEE Data Engineering Bulletin, september 1995 (available on microsoft research site). Best Regards, Tiago
Tiago Antão <tra@fct.unl.pt> writes: > About analyze.c: > If taken out vacuum, couldn't it be completly taken out of pg? Say, > to an external program? Not if you want to do anything useful with it --- direct access to the database is only possible within the context of a backend, because of all the locking, buffering, etc behavior that you must adhere to. > What's the big reason not to do that? I know that > there is some code in analyze.c (like comparing) that uses other parts of > pg, but that seems to be easily fixed. Are you proposing not to do any comparisons? It will be interesting to see how you can compute a histogram without any idea of equality or ordering. But if you want that, then you still need the function-call manager as well as the type-specific comparison routines for every datatype that you might be asked to operate on (don't forget user-defined types here). In short, I doubt you can build a useful analyze-engine that's significantly smaller than a full backend. Besides, having ANALYZE available as a regular SQL command is just too useful to want to see it moved out to some outside program that would have to be run separately. > I'm leaning toward the implementation of end-biased histograms. There is > an introductory reference in the IEEE Data Engineering Bulletin, september > 1995 (available on microsoft research site). Sounds interesting. Can you give us an exact URL? regards, tom lane
On Wed, 23 Aug 2000, Tom Lane wrote: > > What's the big reason not to do that? I know that > > there is some code in analyze.c (like comparing) that uses other parts of > > pg, but that seems to be easily fixed. > > Are you proposing not to do any comparisons? It will be interesting to > see how you can compute a histogram without any idea of equality or > ordering. But if you want that, then you still need the function-call > manager as well as the type-specific comparison routines for every > datatype that you might be asked to operate on (don't forget > user-defined types here). I forgot user defined data types :-(, but regarding histograms I think the code can be made external (at least for testing purposes): 1. I was not suggesting not to do any comparisons, but Ithink the only comparison I need is equality, I don't need order as I don't need to calculate mins or maxs (I just need mins and maxes on frequencies, NOT on dat itself) to make a histogram. 2. The mapping to text guarantees that I have (PQgetvalue returns always char* and pg_statistics keeps a "text" anyway) a way of knowing about equality regardless of type. But at least anything relating to order has to be in. > > I'm leaning toward the implementation of end-biased histograms. There is > > an introductory reference in the IEEE Data Engineering Bulletin, september > > 1995 (available on microsoft research site). > > Sounds interesting. Can you give us an exact URL? http://www.research.microsoft.com/research/db/debull/default.htm BTW, you can get access to SIGMOD CDs with lots of goodies for a very low price (at least in 1999 it was a bargain), check out ACM membership for sigmod. I've been reading something about implementation of histograms, and, AFAIK, in practice histograms is just a cool name for no more than: 1. top ten with frequency for each 2. the same fortop ten worse 3. average for the rest I'm writing code get this info (outside pg for now - for testing purposes). Best Regards, Tiago PS - again: I'm starting, so, some of my comments can be completly dumb.
> > > I'm leaning toward the implementation of end-biased histograms. There is > > > an introductory reference in the IEEE Data Engineering Bulletin, september > > > 1995 (available on microsoft research site). > > > > Sounds interesting. Can you give us an exact URL? > > http://www.research.microsoft.com/research/db/debull/default.htm > > BTW, you can get access to SIGMOD CDs with lots of goodies for a very low > price (at least in 1999 it was a bargain), check out ACM membership for > sigmod. Thanks. I will look into that. SIGMOD has some real valuable stuff. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Hi! > > About analyze.c: > If taken out vacuum, couldn't it be completly taken out of pg? Say, > to an external program? What's the big reason not to do that? I know that > there is some code in analyze.c (like comparing) that uses other parts of > pg, but that seems to be easily fixed. > > I'm leaning toward the implementation of end-biased histograms. There is > an introductory reference in the IEEE Data Engineering Bulletin, september > 1995 (available on microsoft research site). Why take it out of the backend? Seems like a real pain, especially when you realize what functions it would have to call. Also, keep in mind that the current analyze generates perfect estimates for columns containing only two unique values, and columns containing only unique values. All other cases generate imperfect statistics. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> BTW, you can get access to SIGMOD CDs with lots of goodies for a very low > price (at least in 1999 it was a bargain), check out ACM membership for > sigmod. > > I've been reading something about implementation of histograms, and, > AFAIK, in practice histograms is just a cool name for no more than: > 1. top ten with frequency for each > 2. the same for top ten worse > 3. average for the rest I wonder if just increasing the number of buckets in analyze.c would help? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026