Обсуждение: Estimating number of distinct values.
Hello hackers, I will be pleased if somebody (first of all Robert) can comment me "strange" results of distinct values estimation. There is the following code in analyze.c: /*---------- * Estimate the number of distinct values using the estimator * proposed by Haas and Stokes in IBM Research Report RJ 10025: * n*d / (n - f1 + f1*n/N) * where f1 is the number of distinct values that occurred * exactly once in our sample of n rows (from a total of N), * and d is the total number of distinct values in the sample. * This is their Duj1 estimator; the other estimators they * recommend are considerably more complex, and are numerically * very unstable when n is much smaller than N. * * In this calculation, we consider only non-nulls. We used to * include rows with null values in the n and N counts, but that * leads to inaccurate answers in columns with many nulls, and * it's intuitively bogus anyway considering the desired result is * the number of distinct non-null values. * * Overwidth values are assumed to have been distinct. *---------- */ int f1 = ndistinct - nmultiple + toowide_cnt; int d = f1 + nmultiple; double n = samplerows - null_cnt; double N = totalrows * (1.0 - stats->stanullfrac); double stadistinct; /* N == 0 shouldn't happen, but just in case ... */ if (N > 0) stadistinct = (n * d) / ((n - f1) + f1 * n / N); else stadistinct = 0; In my case there are no null values: totalrows = 48980014 samplerows = 30000 ndistinct = 29135 nmultiple = 800 null_cnt = 0 stats->stanullfrac = 0 toowide_cnt = 0 This formula gives: stadistinct = 503391.66267850093 "Naive" estimation of number of distinct values: d*N/n gives 47567756 which is much closer to the real number of distinct values and is almost 100 times (!!!) larger than Duj1 estimator. Real number of distinct value for this dataset is about 10 millions. For some reasons, sampling using random blocks and Vitter algorithm produces worser results than just examining first 30000 rows of the table: in this case number of distinct values is 6835 and d/n*N is 11 millions which is very close to real value of distinct rows. Certainly it can be result of data distribution in the particular table. I have read the cited article. Looks like other estimators are really difficult to implement. But Duj1 result in this case really looks confusing. May be instead of sampling-based estimation use streaming based estimation, for example HyperLogLog algorithm already present in Postgres?
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes: > I will be pleased if somebody (first of all Robert) can comment me > "strange" results of distinct values estimation. Estimating the number of distinct values from a small sample is a hard problem; every algorithm is going to blow it in some cases. > In my case there are no null values: > totalrows = 48980014 > samplerows = 30000 > ndistinct = 29135 > nmultiple = 800 You seem to be using the default statistics target. Possibly raising that would give better answers for this column. > May be instead of sampling-based estimation use streaming > based estimation, for example HyperLogLog algorithm already present in > Postgres? Maybe, but I'd be really surprised if HLL were any sort of magic bullet. I think it's intended to make an ndistinct estimate with just a small amount of state preserved as rows are scanned, which is an admirable goal but not one that ANALYZE particularly needs. Adopting such a constraint when we don't need it does not sound like a recipe for getting a better final answer. regards, tom lane
On Wed, Oct 24, 2018 at 10:07 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
Real number of distinct value for this dataset is about 10 millions. For
some reasons, sampling using random blocks and Vitter algorithm produces
worser results than just examining first 30000 rows of the table:
It is a known problem with our sampling method that once a block is chosen, it then oversamples rows from that block[1]. So you get too many blocks with 0 rows sampled or 2 or more rows samples, and too few with exactly one row sampled. If rows with the same value are clustered together into same blocks, this will find too many duplicates and really skew the Duj1 estimate, because we feeding it a biased sample.
Tomas was working on a patch to make the sampling truly random[2], but I think he abandoned it to work on the multivariate statistics instead. It is hard to tell if the IO implications of no longer reading sampled blocks in physical order would be acceptable, because everyone has different hardware, data, and ideas of what is acceptable.
Cheers,
Jeff