Re: Improving N-Distinct estimation by ANALYZE

Поиск
Список
Период
Сортировка
От Josh Berkus
Тема Re: Improving N-Distinct estimation by ANALYZE
Дата
Msg-id 43BCBB6D.4020609@agliodbs.com
обсуждение исходный текст
Ответ на Re: Improving N-Distinct estimation by ANALYZE  (Trent Shipley <tshipley@deru.com>)
Ответы Re: Improving N-Distinct estimation by ANALYZE  (Josh Berkus <josh@agliodbs.com>)
Re: Improving N-Distinct estimation by ANALYZE  (Rod Taylor <pg@rbt.ca>)
Re: Improving N-Distinct estimation by ANALYZE  (Greg Stark <gsstark@mit.edu>)
Список pgsql-hackers
Trent,

> Sorry to interupt.  The discussion is interesting, but I need some help to 
> follow along.

Thought-out commentary is welcome.


> Is "replace the algorithm" the same as saying "contextually use some estimate 
> of D that is not Chaudhuri?

Yes.  I favor a block-based approach like Brutlag, largely because it 
allows us to increase the sample size without dramatically increasing I/O.

> So Chaudhuri's estimate of D is appropriate (and is working) when making 
> decisions about joins.

Some kinds of joins.   It avoids, for example, risky use of nested loops.

>>However,   PostgreSQL now has a whole set of hash operations and other
>>query types for which a
>>too-low estimate of D causes query lockup.
> 
> 
> Why?  

Two specific examples, both of which I've encountered in the field:

1) too-low D will cause an aggregate query to use a hashagg which is 
larger than memory resulting in swapping (or disk spill when it's fixed) 
which makes the query very much slower than if the hashagg was not used.

2) much too-low D will cause the planner to pick a seq scan when it's 
not necessary, resulting in increased query times.


> Do you *really* want the median estimate in these case?  Are you certain you 
> do not want something with the opposite behavior of Chaudhuri's estimate so 
> that for small sample sizes the bias is toward a high estimate of D? 
> (Converges on D from the right instead of the left.)
> 
> Chaudhuri's <-----D------------------> needed
> Estimate                               estimate

Hmmm.  Yeah, I see what you mean.  True, the ideal approach would to 
deterime for each query operation whether a too-low D or a too-high D 
was more risky, and then use the more conservative number.   However, 
that would complicate the query planner enough that I think Tom would 
leave us. :-p


> These statements are at odds with my admittedly basic understanding of 
> statistics.  Isn't the power of a sample more related to the absolute size of 
> the sample than the sample as fraction of the population?  Why not just pick 
> a smallish sample size, say about 3000, and apply it to all the tables, even 
> the ones with just a single row (modify appropriately from block sampling).

Nope, it's definitely proportional.   As a simple example, a sample of 
500 rows in a table of 1000 rows should yeild stats estimates with 90%+ 
accuracy.  But a sample of 500 rows in a 600,000,000 row table is so 
small as to be nearly useless; it's quite possible to get all the same 
value in a random sample of < 0.1% even on a column with a D/N of 0.001.    If you look at the papers cited, almost all
researchersmore recent 
 
than Chaudhuri use a proportional sample size.

And we're taking the fixed-sample-size approach now, and it's not 
working very well for large tables.

--Josh Berkus


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jeremy Drake
Дата:
Сообщение: Re: catalog corruption bug
Следующее
От: Josh Berkus
Дата:
Сообщение: Re: Improving N-Distinct estimation by ANALYZE