On Distributions In 7.2.1
От | Mark kirkwood |
---|---|
Тема | On Distributions In 7.2.1 |
Дата | |
Msg-id | 1020313069.2325.19.camel@spikey.slithery.org обсуждение исходный текст |
Ответы |
Re: On Distributions In 7.2.1
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-general |
Dear list, I thought I would follow up my last posting on this topic by looking at 7.2.1 - as there is a new estimator for distinct values (see src/backend/commands/analyze.c ~line 1010 onwards). The specimen table is the same as was used in the 7.2 post...to recap briefly : dbbench3=> \d fact0 Table "fact0" Column | Type | Modifiers --------+---------+----------- d0key | integer | d1key | integer | d2key | integer | val | integer | filler | text | Unique keys: fact0_pk dbbench3=> select count(*) from fact0; count --------- 3000000 (1 row) dbbench3=> select count(distinct d0key) from fact0; count ------- 3000 (1 row) The data is generated with each value occurring the same number of times - so the frequency of any particular value is : dbbench3=> select 3000.0/3000000.0; ?column? ---------- 0.001 (1 row) The d0key values are uniformly distributed through the table, so it should be a "good" test of the statistical sampling routines used by ANALYZE : dbbench3=> alter table fact0 alter d0key set statistics 10; ALTER dbbench3=> analyze fact0; ANALYZE dbbench3=> select tablename,attname,n_distinct,most_common_freqs from pg_stats where attname='d0key'; fact0 | d0key | 3128 | {0.00266667,0.00233333,0.002,0.00166667,0.00166667,0.00166667,0.00166667, 0.00166667,0.00166667,0.00166667} (1 row) dbbench3=> alter table fact0 alter d0key set statistics 100; ALTER dbbench3=> analyze fact0; ANALYZE dbbench3=> select tablename,attname,n_distinct,most_common_freqs from pg_stats where attname='d0key'; fact0 | d0key | 3000 | {0.0007,0.0007,0.0007,0.0007,0.000666667,0.000666667,0.000666667,0.000666667, 0.000666667,0.000666667,0.000633333,0.000633333,0.000633333,0.000633333,0.000633333, 0.000633333,0.000633333,0.000633333,0.000633333,0.000633333,0.000633333,0.000633333, 0.000633333,0.000633333,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006, 0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.000566667,0.000566667,0.000566667, 0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667, 0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667, 0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667, 0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667, 0.000566667,0.000566667,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333, 0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333, 0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333, 0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333} (1 row) So the number distinct is extremely accurate, even with 10 quantiles (in 7.2 it was overestimated by a factor of 10). There is slightly odd behaviour with the frequencies decreasing with increasing number of quantiles (same as 7.2 .. same code here ?). The frequencies are estimated best for about 30 quantiles - all 30 are between 0.0009 and 0.001 in this case. I am wondering if this is caused by my example not having any "real" most common values (they are all as common as each other). I am going to fiddle with my data generation script, skew the distribution and see what effect that has. best wishes Mark
В списке pgsql-general по дате отправления: