Обсуждение: strange index behaviour with different statistics target
So, I had a query that uses a postgis geometry index and the planner was underestimating the number of rows it would return. Because of this, the planner was choosing the geometry index over a compound index on the other columns in the WHERE clause. So, I thought, let me increase the stats target for that geometry column. I did, and I got a different (and better) plan, but when I looked at the estimates for the simplified query against the geometry column alone, I noticed that neither the cost nor the estimated rows changed: oitest=# explain ANALYZE SELECT * FROM "blips" WHERE ((ST_Contains(blips.shape, '0101000020E610000049111956F1EB55C0A8E49CD843F34440')) ); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using index_blips_on_shape_gist on blips (cost=0.00..7.33 rows=1 width=13804) (actual time=0.113..745.394 rows=2827 loops=1) Index Cond: (shape && '0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry) Filter: ((shape && '0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry) AND _st_contains(shape, '0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry)) Total runtime: 745.977 ms (4 rows) Time: 747.199 ms oitest=# ALTER TABLE blips ALTER COLUMN shape SET statistics 1000; ALTER TABLE Time: 0.478 ms oitest=# ANALYZE ; ANALYZE Time: 7727.097 ms oitest=# explain ANALYZE SELECT * FROM "blips" WHERE ((ST_Contains(blips.shape, '0101000020E610000049111956F1EB55C0A8E49CD843F34440')) ); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using index_blips_on_shape_gist on blips (cost=0.00..7.33 rows=1 width=13761) (actual time=0.117..755.781 rows=2827 loops=1) Index Cond: (shape && '0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry) Filter: ((shape && '0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry) AND _st_contains(shape, '0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry)) Total runtime: 756.396 ms (4 rows) The width changed slightly, but the cost is 7.33 in both. So, now I thought how could that have changed the plan? Did the other parts of the plan estimate change? So I pulled the shape column out of the where clause and left the others: oitest=# ALTER TABLE blips ALTER COLUMN shape SET statistics 100; ALTER TABLE Time: 0.475 ms oitest=# ANALYZE ; ANALYZE Time: 1225.325 ms oitest=# explain ANALYZE SELECT * FROM "blips" WHERE (blips."content_id" = 2410268 AND blips."content_type" = E'Story'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using index_blips_on_content_type_and_content_id on blips (cost=0.00..9.01 rows=2 width=13924) (actual time=0.026..0.027 rows=2 loops=1) Index Cond: (((content_type)::text = 'Story'::text) AND (content_id = 2410268)) Total runtime: 0.046 ms (3 rows) Time: 1.111 ms oitest=# ALTER TABLE blips ALTER COLUMN shape SET statistics 1000; ALTER TABLE Time: 0.506 ms oitest=# ANALYZE ; ANALYZE Time: 7785.496 ms oitest=# explain ANALYZE SELECT * FROM "blips" WHERE (blips."content_id" = 2410268 AND blips."content_type" = E'Story'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Index Scan using index_blips_on_content_id on blips (cost=0.00..7.29 rows=1 width=13761) (actual time=0.013..0.014 rows=2 loops=1) Index Cond: (content_id = 2410268) Filter: ((content_type)::text = 'Story'::text) Total runtime: 0.034 ms (4 rows) Time: 1.007 ms So, my question is, should changing the stats target on the shape column affect the stats for the content_id and content_type columns? Also, why does the index on content_id win out over the compound index on (content_type, content_id)? "index_blips_on_content_id" btree (content_id) "index_blips_on_content_type_and_content_id" btree (content_type, content_id) -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 916-647-6411 FAX: 916-405-4032
Jeff Frost <jeff@frostconsultingllc.com> writes: > So, my question is, should changing the stats target on the shape column > affect the stats for the content_id and content_type columns? It would change the size of the sample for the table, which might improve the accuracy of the stats. IIRC you'd still get the same number of histogram entries and most-common-values for the other columns, but they might be more accurate. > Also, why does the index on content_id win out over the compound index > on (content_type, content_id)? It's deciding (apparently correctly, from the explain results) that the larger index isn't increasing the selectivity enough to be worth its extra search cost. I suppose content_type = 'Story' isn't very selective in this table? regards, tom lane
On Tue, 13 Jan 2009, Tom Lane wrote: > Jeff Frost <jeff@frostconsultingllc.com> writes: >> So, my question is, should changing the stats target on the shape column >> affect the stats for the content_id and content_type columns? > > It would change the size of the sample for the table, which might > improve the accuracy of the stats. IIRC you'd still get the same number > of histogram entries and most-common-values for the other columns, but > they might be more accurate. Why would they be more accurate? Do they somehow correlate with the other column's histogram and most-common-values when the stats target is increased on that column? The planner is choosing a plan I like for the query, I'm just trying to understand why it's doing that since the planner thinks the gist index is going to give it a single row (vs the 2827 rows it actually gets) and the fact that the cost didn't change for perusing the gist index. I guess I was expecting the estimated rowcount and cost for perusing the gist index to go up and when it didn't I was pleasantly surprised to find I got a plan I wanted anyway. > >> Also, why does the index on content_id win out over the compound index >> on (content_type, content_id)? > > It's deciding (apparently correctly, from the explain results) that the > larger index isn't increasing the selectivity enough to be worth its > extra search cost. I suppose content_type = 'Story' isn't very > selective in this table? Ah! You're right, especially with this content_id! -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 916-647-6411 FAX: 916-405-4032
Jeff Frost <jeff@frostconsultingllc.com> writes: > On Tue, 13 Jan 2009, Tom Lane wrote: >> It would change the size of the sample for the table, which might >> improve the accuracy of the stats. IIRC you'd still get the same number >> of histogram entries and most-common-values for the other columns, but >> they might be more accurate. > Why would they be more accurate? They'd be drawn from a larger sample of the table rows. If we need a random sample of N rows for the largest stats target among the columns, we use all those rows for deriving the stats for the other columns too. > The planner is choosing a plan I like for the query, I'm just trying to > understand why it's doing that since the planner thinks the gist index is > going to give it a single row (vs the 2827 rows it actually gets) and the fact > that the cost didn't change for perusing the gist index. You'd need to ask the postgis guys whether they have an estimator for ST_Contains that actually does anything useful. I haven't the foggiest what the state of their stats support is. [ looks again at the plan... ] Actually it looks like the estimator for && is what's at issue. Estimators are attached to operators not functions. regards, tom lane
On Tue, 13 Jan 2009, Tom Lane wrote: > Jeff Frost <jeff@frostconsultingllc.com> writes: >> On Tue, 13 Jan 2009, Tom Lane wrote: >>> It would change the size of the sample for the table, which might >>> improve the accuracy of the stats. IIRC you'd still get the same number >>> of histogram entries and most-common-values for the other columns, but >>> they might be more accurate. > >> Why would they be more accurate? > > They'd be drawn from a larger sample of the table rows. If we need a > random sample of N rows for the largest stats target among the columns, > we use all those rows for deriving the stats for the other columns too. Oh, ok, thanks Tom. That makes sense now. > >> The planner is choosing a plan I like for the query, I'm just trying to >> understand why it's doing that since the planner thinks the gist index is >> going to give it a single row (vs the 2827 rows it actually gets) and the fact >> that the cost didn't change for perusing the gist index. > > You'd need to ask the postgis guys whether they have an estimator for > ST_Contains that actually does anything useful. I haven't the foggiest > what the state of their stats support is. > > [ looks again at the plan... ] Actually it looks like the estimator > for && is what's at issue. Estimators are attached to operators not > functions. Thanks, I'll see if I can dig up some info on that and/or post to the postgis list if I can't turn anything up. -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 916-647-6411 FAX: 916-405-4032