Обсуждение:
Hi, I'm new to tuning PostgreSQL and I have a query that gets slower after I run a vacuum analyze. I believe it uses a Hash Join before the analyze and a Nested Loop IN Join after. It seems the Nested Loop IN Join estimates the correct number of rows, but underestimates the amount of time required. I am curious why the vacuum analyze makes it slower and if that gives any clues as too which parameter I should be tuning. BTW, I know the query could be re-structured more cleanly to remove the sub-selects, but that doesn't impact the performance. thanks, Jeff Welcome to psql 8.1.5, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit plm_demo=# explain analyze SELECT count(*) AS count_all FROM symptoms WHERE ( 1=1 and symptoms.id in (select symptom_id from symptom_reports sr where 1=1 and sr.user_id in (select id from users where disease_id=1))) ; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ----------------------- Aggregate (cost=366.47..366.48 rows=1 width=0) (actual time=125.093..125.095 rows=1 loops=1) -> Hash Join (cost=362.41..366.38 rows=36 width=0) (actual time=124.162..124.859 rows=106 loops=1) Hash Cond: ("outer".id = "inner".symptom_id) -> Seq Scan on symptoms (cost=0.00..3.07 rows=107 width=4) (actual time=0.032..0.295 rows=108 loops=1) -> Hash (cost=362.25..362.25 rows=67 width=4) (actual time=124.101..124.101 rows=106 loops=1) -> HashAggregate (cost=361.58..362.25 rows=67 width=4) (actual time=123.628..123.854 rows=106 loops=1) -> Hash IN Join (cost=35.26..361.41 rows=67 width=4) (actual time=9.767..96.372 rows=13074 loops=1) Hash Cond: ("outer".user_id = "inner".id) -> Seq Scan on symptom_reports sr (cost=0.00..259.65 rows=13165 width=8) (actual time=0.029..33.359 rows=13074 loops=1) -> Hash (cost=35.24..35.24 rows=11 width=4) (actual time=9.696..9.696 rows=1470 loops=1) -> Bitmap Heap Scan on users (cost=2.04..35.24 rows=11 width=4) (actual time=0.711..6.347 rows=1470 loops=1) Recheck Cond: (disease_id = 1) -> Bitmap Index Scan on users_disease_id_index (cost=0.00..2.04 rows=11 width=0) (actual time=0.644..0.644 rows=2378 loops=1) Index Cond: (disease_id = 1) Total runtime: 134.045 ms (15 rows) plm_demo=# vacuum analyze; VACUUM plm_demo=# analyze; ANALYZE plm_demo=# explain analyze SELECT count(*) AS count_all FROM symptoms WHERE ( 1=1 and symptoms.id in (select symptom_id from symptom_reports sr where 1=1 and sr.user_id in (select id from users where disease_id=1))) ; QUERY PLAN ------------------------------------------------------------------------ ----------------------------------------------------------------- Aggregate (cost=586.47..586.48 rows=1 width=0) (actual time=3441.385..3441.386 rows=1 loops=1) -> Nested Loop IN Join (cost=149.05..586.26 rows=85 width=0) (actual time=54.517..3441.115 rows=106 loops=1) Join Filter: ("outer".id = "inner".symptom_id) -> Seq Scan on symptoms (cost=0.00..3.08 rows=108 width=4) (actual time=0.007..0.273 rows=108 loops=1) -> Hash IN Join (cost=149.05..603.90 rows=13074 width=4) (actual time=0.078..24.503 rows=3773 loops=108) Hash Cond: ("outer".user_id = "inner".id) -> Seq Scan on symptom_reports sr (cost=0.00..258.74 rows=13074 width=8) (actual time=0.003..9.044 rows=3773 loops=108) -> Hash (cost=145.38..145.38 rows=1470 width=4) (actual time=7.608..7.608 rows=1470 loops=1) -> Seq Scan on users (cost=0.00..145.38 rows=1470 width=4) (actual time=0.006..4.353 rows=1470 loops=1) Filter: (disease_id = 1) Total runtime: 3441.452 ms (11 rows)
Jeff Cole <cole.jeff@gmail.com> writes: > Hi, I'm new to tuning PostgreSQL and I have a query that gets slower > after I run a vacuum analyze. I believe it uses a Hash Join before > the analyze and a Nested Loop IN Join after. It seems the Nested > Loop IN Join estimates the correct number of rows, but underestimates > the amount of time required. I am curious why the vacuum analyze > makes it slower and if that gives any clues as too which parameter I > should be tuning. Hm, the cost for the upper nestloop is way less than you would expect given that the HASH IN join is going to have to be repeated 100+ times. I think this must be due to a very low "join_in_selectivity" estimate but I'm not sure why you are getting that, especially seeing that the rowcount estimates aren't far off. Can you show us the pg_stats rows for symptoms.id and symptom_reports.symptom_id? regards, tom lane
On Mar 5, 2007, at 8:54 PM, Tom Lane wrote: > Hm, the cost for the upper nestloop is way less than you would expect > given that the HASH IN join is going to have to be repeated 100+ > times. > I think this must be due to a very low "join_in_selectivity" estimate > but I'm not sure why you are getting that, especially seeing that the > rowcount estimates aren't far off. Can you show us the pg_stats > rows for symptoms.id and symptom_reports.symptom_id? > Hi Tom, thanks for the response. Here are the pg_stats. I think I understand what the stats say, but I don't know what to conclude from them. plm_stage=# select * from pg_stats where tablename = 'symptoms' and attname = 'id'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+----------- +------------+------------------+------------------- +-------------------------------------+------------- public | symptoms | id | 0 | 4 | -1 | | | {1,11,24,34,46,57,71,85,95,106,117} | 0.451606 plm_stage=# select * from pg_stats where tablename = 'symptom_reports' and attname = 'symptom_id'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------------+------------+-----------+----------- +------------+------------------------ +----------------------------------------------------------------------- ---------------+-------------------------------------+------------- public | symptom_reports | symptom_id | 0 | 4 | 80 | {3,2,4,1,5,8,9,7,10,6} | {0.094,0.0933333,0.0933333,0.092,0.0913333,0.0903333,0.0866667,0.0843333 ,0.084,0.08} | {12,18,24,30,38,44,51,57,91,91,114} | 0.0955925 And Ismo, I followed your suggestion to re-write the SQL more cleanly, and you are right it was faster, so that is certainly a solution. Although I am still curious why my original query slowed down after the vacuum analyze. In any case, here is the explain analyze from the new query. Compare that to the 3441.452 ms of the old query after the analyze (and 134.045 ms before the analyze): plm_stage=# explain analyze SELECT count(distinct s.id) AS count_all FROM symptoms s ,symptom_reports sr,users u WHERE s.id=sr.symptom_id and sr.user_id=u.id and u.disease_id in (1); QUERY PLAN ------------------------------------------------------------------------ ----------------------------------------------------------------- Aggregate (cost=770.05..770.06 rows=1 width=4) (actual time=176.749..176.751 rows=1 loops=1) -> Hash Join (cost=89.43..737.50 rows=13020 width=4) (actual time=7.762..142.063 rows=13038 loops=1) Hash Cond: ("outer".symptom_id = "inner".id) -> Hash Join (cost=86.09..538.86 rows=13020 width=4) (actual time=7.277..89.293 rows=13038 loops=1) Hash Cond: ("outer".user_id = "inner".id) -> Seq Scan on symptom_reports sr (cost=0.00..257.38 rows=13038 width=8) (actual time=0.003..30.499 rows=13038 loops=1) -> Hash (cost=82.41..82.41 rows=1471 width=4) (actual time=7.261..7.261 rows=1471 loops=1) -> Seq Scan on users u (cost=0.00..82.41 rows=1471 width=4) (actual time=0.006..4.133 rows=1471 loops=1) Filter: (disease_id = 1) -> Hash (cost=3.07..3.07 rows=107 width=4) (actual time=0.469..0.469 rows=107 loops=1) -> Seq Scan on symptoms s (cost=0.00..3.07 rows=107 width=4) (actual time=0.007..0.247 rows=107 loops=1) Total runtime: 176.842 ms (12 rows)
Jeff Cole <cole.jeff@gmail.com> writes: > Hi Tom, thanks for the response. Here are the pg_stats. I think I > understand what the stats say, but I don't know what to conclude from > them. OK, the symptom_id row claims there are only 80 distinct values of symptom_id in symptom_reports. This is a bit low (looks like the true state of affairs is that all but 2 of the 108 entries of symptoms are represented in symptom_reports), but it's not horridly off considering that you're using the rather low default statistics_target. What happens is that the planner expects that on average only 80 rows of the inner join will need to be scanned to find a match for a given symptoms.id, and this makes the nestloop look cheap. However, per your previous EXPLAIN ANALYZE: > -> Nested Loop IN Join (cost=149.05..586.26 rows=85 width=0) (actual time=54.517..3441.115 rows=106 loops=1) > Join Filter: ("outer".id = "inner".symptom_id) > -> Seq Scan on symptoms (cost=0.00..3.08 rows=108 width=4) (actual time=0.007..0.273 rows=108 loops=1) > -> Hash IN Join (cost=149.05..603.90 rows=13074 width=4) (actual time=0.078..24.503 rows=3773 loops=108) the *actual* average number of rows scanned is 3773. I'm not sure why this should be --- is it possible that the distribution of keys in symptom_reports is wildly uneven? This could happen if all of the physically earlier rows in symptom_reports contain the same small set of symptom_ids, but the stats don't seem to indicate such a skew. regards, tom lane
On Mar 6, 2007, at 11:40 AM, Tom Lane wrote: > the *actual* average number of rows scanned is 3773. I'm not sure why > this should be --- is it possible that the distribution of keys in > symptom_reports is wildly uneven? This could happen if all of the > physically earlier rows in symptom_reports contain the same small set > of symptom_ids, but the stats don't seem to indicate such a skew. Hi Tom, you are correct, the distribution is uneven... In the 13k symptom_reports rows, there are 105 distinct symptom_ids. But the first 8k symptom_reports rows only have 10 distinct symptom_ids. Could this cause the problem and would there be anything I could do to address it? Thanks for all your help, I appreciate it. -Jeff
Jeff Cole <cole.jeff@gmail.com> writes: > Hi Tom, you are correct, the distribution is uneven... In the 13k > symptom_reports rows, there are 105 distinct symptom_ids. But the > first 8k symptom_reports rows only have 10 distinct symptom_ids. > Could this cause the problem and would there be anything I could do > to address it? Ah-hah, yeah, that explains it. Is it worth your time to deliberately randomize the order of the rows in symptom_reports? It wasn't clear whether this query is actually something you need to optimize. You might have other queries that benefit from the rows being in nonrandom order, so I'm not entirely sure that this is a good thing to do ... regards, tom lane
Hi Tom, thanks for the great job getting to the core of this problem... I would say I'm not sure I want randomize the rows (not really even sure how to do it without truncating the table and re- adding the records in a random order). I think for the moment I will either a) re-write the query per Ismo's suggestion, or b) wait until more data comes into that table, potentially kicking the query planner into not using the Nested Loop anymore. Anyway, thanks again, I appreciate it... -Jeff On Mar 7, 2007, at 11:37 AM, Tom Lane wrote: > Jeff Cole <cole.jeff@gmail.com> writes: >> Hi Tom, you are correct, the distribution is uneven... In the 13k >> symptom_reports rows, there are 105 distinct symptom_ids. But the >> first 8k symptom_reports rows only have 10 distinct symptom_ids. >> Could this cause the problem and would there be anything I could do >> to address it? > > Ah-hah, yeah, that explains it. Is it worth your time to deliberately > randomize the order of the rows in symptom_reports? It wasn't clear > whether this query is actually something you need to optimize. You > might have other queries that benefit from the rows being in nonrandom > order, so I'm not entirely sure that this is a good thing to do ... > > regards, tom lane