Обсуждение: Simple query: how to optimize
I have two tables, one is called 'users' the other is 'user_activity'. The 'users' table simply contains the users in the system there is about 30,000 rows. The 'user_activity' table stores the activities the user has taken. This table has about 430,000 rows and also (notably) has a column which tracks the type of activity. 90% of the table is type 7 which indicates the user logged into the system. I am trying to write a simple query that returns the last time each user logged into the system. This is how the query looks at the moment: SELECT u.user_id, MAX(ua.activity_date) FROM pp_users u LEFT OUTER JOIN user_activity ua ON (u.user_id = ua.user_id AND ua.user_activity_type_id = 7) WHERE u.userstatus_id <> 4 AND age(u.joined_date) < interval '30 days' GROUP BY u.user_id The above query takes about 5 seconds but I'm wondering how it can be optimized. When the query is formatted as above it does use an index on the user_id column of the user_activity table... but the cost is huge (cost=0.00..1396700.80). I have tried formatting it another way with a sub-query but it takes about the same amount to completed: SELECT u.user_id, ua.last FROM pp_users u LEFT OUTER JOIN (SELECT max(activity_date) as last, user_id FROM user_activity WHERE user_activity_type_id = 7 GROUP BY user_id) as ua ON (u.user_id = ua.user_id) WHERE u.userstatus_id <> 4 AND age(u.joined_date) < interval '30 days' Can anybody offer any pointers on this scenario? Regards, Collin
On October 28, 2005 2:54 PM Collin Peters wrote: > I have two tables, one is called 'users' the other is 'user_activity'. ... > I am trying to write a simple query that returns the last time each > user logged into the system. This is how the query looks at the > moment: > > SELECT u.user_id, MAX(ua.activity_date) > FROM pp_users u > LEFT OUTER JOIN user_activity ua ON (u.user_id = ua.user_id AND > ua.user_activity_type_id = 7) > WHERE u.userstatus_id <> 4 > AND age(u.joined_date) < interval '30 days' > GROUP BY u.user_id You're first joining against the entire user table, then filtering out the users you don't need. Instead, filter out the users you don't need first, then do the join: SELECT users.user_id, MAX(ua.activity_date) FROM (SELECT u.user_id FROM pp_users u WHERE u.userstatus_id <> 4 AND age(u.joined_date) < interval '30 days' ) users LEFT OUTER JOIN user_activity ua ON (users.user_id = ua.user_id AND ua.user_activity_type_id = 7) GROUP BY users.user_id (disclaimer: I haven't actually tried this sql)
These two queries execute at exactly the same speed. When I run run EXPLAIN on them both they return the *exact* same query plan as well. I find this strange... but it is also kind of what I expected from reading up on various things. I am under the impression the postgresql will break up your query and run it as it sees best. So in the case of these two queries... it seems it is actually almost converting one into the other. Maybe I am wrong. Is there a good resource list somewhere for postgresql query optimization? There are entire books devoted to the subject for oracle but I can't find more than a few small articles on postgresql query optimizations on the web. Regards, Collin On 10/28/05, Roger Hand <RHand@kailea.com> wrote: > > SELECT u.user_id, MAX(ua.activity_date) > > FROM pp_users u > > LEFT OUTER JOIN user_activity ua ON (u.user_id = ua.user_id AND > > ua.user_activity_type_id = 7) > > WHERE u.userstatus_id <> 4 > > AND age(u.joined_date) < interval '30 days' > > GROUP BY u.user_id > > You're first joining against the entire user table, then filtering out the users > you don't need. > > Instead, filter out the users you don't need first, then do the join: > > SELECT users.user_id, MAX(ua.activity_date) > FROM > (SELECT u.user_id > FROM pp_users u > WHERE u.userstatus_id <> 4 > AND age(u.joined_date) < interval '30 days' > ) users > LEFT OUTER JOIN user_activity ua > ON (users.user_id = ua.user_id > AND ua.user_activity_type_id = 7) > GROUP BY users.user_id > > (disclaimer: I haven't actually tried this sql) >
A little bit more on my last post that I forget to mention. The two queries run at the same speed and have the same plan only if I have an index on the user_activity.user_id column. Otherwise they run at different speeds. The query you gave me actually runs slower without the index. All this is making my head spin!! :O On 10/28/05, Collin Peters <cadiolis@gmail.com> wrote: > These two queries execute at exactly the same speed. When I run run > EXPLAIN on them both they return the *exact* same query plan as well. > I find this strange... but it is also kind of what I expected from > reading up on various things. I am under the impression the > postgresql will break up your query and run it as it sees best. So > in the case of these two queries... it seems it is actually almost > converting one into the other. Maybe I am wrong. > > Is there a good resource list somewhere for postgresql query > optimization? There are entire books devoted to the subject for > oracle but I can't find more than a few small articles on postgresql > query optimizations on the web. > > Regards, > Collin > > On 10/28/05, Roger Hand <RHand@kailea.com> wrote: > > > SELECT u.user_id, MAX(ua.activity_date) > > > FROM pp_users u > > > LEFT OUTER JOIN user_activity ua ON (u.user_id = ua.user_id AND > > > ua.user_activity_type_id = 7) > > > WHERE u.userstatus_id <> 4 > > > AND age(u.joined_date) < interval '30 days' > > > GROUP BY u.user_id > > > > You're first joining against the entire user table, then filtering out the users > > you don't need. > > > > Instead, filter out the users you don't need first, then do the join: > > > > SELECT users.user_id, MAX(ua.activity_date) > > FROM > > (SELECT u.user_id > > FROM pp_users u > > WHERE u.userstatus_id <> 4 > > AND age(u.joined_date) < interval '30 days' > > ) users > > LEFT OUTER JOIN user_activity ua > > ON (users.user_id = ua.user_id > > AND ua.user_activity_type_id = 7) > > GROUP BY users.user_id > > > > (disclaimer: I haven't actually tried this sql) > > >
On Fri, Oct 28, 2005 at 03:40:40PM -0700, Roger Hand wrote: > You're first joining against the entire user table, then filtering out the users > you don't need. That's just wrong, sorry -- the planner is perfectly able to push the WHERE down before the join. I'd guess the problem is the age() query; age() doesn't really return what you'd expect, and I don't think it can use an index easily (I might be wrong here, though). Instead, try something like WHERE u.joined_date >= current_date - interval '30 days' except that if you're running pre-8.0, you might want to precalculate the right-hand side on the client. I couldn't see EXPLAIN ANALYZE of your query, BTW -- having it would be useful. /* Steinar */ -- Homepage: http://www.sesse.net/
Postgres is somewhat speed-challenged on aggregate functions. The most-repeated work-around would be something like: SELECT u.user_id, (SELECT activity_date FROM user_activity WHERE user_activity.user_id = pp_users.user_id AND user_activity_type_id = 7 ORDER BY activity_date DESC LIMIT 1) FROM pp_users u WHERE u.userstatus_id <> 4 AND age(u.joined_date) < interval '30 days' (code above is untested) I've read that aggregate functions are improved in the 8.1 code. I'm running 8.1beta3 on one machine but haven't experimented to verify the claimed improvements. Martin Nickel "Collin Peters" <cadiolis@gmail.com> wrote in message news:df01c91b0510281453v5c7ed502rfb3757e886046607@mail.gmail.com... >I have two tables, one is called 'users' the other is 'user_activity'. > The 'users' table simply contains the users in the system there is > about 30,000 rows. The 'user_activity' table stores the activities > the user has taken. This table has about 430,000 rows and also > (notably) has a column which tracks the type of activity. 90% of the > table is type 7 which indicates the user logged into the system. > > I am trying to write a simple query that returns the last time each > user logged into the system. This is how the query looks at the > moment: > > SELECT u.user_id, MAX(ua.activity_date) > FROM pp_users u > LEFT OUTER JOIN user_activity ua ON (u.user_id = ua.user_id AND > ua.user_activity_type_id = 7) > WHERE u.userstatus_id <> 4 > AND age(u.joined_date) < interval '30 days' > GROUP BY u.user_id > > The above query takes about 5 seconds but I'm wondering how it can be > optimized. When the query is formatted as above it does use an index > on the user_id column of the user_activity table... but the cost is > huge (cost=0.00..1396700.80). > > I have tried formatting it another way with a sub-query but it takes > about the same amount to completed: > > SELECT u.user_id, ua.last > FROM pp_users u > LEFT OUTER JOIN (SELECT max(activity_date) as last, user_id FROM > user_activity WHERE user_activity_type_id = 7 GROUP BY user_id) as ua > ON (u.user_id = ua.user_id) > WHERE u.userstatus_id <> 4 > AND age(u.joined_date) < interval '30 days' > > Can anybody offer any pointers on this scenario? > > Regards, > Collin > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >