Обсуждение: Performance problem with table containing a lot of text (blog)
Hello! Some background info.. We have a blog table that contains about eight million blog entries. Average length of an entry is 1200 letters. Because each 8k page can accommodate only a few entries, every query that involves several entries causes several random seeks to disk. We are having problems with queries like: 1) give me a list of months when I have written someting 2) give me id's of entries I have written on month X year X 3) give me the number of blog entries my friends have written since last time Clustering would probably decrease random seeks but it is not an option. It locks the table and operation would take "some" time. It should also be done periodically to maintain clustering. I guess that file system cache gets filled with text contents of blog entries although they are totally useless for queries like these. Contents of individual blog entries are cached to memcached on application level anyway. There's rarely any need to fetch them from database. It would be nice if I could flag a column to be toasted always, regardless of it's length. Because there isn't such option maybe I should create a separate table for blog text content. Does anybody have better ideas for this? :) Thanks! P.S. Here's a plan for query #3. Users can have several bookmark groups they are following. User can limit visibility of an entry to some of his/her bookmark group. Those are not any kind of bottlenecks anyway... Sort (cost=34112.60..34117.94 rows=2138 width=14) Sort Key: count(*), upper((u.nick)::text) -> HashAggregate (cost=33962.28..33994.35 rows=2138 width=14) -> Nested Loop (cost=8399.95..33946.24 rows=2138 width=14) -> Nested Loop (cost=8399.95..9133.16 rows=90 width=22) -> HashAggregate (cost=8399.95..8402.32 rows=237 width=8) -> Nested Loop (cost=0.00..8395.99 rows=792 width=8) -> Index Scan using user_bookmark_uid on user_bookmark ub (cost=0.00..541.39 rows=2368width=12) Index Cond: (uid = 256979) -> Index Scan using user_bookmark_group_pkey on user_bookmark_group bg (cost=0.00..3.30rows=1 width=4) Index Cond: ("outer".bookmark_group_id = bg.bookmark_group_id) Filter: (("type" >= 0) AND ("type" <= 1) AND (trace_blog = 'y'::bpchar)) -> Index Scan using users_uid_accepted_only on users u (cost=0.00..3.06 rows=1 width=14) Index Cond: (u.uid = "outer".marked_uid) -> Index Scan using blog_entry_uid_beid on blog_entry be (cost=0.00..275.34 rows=24 width=8) Index Cond: ((be.uid = "outer".marked_uid) AND (COALESCE("outer".last_seen_blog_entry_id, 0) < be.blog_entry_id)) Filter: ((visibility = 'p'::bpchar) AND ((status = 'p'::bpchar) OR (status = 'l'::bpchar)) AND ((bookmark_group_idIS NULL) OR (subplan))) SubPlan -> Index Scan using user_bookmark_pkey on user_bookmark fub (cost=0.00..3.42 rows=1 width=0) Index Cond: ((bookmark_group_id = $0) AND (marked_uid = 256979)) P.S. That particular user has quite many unread entries though... |\__/| ( oo ) Kari Lavikka - tuner@bdb.fi - (050) 380 3808 __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _ ""
Kari Lavikka wrote: > Hello! > > Some background info.. We have a blog table that contains about eight > million blog entries. Average length of an entry is 1200 letters. > Because each 8k page can accommodate only a few entries, every query > that involves several entries causes several random seeks to disk. We > are having problems with queries like: > > 1) give me a list of months when I have written someting > 2) give me id's of entries I have written on month X year X > 3) give me the number of blog entries my friends have written since last > time I didn't see your schema, but couldn't these problems be solved by storing the article id, owner id, and blog date in a separate table? It seems that if you don't actually need the content of the blogs, all of those questions could be answered by querying a very simple table with minimal I/O overhead.
> I didn't see your schema, but couldn't these problems be solved by storing > the article id, owner id, and blog date in a separate table? It seems that > if you don't actually need the content of the blogs, all of those questions > could be answered by querying a very simple table with minimal I/O overhead. Yes. I was suggesting this as an option but I'm wondering if there are other solutions. |\__/| ( oo ) Kari Lavikka - tuner@bdb.fi - (050) 380 3808 __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _ "" On Tue, 28 Aug 2007, Dan Harris wrote: > Kari Lavikka wrote: >> Hello! >> >> Some background info.. We have a blog table that contains about eight >> million blog entries. Average length of an entry is 1200 letters. Because >> each 8k page can accommodate only a few entries, every query that involves >> several entries causes several random seeks to disk. We are having >> problems with queries like: >> >> 1) give me a list of months when I have written someting >> 2) give me id's of entries I have written on month X year X >> 3) give me the number of blog entries my friends have written since last >> time > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
Kari Lavikka wrote: > It would be nice if I could flag a column to be toasted always, > regardless of it's length. The idea of being able to set the toast threshold per column was discussed during 8.3 development, but no patch was produced IIRC. We might do that in the future. If you're willing to compile from source, you can lower TOAST_TUPLE_THRESHOLD. You could also use ALTER TABLE ... ALTER COLUMN ... SET STORAGE EXTERNAL to force the long blog entries to be stored in the toast table instead of compressing them in the main table. Values smaller than TOAST_TUPLE_THRESHOLD (2k by default?) still wouldn't be toasted, though, so it might not make much difference. > Because there isn't such option maybe I should create a separate table > for blog text content. Does anybody have better ideas for this? :) That's probably the easiest solution. You can put a view on top of them to hide it from the application. > P.S. Here's a plan for query #3. Users can have several bookmark groups > they are following. User can limit visibility of an entry to some of > his/her bookmark group. Those are not any kind of bottlenecks anyway... If the user_bookmark table is not clustered by uid, I'm surprised the planner didn't choose a bitmap index scan. Which version of PostgreSQL is this? PS. EXPLAIN ANALYZE is much more helpful than plain EXPLAIN. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 29 Aug 2007, Heikki Linnakangas wrote: > The idea of being able to set the toast threshold per column was > discussed during 8.3 development, but no patch was produced IIRC. We > might do that in the future. If you're willing to compile from source, > you can lower TOAST_TUPLE_THRESHOLD. We are currently using Postgres 8.1 but have to upgrade to 8.2 shortly. New version fixes some vacuum problems. I always compile postgres from source. Maybe I have to do some calculations because that setting affects all tables and databases. Most of our text/varchar columns are quite short but setting the threshold too low causes excessive seeks to toast tables... right? >> Because there isn't such option maybe I should create a separate table >> for blog text content. Does anybody have better ideas for this? :) > > That's probably the easiest solution. You can put a view on top of them > to hide it from the application. Yeh. > If the user_bookmark table is not clustered by uid, I'm surprised the > planner didn't choose a bitmap index scan. Drumroll... there are: "user_bookmark_pkey" PRIMARY KEY, btree (bookmark_group_id, marked_uid), tablespace "lun3" "user_bookmark_marked_uid" btree (marked_uid) "user_bookmark_uid" btree (uid) CLUSTER, tablespace "lun3" Queries are mostly like "Gimme all of my bookmarked friends in all of my bookmark groups" and rarely the opposite "Gimme all users who have bookmarked me" I have clustered the table using uid to minimize random page fetches. - Kari > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
Kari Lavikka wrote: > On Wed, 29 Aug 2007, Heikki Linnakangas wrote: > >> The idea of being able to set the toast threshold per column was >> discussed during 8.3 development, but no patch was produced IIRC. We >> might do that in the future. If you're willing to compile from source, >> you can lower TOAST_TUPLE_THRESHOLD. > > We are currently using Postgres 8.1 but have to upgrade to 8.2 shortly. > New version fixes some vacuum problems. > > I always compile postgres from source. Maybe I have to do some > calculations because that setting affects all tables and databases. Most > of our text/varchar columns are quite short but setting the threshold > too low causes excessive seeks to toast tables... right? Right. If you have trouble finding the right balance, you can also use ALTER STORAGE PLAIN to force the other columns not to be toasted. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com