Обсуждение: ts_rank seems very slow (140 ranked documents / second on my machine)
Hello, I'm testing PostgreSQL full-text search on a table containing 1.000.000 documents. Document average length is 5.700 chars. Performance is good and very similar to what I can get with Xapian if I don't use ts_rank. But response time collapses if I use ts_rank to select the 50 best matching documents. This is the table and index definition: create table posts_1000000 ( id serial primary key, document_vector tsvector ); create index index_posts_documents_1000000 ON posts_1000000 USING gin(document_vector); This is the query without ts_rank (the word 'crare' matches 5 % of documents): select id from posts_1000000 where to_tsquery('english', 'crare') @@ document_vector limit 50 Limit (cost=0.00..27.93 rows=50 width=4) (actual time=0.303..12.559 rows=50 loops=1) Output: id -> Seq Scan on posts_1000000 (cost=0.00..27472.51 rows=49184 width=4) (actual time=0.299..12.451 rows=50 loops=1) Output: id Filter: ('''crare'''::tsquery @@ document_vector) Total runtime: 12.642 ms Now, this is the query using ts_rank: select id from posts_1000000 where to_tsquery('english', 'crare') @@ document_vector order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc limit 50 Limit (cost=29229.33..29229.45 rows=50 width=22) (actual time=355516.233..355516.339 rows=50 loops=1) Output: id -> Sort (cost=29229.33..29352.29 rows=49184 width=22) (actual time=355516.230..355516.268 rows=50 loops=1) Output: id Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32)) Sort Method: top-N heapsort Memory: 27kB -> Seq Scan on posts_1000000 (cost=0.00..27595.47 rows=49184 width=22) (actual time=0.251..355389.367 rows=49951 loops=1) Output: id Filter: ('''crare'''::tsquery @@ document_vector) Total runtime: 355535.063 ms The ranking is very slow: 140 ranked documents / second on my machine! I'm afraid this is because ts_rank needs to read document_vector, and because that column is stored in TOAST table, it triggers a random access for each matching row. Am I correct? Is it the expected behavior? Is there a way to reduce the execution time? I use PostgreSQL 8.4 with shared_buffers = 256 MB, work_mem = 256 MB. Thanks for your help and advice. -- Nicolas Grilly Garden +33 1 45 72 48 78 - office +33 6 03 00 25 34 - mobile www.gardentechno.com - Développement web & reporting / Web development & data analytics www.vocationcity.com - Plateforme de recrutement sur le web / Web recruitment platform
I don't see your query uses index :) On Tue, 12 Jul 2011, Nicolas Grilly wrote: > Hello, > > I'm testing PostgreSQL full-text search on a table containing > 1.000.000 documents. Document average length is 5.700 chars. > Performance is good and very similar to what I can get with Xapian if > I don't use ts_rank. But response time collapses if I use ts_rank to > select the 50 best matching documents. > > This is the table and index definition: > > create table posts_1000000 ( > id serial primary key, > document_vector tsvector > ); > create index index_posts_documents_1000000 ON posts_1000000 USING > gin(document_vector); > > This is the query without ts_rank (the word 'crare' matches 5 % of documents): > > select id > from posts_1000000 > where to_tsquery('english', 'crare') @@ document_vector > limit 50 > > Limit (cost=0.00..27.93 rows=50 width=4) (actual time=0.303..12.559 > rows=50 loops=1) > Output: id > -> Seq Scan on posts_1000000 (cost=0.00..27472.51 rows=49184 > width=4) (actual time=0.299..12.451 rows=50 loops=1) > Output: id > Filter: ('''crare'''::tsquery @@ document_vector) > Total runtime: 12.642 ms > > Now, this is the query using ts_rank: > > select id > from posts_1000000 > where to_tsquery('english', 'crare') @@ document_vector > order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc > limit 50 > > Limit (cost=29229.33..29229.45 rows=50 width=22) (actual > time=355516.233..355516.339 rows=50 loops=1) > Output: id > -> Sort (cost=29229.33..29352.29 rows=49184 width=22) (actual > time=355516.230..355516.268 rows=50 loops=1) > Output: id > Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32)) > Sort Method: top-N heapsort Memory: 27kB > -> Seq Scan on posts_1000000 (cost=0.00..27595.47 rows=49184 > width=22) (actual time=0.251..355389.367 rows=49951 loops=1) > Output: id > Filter: ('''crare'''::tsquery @@ document_vector) > Total runtime: 355535.063 ms > > The ranking is very slow: 140 ranked documents / second on my machine! > > I'm afraid this is because ts_rank needs to read document_vector, and > because that column is stored in TOAST table, it triggers a random > access for each matching row. Am I correct? Is it the expected > behavior? Is there a way to reduce the execution time? > > I use PostgreSQL 8.4 with shared_buffers = 256 MB, work_mem = 256 MB. > > Thanks for your help and advice. > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Tue, Jul 12, 2011 at 22:25, Oleg Bartunov <oleg@sai.msu.su> wrote: > I don't see your query uses index :) Yes, I know. :) I ran VACUUM ANALYZE and re-ran the query but the output of EXPLAIN ANALYZE stays exactly the same: no index used. Any idea why? By the way, does ts_rank is supposed to use a GIN index when it's available?
On Tue, 12 Jul 2011, Nicolas Grilly wrote: > On Tue, Jul 12, 2011 at 22:25, Oleg Bartunov <oleg@sai.msu.su> wrote: >> I don't see your query uses index :) > > Yes, I know. :) > > I ran VACUUM ANALYZE and re-ran the query but the output of EXPLAIN > ANALYZE stays exactly the same: no index used. > > Any idea why? there is problem with estimating of cost scanning gin index in < 9.1 versions, so you can set enable_seqscan=off; or try 9.1 which beta3 now. > > By the way, does ts_rank is supposed to use a GIN index when it's available? no, I see no benefit :) Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Hi Oleg and all, On Wed, Jul 13, 2011 at 08:16, Oleg Bartunov <oleg@sai.msu.su> wrote: > there is problem with estimating of cost scanning gin index in < 9.1 > versions, > so you can set enable_seqscan=off; > or try 9.1 which beta3 now. I re-ran my queries using enable seqscan=off. Now the first query, without ts_rank, uses the GIN index: set enable_seqscan=off; explain analyze select id from posts_1000000 where to_tsquery('english', 'crare') @@ document_vector limit 50; Limit (cost=42290.12..42306.31 rows=50 width=4) (actual time=16.259..16.412 rows=50 loops=1) -> Bitmap Heap Scan on posts_1000000 (cost=42290.12..57877.02 rows=48152 width=4) (actual time=16.256..16.344 rows=50 loops=1) Recheck Cond: ('''crare'''::tsquery @@ document_vector) -> Bitmap Index Scan on index_posts_documents_1000000 (cost=0.00..42278.08 rows=48152 width=0) (actual time=13.265..13.265 rows=49951 loops=1) Index Cond: ('''crare'''::tsquery @@ document_vector) Total runtime: 16.484 ms But the second query, the one that uses ts_rank, is still very slow... Any idea why? Is ts_rank efficient enough to find the best 50 matches among 50 000 documents? set enable_seqscan=off; explain analyze select id from posts_1000000 where to_tsquery('english', 'crare') @@ document_vector order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc limit 50; Limit (cost=59596.98..59597.10 rows=50 width=22) (actual time=296212.052..296212.257 rows=50 loops=1) -> Sort (cost=59596.98..59717.36 rows=48152 width=22) (actual time=296186.928..296187.007 rows=50 loops=1)" Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32)) Sort Method: top-N heapsort Memory: 27kB -> Bitmap Heap Scan on posts_1000000 (cost=42290.12..57997.40 rows=48152 width=22) (actual time=70.861..296059.515 rows=49951 loops=1) Recheck Cond: ('''crare'''::tsquery @@ document_vector) -> Bitmap Index Scan on index_posts_documents_1000000 (cost=0.00..42278.08 rows=48152 width=0) (actual time=24.922..24.922 rows=49951 loops=1) Index Cond: ('''crare'''::tsquery @@ document_vector) Total runtime: 296220.493 ms >> By the way, does ts_rank is supposed to use a GIN index when it's >> available? > > no, I see no benefit :) Ok. But what is the solution to improve ts_rank execution time? Am I doing something wrong? Thanks for your help, Nicolas
I didn't notice, reading 40K tuples in random order takes a long time and this is a problem of any database. Can you measure time to read all documents found ? :( The only solution I see is to store enough information for ranking in index. Oleg On Wed, 13 Jul 2011, Nicolas Grilly wrote: > Hi Oleg and all, > > On Wed, Jul 13, 2011 at 08:16, Oleg Bartunov <oleg@sai.msu.su> wrote: >> there is problem with estimating of cost scanning gin index in < 9.1 >> versions, >> so you can set enable_seqscan=off; >> or try 9.1 which beta3 now. > > I re-ran my queries using enable seqscan=off. > > Now the first query, without ts_rank, uses the GIN index: > > set enable_seqscan=off; > explain analyze select id > from posts_1000000 > where to_tsquery('english', 'crare') @@ document_vector > limit 50; > > Limit (cost=42290.12..42306.31 rows=50 width=4) (actual > time=16.259..16.412 rows=50 loops=1) > -> Bitmap Heap Scan on posts_1000000 (cost=42290.12..57877.02 > rows=48152 width=4) (actual time=16.256..16.344 rows=50 loops=1) > Recheck Cond: ('''crare'''::tsquery @@ document_vector) > -> Bitmap Index Scan on index_posts_documents_1000000 > (cost=0.00..42278.08 rows=48152 width=0) (actual time=13.265..13.265 > rows=49951 loops=1) > Index Cond: ('''crare'''::tsquery @@ document_vector) > Total runtime: 16.484 ms > > But the second query, the one that uses ts_rank, is still very slow... > Any idea why? Is ts_rank efficient enough to find the best 50 matches > among 50 000 documents? > > set enable_seqscan=off; > explain analyze select id > from posts_1000000 > where to_tsquery('english', 'crare') @@ document_vector > order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc > limit 50; > > Limit (cost=59596.98..59597.10 rows=50 width=22) (actual > time=296212.052..296212.257 rows=50 loops=1) > -> Sort (cost=59596.98..59717.36 rows=48152 width=22) (actual > time=296186.928..296187.007 rows=50 loops=1)" > Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32)) > Sort Method: top-N heapsort Memory: 27kB > -> Bitmap Heap Scan on posts_1000000 > (cost=42290.12..57997.40 rows=48152 width=22) (actual > time=70.861..296059.515 rows=49951 loops=1) > Recheck Cond: ('''crare'''::tsquery @@ document_vector) > -> Bitmap Index Scan on index_posts_documents_1000000 > (cost=0.00..42278.08 rows=48152 width=0) (actual time=24.922..24.922 > rows=49951 loops=1) > Index Cond: ('''crare'''::tsquery @@ document_vector) > Total runtime: 296220.493 ms > >>> By the way, does ts_rank is supposed to use a GIN index when it's >>> available? >> >> no, I see no benefit :) > > Ok. But what is the solution to improve ts_rank execution time? Am I > doing something wrong? > > Thanks for your help, > > Nicolas > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Wed, Jul 13, 2011 at 17:36, Oleg Bartunov <oleg@sai.msu.su> wrote: > I didn't notice, reading 40K tuples in random order takes a long time and > this > is a problem of any database. Can you measure time to read all documents > found ? As you asked, I measured the time required to read all documents. For reference, after having dropped the operating system cache, my machine can read a 1 GB file in 20 seconds, that is 50 MB / second. Here are the stats for table posts_1000000: Table size: 117 MB TOAST table size: 8356 MB Index size: 1720 MB I forced PostgreSQL to read all documents using the following query, which doesn't involve ts_rank: explain analyze select sum(length(document_vector)) from posts_1000000; Aggregate (cost=27472.52..27472.53 rows=1 width=18) (actual time=346952.556..346952.557 rows=1 loops=1) -> Seq Scan on posts_1000000 (cost=0.00..24975.01 rows=999001 width=18) (actual time=0.023..1793.523 rows=999001 loops=1) Total runtime: 346952.595 ms Then I ran a similar query that involves ts_rank: explain analyze select sum(ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32)) from posts_1000000 Aggregate (cost=27472.52..27472.53 rows=1 width=18) (actual time=373713.957..373713.958 rows=1 loops=1) -> Seq Scan on posts_1000000 (cost=0.00..24975.01 rows=999001 width=18) (actual time=20.045..1847.897 rows=999001 loops=1) Total runtime: 373714.031 ms The first query ran in 347 seconds; the second one in 374 seconds. Conclusion: There is no significant overhead in the ts_rank function itself. It's slow because ts_rank has to read in random order 40 000 ts_vector stored in TOAST table. The slow execution time looks like a direct consequence of storing ts_vector in TOAST table... > :( The only solution I see is to store enough information for ranking in index. Is it the expected behavior? How can I improve that? Thanks, Nicolas
Reopening a six months old discussion... Is there any plans to improve ts_rank performance by making it able to use GIN/GIST indices?
From re-reading Oleg answers, I understand FTS indices do not contain enough information for ts_rank. Because of that, ts_rank has to read the ts_vector, which is stored in TOAST table, which triggers a random read for each ranked document.
Cheers,
Nicolas Grilly
On Wed, Jul 13, 2011 at 18:55, Nicolas Grilly <nicolas@gardentechno.com> wrote:
The first query ran in 347 seconds; the second one in 374 seconds.Conclusion: There is no significant overhead in the ts_rank function
itself. It's slow because ts_rank has to read in random order 40 000
ts_vector stored in TOAST table. The slow execution time looks like
a direct consequence of storing ts_vector in TOAST table...Is it the expected behavior? How can I improve that?
> :( The only solution I see is to store enough information for ranking in index.