Re: text search: restricting the number of parsed words in headline generation
От | Bruce Momjian |
---|---|
Тема | Re: text search: restricting the number of parsed words in headline generation |
Дата | |
Msg-id | 20130124205452.GB21914@momjian.us обсуждение исходный текст |
Ответ на | Re: text search: restricting the number of parsed words in headline generation (Sushant Sinha <sushant354@gmail.com>) |
Список | pgsql-hackers |
On Wed, Aug 15, 2012 at 11:09:18PM +0530, Sushant Sinha wrote: > I will do the profiling and present the results. Sushant, do you have any profiling results on this issue from August? --------------------------------------------------------------------------- > > On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > Is this a TODO? > > > > AFAIR nothing's been done about the speed issue, so yes. I didn't > > like the idea of creating a user-visible knob when the speed issue > > might be fixable with internal algorithm improvements, but we never > > followed up on this in either fashion. > > > > regards, tom lane > > > > > --------------------------------------------------------------------------- > > > > > On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote: > > >> Sushant Sinha <sushant354@gmail.com> writes: > > >>> Doesn't this force the headline to be taken from the first N words of > > >>> the document, independent of where the match was? That seems rather > > >>> unworkable, or at least unhelpful. > > >> > > >>> In headline generation function, we don't have any index or knowledge of > > >>> where the match is. We discover the matches by first tokenizing and then > > >>> comparing the matches with the query tokens. So it is hard to do > > >>> anything better than first N words. > > >> > > >> After looking at the code in wparser_def.c a bit more, I wonder whether > > >> this patch is doing what you think it is. Did you do any profiling to > > >> confirm that tokenization is where the cost is? Because it looks to me > > >> like the match searching in hlCover() is at least O(N^2) in the number > > >> of tokens in the document, which means it's probably the dominant cost > > >> for any long document. I suspect that your patch helps not so much > > >> because it saves tokenization costs as because it bounds the amount of > > >> effort spent in hlCover(). > > >> > > >> I haven't tried to do anything about this, but I wonder whether it > > >> wouldn't be possible to eliminate the quadratic blowup by saving more > > >> state across the repeated calls to hlCover(). At the very least, it > > >> shouldn't be necessary to find the last query-token occurrence in the > > >> document from scratch on each and every call. > > >> > > >> Actually, this code seems probably flat-out wrong: won't every > > >> successful call of hlCover() on a given document return exactly the same > > >> q value (end position), namely the last token occurrence in the > > >> document? How is that helpful? > > >> > > >> regards, tom lane > > >> > > >> -- > > >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > > >> To make changes to your subscription: > > >> http://www.postgresql.org/mailpref/pgsql-hackers > > > > > -- > > > Bruce Momjian <bruce@momjian.us> http://momjian.us > > > EnterpriseDB http://enterprisedb.com > > > > > + It's impossible for everything to be true. + > > > > > > > -- > > > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > > > To make changes to your subscription: > > > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
В списке pgsql-hackers по дате отправления:
Следующее
От: Bruce MomjianДата:
Сообщение: Re: [BUGS] BUG #6572: The example of SPI_execute is bogus