Re: Using FTI-Search (likely a more general

Поиск
Список
Период
Сортировка
От Oleg Bartunov
Тема Re: Using FTI-Search (likely a more general
Дата
Msg-id Pine.GSO.4.44.0207291848080.29303-100000@ra.sai.msu.su
обсуждение исходный текст
Ответ на Re: Using FTI-Search (likely a more general runtime-puzzle)  ("Markus Wollny" <Markus.Wollny@computec.de>)
Список pgsql-general
Markus,

version of PostgreSQL ? Also, could you try just

select txt2txtidx(text); where 'text' is a sample entry from your table.

    Oleg

On Mon, 29 Jul 2002, Markus Wollny wrote:

> Hi!
>
> I am about to evaluate the tsearch module from contrib, but as yet I
> haven't succeeded in getting it to run.
>
> I followed instructions on installation of tsearch; the new datatype and
> the corresponding functions seem all to be there. So I want to try it
> out on some existing data. In table article I want to set up a full-text
> index on column text. After adding the column textindex of type txtidx,
> I try to fill that column with data, but this just crashes the backend
> immediately:
>
> pcgames=# update article set textindex=txt2txtidx(text);
> pqReadData() -- backend closed the channel unexpectedly.
>         This probably means the backend terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> The logfile doesn't really give much information on what exactly
> happened:
>
> 2002-07-29 13:37:20 [22127]  DEBUG:  database system is ready
> 2002-07-29 13:37:46 [22124]  DEBUG:  server process (pid 22137) was
> terminated by signal 11
> 2002-07-29 13:37:46 [22124]  DEBUG:  terminating any other active server
> processes
> 2002-07-29 13:37:46 [22124]  DEBUG:  all server processes terminated;
> reinitializing shared memory and semaphores
> 2002-07-29 13:37:46 [22139]  FATAL 1:  The database system is starting
> up
> 2002-07-29 13:37:46 [22138]  DEBUG:  database system was interrupted at
> 2002-07-29 13:37:20 CEST
> 2002-07-29 13:37:46 [22138]  DEBUG:  checkpoint record is at E/3B785168
> 2002-07-29 13:37:46 [22138]  DEBUG:  redo record is at E/3B785168; undo
> record is at 0/0; shutdown TRUE
> 2002-07-29 13:37:46 [22138]  DEBUG:  next transaction id: 179915804;
> next oid: 95552489
> 2002-07-29 13:37:46 [22138]  DEBUG:  database system was not properly
> shut down; automatic recovery in progress
> 2002-07-29 13:37:46 [22138]  DEBUG:  ReadRecord: record with zero length
> at E/3B7851A8
> 2002-07-29 13:37:46 [22138]  DEBUG:  redo is not required
> 2002-07-29 13:37:48 [22138]  DEBUG:  database system is ready
>
> Any ideas?
>
> Regards,
>
>     Markus
>
> > -----Ursprungliche Nachricht-----
> > Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
> > Gesendet: Mittwoch, 24. Juli 2002 21:20
> > An: Markus Wollny
> > Betreff: Re: AW: [GENERAL] Using FTI-Search (likely a more
> > general runtime-puzzle)
> >
> >
> > On Wed, 24 Jul 2002, Markus Wollny wrote:
> >
> > > Hi!
> > >
> > > No I haven't as yet - does it provide other/better/more
> > features than
> > > the fti-module?
> > >
> >
> > Sure, it's designed mostly for full text search and comes from
> > Information Retrieval. There is a README file with some intro.
> >
> > > Regards,
> > >
> > >    Markus
> > >
> > >     -----UrsprУМngliche Nachricht-----
> > >     Von: Oleg Bartunov
> > >     Gesendet: Mi 24.07.2002 19:37
> > >     An: Markus Wollny
> > >     Cc: pgsql-general@postgresql.org
> > >     Betreff: Re: [GENERAL] Using FTI-Search (likely a more general
> > > runtime-puzzle)
> > >
> > >
> > >
> > >     Markus,
> > >
> > >     have you tried our contrib/tsearch module ?
> > >
> > >             Oleg
> > >     On Wed, 24 Jul 2002, Markus Wollny wrote:
> > >
> > >     > Hello!
> > >     >
> > >     > I have implemented the full text index feature from /contrib
> > > and so far
> > >     > the results look quite promising. We use it to index several
> > > columns of
> > >     > an article table. The main column is called 'text'. An example
> > > for a
> > >     > search would be to retrieve all the articles with "grand theft
> > > auto" in
> > >     > them somewhere (not necessarily one string but individual
> > > words).
> > >     >
> > >     >
> > >
> > --------------------------------------------------------------
> > ----------
> > >     > --------------------
> > >     > Case 1: Our Good Old Fashioned Search (TM) over the
> > > article-table:
> > >     >
> > >     > ANALYZE select distinct (p1.article_id)
> > >     > from article p1, article p2, article p3
> > >     > where lower(p1.text) LIKE '%grand%'
> > >     > AND lower(p2.text) LIKE '%theft%'
> > >     > AND lower(p3.text) LIKE '%auto%'
> > >     > AND p1.article_id=p2.article_id
> > >     > and p1.article_id=p3.article_id
> > >     >
> > >     > Unique  (cost=3067.03..3067.04 rows=1 width=12) (actual
> > >     > time=1455.25..1455.44 rows=30 loops=1)
> > >     >   ->  Sort  (cost=3067.03..3067.03 rows=1 width=12) (actual
> > >     > time=1455.24..1455.29 rows=30 loops=1)
> > >     >         ->  Nested Loop  (cost=1098.52..3067.02 rows=1
> > > width=12) (actual
> > >     > time=1211.54..1453.82 rows=30 loops=1)
> > >     >               ->  Hash Join  (cost=1098.52..2197.47 rows=6
> > > width=8)
> > >     > (actual time=1121.32..1309.23 rows=30 loops=1)
> > >     >                     ->  Seq Scan on article p2
> > > (cost=0.00..1098.43
> > >     > rows=35 width=4) (actual time=352.03..539.28 rows=30 loops=1)
> > >     >                     ->  Hash  (cost=1098.43..1098.43 rows=35
> > > width=4)
> > >     > (actual time=766.21..766.21 rows=0 loops=1)
> > >     >                           ->  Seq Scan on article p3
> > >     > (cost=0.00..1098.43 rows=35 width=4) (actual
> > > time=13.63..763.73 rows=411
> > >     > loops=1)
> > >     >               ->  Index Scan using idx0_article on article p1
> > >     > (cost=0.00..140.79 rows=1 width=4) (actual time=4.79..4.80
> > > rows=1
> > >     > loops=30)
> > >     > Total runtime: 1456.00 msec
> > >     >
> > >     > This is the time, fti needs to beat in order to be of any use
> > > to us.
> > >     >
> > >     >
> > >
> > --------------------------------------------------------------
> > ----------
> > >     > --------------------
> > >     >
> > >     > Case 2:
> > >     > Searching the FTI-Table:
> > >     > EXPLAIN ANALYZE select distinct (p.article_id)
> > >     > from article p, article_fti f1, article_fti f2, article_fti f3
> > >     > where f1.string ='grand' and f2.string ='theft' and f3.string
> > > ='auto'
> > >     > and p.oid=f1.id and p.oid=f2.id and p.oid=f3.id;
> > >     >
> > >     >
> > >     > Unique  (cost=39200652.79..39647427.57 rows=17870991 width=20)
> > > (actual
> > >     > time=52506.81..52506.81 rows=0 loops=1)
> > >     >   ->  Sort  (cost=39200652.79..39200652.79 rows=178709913
> > > width=20)
> > >     > (actual time=52506.81..52506.81 rows=0 loops=1)
> > >     >         ->  Merge Join  (cost=222181.40..313718.83
> > > rows=178709913
> > >     > width=20) (actual time=52505.69..52505.69 rows=0 loops=1)
> > >     >               ->  Merge Join  (cost=147175.62..170464.70
> > > rows=1855421
> > >     > width=8) (actual time=37084.36..37084.36 rows=1 loops=1)
> > >     >                     ->  Sort  (cost=73587.81..73587.81
> > > rows=19264
> > >     > width=4) (actual time=22423.27..22423.27 rows=1 loops=1)
> > >     >                           ->  Seq Scan on article_fti f2
> > >     > (cost=0.00..72216.86 rows=19264 width=4) (actual
> > > time=18297.22..22422.20
> > >     > rows=97 loops=1)
> > >     >                     ->  Sort  (cost=73587.81..73587.81
> > > rows=19264
> > >     > width=4) (actual time=14661.04..14661.04 rows=1 loops=1)
> > >     >                           ->  Seq Scan on article_fti f3
> > >     > (cost=0.00..72216.86 rows=19264 width=4) (actual
> > > time=1344.87..14659.59
> > >     > rows=159 loops=1)
> > >     >               ->  Sort  (cost=75005.78..75005.78 rows=19264
> > > width=12)
> > >     > (actual time=15421.32..15421.32 rows=0 loops=1)
> > >     >                     ->  Hash Join  (cost=1080.86..73634.84
> > > rows=19264
> > >     > width=12) (actual time=15421.10..15421.10 rows=0 loops=1)
> > >     >                           ->  Seq Scan on article_fti f1
> > >     > (cost=0.00..72216.86 rows=19264 width=4) (actual
> > > time=4879.59..14306.97
> > >     > rows=350 loops=1)
> > >     >                           ->  Hash  (cost=1063.29..1063.29
> > > rows=7029
> > >     > width=8) (actual time=1109.79..1109.79 rows=0 loops=1)
> > >     >                                 ->  Seq Scan on article p
> > >     > (cost=0.00..1063.29 rows=7029 width=8) (actual
> > > time=12.20..1076.46
> > >     > rows=7029 loops=1)
> > >     > Total runtime: 52507.92 msec
> > >     >
> > >     > This is way too long. I don't know why it doesn't use the
> > > indexes I gave
> > >     > to it... If someone has got some idea as to what can be done
> > > to optimize
> > >     > query-planner decisions, I'd be happy to comply. But anyway, I
> > > don't
> > >     > worry about this too much and just force it to use them, once
> > > I know its
> > >     > decision for seqential scan is wrong:
> > >     >
> > >     >
> > >
> > --------------------------------------------------------------
> > ----------
> > >     > --------------------
> > >     >
> > >     > Case 3:
> > >     > Searching the FTI-Table forcing use of indexes:
> > >     > SET ENABLE_SEQSCAN=OFF;
> > >     > EXPLAIN ANALYZE select distinct (p.article_id)
> > >     > from article p, article_fti f1, article_fti f2, article_fti f3
> > >     > where f1.string ='grand' and f2.string ='theft' and f3.string
> > > ='auto'
> > >     > and p.oid=f1.id and p.oid=f2.id and p.oid=f3.id;
> > >     > SET ENABLE_SEQSCAN=ON;
> > >     >
> > >     > NOTICE:  QUERY PLAN:
> > >     >
> > >     > Unique  (cost=5341.68..5342.61 rows=37 width=20) (actual
> > >     > time=40.52..41.83 rows=65 loops=1)
> > >     >   ->  Sort  (cost=5341.68..5341.68 rows=372 width=20) (actual
> > >     > time=40.52..40.96rows=376 loops=1)
> > >     >         ->  Nested Loop  (cost=3067.05..5325.79 rows=372
> > > width=20)
> > >     > (actual time=14.02..38.38 rows=376 loops=1)
> > >     >               ->  Merge Join  (cost=3067.05..3084.32 rows=372
> > > width=12)
> > >     > (actualtime=13.95..20.45 rows=376 loops=1)
> > >     >                     ->  Merge Join  (cost=2044.70..2054.37
> > > rows=502
> > >     > width=8) (actual time=5.75..7.81 rows=176 loops=1)
> > >     >                           ->  Sort  (cost=1022.35..1022.35
> > > rows=678
> > >     > width=4) (actual time=2.31..2.44 rows=97 loops=1)
> > >     >                                 ->  Index Scan using
> > >     > article_fti_string_idx on article_fti f2  (cost=0.00..990.47
> > > rows=678
> > >     > width=4) (actual time=0.16..1.74 rows=97 loops=1)
> > >     >                           ->  Sort  (cost=1022.35..1022.35
> > > rows=678
> > >     > width=4) (actual time=3.42..3.66 rows=194 loops=1)
> > >     >                                 ->  Index Scan using
> > >     > article_fti_string_idx on article_fti f3  (cost=0.00..990.47
> > > rows=678
> > >     > width=4) (actual time=0.13..2.53 rows=159 loops=1)
> > >     >                     ->  Sort  (cost=1022.35..1022.35 rows=678
> > > width=4)
> > >     > (actual time=8.18..8.89 rows=532 loops=1)
> > >     >                           ->  Index Scan using
> > > article_fti_string_idx on
> > >     > article_fti f1  (cost=0.00..990.47 rows=678 width=4) (actual
> > >     > time=0.13..5.98 rows=350 loops=1)
> > >     >               ->  Index Scan using article_oid_idx on article
> > > p
> > >     > (cost=0.00..6.01 rows=1 width=8) (actual time=0.03..0.03
> > > rows=1
> > >     > loops=376)
> > >     > Total runtime: 42.41 msec
> > >     >
> > >     > Not that's much more like it :) This query returns exactly 65
> > > numbers.
> > >     > But really we want not only the article-IDs, but also some
> > > info on the
> > >     > article as well:
> > >     >
> > >     >
> > >
> > --------------------------------------------------------------
> > ----------
> > >     > --------------------
> > >     >
> > >     > Case 3b:
> > >     > Retrieving additional article-info:
> > >     >
> > >     > SET ENABLE_SEQSCAN=OFF;
> > >     > EXPLAIN ANALYZE
> > >     > SELECT article_id, site_id, article_type, topstory, headline,
> > > published
> > >     > from article where article_id in (
> > >     > select distinct (p.article_id)
> > >     > from article p, article_fti f1, article_fti f2, article_fti f3
> > >     > where f1.string ='grand' and f2.string ='theft' and f3.string
> > > ='auto'
> > >     > and p.oid=f1.id and p.oid=f2.id and p.oid=f3.id);
> > >     > SET ENABLE_SEQSCAN=ON;
> > >     >
> > >     > NOTICE:  QUERY PLAN:
> > >     >
> > >     > Seq Scan on article  (cost=100000000.00..137645255.53
> > > rows=3523
> > >     > width=77) (actual time=1756.30..3720.44 rows=65 loops=1)
> > >     >   SubPlan
> > >     >     ->  Materialize  (cost=5342.61..5342.61 rows=37 width=20)
> > > (actual
> > >     > time=0.17..0.28 rows=65 loops=7046)
> > >     >           ->  Unique  (cost=5341.68..5342.61 rows=37 width=20)
> > > (actual
> > >     > time=1182.07..1183.29 rows=65 loops=1)
> > >     >                 ->  Sort  (cost=5341.68..5341.68 rows=372
> > > width=20)
> > >     > (actual time=1182.06..1182.52 rows=376 loops=1)
> > >     >                       ->  Nested Loop  (cost=3067.05..5325.79
> > > rows=372
> > >     > width=20) (actual time=318.55..1179.04 rows=376 loops=1)
> > >     >                             ->  Merge Join
> > > (cost=3067.05..3084.32
> > >     > rows=372 width=12) (actual time=300.97..314.39 rows=376
> > > loops=1)
> > >     >                                   ->  Merge Join
> > > (cost=2044.70..2054.37
> > >     > rows=502 width=8) (actual time=159.24..163.45 rows=176
> > > loops=1)
> > >     >                                         ->  Sort
> > > (cost=1022.35..1022.35
> > >     > rows=678 width=4) (actual time=71.23..71.45 rows=97 loops=1)
> > >     >                                               ->  Index Scan
> > > using
> > >     > article_fti_string_idx on article_fti f2  (cost=0.00..990.47
> > > rows=678
> > >     > width=4) (actual time=1.46..69.08 rows=97 loops=1)
> > >     >                                         ->  Sort
> > > (cost=1022.35..1022.35
> > >     > rows=678 width=4) (actual time=87.97..88.39 rows=194 loops=1)
> > >     >                                               ->  Index Scan
> > > using
> > >     > article_fti_string_idx on article_fti f3  (cost=0.00..990.47
> > > rows=678
> > >     > width=4) (actual time=0.19..85.06 rows=159 loops=1)
> > >     >                                   ->  Sort
> > > (cost=1022.35..1022.35
> > >     > rows=678 width=4) (actual time=141.70..143.03 rows=532
> > > loops=1)
> > >     >                                         ->  Index Scan using
> > >     > article_fti_string_idx on article_fti f1  (cost=0.00..990.47
> > > rows=678
> > >     > width=4) (actual time=0.17..138.95 rows=350 loops=1)
> > >     >                             ->  Index Scan using
> > > article_oid_idx on
> > >     > article p  (cost=0.00..6.01 rows=1 width=8) (actual
> > > time=1.65..2.27
> > >     > rows=1 loops=376)
> > >     > Total runtime: 3722.74 msec
> > >     >
> > >     > Whoa! Why is that? How come this takes so extremely much
> > > longer?
> > >     > Because...
> > >     >
> > >     >
> > >
> > --------------------------------------------------------------
> > ----------
> > >     > --------------------
> > >     >
> > >     > Case 3c:
> > >     > If I just select all the ids, I get a total runtime of  42.41
> > > msec. And
> > >     > if I do a select ... where article_id in (all the results from
> > > case3):
> > >     >
> > >     > explain analyze SELECT article_id, site_id, article_type,
> > > topstory,
> > >     > headline, published from article where article_id in
> > >     >
> > >
> > (8965,8966,9701,10204,11114,11788,11941,12339,12685,12696,1278
> > 4,13324,14
> > >     >
> > >
> > 049,30982,31697,31924,31941,31944,32470,36810,37950,38913,3973
> > 6,40470,41
> > >     >
> > >
> > 330,45651,47050,47997,49692,51253,51730,51817,52111,52233,5265
> > 3,52770,53
> > >     >
> > >
> > 090,54410,54812,54890,55310,55974,56051,56590,56610,56630,5695
> > 0,57210,57
> > >     >
> > >
> > 690,59670,60951,62630,62650,63070,63130,63551,63570,65691,6571
> > 0,65771,66
> > >     > 050,66490,66531,67030,67110);
> > >     >
> > >     > NOTICE:  QUERY PLAN:
> > >     >
> > >     > Index Scan using idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article, idx0_article, idx0_article, idx0_article,
> > > idx0_article,
> > >     > idx0_article on article  (cost=0.00..378.61 rows=67 width=77)
> > > (actual
> > >     > time=0.08..5.17 rows=65 loops=1)
> > >     > Total runtime: 6.15 msec
> > >     >
> > >     > I get just another 6.15msec.
> > >     >
> > >     > Now I do think that the total runtime of 3b should really be
> > > 3+3c, which
> > >     > amounts to roughly 50msec, but not nearly 4 seconds. What am I
> > > doing
> > >     > wrong?
> > >     >
> > >     > Thanks for your advice!
> > >     >
> > >     > Regards,
> > >     >
> > >     >       Markus
> > >     >
> > >     > ---------------------------(end of
> > > broadcast)---------------------------
> > >     > TIP 1: subscribe and unsubscribe commands go to
> > > majordomo@postgresql.org
> > >     >
> > >
> > >             Regards,
> > >                     Oleg
> > >     _____________________________________________________________
> > >     Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> > >     Sternberg Astronomical Institute, Moscow University (Russia)
> > >     Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> > >     phone: +007(095)939-16-83, +007(095)939-23-83
> > >
> > >
> > >
> >
> >     Regards,
> >         Oleg
> > _____________________________________________________________
> > Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> > Sternberg Astronomical Institute, Moscow University (Russia)
> > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> > phone: +007(095)939-16-83, +007(095)939-23-83
> >
> >
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


В списке pgsql-general по дате отправления:

Предыдущее
От: Andrew Sullivan
Дата:
Сообщение: Re: query slow problem
Следующее
От: Oleg Bartunov
Дата:
Сообщение: Re: tsearch - Regression tests fail