Обсуждение: Full text indexing (and errors!)

Поиск
Список
Период
Сортировка

Full text indexing (and errors!)

От
"Mitch Vincent"
Дата:
I posted this to -hackers but thought it might be of some use to people here
as well. I look forward to any and all comments.. Thanks!

----------------------------------

I finally finished the fulltextindex ( in contrib/fulltextindex) and am
wondering if I might be able to do something that I haven't already to make
these queries a bit faster.. There is a lot of data here, I know, but as it
is right now it is faster for me to use grep (or agrep) to search the files
and put the matching files into a temporay table and qualify results on
that... I had hoped this would be the case :-)

Lets see.. (Most of this is in contrib/fulltextindex but I'll repeat it
here)

I have 2 tables.
   Table "applicants_resumes" Attribute  |  Type   | Modifier
-------------+---------+----------app_id      | integer |resume_text | text    |
Index: resumes_oid_index

And
   Table "resumes_fti"Attribute |     Type     | Modifier
-----------+--------------+----------string    | varchar(255) |id        | oid          |
Index: resume_fti_index

The resumes_fti table holds the broken up chunks of text (duh) ..

I also did the  CLUSTER resumes_fti_index on resumes_fti

When I run a query I get :


ipa=# explain select c.* from applicants c, resumes_fti f1 where f1.string ~
'^engineer' and f1.id = c.oid;
NOTICE:  QUERY PLAN:

Hash Join  (cost=918.11..7969.43 rows=56443 width=607) ->  Seq Scan on applicants c  (cost=0.00..1331.56 rows=9856
width=603)->  Hash  (cost=5.00..5.00 rows=168041 width=4)       ->  Index Scan using resume_fti_index on resumes_fti
f1
(cost=0.00..5.00 rows=168041 width=4)

EXPLAIN
ipa=#


query: select c.* from applicants c, resumes_fti f1 where f1.string ~
'^engineer' and f1.id = c.oid;
ProcessQuery
! system usage stats:
!       304.304711 elapsed 9.111341 user 7.992034 system sec
!       [9.172874 user 8.012545 sys total]
!       53309/1599 [53411/1599] filesystem blocks in/out
!       0/295 [0/605] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [4/5] messages rcvd/sent
!       53168/226 [53255/232] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:      58171 read,          0 written, buffer hit rate
= 14.19%
!       Local  blocks:          0 read,          0 written, buffer hit rate
= 0.00%
!       Direct blocks:          0 read,          0 written
CommitTransactionCommand
proc_exit(0)
shmem_exit(0)
exit(0)


... As you can see, really really slow. So I thought baout creating an index
on 'string' in resumes_fti (makes since as this is suppose to be a full text
index) -- Note that's not listed in the README which was very odd to me. I
don't understand the point in breaking all the text out into chunks if you
don't then index it. The way it is after you get done with the README is
just a broken up version of what you already had (a bunch of text fields).
It was my understanding that the fulltextindex was meant to get around the
inability to index a full text field by breaking the full text field up into
smaller varchar() fields which could them be indexed...

Anyway, I tried --

ipa=# CREATE INDEX "rstring_fti_index" on resumes_fti(lower(string));
FATAL 1:  Memory exhausted in AllocSetAlloc()
pqReadData() -- backend closed the channel unexpectedly.       This probably means the backend terminated abnormally
  before or while processing the request.
 
The connection to the server was lost. Attempting reset: Succeeded.
ipa=#

... and as you can see it messed up pretty badly after about 15 minutes.
This machine is a PII400 with 256 Megs of RAM and 520 megs of swap running
PostgreSQL 7.0..

Sorry about the length of this post and thanks for any
pointers/comments/ideas..

-Mitch Vincent











Re: Full text indexing (and errors!)

От
Tom Lane
Дата:
"Mitch Vincent" <mitch@venux.net> writes:
> I have 2 tables.

>     Table "applicants_resumes"
>   Attribute  |  Type   | Modifier
> -------------+---------+----------
>  app_id      | integer |
>  resume_text | text    |
> Index: resumes_oid_index

Uh, the query you show below is on "applicants" not on this table ...
is there an index on applicants' OID column?

> ... As you can see, really really slow. So I thought baout creating an index
> on 'string' in resumes_fti (makes since as this is suppose to be a full text
> index) -- Note that's not listed in the README which was very odd to
> me.

Isn't that what resume_fti_index is?

> ipa=# CREATE INDEX "rstring_fti_index" on resumes_fti(lower(string));
> FATAL 1:  Memory exhausted in AllocSetAlloc()

lower(text) leaks memory, so I guess this would happen with a large
enough table :-(.  There are plans afoot to improve matters in 7.1 ...

However, since fti.c lowercases everything it puts into the fti table,
I don't see a need for doing another lower() operation in the index
definition.

As far as I can tell, you're already OK on the string search, since
you are getting an indexscan on resumes_fti.  The hash join might
not be such a bright idea though.  I suspect the reason for that
choice is the large estimate for the number of rows matched by the
f1.string ~ '^engineer' condition (168041 which seems like a lot).
How big are these tables really?  Have you done a 'vacuum analyze'
on them?
        regards, tom lane


Re: Full text indexing (and errors!)

От
"Mitch Vincent"
Дата:
> Uh, the query you show below is on "applicants" not on this table ...
> is there an index on applicants' OID column?

No, there is not an index on the applicant OID column.. Since I'm getting
all the records from the applicants table where the string I search for is
in the resumes_fti table, I didn't think and index like that would help
(since I'm qualifying the results based on rows in another table). Am I
wrong in thinking that?

> Isn't that what resume_fti_index is?

Indeed, I'm a complete dork. Disregard the error below, you explained it
nicely :-)

> As far as I can tell, you're already OK on the string search, since
> you are getting an indexscan on resumes_fti.  The hash join might
> not be such a bright idea though.  I suspect the reason for that
> choice is the large estimate for the number of rows matched by the
> f1.string ~ '^engineer' condition (168041 which seems like a lot).
> How big are these tables really?  Have you done a 'vacuum analyze'
> on them?

I have done a vacuum analyze.

select count(app_id) from applicants_resumes;
count
-------14673
(1 row)


select count(id) from resumes_fti;
 count
----------33462249
(1 row)

In a word. Huge :-)

-Mitch








Re: Full text indexing (and errors!)

От
Tom Lane
Дата:
"Mitch Vincent" <mitch@venux.net> writes:
>> is there an index on applicants' OID column?

> No, there is not an index on the applicant OID column.. Since I'm getting
> all the records from the applicants table where the string I search for is
> in the resumes_fti table, I didn't think and index like that would help
> (since I'm qualifying the results based on rows in another table). Am I
> wrong in thinking that?

If the pattern match is reasonably selective then I'd think that the
best plan would probably be an indexscan on resumes_fti (using the
pattern operator to select rows) and then a nestloop join against the
applicant table using an inner indexscan on OID.  In English: look up
the entries in resumes_fti that match the pattern, and then use the OIDs
to look up the applicants entries ;-).  But it doesn't work without the
index on OID.

> select count(app_id) from applicants_resumes;
>  14673
> select count(id) from resumes_fti;
>  33462249

Hmm.  So the selectivity being estimated for the pattern match is
168041/33462249 or about 0.005 ... which is not huge but we'd
probably like it to be smaller.  What do you get from the standard
statistical query:

select attname,attdisbursion,s.*
from pg_statistic s, pg_attribute a, pg_class c
where starelid = c.oid and attrelid = c.oid and staattnum = attnum
and relname = 'resumes_fti';
        regards, tom lane


Re: Full text indexing (and errors!)

От
"Mitch Vincent"
Дата:
 attname | attdisbursion | starelid | staattnum | staop | stanullfrac |
stacommonfrac | stacommonval | staloval | stahival
---------+---------------+----------+-----------+-------+-------------+-----
----------+--------------+----------+----------string  |    0.00208943 |  1161760 |         1 |  1066 |           0 |
0.0100436 | on           | 00       | zzzid      |   3.40795e-05 |  1161760 |         2 |   609 |           0 |
0.000170281 | 9807369      | 7647538  | 41122350
(2 rows)

There ya go!


----- Original Message -----
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mitch Vincent <mitch@venux.net>
Cc: <pgsql-sql@postgresql.org>
Sent: Sunday, May 21, 2000 2:28 PM
Subject: Re: [SQL] Full text indexing (and errors!)


> "Mitch Vincent" <mitch@venux.net> writes:
> >> is there an index on applicants' OID column?
>
> > No, there is not an index on the applicant OID column.. Since I'm
getting
> > all the records from the applicants table where the string I search for
is
> > in the resumes_fti table, I didn't think and index like that would help
> > (since I'm qualifying the results based on rows in another table). Am I
> > wrong in thinking that?
>
> If the pattern match is reasonably selective then I'd think that the
> best plan would probably be an indexscan on resumes_fti (using the
> pattern operator to select rows) and then a nestloop join against the
> applicant table using an inner indexscan on OID.  In English: look up
> the entries in resumes_fti that match the pattern, and then use the OIDs
> to look up the applicants entries ;-).  But it doesn't work without the
> index on OID.
>
> > select count(app_id) from applicants_resumes;
> >  14673
> > select count(id) from resumes_fti;
> >  33462249
>
> Hmm.  So the selectivity being estimated for the pattern match is
> 168041/33462249 or about 0.005 ... which is not huge but we'd
> probably like it to be smaller.  What do you get from the standard
> statistical query:
>
> select attname,attdisbursion,s.*
> from pg_statistic s, pg_attribute a, pg_class c
> where starelid = c.oid and attrelid = c.oid and staattnum = attnum
> and relname = 'resumes_fti';
>
> regards, tom lane
>



Re: Full text indexing (and errors!)

От
Tom Lane
Дата:
"Mitch Vincent" <mitch@venux.net> writes:
>  attname | attdisbursion | starelid | staattnum | staop | stanullfrac |
> stacommonfrac | stacommonval | staloval | stahival
> ---------+---------------+----------+-----------+-------+-------------+-----
> ----------+--------------+----------+----------
>  string  |    0.00208943 |  1161760 |         1 |  1066 |           0 |
> 0.0100436 | on           | 00       | zzz

Hmm, so the most common word is "on" accounting for about 1% of the
entries.  Although I don't think that stacommonfrac directly affects
this particular query plan, it'd still be a good idea to try to push it
down.  fti.c has a provision for ignoring "stop words", but its stopword
list seems to be empty by default.  You might want to throw in "the" and
"on" and any other noisewords you're unlikely to want to search for.
That should help reduce the size of the fti table, too...

Actually ... waitasec.  stacommonfrac *does* affect this query plan in
7.0 release.  Before you do anything else, try enabling the new LIKE
estimator code (see contrib/likeplanning/ for details) and see what
sort of plan you get then.  The estimated selectivity should go *way*
down, and that ought to change the plan.

You'd still be well advised to get rid of as many stop words as you can.
        regards, tom lane