Обсуждение: Indexes not used in 7.1RC4: Bug?

Поиск

Список

Период

Сортировка

Indexes not used in 7.1RC4: Bug?

От

Alvar Freude

Дата:

10 апреля 2001 г., 09:41:17

Hi,

I have the following table, containing about 570000 Rows, but some
indexes are not used, on 7.1RC4, freshly vacuumed (analyse). It was the
same at least in 7.1RC1


 --- SNIP ---

      CREATE TABLE access_log(
         site_id     int2        NOT NULL DEFAULT 0,
         access_time timestamp   NOT NULL DEFAULT NOW(),
         time_taken  interval    NOT NULL,
         remote_ip   inet        NOT NULL,
         method_num  int2        NOT NULL,
         url_id      int4        NOT NULL REFERENCES urls(id),
         referer_id  int4                 REFERENCES referer(id),
         browser_id  int4                 REFERENCES browser(id),
         status      int2        NOT NULL DEFAULT 0,
         bytes       int4        NOT NULL DEFAULT 0,
         content_id  int2        NOT NULL REFERENCES content_types(id),

         https_flag  boolean     NOT NULL DEFAULT 'f',

         session_id  char(32),
         user_id     int4                 REFERENCES users(id),
         uname       varchar(255),

         note_id     int4

         );

      CREATE INDEX site_idx         ON access_log(site_id);
      CREATE INDEX access_time_idx  ON access_log(access_time);
      CREATE INDEX time_taken_idx   ON access_log(time_taken);
      CREATE INDEX remote_ip_idx    ON access_log(remote_ip);
      CREATE INDEX method_idx       ON access_log(method_num);
      CREATE INDEX url_idx          ON access_log(url_id);
      CREATE INDEX referer_idx      ON access_log(referer_id);
      CREATE INDEX browser_idx      ON access_log(browser_id);
      CREATE INDEX status_idx       ON access_log(status);
      CREATE INDEX bytes_idx        ON access_log(bytes);
      CREATE INDEX content_idx      ON access_log(content_id);
      CREATE INDEX https_idx        ON access_log(https_flag);
      CREATE INDEX session_idx      ON access_log(session_id);
      CREATE INDEX user_id_idx      ON access_log(user_id);
      CREATE INDEX user_idx         ON access_log(uname);
      CREATE INDEX note_idx         ON access_log(note_id);


 --- SNAP ---


url_idx seems OK:

  logger=# EXPLAIN SELECT * FROM access_log WHERE url_id = 1000;
  Index Scan using url_idx on access_log
     (cost=0.00..3618.92 rows=1002 width=89)



But the others not:

  logger=# EXPLAIN SELECT * FROM access_log WHERE method_num = 0;
  Seq Scan on access_log  (cost=0.00..16443.71 rows=559371 width=89)

  logger=# EXPLAIN SELECT * FROM access_log WHERE browser_id = 500;
  Seq Scan on access_log  (cost=0.00..16443.71 rows=7935 width=89)

  logger=# EXPLAIN SELECT * FROM access_log WHERE content_id = 20;
  Seq Scan on access_log  (cost=0.00..16443.71 rows=20579 width=89)

....


And very strange:

  logger=# EXPLAIN SELECT * FROM access_log WHERE access_time >
           '2001-04-10 10:10:10';
  Index Scan using access_time_idx on access_log
           (cost=0.00..10605.12 rows=3251 width=89)

  logger=# EXPLAIN SELECT * FROM access_log WHERE access_time >
           '2001-04-08 10:10:10';
  Seq Scan on access_log  (cost=0.00..16443.71 rows=152292 width=89)



Indexes are also not used in Subselects:


  logger=# EXPLAIN SELECT * FROM access_log WHERE url_id in (1);
  Index Scan using url_idx on access_log
           (cost=0.00..3618.92 rows=1002 width=89)

  logger=# EXPLAIN SELECT * FROM access_log WHERE url_id in (1,2,3);
  Index Scan using url_idx, url_idx, url_idx on access_log
           (cost=0.00..10871.79 rows=3000 width=89)


But:

  logger=# EXPLAIN SELECT * FROM access_log
                           WHERE url_id IN (SELECT 1);
  Seq Scan on access_log  (cost=0.00..16443.71 rows=572537 width=89)
  SubPlan
    ->  Materialize  (cost=0.00..0.00 rows=0 width=0)
          ->  Result  (cost=0.00..0.00 rows=0 width=0)



Indexes are also not used for remote_ip, ORDER BY access_time
(timestamp), ORDER BY time_taken (interval), status, method_num etc. The
only I found where indexes are used is url_id!

hmmm, any hints? Bug?


Postgres configuration is default for all optimizations (geqo_...);
others:

  sort_mem       = 1024
  shared_buffers = 512


Tested on Linux and Win2K/Cygwin.


For the hackers: explain verbose follows:

logger=# EXPLAIN VERBOSE SELECT * FROM access_log WHERE url_id = 1000;
NOTICE:  QUERY DUMP:

{ INDEXSCAN :startup_cost 0.00 :total_cost 3618.92 :rows 1002 :width 89
:qptargetlist ({ TARGETENTRY :resdom { RESDOM :resno 1 :restype 21
:restypmod -1 :resname site_id :reskey 0 :reskeyop 0 :ressortgroupref 0
:resjunk false } :expr { VAR :varno 1 :varattno 1 :vartype 21 :vartypmod
-1  :varlevelsup 0 :varnoold 1 :varoattno 1}} { TARGETENTRY :resdom {
RESDOM :resno 2 :restype 1184 :restypmod -1 :resname access_time :reskey
0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1
:varattno 2 :vartype 1184 :vartypmod -1  :varlevelsup 0 :varnoold 1
:varoattno 2}} { TARGETENTRY :resdom { RESDOM :resno 3 :restype 1186
:restypmod -1 :resname time_taken :reskey 0 :reskeyop 0 :ressortgroupref
0 :resjunk false } :expr { VAR :varno 1 :varattno 3 :vartype 1186
:vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 3}} { TARGETENTRY
:resdom { RESDOM :resno 4 :restype 869 :restypmod -1 :resname remote_ip
:reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR
:varno 1 :varattno 4 :vartype 869 :vartypmod -1  :varlevelsup 0
:varnoold 1 :varoattno 4}} { TARGETENTRY :resdom { RESDOM :resno 5
:restype 21 :restypmod -1 :resname method_num :reskey 0 :reskeyop 0
:ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 5
:vartype 21 :vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 5}} {
TARGETENTRY :resdom { RESDOM :resno 6 :restype 23 :restypmod -1 :resname
url_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr {
VAR :varno 1 :varattno 6 :vartype 23 :vartypmod -1  :varlevelsup 0
:varnoold 1 :varoattno 6}} { TARGETENTRY :resdom { RESDOM :resno 7
:restype 23 :restypmod -1 :resname referer_id :reskey 0 :reskeyop 0
:ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 7
:vartype 23 :vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 7}} {
TARGETENTRY :resdom { RESDOM :resno 8 :restype 23 :restypmod -1 :resname
browser_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false }
:expr { VAR :varno 1 :varattno 8 :vartype 23 :vartypmod -1  :varlevelsup
0 :varnoold 1 :varoattno 8}} { TARGETENTRY :resdom { RESDOM :resno 9
:restype 21 :restypmod -1 :resname status :reskey 0 :reskeyop 0
:ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 9
:vartype 21 :vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 9}} {
TARGETENTRY :resdom { RESDOM :resno 10 :restype 23 :restypmod -1
:resname bytes :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false }
:expr { VAR :varno 1 :varattno 10 :vartype 23 :vartypmod -1
:varlevelsup 0 :varnoold 1 :varoattno 10}} { TARGETENTRY :resdom {
RESDOM :resno 11 :restype 21 :restypmod -1 :resname content_id :reskey 0
:reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1
:varattno 11 :vartype 21 :vartypmod -1  :varlevelsup 0 :varnoold 1
:varoattno 11}} { TARGETENTRY :resdom { RESDOM :resno 12 :restype 16
:restypmod -1 :resname https_flag :reskey 0 :reskeyop 0 :ressortgroupref
0 :resjunk false } :expr { VAR :varno 1 :varattno 12 :vartype 16
:vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 12}} { TARGETENTRY
:resdom { RESDOM :resno 13 :restype 1042 :restypmod 36 :resname
session_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false }
:expr { VAR :varno 1 :varattno 13 :vartype 1042 :vartypmod 36
:varlevelsup 0 :varnoold 1 :varoattno 13}} { TARGETENTRY :resdom {
RESDOM :resno 14 :restype 23 :restypmod -1 :resname user_id :reskey 0
:reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1
:varattno 14 :vartype 23 :vartypmod -1  :varlevelsup 0 :varnoold 1
:varoattno 14}} { TARGETENTRY :resdom { RESDOM :resno 15 :restype 1043
:restypmod 259 :resname uname :reskey 0 :reskeyop 0 :ressortgroupref 0
:resjunk false } :expr { VAR :varno 1 :varattno 15 :vartype 1043
:vartypmod 259  :varlevelsup 0 :varnoold 1 :varoattno 15}} { TARGETENTRY
:resdom { RESDOM :resno 16 :restype 23 :restypmod -1 :resname note_id
:reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR
:varno 1 :varattno 16 :vartype 23 :vartypmod -1  :varlevelsup 0
:varnoold 1 :varoattno 16}}) :qpqual <> :lefttree <> :righttree <>
:extprm () :locprm () :initplan <> :nprm 0  :scanrelid 1 :indxid (
1870492) :indxqual (({ EXPR :typeOid 16  :opType op :oper { OPER :opno
96 :opid 65 :opresulttype 16 } :args ({ VAR :varno 1 :varattno 1
:vartype 23 :vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 6} {
CONST :consttype 23 :constlen 4 :constbyval true :constisnull false
:constvalue  4 [ -24 3 0 0 ] })})) :indxqualorig (({ EXPR :typeOid 16
:opType op :oper { OPER :opno 96 :opid 65 :opresulttype 16 } :args ({
VAR :varno 1 :varattno 6 :vartype 23 :vartypmod -1  :varlevelsup 0
:varnoold 1 :varoattno 6} { CONST :consttype 23 :constlen 4 :constbyval
true :constisnull false :constvalue  4 [ -24 3 0 0 ] })})) :indxorderdir
1 }
NOTICE:  QUERY PLAN:

Index Scan using url_idx on access_log  (cost=0.00..3618.92 rows=1002
width=89)

EXPLAIN



logger=# EXPLAIN VERBOSE SELECT * FROM access_log WHERE referer_id =
1000;
NOTICE:  QUERY DUMP:

{ SEQSCAN :startup_cost 0.00 :total_cost 16443.71 :rows 11715 :width 89
:qptargetlist ({ TARGETENTRY :resdom { RESDOM :resno 1 :restype 21
:restypmod -1 :resname site_id :reskey 0 :reskeyop 0 :ressortgroupref 0
:resjunk false } :expr { VAR :varno 1 :varattno 1 :vartype 21 :vartypmod
-1  :varlevelsup 0 :varnoold 1 :varoattno 1}} { TARGETENTRY :resdom {
RESDOM :resno 2 :restype 1184 :restypmod -1 :resname access_time :reskey
0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1
:varattno 2 :vartype 1184 :vartypmod -1  :varlevelsup 0 :varnoold 1
:varoattno 2}} { TARGETENTRY :resdom { RESDOM :resno 3 :restype 1186
:restypmod -1 :resname time_taken :reskey 0 :reskeyop 0 :ressortgroupref
0 :resjunk false } :expr { VAR :varno 1 :varattno 3 :vartype 1186
:vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 3}} { TARGETENTRY
:resdom { RESDOM :resno 4 :restype 869 :restypmod -1 :resname remote_ip
:reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR
:varno 1 :varattno 4 :vartype 869 :vartypmod -1  :varlevelsup 0
:varnoold 1 :varoattno 4}} { TARGETENTRY :resdom { RESDOM :resno 5
:restype 21 :restypmod -1 :resname method_num :reskey 0 :reskeyop 0
:ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 5
:vartype 21 :vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 5}} {
TARGETENTRY :resdom { RESDOM :resno 6 :restype 23 :restypmod -1 :resname
url_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr {
VAR :varno 1 :varattno 6 :vartype 23 :vartypmod -1  :varlevelsup 0
:varnoold 1 :varoattno 6}} { TARGETENTRY :resdom { RESDOM :resno 7
:restype 23 :restypmod -1 :resname referer_id :reskey 0 :reskeyop 0
:ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 7
:vartype 23 :vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 7}} {
TARGETENTRY :resdom { RESDOM :resno 8 :restype 23 :restypmod -1 :resname
browser_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false }
:expr { VAR :varno 1 :varattno 8 :vartype 23 :vartypmod -1  :varlevelsup
0 :varnoold 1 :varoattno 8}} { TARGETENTRY :resdom { RESDOM :resno 9
:restype 21 :restypmod -1 :resname status :reskey 0 :reskeyop 0
:ressortgroupref 0 :resjunk false } :expr { VAR :varno 1 :varattno 9
:vartype 21 :vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 9}} {
TARGETENTRY :resdom { RESDOM :resno 10 :restype 23 :restypmod -1
:resname bytes :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false }
:expr { VAR :varno 1 :varattno 10 :vartype 23 :vartypmod -1
:varlevelsup 0 :varnoold 1 :varoattno 10}} { TARGETENTRY :resdom {
RESDOM :resno 11 :restype 21 :restypmod -1 :resname content_id :reskey 0
:reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1
:varattno 11 :vartype 21 :vartypmod -1  :varlevelsup 0 :varnoold 1
:varoattno 11}} { TARGETENTRY :resdom { RESDOM :resno 12 :restype 16
:restypmod -1 :resname https_flag :reskey 0 :reskeyop 0 :ressortgroupref
0 :resjunk false } :expr { VAR :varno 1 :varattno 12 :vartype 16
:vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 12}} { TARGETENTRY
:resdom { RESDOM :resno 13 :restype 1042 :restypmod 36 :resname
session_id :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false }
:expr { VAR :varno 1 :varattno 13 :vartype 1042 :vartypmod 36
:varlevelsup 0 :varnoold 1 :varoattno 13}} { TARGETENTRY :resdom {
RESDOM :resno 14 :restype 23 :restypmod -1 :resname user_id :reskey 0
:reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno 1
:varattno 14 :vartype 23 :vartypmod -1  :varlevelsup 0 :varnoold 1
:varoattno 14}} { TARGETENTRY :resdom { RESDOM :resno 15 :restype 1043
:restypmod 259 :resname uname :reskey 0 :reskeyop 0 :ressortgroupref 0
:resjunk false } :expr { VAR :varno 1 :varattno 15 :vartype 1043
:vartypmod 259  :varlevelsup 0 :varnoold 1 :varoattno 15}} { TARGETENTRY
:resdom { RESDOM :resno 16 :restype 23 :restypmod -1 :resname note_id
:reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR
:varno 1 :varattno 16 :vartype 23 :vartypmod -1  :varlevelsup 0
:varnoold 1 :varoattno 16}}) :qpqual ({ EXPR :typeOid 16  :opType op
:oper { OPER :opno 96 :opid 65 :opresulttype 16 } :args ({ VAR :varno 1
:varattno 7 :vartype 23 :vartypmod -1  :varlevelsup 0 :varnoold 1
:varoattno 7} { CONST :consttype 23 :constlen 4 :constbyval true
:constisnull false :constvalue  4 [ -24 3 0 0 ] })}) :lefttree <>
:righttree <> :extprm () :locprm () :initplan <> :nprm 0  :scanrelid 1 }
NOTICE:  QUERY PLAN:

Seq Scan on access_log  (cost=0.00..16443.71 rows=11715 width=89)

EXPLAIN



Thanks & Ciao

  Alvar

--
AGI
Magirusstrasse 21B, 70469 Stuttgart
Fon +49 (0)711.228 74-50, Fax +49 (0)711.228 74-88
+++news+++news+++news+++
Beste Image-Website 2001 kommt von AGI
http://www.agi.de/tagebuch
http://www.agi.com/diary (english)

Re: [HACKERS] Indexes not used in 7.1RC4: Bug?

От

Stephan Szabo

Дата:

10 апреля 2001 г., 15:16:28

> url_idx seems OK:
>
>   logger=# EXPLAIN SELECT * FROM access_log WHERE url_id = 1000;
>   Index Scan using url_idx on access_log
>      (cost=0.00..3618.92 rows=1002 width=89)
>
>
>
> But the others not:
>
>   logger=# EXPLAIN SELECT * FROM access_log WHERE method_num = 0;
>   Seq Scan on access_log  (cost=0.00..16443.71 rows=559371 width=89)

In this case you'd need to coerce the 0 to int2 in any case, but it
figures that most of your rows are returned so a sequence scan will
be faster.  For an index scan, it's got to seek around the heap file
doing random access to determine if rows are visible to your transaction
which is more expensive than sequential reads, so at some point the
optimizer will guess the sequence scan to be faster.

>   logger=# EXPLAIN SELECT * FROM access_log WHERE browser_id = 500;
>   Seq Scan on access_log  (cost=0.00..16443.71 rows=7935 width=89)
>
>   logger=# EXPLAIN SELECT * FROM access_log WHERE content_id = 20;
>   Seq Scan on access_log  (cost=0.00..16443.71 rows=20579 width=89)

Not sure on these two, it probably is estimating less disk access
for doing the sequence scan, my guess would be that the break point
is probably somewhere between the 1000 and 8000 row point for the
two queries.  And I believe the second was an int2, so you'll need
to cast the 20.

> And very strange:
>
>   logger=# EXPLAIN SELECT * FROM access_log WHERE access_time >
>            '2001-04-10 10:10:10';
>   Index Scan using access_time_idx on access_log
>            (cost=0.00..10605.12 rows=3251 width=89)
>
>   logger=# EXPLAIN SELECT * FROM access_log WHERE access_time >
>            '2001-04-08 10:10:10';
>   Seq Scan on access_log  (cost=0.00..16443.71 rows=152292 width=89)

Same as above, for 3000 rows it thinks index scan will be faster,
for 152000 rows the sequence scan.

> But:
>
>   logger=# EXPLAIN SELECT * FROM access_log
>                            WHERE url_id IN (SELECT 1);
>   Seq Scan on access_log  (cost=0.00..16443.71 rows=572537 width=89)
>   SubPlan
>     ->  Materialize  (cost=0.00..0.00 rows=0 width=0)
>           ->  Result  (cost=0.00..0.00 rows=0 width=0)

My guess is it doesn't realize that SELECT 1 is a constant that it
can use the index for.  IN (subselect) isn't handled very well right
now.

> Indexes are also not used for remote_ip, ORDER BY access_time
> (timestamp), ORDER BY time_taken (interval), status, method_num etc. The
> only I found where indexes are used is url_id!

Any of the int2s will require explicit casting of a constant in order
to use the index.  I'm not sure on the others.

Re: Indexes not used in 7.1RC4: Bug?

От

Alvar Freude

Дата:

10 апреля 2001 г., 15:43:07

Thomas Lockhart wrote:
>
> The parser does not know that your int4 constant "0" can be represented
> as an int2. Try
>
>   SELECT * FROM access_log WHERE method_num = int2 '0';

hmmm, but its still a sequentiell scan:

  logger=# explain SELECT * FROM access_log
                           WHERE method_num = int2 '0';
  Seq Scan on access_log  (cost=0.00..16443.71 rows=559371 width=89)

But:
Now I realised: the number of rows! :)
If I make "WHERE method_num = int2 '2', then the index is used,
interesting -- so it seems that the optimizer uses the value of the
WHERE clause to check what might be faster and guesses, that an index
scan is more overhead and slower. Nice!

> For the other cases, PostgreSQL is estimating the query cost to be lower
> with a sequential scan.

hm, OK, but I guess, that he is estimating wrong ;)

After re-reading the using-explain chapter in the docs I guess I
understand the problems of estimating the number of rows ...

Do you have any hints how to optimize the ..._cost-Values?

Perhaps it is possible to write a test program, which checks out some
good ..._cost-Values -- I'm volunteer, but I guess it should possible
for this to force some optimizer results to measure the real time some
different methods cost.

> For the "SELECT 1" subselect case, it may be that the optimizer does not cheat and > determine that there will be
only
> one row returned, or that the query can be reformulated to use a simple
> constant.

yes, it was only an example -- i hope nobody is really so stupid and
uses a "select 1" subselect ;)

It might be an optimization, that the hole subselect is performed before
the outer select is called, so the result of the subselect can be used
in the query planer.

Ciao
  Alvar

--
AGI
Magirusstrasse 21B, 70469 Stuttgart
Fon +49 (0)711.228 74-50, Fax +49 (0)711.228 74-88
+++news+++news+++news+++
Beste Image-Website 2001 kommt von AGI
http://www.agi.de/tagebuch
http://www.agi.com/diary (english)

Re: Indexes not used in 7.1RC4: Bug?

От

Thomas Lockhart

Дата:

10 апреля 2001 г., 22:40:55

> I have the following table, containing about 570000 Rows, but some
> indexes are not used, on 7.1RC4, freshly vacuumed (analyse). It was the
> same at least in 7.1RC1
>       CREATE TABLE access_log(
>          access_time timestamp   NOT NULL DEFAULT NOW(),
>          method_num  int2        NOT NULL,
>          url_id      int4        NOT NULL REFERENCES urls(id),
>          );
>       CREATE INDEX method_idx       ON access_log(method_num);
>       CREATE INDEX url_idx          ON access_log(url_id);
> url_idx seems OK:
> But the others not:
>   logger=# EXPLAIN SELECT * FROM access_log WHERE method_num = 0;
>   Seq Scan on access_log  (cost=0.00..16443.71 rows=559371 width=89)

The parser does not know that your int4 constant "0" can be represented
as an int2. Try

  SELECT * FROM access_log WHERE method_num = int2 '0';

(note the type coersion on the constant; there are other ways of
specifying the same thing).

For the other cases, PostgreSQL is estimating the query cost to be lower
with a sequential scan. For the "SELECT 1" subselect case, it may be
that the optimizer does not cheat and determine that there will be only
one row returned, or that the query can be reformulated to use a simple
constant.

HTH

                     - Thomas

Speaking of Indexing... (Text indexing)

От

Poet/Joshua Drake

Дата:

10 апреля 2001 г., 23:02:21

Good day,

I've been experimenting a bit with Full Text Indexing in PostgreSQL. I
have found several conflicting sites various places on the net pertaining
to whether or not PostgreSQL supports FTI, and I was hoping I could find
an authoritative answer here - I tried searching the website's archives,
but the search seems to be having some problems.

At any rate, I am running a CVS snapshot of 7.1, and I have been trying to
create a full text index on a series of resumes. Some of these exceed 8k
in size, which is no longer a storage problem of course with 7.1, but I
seem to have run into the wicked 8k once again. Specifically:

ERROR:  index_formtuple: data takes 9344 bytes, max is 8191

Furthermore, after trying to just index on a 8191-character long substring
of the resume, I run into the following:

ERROR:  btree: index item size 3948 exceeds maximum 2713

The only way I could actually get the index created was to substring the
body of the resumes down to 2k. I also later tried using HASH rather than
BTREE, which worked, but none of these solutions really appreciably
increased performance in the way we were hoping.

Are these known and accepted limitations of the current 7.1
implementation, or am I doing something terribly wrong? ;)
On Tue, 10 Apr 2001, Thomas Lockhart wrote:

>> I have the following table, containing about 570000 Rows, but some
>> indexes are not used, on 7.1RC4, freshly vacuumed (analyse). It was the
>> same at least in 7.1RC1
>>       CREATE TABLE access_log(
>>          access_time timestamp   NOT NULL DEFAULT NOW(),
>>          method_num  int2        NOT NULL,
>>          url_id      int4        NOT NULL REFERENCES urls(id),
>>          );
>>       CREATE INDEX method_idx       ON access_log(method_num);
>>       CREATE INDEX url_idx          ON access_log(url_id);
>> url_idx seems OK:
>> But the others not:
>>   logger=# EXPLAIN SELECT * FROM access_log WHERE method_num = 0;
>>   Seq Scan on access_log  (cost=0.00..16443.71 rows=559371 width=89)
>
>The parser does not know that your int4 constant "0" can be represented
>as an int2. Try
>
>  SELECT * FROM access_log WHERE method_num = int2 '0';
>
>(note the type coersion on the constant; there are other ways of
>specifying the same thing).
>
>For the other cases, PostgreSQL is estimating the query cost to be lower
>with a sequential scan. For the "SELECT 1" subselect case, it may be
>that the optimizer does not cheat and determine that there will be only
>one row returned, or that the query can be reformulated to use a simple
>constant.
>
>HTH
>
>                     - Thomas
>
>---------------------------(end of broadcast)---------------------------
>TIP 4: Don't 'kill -9' the postmaster
>

--
--
<COMPANY>CommandPrompt    - http://www.commandprompt.com    </COMPANY>
<PROJECT>OpenDocs, LLC.    - http://www.opendocs.org    </PROJECT>
<PROJECT>LinuxPorts     - http://www.linuxports.com     </PROJECT>
<WEBMASTER>LDP        - http://www.linuxdoc.org    </WEBMASTER>
--
Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.
--

Re: Speaking of Indexing... (Text indexing)

От

Joel Burton

Дата:

11 апреля 2001 г., 01:14:13

On Tue, 10 Apr 2001, Poet/Joshua Drake wrote:

> I've been experimenting a bit with Full Text Indexing in PostgreSQL. I
> have found several conflicting sites various places on the net pertaining
> to whether or not PostgreSQL supports FTI, and I was hoping I could find
> an authoritative answer here - I tried searching the website's archives,
> but the search seems to be having some problems.
>
> At any rate, I am running a CVS snapshot of 7.1, and I have been trying to
> create a full text index on a series of resumes. Some of these exceed 8k
> in size, which is no longer a storage problem of course with 7.1, but I
> seem to have run into the wicked 8k once again. Specifically:

Joshua --

CREATE INDEX ... creates an index on a field, allowing for faster
searches, *if* you're looking to match the first part of that text string.
So, if I have a table of movie titles, creating an index on column title
will allow for faster searches if my criteria is something like
title='Toto Les Heros' (or like 'Toto%' or such), but not (AFAIK) for
title ~ 'Les' or title LIKE '%Les%'. The index doesn't help here.

For these long fields you have, you probably want to search for a word in
the field, not match the start of the field. A regular index isn't your
answer.

There is a full text indexing solution in the contrib/ directory of the
source. It essentially creates a new table w/every occurence of every word
fragment, with a reference back to the row that contains it. Searching
against this is indexed, and is speedy. The only downside is that you will
have a *large* table holding the full text index.

More help can be found in the README file in contrib/fulltextindex

HTH,
--
Joel Burton   <jburton@scw.org>
Director of Information Systems, Support Center of Washington

Re: Speaking of Indexing... (Text indexing)

От

Thomas Lockhart

Дата:

11 апреля 2001 г., 01:41:48

> Furthermore, after trying to just index on a 8191-character long substring
> of the resume, I run into the following:
> ERROR:  btree: index item size 3948 exceeds maximum 2713
> The only way I could actually get the index created was to substring the
> body of the resumes down to 2k. I also later tried using HASH rather than
> BTREE, which worked, but none of these solutions really appreciably
> increased performance in the way we were hoping.
>
> Are these known and accepted limitations of the current 7.1
> implementation, or am I doing something terribly wrong? ;)

Hmm. I'm pretty sure that a single index on the entire contents of a
resume *as a single field* is close to useless. And an index on an 8k
piece is also useless. Presumably you really want an index covering each
significant word of each resume, in which case you would not run into
the 4k limit (or 2k limit? it is documented somewhere) on the size of an
*index* field (which is still a limitation on PostgreSQL built with the
standard 8k block size. Of course, you can build with a larger block
size).

hth

                   - Thomas

Re: Speaking of Indexing... (Text indexing)

От

"Mitch Vincent"

Дата:

11 апреля 2001 г., 14:44:51

> Hmm. I'm pretty sure that a single index on the entire contents of a
> resume *as a single field* is close to useless. And an index on an 8k
> piece is also useless. Presumably you really want an index covering each
> significant word of each resume, in which case you would not run into
> the 4k limit (or 2k limit? it is documented somewhere) on the size of an
> *index* field (which is still a limitation on PostgreSQL built with the
> standard 8k block size. Of course, you can build with a larger block
> size).

Just an FYI..

I asked the other day and someone (Tom?) told me it was about 2k..

-Mitch

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Indexes not used in 7.1RC4: Bug?