Обсуждение: Function-based index not used in a simple query

Поиск
Список
Период
Сортировка

Function-based index not used in a simple query

От
Rostislav Opocensky
Дата:
Hi all,

I have run into a problem trying to optimize a select from a single table.  
Due to the nature of the queries our frontends issue most of the time, we
have created an index that reduces the processing time a lot.  The problem
(index not being used) arose when I tried to do a select with non-constant
parameters to the WHERE condition.

The table definition looks like this:
CREATE TABLE vals (    timestamp       timestamp   NOT NULL,    agent           varchar(15) NOT NULL,    var
varchar(64) NOT NULL,    val             text        NOT NULL);
 

The index definition is based on a user-defined function:
CREATE FUNCTION trunc_to_day(timestamp) RETURNS timestamp AS 'DECLARE tstamp ALIAS FOR $1;BEGIN    RETURN
date_trunc(''day'',tstamp);END;' LANGUAGE 'plpgsql';
 
CREATE INDEX vals_days    ON vals (trunc_to_day(timestamp) timestamp_ops);

A typical query looks like this (additional conditions removed from the
WHERE condition as well as additional GROUP BY and ORDER BY clauses):
SELECT *    FROM vals    WHERE trunc_to_day(timestamp) = '28.5.2000';

Explain on this query produces:
Index Scan using vals_days on vals  (cost=0.00..8.16 rows=10 width=44)

Now, when I try to do the same with a slightly changed comparison expression
in the WHERE clause, an optimizer decides to sequentially scan the table:
SELECT *    FROM vals    WHERE trunc_to_day(timestamp) = trunc_to_day('28.5.2000');
Seq Scan on vals  (cost=0.00..27.50 rows=10 width=44)

Actually, the problem first appeared in a stored procedure:
CREATE FUNCTION detector(timestamp, varchar) RETURNS float AS 'DECLARE    check_time  ALIAS FOR $1;    check_agent
ALIASFOR $2;
 
    from_time timestamp;    to_time   timestamp;    from_day  timestamp;    to_day    timestamp;    rssi_var
vars.var%TYPE;   avg_rssi  float;BEGIN    from_time = check_time;    from_day  = trunc_to_day(from_time);
 
    to_time   = check_time + ''1 day''::interval;    to_day    = trunc_to_day(to_time);
    SELECT INTO rssi_var var || ''%''        FROM vars        WHERE name = ''brzAvrgRssi'';
    SELECT INTO avg_rssi AVG(val::float)        FROM vals        WHERE trunc_to_day(timestamp) BETWEEN from_day AND
to_dayAND            timestamp BETWEEN from_time AND to_time AND            agent = check_agent AND            var LIKE
rssi_var;
    IF avg_rssi IS NULL THEN        RAISE EXCEPTION ''There are no values for % and %.'',
check_time,check_agent;    END IF;
 
    RETURN avg_rssi;END;' LANGUAGE 'plpgsql';

Sorry for a lengthy listing, but I didn't want to omit something important
accidentally.  It seems the optimizer chooses to seq-scan whenever there is
anything else than a simple string constant in the WHERE condition.

After reading the thread ``Index not used in functions in 7.0'' two weeks ago
in this list, I have experimented with typecasting extensively, trying to add
``::timestamp'' wherever possible to the query, but with no success.

The PostgreSQL version is 7.0.0 on i686-pc-linux-gnu, compiled by gcc
egcs-2.91.66.

Thanks in advance for any advices!
Orbis

-- 
Rostislav Opocensky <orbis@pictus.org> <orbis@unreal.cz> +420 411 825144
Unreal Technology sro., Dobrin 118, 41301 Roudnice n. L. +420 411 825111



Re: Function-based index not used in a simple query

От
Tom Lane
Дата:
Rostislav Opocensky <orbis@pictus.org> writes:
>     SELECT *
>         FROM vals
>         WHERE trunc_to_day(timestamp) = trunc_to_day('28.5.2000');
>
>     Seq Scan on vals  (cost=0.00..27.50 rows=10 width=44)

The problem here is that the optimizer will only consider an indexscan
for a clause that looks like index_key OP constant.  It doesn't think
that trunc_to_day('28.5.2000') is a constant, because you haven't told
it that it can pre-evaluate that function call --- and for all it knows,
the function might generate different values when evaluated at different
rows of the table, rendering an indexscan incorrect.  (Consider a
function like random() ...)  If you mark trunc_to_day() as a "cachable"
function then the system will know it's safe to collapse the righthand
side to a constant:

CREATE FUNCTION trunc_to_day(timestamp) RETURNS timestamp AS 'DECLARE tstamp ALIAS FOR $1;BEGIN    RETURN
date_trunc(''day'',tstamp);END;' LANGUAGE 'plpgsql'WITH (isCachable);
 

This is a little bit of a cheat for this function, because what
isCachable actually implies is that the function's value depends
on the presented arguments *and nothing else*.  I believe date_trunc
depends on the current TZ setting, so it's not really cachable.
You could possibly get away with it for your application, but you
might run into problems like this:

1. You write trunc_to_day('28.5.2000') in a table's column  default expression, stored rule, or some such.  The system
decidesit can fold that expression down to a simple constant,  giving you (say) 2000-05-20 00:00-04 --- ie, midnight in
whatever timezone you are using.  That value gets stored in  the column or rule definition.
 

2. You then access the database while running with a different  TZ setting.  You wonder why you don't get the answers
youexpected.
 

Come to think of it, your index on trunc_to_day() is fundamentally
risky because different backends running with different TZ settings
would generate different index entries for the "same" data entry.
For example, if you make an entry dated '28.5.2000' while running in
one timezone, and then try to query with "= trunc_to_day('28.5.2000')"
while running in another timezone, you won't match that entry if the
index is used, because the timestamp generated on-the-fly from
trunc_to_day('28.5.2000') won't equal the one stored in the index.

(Hmm ... I wonder if that suggests that we shouldn't allow indexes
to be built using functions that are not isCachable?)

You could avoid some of these risks by having trunc_to_day return
a 'date' rather than a 'timestamp', thereby removing at least some
of the timezone dependency from the index definition.
        regards, tom lane


Re: Function-based index not used in a simple query

От
Rostislav Opocensky
Дата:
On Tue, 30 May 2000, Tom Lane wrote:

> The problem here is that the optimizer will only consider an indexscan
> for a clause that looks like index_key OP constant.  It doesn't think
> that trunc_to_day('28.5.2000') is a constant, because you haven't told
> it that it can pre-evaluate that function call --- and for all it knows,
[...]

Dear Tom,

thank you for the precise explanation for my problem!

I'll consider having my index function return a `date'.  Still one thing
remains unclear to me: why the optimizer doesn't use an indexscan in the
stored procedure I have attached to my previous post.  The condition looks
like WHERE trunc_to_day(timestamp) BETWEEN var1 AND var2.  var1 and var2
get their values from calling the `volatile' function trunc_to_day, but
from then on, their values can't be changed during the execution of the
query.  Is it possible to give the optimizer a hint about it?

Best regards    Orbis

-- 
Rostislav Opocensky <orbis@pictus.org> <orbis@unreal.cz> +420 411 825144
Unreal Technology sro., Dobrin 118, 41301 Roudnice n. L. +420 411 825111







Re: Function-based index not used in a simple query

От
Tom Lane
Дата:
Rostislav Opocensky <orbis@pictus.org> writes:
> On Tue, 30 May 2000, Tom Lane wrote:
>> The problem here is that the optimizer will only consider an indexscan
>> for a clause that looks like index_key OP constant.  It doesn't think

> I'll consider having my index function return a `date'.  Still one thing
> remains unclear to me: why the optimizer doesn't use an indexscan in the
> stored procedure I have attached to my previous post.  The condition looks
> like WHERE trunc_to_day(timestamp) BETWEEN var1 AND var2.  var1 and var2
> get their values from calling the `volatile' function trunc_to_day, but
> from then on, their values can't be changed during the execution of the
> query.  Is it possible to give the optimizer a hint about it?

Hmm, actually the optimizer should/does regard those as constants within
subsequent queries (internally they are Params instead of Consts, but
that's supposed to be OK).  What I find here is that the optimizer does
consider an indexscan for this query, but there's a bug in its
selectivity estimation routine that causes it not to recognize the
BETWEEN clause as being a range restriction --- and that means it
produces a fairly high cost estimate for the indexscan.  I still got
an indexscan plan for a small test table, but on a larger table you
might not get one.

I've applied the attached patch for 7.0.1 --- if you are in a hurry,
you may care to apply it to your local copy.  It just tweaks the range-
query recognizer to accept Param as well as Const nodes.

            regards, tom lane


*** src/backend/optimizer/path/clausesel.c.orig    Tue May 30 00:26:44 2000
--- src/backend/optimizer/path/clausesel.c    Wed May 31 11:38:53 2000
***************
*** 120,129 ****
          Selectivity s2;

          /*
!          * See if it looks like a restriction clause with a constant. (If
!          * it's not a constant we can't really trust the selectivity!) NB:
!          * for consistency of results, this fragment of code had better
!          * match what clause_selectivity() would do.
           */
          if (varRelid != 0 || NumRelids(clause) == 1)
          {
--- 120,131 ----
          Selectivity s2;

          /*
!          * See if it looks like a restriction clause with a Const or Param
!          * on one side.  (Anything more complicated than that might not
!          * behave in the simple way we are expecting.)
!          *
!          * NB: for consistency of results, this fragment of code had better
!          * match what clause_selectivity() would do in the cases it handles.
           */
          if (varRelid != 0 || NumRelids(clause) == 1)
          {
***************
*** 134,174 ****

              get_relattval(clause, varRelid,
                            &relidx, &attno, &constval, &flag);
!             if (relidx != 0 && (flag & SEL_CONSTANT))
              {
                  /* if get_relattval succeeded, it must be an opclause */
!                 Oid            opno = ((Oper *) ((Expr *) clause)->oper)->opno;
!                 RegProcedure oprrest = get_oprrest(opno);

!                 if (!oprrest)
!                     s2 = (Selectivity) 0.5;
!                 else
!                     s2 = restriction_selectivity(oprrest, opno,
!                                                  getrelid(relidx,
!                                                           root->rtable),
!                                                  attno,
!                                                  constval, flag);
!
!                 /*
!                  * If we reach here, we have computed the same result that
!                  * clause_selectivity would, so we can just use s2 if it's
!                  * the wrong oprrest.  But if it's the right oprrest, add
!                  * the clause to rqlist for later processing.
!                  */
!                 switch (oprrest)
                  {
!                     case F_SCALARLTSEL:
!                         addRangeClause(&rqlist, clause, flag, true, s2);
!                         break;
!                     case F_SCALARGTSEL:
!                         addRangeClause(&rqlist, clause, flag, false, s2);
!                         break;
!                     default:
!                         /* Just merge the selectivity in generically */
!                         s1 = s1 * s2;
!                         break;
                  }
-                 continue;        /* drop to loop bottom */
              }
          }
          /* Not the right form, so treat it generically. */
--- 136,183 ----

              get_relattval(clause, varRelid,
                            &relidx, &attno, &constval, &flag);
!             if (relidx != 0)
              {
                  /* if get_relattval succeeded, it must be an opclause */
!                 Var           *other;

!                 other = (flag & SEL_RIGHT) ? get_rightop((Expr *) clause) :
!                     get_leftop((Expr *) clause);
!                 if (IsA(other, Const) || IsA(other, Param))
                  {
!                     Oid        opno = ((Oper *) ((Expr *) clause)->oper)->opno;
!                     RegProcedure oprrest = get_oprrest(opno);
!
!                     if (!oprrest)
!                         s2 = (Selectivity) 0.5;
!                     else
!                         s2 = restriction_selectivity(oprrest, opno,
!                                                      getrelid(relidx,
!                                                               root->rtable),
!                                                      attno,
!                                                      constval, flag);
!
!                     /*
!                      * If we reach here, we have computed the same result that
!                      * clause_selectivity would, so we can just use s2 if it's
!                      * the wrong oprrest.  But if it's the right oprrest, add
!                      * the clause to rqlist for later processing.
!                      */
!                     switch (oprrest)
!                     {
!                         case F_SCALARLTSEL:
!                             addRangeClause(&rqlist, clause, flag, true, s2);
!                             break;
!                         case F_SCALARGTSEL:
!                             addRangeClause(&rqlist, clause, flag, false, s2);
!                             break;
!                         default:
!                             /* Just merge the selectivity in generically */
!                             s1 = s1 * s2;
!                             break;
!                     }
!                     continue;    /* drop to loop bottom */
                  }
              }
          }
          /* Not the right form, so treat it generically. */

short query becomes long

От
mikeo
Дата:
hi, we have a weird situation here.  we have a table of approx. 10k rows
representing accumulated activity by specific customers.  as information 
is gathered those customers rows are updated.  the number of rows does not
increase unless we get a new customer so that is not a factor.  the table 
is defined as follows:
     Table "account_summary_02" Attribute  |    Type     | Modifier
-------------+-------------+----------bill_br_id  | bigint      | not nullcust_id     | varchar(15) | not nullbtn_id
 | varchar(15) | not nullln_id       | varchar(15) | not nullct_key      | float8      | not nullas_quantity | float8
  | not nullas_charges  | float8      | not nullas_count    | float8      | not null
 
Index: account_summary_02_unq_idx

the index is on the first 5 columns.  here's the situation.  after about
50,000
updates, which fly right along, the process begins to really bog down.  we
perform 
a vacuum analzye and it speeds right up again.  my question is, is there a way
to perform these updates, potentially 500k to 1 million in a day, without
having 
to vacuum so frequently?  maybe some setting or parameter to be changed?
the update
query is doing an index scan.  

mikeo 



Re: short query becomes long

От
Tom Lane
Дата:
mikeo <mikeo@spectrumtelecorp.com> writes:
> after about 50,000 updates, which fly right along, the process begins
> to really bog down.  we perform a vacuum analzye and it speeds right
> up again.

A plain "vacuum" should do the job in a bit less time.  The problem
is you need to reclaim the space occupied by deleted versions of rows.
No way around that with the current storage manager: vacuum is the
only way to get rid of the wasted space.
        regards, tom lane