Обсуждение: Bad row estimation with indexed func returning bool

Поиск
Список
Период
Сортировка

Bad row estimation with indexed func returning bool

От
Jehan-Guillaume de Rorthais
Дата:
Hi,

I faced a correlation problem on a query today and tried the usual trick
consisting of using an functional index and rewriting the query to use it.

However, after writing the function, indexing it and rewriting the query, I
faced an optimizer behavior I was not expecting. I wrote a short scenario to
reproduce it on current HEAD:

  CREATE TABLE correl AS SELECT (i-1)%2 AS i, i%2 AS j
  FROM generate_series(1,100000) AS i;

  ANALYZE correl ;

  EXPLAIN ANALYZE SELECT * FROM correl WHERE i = 1 AND j = 1;

  -- Seq Scan on correl (rows=25000) (rows=0)
  --    Filter: ((i = 1) AND (j = 1))
  --    Rows Removed by Filter: 100000
  -- Planning time: 0.356 ms
  -- Execution time: 21.937 ms

  CREATE FUNCTION fix_correl(int, int) RETURNS bool AS
    'BEGIN RETURN $1 = 1 AND $2 = 1; END '
    IMMUTABLE
    CALLED ON NULL INPUT
    LANGUAGE plpgsql;

  CREATE INDEX ON correl ( fix_correl(i, j) );

  ANALYZE correl ;

  EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl(i, j);

  -- Index Scan using correl_fix_correl_idx on correl  (rows=33333) (rows=0)
  --    Index Cond: (fix_correl(i, j) = true)
  --    Filter: fix_correl(i, j)
  -- Planning time: 0.421 ms
  -- Execution time: 0.102 ms


Using a function returning integer work as expected:

  CREATE FUNCTION fix_correl_add(int, int) RETURNS int AS
    'BEGIN RETURN $1 + $2 ; END '
    IMMUTABLE
    CALLED ON NULL INPUT
    LANGUAGE plpgsql;

  CREATE INDEX ON correl ( fix_correl_add( i, j ) );

  ANALYZE correl ;

  EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl_add( i, j ) = 2;
  -- Index Scan using correl_fix_correl_add_idx on correl  (rows=1) (rows=0)
  --    Index Cond: (fix_correl_add(i, j) = 2)
  -- Planning time: 0.462 ms
  -- Execution time: 0.102 ms


It works works as expected with a simple index on (i + j) with no function, but
I wanted to have the same conditions in both tests.

Why does the optimizer behave differently in both cases? Why do it add a Filter
when index scan-ing on correl_fix_correl_idx indexing booleans?

Please, find the complete scenario in attachment.

Regards,
--
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com

Вложения

Re: Bad row estimation with indexed func returning bool

От
Tom Lane
Дата:
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:
> I faced a correlation problem on a query today and tried the usual trick
> consisting of using an functional index and rewriting the query to use it.

The core reason this isn't doing anything useful is that
clause_selectivity() is hard-wired to estimate the selectivity of a
top-level WHERE clause that is a function call as 0.3333333, no matter
what:
   else if (is_funcclause(clause))   {       /*        * This is not an operator, so we guess at the selectivity. THIS
ISA        * HACK TO GET V4 OUT THE DOOR.  FUNCS SHOULD BE ABLE TO HAVE        * SELECTIVITIES THEMSELVES.       -- JMH
7/9/92       */       s1 = (Selectivity) 0.3333333;   }
 

Adding per-function selectivity estimators, as Joe was presumably
envisioning, would be a sufficiently large amount of work that it's not
too surprising nobody's gotten around to it in twenty-three years.  (The
infrastructure maybe wouldn't be so bad, but where would the estimators
themselves come from, especially for user-defined functions?)

However, in the case at hand, the complaint basically is why aren't we
treating the boolean function expression like a boolean variable, and
looking to see if there are stats available for it, like this other
bit in clause_selectivity:
           /*            * A Var at the top of a clause must be a bool Var. This is            * equivalent to the
clausereln.attribute = 't', so we compute            * the selectivity as if that is what we have.            */
  s1 = restriction_selectivity(root,                                        BooleanEqualOperator,
                list_make2(var,                                                   makeBoolConst(true,
                                             false)),                                        InvalidOid,
                       varRelid);
 

Indeed you could argue that this ought to be the fallback behavior for
*any* unhandled case, not just function expressions.  Not sure if we'd
need to restrict it to single-relation expressions or not.

The implication of doing it like this would be that the default estimate
in the absence of any matching stats would be 0.5 (since eqsel defaults
to 1/ndistinct, and get_variable_numdistinct will report 2.0 for any
boolean-type expression it has no stats for).  That's not a huge change
from the existing 0.3333333 estimate, which seems pretty unprincipled
anyway ... but it would probably be enough to annoy people if we did it in
stable branches.  So I'd be inclined to propose changing this in HEAD and
maybe 9.5, but not further back.  (For non-function expressions, 0.5 is
the default already, so those would not change behavior.)

Comments?
        regards, tom lane



Re: Bad row estimation with indexed func returning bool

От
Tom Lane
Дата:
I wrote:
> However, in the case at hand, the complaint basically is why aren't we
> treating the boolean function expression like a boolean variable, and
> looking to see if there are stats available for it, like this other
> bit in clause_selectivity:

>             /*
>              * A Var at the top of a clause must be a bool Var. This is
>              * equivalent to the clause reln.attribute = 't', so we compute
>              * the selectivity as if that is what we have.
>              */
>             s1 = restriction_selectivity(root,
>                                          BooleanEqualOperator,
>                                          list_make2(var,
>                                                     makeBoolConst(true,
>                                                                   false)),
>                                          InvalidOid,
>                                          varRelid);

> Indeed you could argue that this ought to be the fallback behavior for
> *any* unhandled case, not just function expressions.  Not sure if we'd
> need to restrict it to single-relation expressions or not.

> The implication of doing it like this would be that the default estimate
> in the absence of any matching stats would be 0.5 (since eqsel defaults
> to 1/ndistinct, and get_variable_numdistinct will report 2.0 for any
> boolean-type expression it has no stats for).  That's not a huge change
> from the existing 0.3333333 estimate, which seems pretty unprincipled
> anyway ... but it would probably be enough to annoy people if we did it in
> stable branches.  So I'd be inclined to propose changing this in HEAD and
> maybe 9.5, but not further back.  (For non-function expressions, 0.5 is
> the default already, so those would not change behavior.)

I experimented with the attached patch.  The change in the default
estimate for a function results in just one change in the standard
regression test results, so far as I can find.

Comments, objections?

            regards, tom lane

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index dcac1c1..c862b70 100644
*** a/src/backend/optimizer/path/clausesel.c
--- b/src/backend/optimizer/path/clausesel.c
***************
*** 14,20 ****
   */
  #include "postgres.h"

- #include "catalog/pg_operator.h"
  #include "nodes/makefuncs.h"
  #include "optimizer/clauses.h"
  #include "optimizer/cost.h"
--- 14,19 ----
*************** clause_selectivity(PlannerInfo *root,
*** 568,585 ****
          if (var->varlevelsup == 0 &&
              (varRelid == 0 || varRelid == (int) var->varno))
          {
!             /*
!              * A Var at the top of a clause must be a bool Var. This is
!              * equivalent to the clause reln.attribute = 't', so we compute
!              * the selectivity as if that is what we have.
!              */
!             s1 = restriction_selectivity(root,
!                                          BooleanEqualOperator,
!                                          list_make2(var,
!                                                     makeBoolConst(true,
!                                                                   false)),
!                                          InvalidOid,
!                                          varRelid);
          }
      }
      else if (IsA(clause, Const))
--- 567,574 ----
          if (var->varlevelsup == 0 &&
              (varRelid == 0 || varRelid == (int) var->varno))
          {
!             /* Use the restriction selectivity function for a bool Var */
!             s1 = boolvarsel(root, (Node *) var, varRelid);
          }
      }
      else if (IsA(clause, Const))
*************** clause_selectivity(PlannerInfo *root,
*** 680,704 ****
          if (IsA(clause, DistinctExpr))
              s1 = 1.0 - s1;
      }
-     else if (is_funcclause(clause))
-     {
-         /*
-          * This is not an operator, so we guess at the selectivity. THIS IS A
-          * HACK TO GET V4 OUT THE DOOR.  FUNCS SHOULD BE ABLE TO HAVE
-          * SELECTIVITIES THEMSELVES.       -- JMH 7/9/92
-          */
-         s1 = (Selectivity) 0.3333333;
-     }
- #ifdef NOT_USED
-     else if (IsA(clause, SubPlan) ||
-              IsA(clause, AlternativeSubPlan))
-     {
-         /*
-          * Just for the moment! FIX ME! - vadim 02/04/98
-          */
-         s1 = (Selectivity) 0.5;
-     }
- #endif
      else if (IsA(clause, ScalarArrayOpExpr))
      {
          /* Use node specific selectivity calculation function */
--- 669,674 ----
*************** clause_selectivity(PlannerInfo *root,
*** 766,771 ****
--- 736,752 ----
                                  jointype,
                                  sjinfo);
      }
+     else
+     {
+         /*
+          * For anything else, see if we can consider it as a boolean variable.
+          * This only works if it's an immutable expression in Vars of a single
+          * relation; but there's no point in us checking that here because
+          * boolvarsel() will do it internally, and return a suitable default
+          * selectivity (0.5) if not.
+          */
+         s1 = boolvarsel(root, clause, varRelid);
+     }

      /* Cache the result if possible */
      if (cacheable)
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 72bc502..a643c6f 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 105,110 ****
--- 105,111 ----
  #include "access/sysattr.h"
  #include "catalog/index.h"
  #include "catalog/pg_collation.h"
+ #include "catalog/pg_operator.h"
  #include "catalog/pg_opfamily.h"
  #include "catalog/pg_statistic.h"
  #include "catalog/pg_type.h"
*************** icnlikesel(PG_FUNCTION_ARGS)
*** 1440,1445 ****
--- 1441,1479 ----
  }

  /*
+  *        boolvarsel        - Selectivity of Boolean variable.
+  *
+  * This can actually be called on any expression involving only Vars of
+  * the specified relation.  If there are statistics about the Var or
+  * expression (the latter can happen, if it's indexed) then we'll produce
+  * a real estimate; otherwise we default to 0.5.
+  */
+ Selectivity
+ boolvarsel(PlannerInfo *root, Node *arg, int varRelid)
+ {
+     VariableStatData vardata;
+     double        selec;
+
+     examine_variable(root, arg, varRelid, &vardata);
+     if (HeapTupleIsValid(vardata.statsTuple))
+     {
+         /*
+          * A boolean variable V is equivalent to the clause V = 't', so we
+          * compute the selectivity as if that is what we have.
+          */
+         selec = var_eq_const(&vardata, BooleanEqualOperator,
+                              BoolGetDatum(true), false, true);
+     }
+     else
+     {
+         /* Punt and estimate 0.5 */
+         selec = 0.5;
+     }
+     ReleaseVariableStats(vardata);
+     return selec;
+ }
+
+ /*
   *        booltestsel        - Selectivity of BooleanTest Node.
   */
  Selectivity
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index b3d8017..a7433d9 100644
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern Datum icregexnejoinsel(PG_FUNCTIO
*** 164,169 ****
--- 164,170 ----
  extern Datum nlikejoinsel(PG_FUNCTION_ARGS);
  extern Datum icnlikejoinsel(PG_FUNCTION_ARGS);

+ extern Selectivity boolvarsel(PlannerInfo *root, Node *arg, int varRelid);
  extern Selectivity booltestsel(PlannerInfo *root, BoolTestType booltesttype,
              Node *arg, int varRelid,
              JoinType jointype, SpecialJoinInfo *sjinfo);
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 9d3540f..88dcca3 100644
*** a/src/test/regress/expected/rowsecurity.out
--- b/src/test/regress/expected/rowsecurity.out
*************** EXPLAIN (COSTS OFF) SELECT * FROM docume
*** 186,204 ****
  (7 rows)

  EXPLAIN (COSTS OFF) SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle);
!                               QUERY PLAN
! ----------------------------------------------------------------------
   Hash Join
!    Hash Cond: (category.cid = document.cid)
!    ->  Seq Scan on category
     ->  Hash
!          ->  Subquery Scan on document
!                Filter: f_leak(document.dtitle)
!                ->  Seq Scan on document document_1
!                      Filter: (dlevel <= $0)
!                      InitPlan 1 (returns $0)
!                        ->  Index Scan using uaccount_pkey on uaccount
!                              Index Cond: (pguser = "current_user"())
  (11 rows)

  -- only owner can change policies
--- 186,204 ----
  (7 rows)

  EXPLAIN (COSTS OFF) SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle);
!                            QUERY PLAN
! ----------------------------------------------------------------
   Hash Join
!    Hash Cond: (document.cid = category.cid)
!    ->  Subquery Scan on document
!          Filter: f_leak(document.dtitle)
!          ->  Seq Scan on document document_1
!                Filter: (dlevel <= $0)
!                InitPlan 1 (returns $0)
!                  ->  Index Scan using uaccount_pkey on uaccount
!                        Index Cond: (pguser = "current_user"())
     ->  Hash
!          ->  Seq Scan on category
  (11 rows)

  -- only owner can change policies

Re: Bad row estimation with indexed func returning bool

От
Tom Lane
Дата:
I wrote:
>> The implication of doing it like this would be that the default estimate
>> in the absence of any matching stats would be 0.5 (since eqsel defaults
>> to 1/ndistinct, and get_variable_numdistinct will report 2.0 for any
>> boolean-type expression it has no stats for).  That's not a huge change
>> from the existing 0.3333333 estimate, which seems pretty unprincipled
>> anyway ... but it would probably be enough to annoy people if we did it in
>> stable branches.  So I'd be inclined to propose changing this in HEAD and
>> maybe 9.5, but not further back.  (For non-function expressions, 0.5 is
>> the default already, so those would not change behavior.)

> I experimented with the attached patch.  The change in the default
> estimate for a function results in just one change in the standard
> regression test results, so far as I can find.

After more thought I concluded that changing the default behavior is
something not to do without a lot more testing than I have time for now.
So I modified the patch to preserve the old default estimate for
statistics-less function calls, and pushed it.
        regards, tom lane