Обсуждение: Re: [GENERAL] Problem (bug?) with like

Поиск
Список
Период
Сортировка

Re: [GENERAL] Problem (bug?) with like

От
Tom Lane
Дата:
bombadil@wanadoo.es writes:
> 1) # explain SELECT * from v_cliente where nombre like '%DA%';

> Merge Join  (cost=54763.50..62874.36 rows=413980 width=183)
>   ->  Sort  (cost=16238.44..16238.44 rows=54 width=131)
>         ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=54 width=131)
>   ->  Sort  (cost=38525.06..38525.06 rows=20097 width=74)
>         ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
>               ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
>                   ...

> 2) explain SELECT * from v_cliente where nombre like '%DAVID%';

> Nested Loop  (cost=891.91..61414.58 rows=7638 width=183)
>   ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=1 width=131)
>   ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
>         ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
>             ... [same subplan as above]

The problem here is that the planner is being way too optimistic about
the selectivity of LIKE '%DAVID%' --- notice the estimate that only
one matching row will be found in cliente, rather than 54 as with '%DA%'.
So it chooses a plan that avoids the sort overhead needed for an
efficient merge join with the other tables.  That would be a win if
there were only one matching row, but as soon as there are lots, it's
a big loss, because the subquery to join the other tables gets redone
for every matching row :-(

>> Also, how many rows are there really that match '%DA%' and '%DAVID%'?

>  1)   2672 rows    -> 3.59 sec.
>  2)   257 rows     -> 364.69 sec.

I am thinking that the rules for selectivity of LIKE patterns probably
need to be modified.  Presently the code assumes that a long constant
string has probability of occurrence proportional to the product of the
probabilities of the individual letters.  That might be true in a random
world, but people don't search for random strings.  I think we need to
back off the selectivity estimate by some large factor to account for
the fact that the pattern being searched for is probably not random.
Anyone have ideas how to do that?

            regards, tom lane


Re: [GENERAL] Problem (bug?) with like

От
Bruce Momjian
Дата:
> I am thinking that the rules for selectivity of LIKE patterns probably
> need to be modified.  Presently the code assumes that a long constant
> string has probability of occurrence proportional to the product of the
> probabilities of the individual letters.  That might be true in a random
> world, but people don't search for random strings.  I think we need to
> back off the selectivity estimate by some large factor to account for
> the fact that the pattern being searched for is probably not random.
> Anyone have ideas how to do that?

But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
different from the first the our optimizer may be fine with that.  Is it
only when the strings get longer that we lose specificity?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Problem (bug?) with like

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
> different from the first the our optimizer may be fine with that.  Is it
> only when the strings get longer that we lose specificity?

Yeah, I don't think that the estimates are bad for one or two
characters.  But the estimate gets real small real fast as you
increase the number of match characters in the LIKE pattern.
We need to slow that down some.

            regards, tom lane

Re: [GENERAL] Problem (bug?) with like

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
> > different from the first the our optimizer may be fine with that.  Is it
> > only when the strings get longer that we lose specificity?
>
> Yeah, I don't think that the estimates are bad for one or two
> characters.  But the estimate gets real small real fast as you
> increase the number of match characters in the LIKE pattern.
> We need to slow that down some.

Yea, maybe a log base 2 decrease:

    1 char    1x
    2 char    2x
    4 char    3x
    8 char    4x
    16 char 5x

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Problem (bug?) with like

От
Hannu Krosing
Дата:

Tom Lane wrote:

>Bruce Momjian <pgman@candle.pha.pa.us> writes:
>
>>But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
>>different from the first the our optimizer may be fine with that.  Is it
>>only when the strings get longer that we lose specificity?
>>
>
>Yeah, I don't think that the estimates are bad for one or two
>characters.  But the estimate gets real small real fast as you
>increase the number of match characters in the LIKE pattern.
>We need to slow that down some.
>
Could we just assign weights to first few characters and then consider
only these
first few characters when determinind probabbility of finding it ?

If someone searches for '%New York City%' we have quite good reasons to
believe
that there are some of these in there so we should factor in the fact
that usually one searches
for strings that do exist by said weights.

Another option would be to gather statistics not only on individual
letters but on bi- or
trigraphs. Then as a next step we could implement proper trigraph indexes ;)

----------------
Hannu



Re: [GENERAL] Problem (bug?) with like

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
> > different from the first the our optimizer may be fine with that.  Is it
> > only when the strings get longer that we lose specificity?
>
> Yeah, I don't think that the estimates are bad for one or two
> characters.  But the estimate gets real small real fast as you
> increase the number of match characters in the LIKE pattern.
> We need to slow that down some.

See my earlier email about the 50% idea for LIKE.  Do ordinary string
comparisons also have this problem?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Problem (bug?) with like

От
Bruce Momjian
Дата:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
> > > different from the first the our optimizer may be fine with that.  Is it
> > > only when the strings get longer that we lose specificity?
> >
> > Yeah, I don't think that the estimates are bad for one or two
> > characters.  But the estimate gets real small real fast as you
> > increase the number of match characters in the LIKE pattern.
> > We need to slow that down some.
>
> Yea, maybe a log base 2 decrease:
>
>     1 char    1x
>     2 char    2x
>     4 char    3x
>     8 char    4x
>     16 char 5x

I did a little research on this.  I think the problem is that ordinary
characters are assumed to randomly appear in a character string, while
in practice, if the string has already been specified like 'DAV', there
are very few additional characters that can follow it and make sense.

Looking at backend/utils/adt/selfuncs.c, I see this:

    #define FIXED_CHAR_SEL    0.04    /* about 1/25 */
...
    sel *= FIXED_CHAR_SEL;

which means every additional character reduces the selectivity by 96%.
This seems much too restrictive to me.  Because of the new optimizer
buckets, we do have good statistics on the leading character, but
additional characters drastically reduce selectivity.  I think perhaps a
number like 0.50 or 50% may be correct.

That would be a table like this:

     1 char    2x
     2 char    4x
     4 char    8x
     8 char    16x
     16 char 32x

which is more restrictive than I initially suggested above but less
restrictive than we have now.

Should we assume additional characters are indeed randomly appearing in
the string?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Problem (bug?) with like

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Do ordinary string
> comparisons also have this problem?

No, only LIKE and regex matching.
        regards, tom lane


Re: [GENERAL] Problem (bug?) with like

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
> > different from the first the our optimizer may be fine with that.  Is it
> > only when the strings get longer that we lose specificity?
>
> Yeah, I don't think that the estimates are bad for one or two
> characters.  But the estimate gets real small real fast as you
> increase the number of match characters in the LIKE pattern.
> We need to slow that down some.

OK, I think I have the proper value for FIXED_CHAR_SEL.  It is currently
0.04 or 1/26, meaning the letters are random, though this is not usually
the case.

If we assume our optimizer buckets have given us a reasonable value for
the first character, suppose it is an 'F', there are only a few valid
characters after that, at least in English.  There are vowels, and a few
consonants, and given that character, there are only a few characters
that can be valid after that.  To my thinking, it is two characters that
represent the same distribution as one random character, leaving 0.20 as
the proper value for FIXED_CHAR_SEL because 0.20 * 0.20 is the same as
0.04.

Added to TODO:

  * Change FIXED_CHAR_SEL to 0.20 from 0.04 to give better selectivity (Bruce)

If people think there is a better value for this, please chime in.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Problem (bug?) with like

От
Bruce Momjian
Дата:
> The problem here is that the planner is being way too optimistic about
> the selectivity of LIKE '%DAVID%' --- notice the estimate that only
> one matching row will be found in cliente, rather than 54 as with '%DA%'.
> So it chooses a plan that avoids the sort overhead needed for an
> efficient merge join with the other tables.  That would be a win if
> there were only one matching row, but as soon as there are lots, it's
> a big loss, because the subquery to join the other tables gets redone
> for every matching row :-(
>
> >> Also, how many rows are there really that match '%DA%' and '%DAVID%'?
>
> >  1)   2672 rows    -> 3.59 sec.
> >  2)   257 rows     -> 364.69 sec.
>
> I am thinking that the rules for selectivity of LIKE patterns probably
> need to be modified.  Presently the code assumes that a long constant
> string has probability of occurrence proportional to the product of the
> probabilities of the individual letters.  That might be true in a random
> world, but people don't search for random strings.  I think we need to
> back off the selectivity estimate by some large factor to account for
> the fact that the pattern being searched for is probably not random.
> Anyone have ideas how to do that?

Let's use the above example with the new FIXED_CHAR_SEL values:

With the new 0.20 value for FIXED_CHAR_SEL, we see for DA and DAVID
above:

DA    1) 0.20 ^ 2
            .04

DAVID    2) 0.20 ^ 5
            .00032

If we divide these two, we get:

    > 0.04 / 0.00032
            125

while looking at the total counts reported above, we get:

    > 2672 / 257
            ~10.39688715953307392996

The 0.04 value gives a value of:

    > 0.04 ^ 2
            .0016
    > 0.04 ^ 5
            .0000001024
    > .0016 / .0000001024
            15625

Clearly the 0.20 value is 10x too large, while the 0.04 value is 1000x
too large.  Because this was a contrived example, and because some have
more random text than DAVID in their field, I think 0.20 is the proper
value.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026