"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> we'll consider just three columns, which we unimaginatively call a, b,
> and c. There are also three indexes:
> t_a_idx btree (a) WHERE a IS NOT NULL
> t_b_idx btree (b) WHERE b IS NOT NULL
> t_a_b_idx btree (a, b) WHERE a IS NOT NULL AND b IS NOT NULL
> Nowe I have a query
> select c from t where a='A' and b='B';
> This uses t_b_idx, not - as I expected - t_a_b_idx.
> The distribution of values in columns a and b is quite different: a has
> 10 different values of similar frequency (and no null values). b has
> only a single non-null value which with a frequency of about 1 %.
> So I definitely understand why it would prefer t_b_idx to t_a_idx, but
> certainly t_a_b_idx should be even better?
Not necessarily --- t_a_b_idx is (presumably) physically bigger than
t_b_idx, which makes it more expensive to search. The additional
selectivity gain apparently doesn't outweigh that.
> If I create an index with the columns swapped:
> t_b_a_idx btree (b, a) WHERE b IS NOT NULL and a IS NOT NULL
> this index will be used.
Hmm. Probably that has something to do with a calculation about
the selectivity of the leading index column, ie do you have to
scan 10% of the index or 1% of the index. It's not taking the
partial-index filter into account in that, I suspect, which skews
the results in this case --- but that would be hard to account
for accurately. Anyway I can't get excited about optimizing for
a single non-null value.
regards, tom lane