Обсуждение: Naive handling of inequalities by nbtree initial positioning code

Поиск

Список

Период

Сортировка

Naive handling of inequalities by nbtree initial positioning code

От

Peter Geoghegan

Дата:

14 августа 2023 г., 03:50:56

Suppose that I create the following index on the tenk1 table from the
regression tests:

create index on tenk1 (two, four, hundred, thousand, tenthous);

Now the following query will be able to use index quals for each
column that appear in my composite index:

select * from tenk1
where
two = 1
and four = 3
and hundred = 91
and thousand = 891
and tenthous = 1891;

The query returns one row, and touches 3 buffers/pages (according to
EXPLAIN ANALYZE with buffers). The overheads here make perfect sense:
there's one root page access, one leaf page access, and a single heap
page access. Clearly the nbtree initial positioning code is able to
descend to the exact leaf page (and page offset) where the first
possible match could be found. Pretty standard stuff.

But if I rewrite this query to use an inequality, the picture changes.
If I replace "four = 3" with "four > 2", I get a query that is very
similar to the original (that returns the same single row):

select * from tenk1
where
two = 1
and four > 2
and hundred = 91
and thousand = 891
and tenthous = 1891;

This time our query touches 16 buffers/pages. That's a total of 15
index pages accessed, of which 14 are leaf pages. We'll needlessly
plow through an extra 13 leaf pages, before finally arriving at the
first leaf page that might *actually* have a real match for us.

We can and should find a way for the second query to descend to the
same leaf page directly, so that the physical access patterns match
those that we saw with the first query. Only the first query can use
an insertion scan key with all 4 attributes filled in to find its
initial scan position. The second query uses an insertion scan key
with values set for the first 2 index columns (on two and four) only.
EXPLAIN offers no hint that this is what happens -- the "Index Cond:"
shown for each query is practically identical. It seems to me that
Markus Winand had a very good point when he complained that we don't
expose this difference directly (e.g., by identifying which columns
appear in "Access predicates" and which columns are merely "Index
filter predicates") [1]. That would make these kinds of issues a lot
more obvious.

The nbtree code is already capable of tricks that are close enough to
what I'm thinking of here. Currently, nbtree's initial positioning
code will set BTScanInsertData.nextkey=false for the first query
(where BTScanInsertData.keysz=4), and BTScanInsertData.nextkey=true
for the second query (where BTScanInsertData.keysz=2 right now). So
the second query I came up with does at least manage to locate the
leaf page where "four = 3" tuples begin, even today -- its "four > 2"
inequality is at least "handled efficiently". The inefficiencies come
from how nbtree handles the remaining two index columns when building
an insertion scan key for our initial descent. nbtree will treat the
inequality as making it unsafe to include further values for the
remaining two attributes, which is the real source of the extra leaf
page scans (though of course the later attributes are still usable as
search-type scan keys). But it's *not* unsafe to position ourselves on
the right leaf page from the start. Not really.

All that it would take to fix the problem is per-attribute
BTScanInsertData.nextkey values. There is no reason why "nextkey"
semantics should only work for the last attribute in the insertion
scan key. Under this scheme, _bt_first() would be taught to set up the
insertion scan key with (say) nextkey=true for the "four > 2"
attribute, and nextkey=false for the other 3 attributes (since we do
that whenever >= or = are used). It would probably also make sense to
generalize this approach to handle (say) a third query that had a
"four < 2" inequality, but otherwise matched the first two queries. So
we wouldn't literally use multiple "nextkey" fields to do this.

The most general approach seems to require that we teach insertion
scan key routines like _bt_compare() about "goback" semantics, which
must also work at the attribute granularity. So we'd probably replace
both "nextkey" and "goback" with something new and more general. I
already wrote a patch (still in the CF queue) to teach nbtree
insertion scan keys about "goback" semantics [2] (whose use would
still be limited to backwards scans), so that we'd avoid needlessly
accessing extra pages in so-called boundary cases (which seems like a
much less important problem than the one I've highlighted here).

That existing patch already removed code in _bt_first that handled
"stepping back" once we're on the leaf level. ISTM that the right
place to do stuff like that is in routines like _bt_search,
_bt_binsrch, and _bt_compare -- not in _bt_first. The code around
_bt_compare seems like it would benefit from having more of this sort
of context. Having the full picture matters both when searching
internal pages and leaf pages.

Thoughts? Was this issue discussed at some point in the past?

[1] https://use-the-index-luke.com/sql/explain-plan/postgresql/filter-predicates
[2] https://www.postgresql.org/message-id/flat/CAH2-Wz=XPzM8HzaLPq278Vms420mVSHfgs9wi5tjFKHcapZCEw@mail.gmail.com
--
Peter Geoghegan

Re: Naive handling of inequalities by nbtree initial positioning code

От

Peter Geoghegan

Дата:

14 августа 2023 г., 04:09:30

On Sun, Aug 13, 2023 at 5:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
> select * from tenk1
> where
>   two = 1
>   and four = 3
>   and hundred = 91
>   and thousand = 891
>   and tenthous = 1891;
>
> The query returns one row, and touches 3 buffers/pages (according to
> EXPLAIN ANALYZE with buffers). The overheads here make perfect sense:
> there's one root page access, one leaf page access, and a single heap
> page access. Clearly the nbtree initial positioning code is able to
> descend to the exact leaf page (and page offset) where the first
> possible match could be found. Pretty standard stuff.

I probably should have made this first query use "four >= 3" instead
of using "four = 3" (while still using "four > 2" for the second,
"bad" query). The example works a bit better that way because now the
queries are logically equivalent, and yet still have this big
disparity. (We get 4 buffer hits for the "good" >= query, but 16
buffer hits for the equivalent "bad" > query.)

--
Peter Geoghegan

Re: Naive handling of inequalities by nbtree initial positioning code

От

Peter Geoghegan

Дата:

14 августа 2023 г., 07:33:30

On Sun, Aug 13, 2023 at 5:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
> All that it would take to fix the problem is per-attribute
> BTScanInsertData.nextkey values. There is no reason why "nextkey"
> semantics should only work for the last attribute in the insertion
> scan key. Under this scheme, _bt_first() would be taught to set up the
> insertion scan key with (say) nextkey=true for the "four > 2"
> attribute, and nextkey=false for the other 3 attributes (since we do
> that whenever >= or = are used). It would probably also make sense to
> generalize this approach to handle (say) a third query that had a
> "four < 2" inequality, but otherwise matched the first two queries. So
> we wouldn't literally use multiple "nextkey" fields to do this.

Actually, that can't work when there are a huge number of index tuples
with the same values for "four" (enough to span many internal pages).
So we'd need specialized knowledge of the data type (probably
from an opclass support function) to transform "four > 2" into "four
>= 3" up front. Alternatively, we could do roughly the same thing via
an initial index probe to do the same thing. The latter approach would
be needed for continuous data types, where the transformation isn't
possible at all.

The probing approach could work by finding an initial position in the
same way as we currently locate an initial leaf page -- the way that I
complained about earlier on, but with an important twist. Once we'd
established that the first "four" value in the index > 2 really was 3
(or whatever it turned out to be), we could fill that value into a new
insertion scan key. It would then be possible to do another descent of
the index, skipping over most of the leaf pages that we'll access
needlessly right now. (Actually, we'd only do all that when it seemed
likely to allow us to skip a significant number of intermediate leaf
pages -- which is what we saw in my test case.)

This is directly related to skip scan. The former approach is more or
less what the MDAM paper calls "dense" access (which is naturally
limited to discrete data types like integer), while the latter probing
approach is what it calls "sparse" access. Skip scan performs this
process repeatedly, most of the time, but we'd only skip once here.

In fact, if my example had used (say) "four > 1" instead, then it
would have made sense to skip multiple times -- not just once, after
an initial descent. Because then we'd have had to consider matches for
both "two=1 and four=2" and "two=1 and four=3" (there aren't any
"two=1 and four=4" matches so we'd then be done).

In fact, had there been no mention of the "four" column in the query
whatsoever (which is how we tend to think of skip scan), then a decent
implementation of skip scan would effectively behave as if the query
had been written "two=1 and four > -inf and ...", while following the
same general approach. (Or "two=1 and four < +inf and ...", if this
was a similar looking backwards scan.)

--
Peter Geoghegan

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Naive handling of inequalities by nbtree initial positioning code

Naive handling of inequalities by nbtree initial positioning code

Re: Naive handling of inequalities by nbtree initial positioning code

Re: Naive handling of inequalities by nbtree initial positioning code