PoC: prefetching index leaf pages (for inserts)

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема PoC: prefetching index leaf pages (for inserts)
Дата
Msg-id 8081aa62-4032-927c-8411-548e1b87ed5b@enterprisedb.com
обсуждение исходный текст
Ответы Re: PoC: prefetching index leaf pages (for inserts)  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers
Hi,

Some time ago I started a thread about prefetching heap pages during
index scans [1]. That however only helps when reading rows, not when
inserting them.

Imagine inserting a row into a table with many indexes - we insert the
row into the table, and then into the indexes one by one, synchronously.
We walk the index, determine the appropriate leaf page, read it from
disk into memory, insert the index tuple, and then do the same thing for
the next index.

If there are many indexes, and there's not much correlation with the
table, this may easily results in I/O happening synchronously with queue
depth 1. Hard to make it even slower ...

This can be a problem even with a modest number of indexes - imagine
bulk-loading data into a table using COPY (say, in 100-row batches).
Inserting the rows into heap happens in a bulk, but the indexes are
still modified in a loop, as if for single-row inserts. Not great.

The with multiple connections the concurrent I/O may be generated that
way, but for low-concurrency workloads (e.g. batch jobs) that may not
really work.

I had an idea what we might do about this - we can walk the index,
almost as if we're inserting the index tuple, but only the "inner"
non-leaf pages. And instead of descending to the leaf page, we just
prefetch it. The non-leaf pages are typically <1% of the index, and hot,
so likely already cached, so not worth prefetching those.

The attached patch does a PoC of this. It adds a new AM function
"amprefetch", with an implementation for btree indexes, mimicking the
index lookup, except that it only prefetches the leaf page as explained
a bit earlier.

In the executor, this is wrapped in ExecInsertPrefetchIndexes() which
gets called in various places right before ExecInsertPrefetchIndexes().
I thought about doing that in ExecInsertPrefetchIndexes() directly, but
that would not work for COPY, where we want to issue the prefetches for
the whole batch, not for individual tuples.

This may need various improvements - the prefetch duplicates a couple
steps that could be expensive (e.g. evaluation of index predicates,
forming index tuples, and so on). Would be nice to improve this, but
good enough for PoC I think.

Another gap is lack of incremental prefetch (ramp-up). We just prefetch
all the indexes, for all tuples. But I think that's OK. We know we'll
need those pages, and the number is fairly limited.

There's a GUC enable_insert_prefetch, that can be used to enable this
insert prefetching.

I did a simple test on two machines - one with SATA SSD RAID, one with
NVMe SSD. In both cases the data (table+indexes) are an order of
magnitude larger than RAM. The indexes are on UUID, so pretty random and
there's no correlation. Then batches of 100, 1000 and 10000 rows are
inserted, with/without the prefetching.

With 5 indexes, the results look like this:

SATA SSD RAID
-------------

 rows     no prefetch        prefetch
  100      176.872 ms       70.910 ms
 1000     1035.056 ms      590.495 ms
10000     8494.836 ms     3216.206 ms


NVMe
----

 rows     no prefetch        prefetch
  100      133.365 ms       72.899 ms
 1000     1572.379 ms      829.298 ms
10000    11889.143 ms     3621.981 ms


Not bad, I guess. Cutting the time to ~30% is nice.

The fewer the indexes, the smaller the difference (with 1 index there is
almost no difference), of course.


regards


[1] https://commitfest.postgresql.org/45/4351/

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Vik Fearing
Дата:
Сообщение: Re: Why is hot_standby_feedback off by default?
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: BRIN minmax multi - incorrect distance for infinite timestamp/date