Обсуждение: Lots of read activity on index only scan

Поиск
Список
Период
Сортировка

Lots of read activity on index only scan

От
"Peter J. Holzer"
Дата:
This is triggered by Hans-Jürgen Schönig's blog post on B-Tree vs. BRIN
indexes here:
https://www.cybertec-postgresql.com/en/btree-vs-brin-2-options-for-indexing-in-postgresql-data-warehouses/

He creates a rather large table with a sorted and an unsorted column
plus two indexes:

CREATE TABLE t_demo (id_sorted int8, id_random int8);
INSERT INTO t_demo
       SELECT id, hashtext(id::text)
       FROM generate_series(1, 2000000000) AS id;
create index on t_demo (id_sorted);
create index on t_demo (id_random);

(Actually, his table is even larger, but I didn't have enough free space
on my test server and didn't want to extend the file system just for
this test. Anyway. That's large enough that the setup takes longer than
a cup of coffee, so maybe you want to go even lower.)

The table is now about 82 GB, and the indexes 42 GB and 40 GB,
respectively.

Then select about 10E6 values from each index:

wds=> select count(*) from t_demo where id_sorted between 10000000 and 20000000;
╔══════════╗
║  count   ║
╟──────────╢
║ 10000001 ║
╚══════════╝
(1 row)

Time: 349.186 ms

wds=> select count(*) from t_demo where id_random between 10000000 and 31475500;
╔══════════╗
║  count   ║
╟──────────╢
║ 10000028 ║
╚══════════╝
(1 row)

Time: 1296.489 ms (00:01.296)

The second is a bit slower but (on this server with lots of RAM) not
that much. However, if you look at the EXPLAIN output, the difference is
striking (wide output, sorry):

wds=> explain (analyze, buffers) select count(*) from t_demo where id_sorted between 10000000 and 20000000;

╔════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                                                                   QUERY PLAN
                                                        ║ 

╟────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╢
║ Finalize Aggregate  (cost=203279.97..203279.98 rows=1 width=8) (actual time=545.388..559.523 rows=1 loops=1)
                                                        ║ 
║   Buffers: shared hit=27336
                                                        ║ 
║   ->  Gather  (cost=203279.76..203279.97 rows=2 width=8) (actual time=545.201..559.504 rows=3 loops=1)
                                                        ║ 
║         Workers Planned: 2
                                                        ║ 
║         Workers Launched: 2
                                                        ║ 
║         Buffers: shared hit=27336
                                                        ║ 
║         ->  Partial Aggregate  (cost=202279.76..202279.77 rows=1 width=8) (actual time=508.272..508.274 rows=1
loops=3)                                                       ║ 
║               Buffers: shared hit=27336
                                                        ║ 
║               ->  Parallel Index Only Scan using t_demo_id_sorted_idx on t_demo  (cost=0.58..191374.26 rows=4362197
width=0)(actual time=0.075..352.469 rows=3333334 loops=3) ║ 
║                     Index Cond: ((id_sorted >= 10000000) AND (id_sorted <= 20000000))
                                                        ║ 
║                     Heap Fetches: 0
                                                        ║ 
║                     Buffers: shared hit=27336
                                                        ║ 
║ Planning:
                                                        ║ 
║   Buffers: shared hit=5
                                                        ║ 
║ Planning Time: 0.255 ms
                                                        ║ 
║ JIT:
                                                        ║ 
║   Functions: 11
                                                        ║ 
║   Options: Inlining false, Optimization false, Expressions true, Deforming true
                                                        ║ 
║   Timing: Generation 1.217 ms, Inlining 0.000 ms, Optimization 0.762 ms, Emission 12.777 ms, Total 14.756 ms
                                                        ║ 
║ Execution Time: 560.422 ms
                                                        ║ 

╚════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

wds=> explain (analyze, buffers) select count(*) from t_demo where id_random between 10000000 and 31475500;

╔═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                                                                   QUERY PLAN
                                                         ║ 

╟─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╢
║ Finalize Aggregate  (cost=173323.64..173323.65 rows=1 width=8) (actual time=1513.163..1520.166 rows=1 loops=1)
                                                         ║ 
║   Buffers: shared hit=9995216
                                                         ║ 
║   ->  Gather  (cost=173323.42..173323.63 rows=2 width=8) (actual time=1513.017..1520.154 rows=3 loops=1)
                                                         ║ 
║         Workers Planned: 2
                                                         ║ 
║         Workers Launched: 2
                                                         ║ 
║         Buffers: shared hit=9995216
                                                         ║ 
║         ->  Partial Aggregate  (cost=172323.42..172323.43 rows=1 width=8) (actual time=1492.693..1492.694 rows=1
loops=3)                                                      ║ 
║               Buffers: shared hit=9995216
                                                         ║ 
║               ->  Parallel Index Only Scan using t_demo_id_random_idx on t_demo  (cost=0.58..162927.68 rows=3758297
width=0)(actual time=0.058..1338.110 rows=3333343 loops=3) ║ 
║                     Index Cond: ((id_random >= 10000000) AND (id_random <= 31475500))
                                                         ║ 
║                     Heap Fetches: 0
                                                         ║ 
║                     Buffers: shared hit=9995216
                                                         ║ 
║ Planning Time: 0.186 ms
                                                         ║ 
║ JIT:
                                                         ║ 
║   Functions: 11
                                                         ║ 
║   Options: Inlining false, Optimization false, Expressions true, Deforming true
                                                         ║ 
║   Timing: Generation 1.204 ms, Inlining 0.000 ms, Optimization 0.634 ms, Emission 11.510 ms, Total 13.348 ms
                                                         ║ 
║ Execution Time: 1521.068 ms
                                                         ║ 

╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Both do a parallel index only scan. Both perform 0 heap fetches.
But one reads 27336 buffers (or about 22 bytes per index entry, which
sounds reasonable) while the other reads 9995216 buffers (or almost one
full buffer per row). Why? The entries should be dense in the index in
both cases and since it's an index only scan (and explain says there
were 0 heap fetches) I would not expect extra accesses. Where do these
buffer reads come from?

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

Re: Lots of read activity on index only scan

От
Tom Lane
Дата:
"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> Both do a parallel index only scan. Both perform 0 heap fetches.
> But one reads 27336 buffers (or about 22 bytes per index entry, which
> sounds reasonable) while the other reads 9995216 buffers (or almost one
> full buffer per row). Why? The entries should be dense in the index in
> both cases and since it's an index only scan (and explain says there
> were 0 heap fetches) I would not expect extra accesses. Where do these
> buffer reads come from?

An "index only" scan is only that if the table's all-visible map is
fully set.  Did you vacuum this table after building it, or wait
long enough for autovacuum to do so?

            regards, tom lane



Re: Lots of read activity on index only scan

От
Peter Geoghegan
Дата:
On Fri, Nov 18, 2022 at 12:46 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> Both do a parallel index only scan. Both perform 0 heap fetches.
> But one reads 27336 buffers (or about 22 bytes per index entry, which
> sounds reasonable) while the other reads 9995216 buffers (or almost one
> full buffer per row). Why? The entries should be dense in the index in
> both cases and since it's an index only scan (and explain says there
> were 0 heap fetches) I would not expect extra accesses. Where do these
> buffer reads come from?

The index-only scan processes an index leaf page at a time. When there
is naturally a high correlation (or some kind of clustering) in how we
access VM pages, we'll naturally be able to do more visibility checks
covering more index tuples per VM page accessed. This is a less
severe problem here than it would be with an equivalent pair of plain
index scans, just because there are so few VM pages relative to
heap pages. But it's more or less an analogous problem. You're
really noticing it here because these index scans have very low
selectivity -- which is kinda unusual in most environments.

Also worth bearing in mind that it's unusual to have a perfectly
random and uniformly distributed clustering of index tuples, which is
what the index built via hashing exhibits. Even a v4 UUID index could
easily have plenty of duplicates, which would probably do
significantly better on the metric you've focussed on.


--
Peter Geoghegan



Re: Lots of read activity on index only scan

От
"Peter J. Holzer"
Дата:
On 2022-11-18 15:59:46 -0500, Tom Lane wrote:
> "Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> > Both do a parallel index only scan. Both perform 0 heap fetches.
> > But one reads 27336 buffers (or about 22 bytes per index entry, which
> > sounds reasonable) while the other reads 9995216 buffers (or almost one
> > full buffer per row). Why? The entries should be dense in the index in
> > both cases and since it's an index only scan (and explain says there
> > were 0 heap fetches) I would not expect extra accesses. Where do these
> > buffer reads come from?
>
> An "index only" scan is only that if the table's all-visible map is
> fully set.

Yes, but shouldn't that be reflected in the heap fetches? Or am I
misunderstanding what that means?

Plus:

> Did you vacuum this table after building it, or wait long enough for
> autovacuum to do so?

Both actually. I built the table this morning and did all the tests
again just before posting. And I explicitely vacuumed it during this
second set of tests. I should have mentioned that, sorry. Also possibly
relevant: This is on Pg 15.1 (PGDG build for Ubuntu).

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

Re: Lots of read activity on index only scan

От
"Peter J. Holzer"
Дата:
On 2022-11-18 13:09:16 -0800, Peter Geoghegan wrote:
> On Fri, Nov 18, 2022 at 12:46 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> > Both do a parallel index only scan. Both perform 0 heap fetches.
> > But one reads 27336 buffers (or about 22 bytes per index entry, which
> > sounds reasonable) while the other reads 9995216 buffers (or almost one
> > full buffer per row). Why? The entries should be dense in the index in
> > both cases and since it's an index only scan (and explain says there
> > were 0 heap fetches) I would not expect extra accesses. Where do these
> > buffer reads come from?
>
> The index-only scan processes an index leaf page at a time.

There should be about 27000 of them, same as for the othe index, right?

> When there is naturally a high correlation (or some kind of
> clustering) in how we access VM pages, we'll naturally be able to do
> more visibility checks covering more index tuples per VM page
> accessed.

So you are saying that these are accesses to the visibility map, not the
base table?

Hmm, that would explain why all the accesses are hits. The whole table
definitely doesn't fit into work_mem.

> Also worth bearing in mind that it's unusual to have a perfectly
> random and uniformly distributed clustering of index tuples,

Sure. This is a highly contrived example.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

Re: Lots of read activity on index only scan

От
Peter Geoghegan
Дата:
On Fri, Nov 18, 2022 at 1:50 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> There should be about 27000 of them, same as for the othe index, right?

There aren't that many. The point I'm making is that you can access
each VM page approximately once (and check relatively many index
tuple's TIDs all in one go), or many times. The total number of VM
pages may be constant, but the access patterns are quite different
owing to differences in how the data is clustered in each index.

> > When there is naturally a high correlation (or some kind of
> > clustering) in how we access VM pages, we'll naturally be able to do
> > more visibility checks covering more index tuples per VM page
> > accessed.
>
> So you are saying that these are accesses to the visibility map, not the
> base table?

Yes. I see "Heap Fetches: 0" for both plans, that each query the same
table and scan approximately the same number of index pages. So VM
accesses are the only explanation that makes any sense.

> > Also worth bearing in mind that it's unusual to have a perfectly
> > random and uniformly distributed clustering of index tuples,
>
> Sure. This is a highly contrived example.

FWIW I think that it could be a lot less bad, even with indexes that
you'd think would be almost as bad as the bad one from your test case.
Even things that appear to be random aren't usually nearly as random
as what you've shown.

-- 
Peter Geoghegan