Обсуждение: plan_create_index_workers doesn't account for TOAST

Поиск
Список
Период
Сортировка

plan_create_index_workers doesn't account for TOAST

От
"Jonathan S. Katz"
Дата:
Hi,

plan_create_index_workers[1] does not consider the amount of tuples 
existing in TOAST pages when determining the number of parallel workers 
to use for a build. The estimation comes from estimate_rel_size[2], 
which in this case, will just take the value from rel->rd_rel->relpages.

We probably don't notice this much with B-trees, given a B-tree is 
typically used for data that does not require toasting. However, this 
becomes more visible when working on custom index access methods that 
implement their own parallel build strategy.

For example, pgvector[3] provides its own data types and index access 
method for indexing vector data. Vectors can get quite large fairly 
quickly, e.g. a 768-dimensional vector takes up 8 + 4*768 = 3080 bytes 
on disk, which quickly clears the default TOAST tuple threshold.

In a recent patch proposal to allow for building indexes in parallel[4], 
I performed a few experiments on how many parallel workers would be 
spawned when indexing 1,000,000 (1MM) 768-dim vectors, both with 
EXTEDNED (default) and PLAIN storage. In all cases, I allowed for leader 
participation, but the leader is not considered in 
plan_create_index_workers.

With EXTENDED, plan_create_index_workers recommended 2 workers. The 
build time was ~2x faster than the serial build.

With PLAIN, plan_create_index_workers recommended 4 workers. The build 
time was **~3X faster** than the serial build.

(I've been doing more detailed, less hand-waivy performance testing, but 
I wanted to provide directional numbers here)

It seems like we're leaving some performance for columns with TOASTed 
data that require indexing, so I wanted to propose allowing the pages in 
TOASTed tables to be considered when we're trying to index a column with 
TOASTed attributes.

Thanks,

Jonathan

[1] 

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/optimizer/plan/planner.c;hb=refs/heads/master#l6734
[2] 

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/optimizer/util/plancat.c;hb=refs/heads/master#l1117
[3] https://github.com/pgvector/pgvector
[4] https://github.com/pgvector/pgvector/commits/parallel-index-build

Вложения

Re: plan_create_index_workers doesn't account for TOAST

От
"Jonathan S. Katz"
Дата:
On 6/29/23 10:12 AM, Jonathan S. Katz wrote:
> Hi,
> 
> plan_create_index_workers[1] does not consider the amount of tuples 
> existing in TOAST pages when determining the number of parallel workers 
> to use for a build. The estimation comes from estimate_rel_size[2], 
> which in this case, will just take the value from rel->rd_rel->relpages.
> 
> We probably don't notice this much with B-trees, given a B-tree is 
> typically used for data that does not require toasting. However, this 
> becomes more visible when working on custom index access methods that 
> implement their own parallel build strategy.
> 
> For example, pgvector[3] provides its own data types and index access 
> method for indexing vector data. Vectors can get quite large fairly 
> quickly, e.g. a 768-dimensional vector takes up 8 + 4*768 = 3080 bytes 
> on disk, which quickly clears the default TOAST tuple threshold.
> 
> In a recent patch proposal to allow for building indexes in parallel[4], 
> I performed a few experiments on how many parallel workers would be 
> spawned when indexing 1,000,000 (1MM) 768-dim vectors, both with 
> EXTEDNED (default) and PLAIN storage. In all cases, I allowed for leader 
> participation, but the leader is not considered in 
> plan_create_index_workers.
> 
> With EXTENDED, plan_create_index_workers recommended 2 workers. The 
> build time was ~2x faster than the serial build.
> 
> With PLAIN, plan_create_index_workers recommended 4 workers. The build 
> time was **~3X faster** than the serial build.
> 
> (I've been doing more detailed, less hand-waivy performance testing, but 
> I wanted to provide directional numbers here)
> 
> It seems like we're leaving some performance for columns with TOASTed 
> data that require indexing, so I wanted to propose allowing the pages in 
> TOASTed tables to be considered when we're trying to index a column with 
> TOASTed attributes.

Just to add to this: there is a lever to get more parallel workers by 
setting "min_parallel_table_scan_size" to a lower value, which does help 
in this case. However, it does mask the fact that a large chunk of the 
data required to build the index exists in the TOAST table, which is not 
intuitive to a user who rarely has to use tuning parameters.

Thanks,

Jonathan


Вложения