Обсуждение: Parallel CREATE INDEX for BRIN indexes

Поиск
Список
Период
Сортировка

Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
Hi,

Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
infrastructure (starting workers etc.) is "inspired" by the BTREE code
(i.e. copied from that and massaged a bit to call brin stuff).

 _bt_begin_parallel -> _brin_begin_parallel
 _bt_end_parallel -> _brin_end_parallel
 _bt_parallel_estimate_shared -> _brin_parallel_estimate_shared
 _bt_leader_participate_as_worker -> _brin_leader_participate_as_worker
 _bt_parallel_scan_and_sort -> _brin_parallel_scan_and_build

This is mostly mechanical stuff - setting up the parallel workers,
starting the scan etc.

The tricky part is how to divide the work between workers and how we
combine the partial results. For BTREE we simply let each worker to read
a subset of the table (using a parallel scan), sort it and then do a
merge sort on the partial results.

For BRIN it's a bit different, because the indexes essentially splits
the table into smaller ranges and treat them independently. So the
easiest way is to organize the table scan so that each range gets
processed by exactly one worker. Each worker writes the index tuples
into a temporary file, and then when all workers are done we read and
write them into the index.

The problem is a parallel scan assigns mostly random subset of the table
to each worker - it's not guaranteed a BRIN page range to be processed
by a single worker.


0001 does that in a bit silly way - instead of doing single large scan,
each worker does a sequence of TID range scans for each worker (see
_brin_parallel_scan_and_build), and BrinShared has fields used to track
which ranges were already assigned to workers. A bit cumbersome, but it
works pretty well.

0002 replaces the TID range scan sequence with a single parallel scan,
modified to assign "chunks" in multiple of pagesPerRange.


In both cases _brin_end_parallel then reads the summaries from worker
files, and adds them into the index. In 0001 this is fairly simple,
although we could do one more improvement and sort the ranges by range
start to make the index nicer (and possibly a bit more efficient). This
should be simple, because the per-worker results are already sorted like
that (so a merge sort in _brin_end_parallel would be enough).

For 0002 it's a bit more complicated, because with a single parallel
scan brinbuildCallbackParallel can't decide if a range is assigned to a
different worker or empty. And we want to generate summaries for empty
ranges in the index. We could either skip such range during index build,
and then add empty summaries in _brin_end_parallel (if needed), or add
them and then merge them using "union".


I just realized there's a third option to do this - we could just do
regular parallel scan (with no particular regard to pagesPerRange), and
then do "union" when merging results from workers. It doesn't require
the sequence of TID scans, and the union would also handle the empty
ranges. The per-worker results might be much larger, though, because
each worker might produce up to the "full" BRIN index.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Thu, 8 Jun 2023 at 14:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
> infrastructure (starting workers etc.) is "inspired" by the BTREE code
> (i.e. copied from that and massaged a bit to call brin stuff).

Nice work.

> In both cases _brin_end_parallel then reads the summaries from worker
> files, and adds them into the index. In 0001 this is fairly simple,
> although we could do one more improvement and sort the ranges by range
> start to make the index nicer (and possibly a bit more efficient). This
> should be simple, because the per-worker results are already sorted like
> that (so a merge sort in _brin_end_parallel would be enough).

I see that you manually built the passing and sorting of tuples
between workers, but can't we use the parallel tuplesort
infrastructure for that? It already has similar features in place and
improves code commonality.

> For 0002 it's a bit more complicated, because with a single parallel
> scan brinbuildCallbackParallel can't decide if a range is assigned to a
> different worker or empty. And we want to generate summaries for empty
> ranges in the index. We could either skip such range during index build,
> and then add empty summaries in _brin_end_parallel (if needed), or add
> them and then merge them using "union".
>
>
> I just realized there's a third option to do this - we could just do
> regular parallel scan (with no particular regard to pagesPerRange), and
> then do "union" when merging results from workers. It doesn't require
> the sequence of TID scans, and the union would also handle the empty
> ranges. The per-worker results might be much larger, though, because
> each worker might produce up to the "full" BRIN index.

Would it be too much effort to add a 'min_chunk_size' argument to
table_beginscan_parallel (or ParallelTableScanDesc) that defines the
minimum granularity of block ranges to be assigned to each process? I
think that would be the most elegant solution that would require
relatively little effort: table_block_parallelscan_nextpage already
does parallel management of multiple chunk sizes, and I think this
modification would fit quite well in that code.

Kind regards,

Matthias van de Meent



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:

On 7/4/23 23:53, Matthias van de Meent wrote:
> On Thu, 8 Jun 2023 at 14:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
>> infrastructure (starting workers etc.) is "inspired" by the BTREE code
>> (i.e. copied from that and massaged a bit to call brin stuff).
> 
> Nice work.
> 
>> In both cases _brin_end_parallel then reads the summaries from worker
>> files, and adds them into the index. In 0001 this is fairly simple,
>> although we could do one more improvement and sort the ranges by range
>> start to make the index nicer (and possibly a bit more efficient). This
>> should be simple, because the per-worker results are already sorted like
>> that (so a merge sort in _brin_end_parallel would be enough).
> 
> I see that you manually built the passing and sorting of tuples
> between workers, but can't we use the parallel tuplesort
> infrastructure for that? It already has similar features in place and
> improves code commonality.
> 

Maybe. I wasn't that familiar with what parallel tuplesort can and can't
do, and the little I knew I managed to forget since I wrote this patch.
Which similar features do you have in mind?

The workers are producing the results in "start_block" order, so if they
pass that to the leader, it probably can do the usual merge sort.

>> For 0002 it's a bit more complicated, because with a single parallel
>> scan brinbuildCallbackParallel can't decide if a range is assigned to a
>> different worker or empty. And we want to generate summaries for empty
>> ranges in the index. We could either skip such range during index build,
>> and then add empty summaries in _brin_end_parallel (if needed), or add
>> them and then merge them using "union".
>>
>>
>> I just realized there's a third option to do this - we could just do
>> regular parallel scan (with no particular regard to pagesPerRange), and
>> then do "union" when merging results from workers. It doesn't require
>> the sequence of TID scans, and the union would also handle the empty
>> ranges. The per-worker results might be much larger, though, because
>> each worker might produce up to the "full" BRIN index.
> 
> Would it be too much effort to add a 'min_chunk_size' argument to
> table_beginscan_parallel (or ParallelTableScanDesc) that defines the
> minimum granularity of block ranges to be assigned to each process? I
> think that would be the most elegant solution that would require
> relatively little effort: table_block_parallelscan_nextpage already
> does parallel management of multiple chunk sizes, and I think this
> modification would fit quite well in that code.
> 

I'm confused. Isn't that pretty much exactly what 0002 does? I mean,
that passes pagesPerRange to table_parallelscan_initialize(), so that
each pagesPerRange is assigned to a single worker.

The trouble I described above is that the scan returns tuples, and the
consumer has no idea what was the chunk size or how many other workers
are there. Imagine you get a tuple from block 1, and then a tuple from
block 1000. Does that mean that the blocks in between are empty or that
they were processed by some other worker?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Wed, 5 Jul 2023 at 00:08, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 7/4/23 23:53, Matthias van de Meent wrote:
> > On Thu, 8 Jun 2023 at 14:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> Hi,
> >>
> >> Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
> >> infrastructure (starting workers etc.) is "inspired" by the BTREE code
> >> (i.e. copied from that and massaged a bit to call brin stuff).
> >
> > Nice work.
> >
> >> In both cases _brin_end_parallel then reads the summaries from worker
> >> files, and adds them into the index. In 0001 this is fairly simple,
> >> although we could do one more improvement and sort the ranges by range
> >> start to make the index nicer (and possibly a bit more efficient). This
> >> should be simple, because the per-worker results are already sorted like
> >> that (so a merge sort in _brin_end_parallel would be enough).
> >
> > I see that you manually built the passing and sorting of tuples
> > between workers, but can't we use the parallel tuplesort
> > infrastructure for that? It already has similar features in place and
> > improves code commonality.
> >
>
> Maybe. I wasn't that familiar with what parallel tuplesort can and can't
> do, and the little I knew I managed to forget since I wrote this patch.
> Which similar features do you have in mind?
>
> The workers are producing the results in "start_block" order, so if they
> pass that to the leader, it probably can do the usual merge sort.
>
> >> For 0002 it's a bit more complicated, because with a single parallel
> >> scan brinbuildCallbackParallel can't decide if a range is assigned to a
> >> different worker or empty. And we want to generate summaries for empty
> >> ranges in the index. We could either skip such range during index build,
> >> and then add empty summaries in _brin_end_parallel (if needed), or add
> >> them and then merge them using "union".
> >>
> >>
> >> I just realized there's a third option to do this - we could just do
> >> regular parallel scan (with no particular regard to pagesPerRange), and
> >> then do "union" when merging results from workers. It doesn't require
> >> the sequence of TID scans, and the union would also handle the empty
> >> ranges. The per-worker results might be much larger, though, because
> >> each worker might produce up to the "full" BRIN index.
> >
> > Would it be too much effort to add a 'min_chunk_size' argument to
> > table_beginscan_parallel (or ParallelTableScanDesc) that defines the
> > minimum granularity of block ranges to be assigned to each process? I
> > think that would be the most elegant solution that would require
> > relatively little effort: table_block_parallelscan_nextpage already
> > does parallel management of multiple chunk sizes, and I think this
> > modification would fit quite well in that code.
> >
>
> I'm confused. Isn't that pretty much exactly what 0002 does? I mean,
> that passes pagesPerRange to table_parallelscan_initialize(), so that
> each pagesPerRange is assigned to a single worker.

Huh, I overlooked that one... Sorry for that.

> The trouble I described above is that the scan returns tuples, and the
> consumer has no idea what was the chunk size or how many other workers
> are there. Imagine you get a tuple from block 1, and then a tuple from
> block 1000. Does that mean that the blocks in between are empty or that
> they were processed by some other worker?

If the unit of work for parallel table scans is the index's
pages_per_range, then I think we can just fill in expected-but-missing
ranges as 'empty' in the parallel leader during index loading, like
the first of the two solutions you proposed.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:

On 7/5/23 10:44, Matthias van de Meent wrote:
> On Wed, 5 Jul 2023 at 00:08, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>>
>>
>> On 7/4/23 23:53, Matthias van de Meent wrote:
>>> On Thu, 8 Jun 2023 at 14:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
>>>> infrastructure (starting workers etc.) is "inspired" by the BTREE code
>>>> (i.e. copied from that and massaged a bit to call brin stuff).
>>>
>>> Nice work.
>>>
>>>> In both cases _brin_end_parallel then reads the summaries from worker
>>>> files, and adds them into the index. In 0001 this is fairly simple,
>>>> although we could do one more improvement and sort the ranges by range
>>>> start to make the index nicer (and possibly a bit more efficient). This
>>>> should be simple, because the per-worker results are already sorted like
>>>> that (so a merge sort in _brin_end_parallel would be enough).
>>>
>>> I see that you manually built the passing and sorting of tuples
>>> between workers, but can't we use the parallel tuplesort
>>> infrastructure for that? It already has similar features in place and
>>> improves code commonality.
>>>
>>
>> Maybe. I wasn't that familiar with what parallel tuplesort can and can't
>> do, and the little I knew I managed to forget since I wrote this patch.
>> Which similar features do you have in mind?
>>
>> The workers are producing the results in "start_block" order, so if they
>> pass that to the leader, it probably can do the usual merge sort.
>>
>>>> For 0002 it's a bit more complicated, because with a single parallel
>>>> scan brinbuildCallbackParallel can't decide if a range is assigned to a
>>>> different worker or empty. And we want to generate summaries for empty
>>>> ranges in the index. We could either skip such range during index build,
>>>> and then add empty summaries in _brin_end_parallel (if needed), or add
>>>> them and then merge them using "union".
>>>>
>>>>
>>>> I just realized there's a third option to do this - we could just do
>>>> regular parallel scan (with no particular regard to pagesPerRange), and
>>>> then do "union" when merging results from workers. It doesn't require
>>>> the sequence of TID scans, and the union would also handle the empty
>>>> ranges. The per-worker results might be much larger, though, because
>>>> each worker might produce up to the "full" BRIN index.
>>>
>>> Would it be too much effort to add a 'min_chunk_size' argument to
>>> table_beginscan_parallel (or ParallelTableScanDesc) that defines the
>>> minimum granularity of block ranges to be assigned to each process? I
>>> think that would be the most elegant solution that would require
>>> relatively little effort: table_block_parallelscan_nextpage already
>>> does parallel management of multiple chunk sizes, and I think this
>>> modification would fit quite well in that code.
>>>
>>
>> I'm confused. Isn't that pretty much exactly what 0002 does? I mean,
>> that passes pagesPerRange to table_parallelscan_initialize(), so that
>> each pagesPerRange is assigned to a single worker.
> 
> Huh, I overlooked that one... Sorry for that.
> 
>> The trouble I described above is that the scan returns tuples, and the
>> consumer has no idea what was the chunk size or how many other workers
>> are there. Imagine you get a tuple from block 1, and then a tuple from
>> block 1000. Does that mean that the blocks in between are empty or that
>> they were processed by some other worker?
> 
> If the unit of work for parallel table scans is the index's
> pages_per_range, then I think we can just fill in expected-but-missing
> ranges as 'empty' in the parallel leader during index loading, like
> the first of the two solutions you proposed.
> 

Right, I think that's the right solution.

Or rather the only solution, because the other idea (generating the
empty ranges in workers) relies on the workers knowing when to generate
that. But I don't think the workers have the necessary information.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Wed, 5 Jul 2023 at 00:08, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 7/4/23 23:53, Matthias van de Meent wrote:
> > On Thu, 8 Jun 2023 at 14:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> Hi,
> >>
> >> Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
> >> infrastructure (starting workers etc.) is "inspired" by the BTREE code
> >> (i.e. copied from that and massaged a bit to call brin stuff).
> >
> > Nice work.
> >
> >> In both cases _brin_end_parallel then reads the summaries from worker
> >> files, and adds them into the index. In 0001 this is fairly simple,
> >> although we could do one more improvement and sort the ranges by range
> >> start to make the index nicer (and possibly a bit more efficient). This
> >> should be simple, because the per-worker results are already sorted like
> >> that (so a merge sort in _brin_end_parallel would be enough).
> >
> > I see that you manually built the passing and sorting of tuples
> > between workers, but can't we use the parallel tuplesort
> > infrastructure for that? It already has similar features in place and
> > improves code commonality.
> >
>
> Maybe. I wasn't that familiar with what parallel tuplesort can and can't
> do, and the little I knew I managed to forget since I wrote this patch.
> Which similar features do you have in mind?

I was referring to the feature that is "emitting a single sorted run
of tuples at the leader backend based on data gathered in parallel
worker backends". It manages the sort state, on-disk runs etc. so that
you don't have to manage that yourself.

Adding a new storage format for what is effectively a logical tape
(logtape.{c,h}) and manually merging it seems like a lot of changes if
that functionality is readily available, standardized and optimized in
sortsupport; and adds an additional place to manually go through for
disk-related changes like TDE.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:

On 7/5/23 16:33, Matthias van de Meent wrote:
> ...
>
>> Maybe. I wasn't that familiar with what parallel tuplesort can and can't
>> do, and the little I knew I managed to forget since I wrote this patch.
>> Which similar features do you have in mind?
> 
> I was referring to the feature that is "emitting a single sorted run
> of tuples at the leader backend based on data gathered in parallel
> worker backends". It manages the sort state, on-disk runs etc. so that
> you don't have to manage that yourself.
> 
> Adding a new storage format for what is effectively a logical tape
> (logtape.{c,h}) and manually merging it seems like a lot of changes if
> that functionality is readily available, standardized and optimized in
> sortsupport; and adds an additional place to manually go through for
> disk-related changes like TDE.
> 

Here's a new version of the patch, with three main changes:

1) Adoption of the parallel scan approach, instead of the homegrown
solution with a sequence of TID scans. This is mostly what the 0002
patch did, except for fixing a bug - parallel scan has a "rampdown"
close to the end, and this needs to consider the chunk size too.


2) Switches to the parallel tuplesort, as proposed. This turned out to
be easier than I expected - most of the work was in adding methods to
tuplesortvariants.c to allow reading/writing BrinTuple items. The main
limitation is that we need to pass around the length of the tuple
(AFAICS it's not in the BrinTuple itself). I'm not entirely sure about
the memory management aspect of this, and maybe there's a more elegant
solution.

Overall it seems to work - the brin.c code is heavily based on how
nbtsearch.c does parallel builds for btree, so hopefully it's fine. At
some point I got a bit confused about which spool to create/use, but it
seems to work.


3) Handling of empty ranges - I ended up ignoring empty ranges in
workers (i.e. those are not written to the tuplesort), and instead the
leader fills them in when reading data from the shared tuplesort.


One thing I was wondering about is whether it might be better to allow
the workers to process overlapping ranges, and then let the leader to
merge the summaries. That would mean we might not need the tableam.c
changes at all, but the leader would need to do more work (although the
BRIN indexes tend to be fairly small). The main reason that got me
thinking about this is that we have pretty much no tests for the union
procedures, because triggering that is really difficult. But for
parallel index builds that'd be much more common.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Thu, 6 Jul 2023 at 16:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 7/5/23 16:33, Matthias van de Meent wrote:
> > ...
> >
> >> Maybe. I wasn't that familiar with what parallel tuplesort can and can't
> >> do, and the little I knew I managed to forget since I wrote this patch.
> >> Which similar features do you have in mind?
> >
> > I was referring to the feature that is "emitting a single sorted run
> > of tuples at the leader backend based on data gathered in parallel
> > worker backends". It manages the sort state, on-disk runs etc. so that
> > you don't have to manage that yourself.
> >
> > Adding a new storage format for what is effectively a logical tape
> > (logtape.{c,h}) and manually merging it seems like a lot of changes if
> > that functionality is readily available, standardized and optimized in
> > sortsupport; and adds an additional place to manually go through for
> > disk-related changes like TDE.
> >
>
> Here's a new version of the patch, with three main changes:

Thanks! I've done a review on the patch, and most looks good. Some
places need cleanup and polish, some others more documentations, and
there are some other issues, but so far it's looking OK.

> One thing I was wondering about is whether it might be better to allow
> the workers to process overlapping ranges, and then let the leader to
> merge the summaries. That would mean we might not need the tableam.c
> changes at all, but the leader would need to do more work (although the
> BRIN indexes tend to be fairly small). The main reason that got me
> thinking about this is that we have pretty much no tests for the union
> procedures, because triggering that is really difficult. But for
> parallel index builds that'd be much more common.

Hmm, that's a good point. I don't mind either way, but it would add
overhead in the leader to do all of that merging - especially when you
configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd
need to merge up to parallel_workers tuples. That could be a
significant overhead.

... thinks a bit.

Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite
unlikely to be a serious problem - the per-backend IO saved of such
large ranges on a single backend has presumably much more impact than
the merging of n_parallel_tasks max-sized brin tuples. So, seems fine
with me.

Review follows below.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

-----------

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c

> +    BrinShared       *brinshared;

Needs some indentation fixes.

> +    int            bs_reltuples;
> [...]
> +    state->bs_reltuples += reltuples;

My IDE warns me that reltuples is a double. Looking deeper into the
value, it contains the number of live tuples in the table, so this
conversion may not result in a meaningful value for tables with >=2^31
live tuples. Tables > 56GB could begin to get affected by this.

> +    int            bs_worker_id;

This variable seems to be unused.

> +    BrinSpool  *bs_spool;
> +    BrinSpool  *bs_spool_out;

Are both used? If so, could you add comments why we have two spools
here, instead of only one?

> +/*
> + * A version of the callback, used by parallel index builds. The main difference
> + * is that instead of writing the BRIN tuples into the index, we write them into
> + * a shared tuplestore file, and leave the insertion up to the leader (which may

+ ... shared tuplesort, and ...

> brinbuildCallbackParallel(...)
> +    while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)

shouldn't this be an 'if' now?

> +        while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)
> +            state->bs_currRangeStart += state->bs_pagesPerRange;

Is there a reason why you went with iterative addition instead of a
single divide-and-multiply like the following?:

+        state->bs_currRangeStart += state->bs_pagesPerRange *
((state->bs_currRangeStart - thisblock) / state->bs_pagesPerRange);

> diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
> [...]
> -table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
> +table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan, BlockNumber chunk_factor)
> [...]
> -    /* compare phs_syncscan initialization to similar logic in initscan */
> +    bpscan->phs_chunk_factor = chunk_factor;
> +    /* compare phs_syncscan initialization to similar logic in initscan
> +     *
> +     * Disable sync scans if the chunk factor is set (valid block number).
> +     */

I think this needs some pgindent or other style work, both on comment
style and line lengths

> diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
> [...]
> +            Assert(false); (x3)

I think these can be cleaned up, right?

> diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
> [...]
> + * Computing BrinTuple size with only the tuple is difficult, so we want to track
> + * the length for r referenced by SortTuple. That's what BrinSortTuple is meant
> + * to do - it's essentially a BrinTuple prefixed by length. We only write the
> + * BrinTuple to the logtapes, though.

Why don't we write the full BrinSortTuple to disk? Doesn't that make more sense?

> +    tuplesort_puttuple_common(state, &stup,
> +                              base->sortKeys &&
> +                              base->sortKeys->abbrev_converter &&
> +                              !stup.isnull1);

Can't this last argument just be inlined, based on knowledge that we
don't use sortKeys in brin?

> +comparetup_index_brin(const SortTuple *a, const SortTuple *b,
> +                      Tuplesortstate *state)
> +{
> +    BrinTuple  *tuple1;
> [...]
> +    tuple1 = &((BrinSortTuple *) a)->tuple;
> [...]

I'm fairly sure that this cast (and it's neighbour) is incorrect and
should be the following instead:

+    tuple1 = &((BrinSortTuple *) (a->tuple))->tuple;

Additionally, I think the following would be a better approach here,
as we wouldn't need to do pointer-chasing:

+ static int
+ comparetup_index_brin(const SortTuple *a, const SortTuple *b,
+                      Tuplesortstate *state)
+ {
+    Assert(TuplesortstateGetPublic(state)->haveDatum1);
+
+    if (DatumGetUInt32(a->datum1) > DatumGetUInt32(b->datum1))
+        return 1;
+    if (DatumGetUInt32(a->datum1) < DatumGetUInt32(b->datum1))
+        return -1;
+     /* silence compilers */
+    return 0;
+ }

---

Thanks for working on this!



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:

On 7/11/23 23:11, Matthias van de Meent wrote:
> On Thu, 6 Jul 2023 at 16:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 7/5/23 16:33, Matthias van de Meent wrote:
>>> ...
>>>
>>>> Maybe. I wasn't that familiar with what parallel tuplesort can and can't
>>>> do, and the little I knew I managed to forget since I wrote this patch.
>>>> Which similar features do you have in mind?
>>>
>>> I was referring to the feature that is "emitting a single sorted run
>>> of tuples at the leader backend based on data gathered in parallel
>>> worker backends". It manages the sort state, on-disk runs etc. so that
>>> you don't have to manage that yourself.
>>>
>>> Adding a new storage format for what is effectively a logical tape
>>> (logtape.{c,h}) and manually merging it seems like a lot of changes if
>>> that functionality is readily available, standardized and optimized in
>>> sortsupport; and adds an additional place to manually go through for
>>> disk-related changes like TDE.
>>>
>>
>> Here's a new version of the patch, with three main changes:
> 
> Thanks! I've done a review on the patch, and most looks good. Some
> places need cleanup and polish, some others more documentations, and
> there are some other issues, but so far it's looking OK.
> 
>> One thing I was wondering about is whether it might be better to allow
>> the workers to process overlapping ranges, and then let the leader to
>> merge the summaries. That would mean we might not need the tableam.c
>> changes at all, but the leader would need to do more work (although the
>> BRIN indexes tend to be fairly small). The main reason that got me
>> thinking about this is that we have pretty much no tests for the union
>> procedures, because triggering that is really difficult. But for
>> parallel index builds that'd be much more common.
> 
> Hmm, that's a good point. I don't mind either way, but it would add
> overhead in the leader to do all of that merging - especially when you
> configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd
> need to merge up to parallel_workers tuples. That could be a
> significant overhead.
> 
> ... thinks a bit.
> 
> Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite
> unlikely to be a serious problem - the per-backend IO saved of such
> large ranges on a single backend has presumably much more impact than
> the merging of n_parallel_tasks max-sized brin tuples. So, seems fine
> with me.
> 

As for PARALLEL_SEQSCAN_MAX_CHUNK_SIZE, the last patch actually
considers the chunk_factor (i.e. pages_per_range) *after* doing

    pbscanwork->phsw_chunk_size = Min(pbscanwork->phsw_chunk_size,
                                      PARALLEL_SEQSCAN_MAX_CHUNK_SIZE);

so even with (pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE) it
would not need to merge anything.

Now, that might have been a bad idea and PARALLEL_SEQSCAN_MAX_CHUNK_SIZE
should be considered. In which case this *has* to do the union, even if
only for the rare corner case.

But I don't think that's a major issue - it's pretty sure summarizing
the tuples is way more expensive than merging the summaries. Which is
what matters for Amdahl's law ...


> Review follows below.
> 
> Kind regards,
> 
> Matthias van de Meent
> Neon (https://neon.tech/)
> 
> -----------
> 
>> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> 
>> +    BrinShared       *brinshared;
> 
> Needs some indentation fixes.
> 
>> +    int            bs_reltuples;
>> [...]
>> +    state->bs_reltuples += reltuples;
> 
> My IDE warns me that reltuples is a double. Looking deeper into the
> value, it contains the number of live tuples in the table, so this
> conversion may not result in a meaningful value for tables with >=2^31
> live tuples. Tables > 56GB could begin to get affected by this.
> 
>> +    int            bs_worker_id;
> 
> This variable seems to be unused.
> 
>> +    BrinSpool  *bs_spool;
>> +    BrinSpool  *bs_spool_out;
> 
> Are both used? If so, could you add comments why we have two spools
> here, instead of only one?
> 

OK, I admit I'm not sure both are actually necessary. I was struggling
getting it working with just one spool, because when the leader
participates as a worker, it needs to both summarize some of the chunks
(and put the tuples somewhere). And then it also needs to consume the
final output.

Maybe it's just a case of cargo cult programming - I was mostly copying
stuff from the btree index build, trying to make it work, and then with
two spools it started working.

>> +/*
>> + * A version of the callback, used by parallel index builds. The main difference
>> + * is that instead of writing the BRIN tuples into the index, we write them into
>> + * a shared tuplestore file, and leave the insertion up to the leader (which may
> 
> + ... shared tuplesort, and ...
> 
>> brinbuildCallbackParallel(...)
>> +    while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)
> 
> shouldn't this be an 'if' now?
> 

Hmmm, probably ... that way we'd skip the empty ranges.

>> +        while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)
>> +            state->bs_currRangeStart += state->bs_pagesPerRange;
> 
> Is there a reason why you went with iterative addition instead of a
> single divide-and-multiply like the following?:
> 
> +        state->bs_currRangeStart += state->bs_pagesPerRange *
> ((state->bs_currRangeStart - thisblock) / state->bs_pagesPerRange);
> 

Probably laziness ... You're right the divide-multiply seems simpler.

>> diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
>> [...]
>> -table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
>> +table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan, BlockNumber chunk_factor)
>> [...]
>> -    /* compare phs_syncscan initialization to similar logic in initscan */
>> +    bpscan->phs_chunk_factor = chunk_factor;
>> +    /* compare phs_syncscan initialization to similar logic in initscan
>> +     *
>> +     * Disable sync scans if the chunk factor is set (valid block number).
>> +     */
> 
> I think this needs some pgindent or other style work, both on comment
> style and line lengths
> 

Right.

>> diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
>> [...]
>> +            Assert(false); (x3)
> 
> I think these can be cleaned up, right?
> 

Duh! Absolutely, this shouldn't have been in the patch at all. I only
added those to quickly identify places that got the tuplesort into
unexpected state (much easier with a coredump and a backtrace).

>> diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
>> [...]
>> + * Computing BrinTuple size with only the tuple is difficult, so we want to track
>> + * the length for r referenced by SortTuple. That's what BrinSortTuple is meant
>> + * to do - it's essentially a BrinTuple prefixed by length. We only write the
>> + * BrinTuple to the logtapes, though.
> 
> Why don't we write the full BrinSortTuple to disk? Doesn't that make more sense?
> 

Not sure I understand. We ultimately do, because we write

    (length + BrinTuple)

and BrinSortTuple is exactly that. But if we write BrinSortTuple, we
would end up writing length for that too, no?

Or maybe I just don't understand how would that make things simpler.

>> +    tuplesort_puttuple_common(state, &stup,
>> +                              base->sortKeys &&
>> +                              base->sortKeys->abbrev_converter &&
>> +                              !stup.isnull1);
> 
> Can't this last argument just be inlined, based on knowledge that we
> don't use sortKeys in brin?
> 

What does "inlined" mean for an argument? But yeah, I guess it might be
just set to false. And we should probably have an assert that there are
no sortKeys.

>> +comparetup_index_brin(const SortTuple *a, const SortTuple *b,
>> +                      Tuplesortstate *state)
>> +{
>> +    BrinTuple  *tuple1;
>> [...]
>> +    tuple1 = &((BrinSortTuple *) a)->tuple;
>> [...]
> 
> I'm fairly sure that this cast (and it's neighbour) is incorrect and
> should be the following instead:
> 
> +    tuple1 = &((BrinSortTuple *) (a->tuple))->tuple;
> 
> Additionally, I think the following would be a better approach here,
> as we wouldn't need to do pointer-chasing:
> 

Uh, right. This only works because 'tuple' happens to be the first field
in SortTuple.

> + static int
> + comparetup_index_brin(const SortTuple *a, const SortTuple *b,
> +                      Tuplesortstate *state)
> + {
> +    Assert(TuplesortstateGetPublic(state)->haveDatum1);
> +
> +    if (DatumGetUInt32(a->datum1) > DatumGetUInt32(b->datum1))
> +        return 1;
> +    if (DatumGetUInt32(a->datum1) < DatumGetUInt32(b->datum1))
> +        return -1;
> +     /* silence compilers */
> +    return 0;
> + }
> 

Good idea! I forgot we're guaranteed to have datum1.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Fri, 14 Jul 2023 at 15:57, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/11/23 23:11, Matthias van de Meent wrote:
>> On Thu, 6 Jul 2023 at 16:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> One thing I was wondering about is whether it might be better to allow
>>> the workers to process overlapping ranges, and then let the leader to
>>> merge the summaries. That would mean we might not need the tableam.c
>>> changes at all, but the leader would need to do more work (although the
>>> BRIN indexes tend to be fairly small). The main reason that got me
>>> thinking about this is that we have pretty much no tests for the union
>>> procedures, because triggering that is really difficult. But for
>>> parallel index builds that'd be much more common.
>>
>> Hmm, that's a good point. I don't mind either way, but it would add
>> overhead in the leader to do all of that merging - especially when you
>> configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd
>> need to merge up to parallel_workers tuples. That could be a
>> significant overhead.
>>
>> ... thinks a bit.
>>
>> Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite
>> unlikely to be a serious problem - the per-backend IO saved of such
>> large ranges on a single backend has presumably much more impact than
>> the merging of n_parallel_tasks max-sized brin tuples. So, seems fine
>> with me.
>>
>
> As for PARALLEL_SEQSCAN_MAX_CHUNK_SIZE, the last patch actually
> considers the chunk_factor (i.e. pages_per_range) *after* doing
>
>     pbscanwork->phsw_chunk_size = Min(pbscanwork->phsw_chunk_size,
>                                       PARALLEL_SEQSCAN_MAX_CHUNK_SIZE);
>
> so even with (pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE) it
> would not need to merge anything.
>
> Now, that might have been a bad idea and PARALLEL_SEQSCAN_MAX_CHUNK_SIZE
> should be considered. In which case this *has* to do the union, even if
> only for the rare corner case.
>
> But I don't think that's a major issue - it's pretty sure summarizing
> the tuples is way more expensive than merging the summaries. Which is
> what matters for Amdahl's law ...

Agreed.

>>> +    BrinSpool  *bs_spool;
>>> +    BrinSpool  *bs_spool_out;
>>
>> Are both used? If so, could you add comments why we have two spools
>> here, instead of only one?
>>
>
> OK, I admit I'm not sure both are actually necessary. I was struggling
> getting it working with just one spool, because when the leader
> participates as a worker, it needs to both summarize some of the chunks
> (and put the tuples somewhere). And then it also needs to consume the
> final output.
>
> Maybe it's just a case of cargo cult programming - I was mostly copying
> stuff from the btree index build, trying to make it work, and then with
> two spools it started working.

Two spools seem to be necessary in a participating leader, but both
spools have non-overlapping lifetimes. In the btree code actually two
pairs of spools are actually used (in unique indexes): you can see the
pairs being allocated in both _bt_leader_participate_as_worker (called
from _bt_begin_parallel, from _bt_spools_heapscan) and in
_bt_spools_heapscan.

> >> diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
> >> [...]
> >> + * Computing BrinTuple size with only the tuple is difficult, so we want to track
> >> + * the length for r referenced by SortTuple. That's what BrinSortTuple is meant
> >> + * to do - it's essentially a BrinTuple prefixed by length. We only write the
> >> + * BrinTuple to the logtapes, though.
> >
> > Why don't we write the full BrinSortTuple to disk? Doesn't that make more sense?
> >
>
> Not sure I understand. We ultimately do, because we write
>
>     (length + BrinTuple)
>
> and BrinSortTuple is exactly that. But if we write BrinSortTuple, we
> would end up writing length for that too, no?
>
> Or maybe I just don't understand how would that make things simpler.

I don't quite understand the intricacies of the tape storage format
quite yet (specifically, I'm continuously getting confused by the len
-= sizeof(int)), so you might well be correct.

My comment was written based on just the comment's contents, which
claims "we can't easily recompute the length, so we store it with the
tuple in memory. However, we don't store the length when we write it
to the tape", which seems self-contradictory.

> >> +    tuplesort_puttuple_common(state, &stup,
> >> +                              base->sortKeys &&
> >> +                              base->sortKeys->abbrev_converter &&
> >> +                              !stup.isnull1);
> >
> > Can't this last argument just be inlined, based on knowledge that we
> > don't use sortKeys in brin?
> >
>
> What does "inlined" mean for an argument? But yeah, I guess it might be
> just set to false. And we should probably have an assert that there are
> no sortKeys.

"inlined", "precomputed", "const-ified"? I'm not super good at
vocabulary. But, indeed, thanks.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
Hi,

here's an updated patch, addressing the review comments, and reworking
how the work is divided between the workers & leader etc.

0001 is just v2, rebased to current master

0002 and 0003 address most of the issues, in particular it

  - removes the unnecessary spool
  - fixes bs_reltuples type to double
  - a couple comments are reworded to be clearer
  - changes loop/condition in brinbuildCallbackParallel
  - removes asserts added for debugging
  - fixes cast in comparetup_index_brin
  - 0003 then simplifies comparetup_index_brin
  - I haven't inlined the tuplesort_puttuple_common parameter
    (didn't seem worth it)

0004 Reworks how the work is divided between workers and combined by the
leader. It undoes the tableam.c changes that attempted to divide the
relation into chunks matching the BRIN ranges, and instead merges the
results in the leader (using the BRIN "union" function).

I haven't done any indentation fixes yet.

I did fairly extensive testing, using pageinspect to compare indexes
built with/without parallelism. More testing is needed, but it seems to
work fine (with other opclasses and so on).

In general I'm quite happy with the current state, and I believe it's
fairly close to be committable.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> here's an updated patch, addressing the review comments, and reworking
> how the work is divided between the workers & leader etc.

Thanks!

> In general I'm quite happy with the current state, and I believe it's
> fairly close to be committable.

Are you planning on committing the patches separately, or squashed? I
won't have much time this week for reviewing the patch, and it seems
like these patches are incremental, so some guidance on what you want
to be reviewed would be appreciated.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:

On 11/12/23 10:38, Matthias van de Meent wrote:
> On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> here's an updated patch, addressing the review comments, and reworking
>> how the work is divided between the workers & leader etc.
> 
> Thanks!
> 
>> In general I'm quite happy with the current state, and I believe it's
>> fairly close to be committable.
> 
> Are you planning on committing the patches separately, or squashed? I
> won't have much time this week for reviewing the patch, and it seems
> like these patches are incremental, so some guidance on what you want
> to be reviewed would be appreciated.
> 

Definitely squashed. I only kept them separate to make it more obvious
what the changes are.

If you need more time for a review, I can certainly wait. No rush.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> here's an updated patch, addressing the review comments, and reworking
> how the work is divided between the workers & leader etc.
>
> 0001 is just v2, rebased to current master
>
> 0002 and 0003 address most of the issues, in particular it
>
>   - removes the unnecessary spool
>   - fixes bs_reltuples type to double
>   - a couple comments are reworded to be clearer
>   - changes loop/condition in brinbuildCallbackParallel
>   - removes asserts added for debugging
>   - fixes cast in comparetup_index_brin
>   - 0003 then simplifies comparetup_index_brin
>   - I haven't inlined the tuplesort_puttuple_common parameter
>     (didn't seem worth it)

OK, thanks

> 0004 Reworks how the work is divided between workers and combined by the
> leader. It undoes the tableam.c changes that attempted to divide the
> relation into chunks matching the BRIN ranges, and instead merges the
> results in the leader (using the BRIN "union" function).

That's OK.

> I haven't done any indentation fixes yet.
>
> I did fairly extensive testing, using pageinspect to compare indexes
> built with/without parallelism. More testing is needed, but it seems to
> work fine (with other opclasses and so on).

After code-only review, here are some comments:

> +++ b/src/backend/access/brin/brin.c
> [...]
> +/* Magic numbers for parallel state sharing */
> +#define PARALLEL_KEY_BRIN_SHARED        UINT64CONST(0xA000000000000001)
> +#define PARALLEL_KEY_TUPLESORT            UINT64CONST(0xA000000000000002)

These shm keys use the same constants also in use in
access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
operations, I'd prefer if we didn't actively introduce conflicting
identifiers when we still have significant amounts of unused values
remaining.

> +#define PARALLEL_KEY_QUERY_TEXT            UINT64CONST(0xA000000000000003)

This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
more than brin's), backend/executor/execParallel.c
(0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
I've not checked that their uses are exactly the same, I'd expect at
least btree to match mostly, if not fully, 1:1).
I think we could probably benefit from a less ad-hoc sharing of query
texts. I don't think that needs to happen specifically in this patch,
but I think it's something to keep in mind in future efforts.

> +_brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
> [...]
> +    BrinSpool  *spool = state->bs_spool;
> [...]
> +    if (!state)
> +        return;

I think the assignment to spool should be moved to below this
condition, as _brin_begin_parallel calls this with state=NULL when it
can't launch parallel workers, which will cause issues here.

> +    state->bs_numtuples = brinshared->indtuples;

My IDE complains about bs_numtuples being an integer. This is a
pre-existing issue, but still valid: we can hit an overflow on tables
with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
but problematic nonetheless.

> +     * FIXME This probably needs some memory management fixes - we're reading
> +     * tuples from the tuplesort, we're allocating an emty tuple, and so on.
> +     * Probably better to release this memory.

This should probably be resolved.

I also noticed that this is likely to execute `union_tuples` many
times when pages_per_range is coprime with the parallel table scan's
block stride (or when we for other reasons have many tuples returned
for each range); and this `union_tuples` internally allocates and
deletes its own memory context for its deserialization of the 'b'
tuple. I think we should just pass a scratch context instead, so that
we don't have the overhead of continously creating then deleting the
same memory context.

> +++ b/src/backend/catalog/index.c
> [...]
> -        indexRelation->rd_rel->relam == BTREE_AM_OID)
> +        (indexRelation->rd_rel->relam == BTREE_AM_OID ||
> +         indexRelation->rd_rel->relam == BRIN_AM_OID))

I think this needs some more effort. I imagine a new
IndexAmRoutine->amcanbuildparallel is more appropriate than this
hard-coded list - external indexes may want to utilize the parallel
index creation planner facility, too.


Some notes:
As a project PostgreSQL seems to be trying to move away from
hardcoding heap into everything in favour of the more AM-agnostic
'table'. I suggest replacing all mentions of "heap" in the arguments
with "table", to reduce the work future maintainers need to do to fix
this. Even when this AM is mostly targetted towards the heap AM, other
AMs (such as those I've heard of that were developed internally at
EDB) use the same block-addressing that heap does, and should thus be
compatible with BRIN. Thus, "heap" is not a useful name here.

There are 2 new mentions of "tuplestore" in the patch, while the
structure used is tuplesort: one on form_and_spill_tuple, and one on
brinbuildCallbackParallel. Please update those comments.

That's it for code review. I'll do some performance comparisons and
testing soon, too.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 11/20/23 20:48, Matthias van de Meent wrote:
> On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> here's an updated patch, addressing the review comments, and reworking
>> how the work is divided between the workers & leader etc.
>>
>> 0001 is just v2, rebased to current master
>>
>> 0002 and 0003 address most of the issues, in particular it
>>
>>   - removes the unnecessary spool
>>   - fixes bs_reltuples type to double
>>   - a couple comments are reworded to be clearer
>>   - changes loop/condition in brinbuildCallbackParallel
>>   - removes asserts added for debugging
>>   - fixes cast in comparetup_index_brin
>>   - 0003 then simplifies comparetup_index_brin
>>   - I haven't inlined the tuplesort_puttuple_common parameter
>>     (didn't seem worth it)
> 
> OK, thanks
> 
>> 0004 Reworks how the work is divided between workers and combined by the
>> leader. It undoes the tableam.c changes that attempted to divide the
>> relation into chunks matching the BRIN ranges, and instead merges the
>> results in the leader (using the BRIN "union" function).
> 
> That's OK.
> 
>> I haven't done any indentation fixes yet.
>>
>> I did fairly extensive testing, using pageinspect to compare indexes
>> built with/without parallelism. More testing is needed, but it seems to
>> work fine (with other opclasses and so on).
> 
> After code-only review, here are some comments:
> 
>> +++ b/src/backend/access/brin/brin.c
>> [...]
>> +/* Magic numbers for parallel state sharing */
>> +#define PARALLEL_KEY_BRIN_SHARED        UINT64CONST(0xA000000000000001)
>> +#define PARALLEL_KEY_TUPLESORT            UINT64CONST(0xA000000000000002)
> 
> These shm keys use the same constants also in use in
> access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
> operations, I'd prefer if we didn't actively introduce conflicting
> identifiers when we still have significant amounts of unused values
> remaining.
> 

Hmmm. Is there some rule of thumb how to pick these key values? I see
nbtsort.c uses 0xA prefix, execParallel.c uses 0xE, while parallel.c
ended up using 0xFFFFFFFFFFFF as prefix. I've user 0xB, simply because
BRIN also starts with B.

>> +#define PARALLEL_KEY_QUERY_TEXT            UINT64CONST(0xA000000000000003)
> 
> This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
> others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
> more than brin's), backend/executor/execParallel.c
> (0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
> I've not checked that their uses are exactly the same, I'd expect at
> least btree to match mostly, if not fully, 1:1).
> I think we could probably benefit from a less ad-hoc sharing of query
> texts. I don't think that needs to happen specifically in this patch,
> but I think it's something to keep in mind in future efforts.
> 

I'm afraid I don't quite get what you mean by "ad hoc sharing of query
texts". Are you saying we shouldn't propagate the query text to the
parallel workers? Why? Or what's the proper solution?

>> +_brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
>> [...]
>> +    BrinSpool  *spool = state->bs_spool;
>> [...]
>> +    if (!state)
>> +        return;
> 
> I think the assignment to spool should be moved to below this
> condition, as _brin_begin_parallel calls this with state=NULL when it
> can't launch parallel workers, which will cause issues here.
> 

Good catch! I wonder if we have tests that might trigger this, say by
setting max_parallel_maintenance_workers > 0 while no workers allowed.

>> +    state->bs_numtuples = brinshared->indtuples;
> 
> My IDE complains about bs_numtuples being an integer. This is a
> pre-existing issue, but still valid: we can hit an overflow on tables
> with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
> but problematic nonetheless.
> 

True. I think I've been hesitant to make this a double because it seems
a bit weird to do +1 with a double, and at some point (d == d+1). But
this seems safe, we're guaranteed to be far away from that threshold.

>> +     * FIXME This probably needs some memory management fixes - we're reading
>> +     * tuples from the tuplesort, we're allocating an emty tuple, and so on.
>> +     * Probably better to release this memory.
> 
> This should probably be resolved.
> 

AFAICS that comment is actually inaccurate/stale, sorry about that. The
code actually allocates (and resets) a single memtuple, and also
emptyTuple. So I think that's OK, I've removed the comment.

> I also noticed that this is likely to execute `union_tuples` many
> times when pages_per_range is coprime with the parallel table scan's
> block stride (or when we for other reasons have many tuples returned
> for each range); and this `union_tuples` internally allocates and
> deletes its own memory context for its deserialization of the 'b'
> tuple. I think we should just pass a scratch context instead, so that
> we don't have the overhead of continously creating then deleting the
> same memory context

Perhaps. Looking at the code, isn't it a bit strange how union_tuples
uses the context? It creates the context, calls brin_deform_tuple in
that context, but then the rest of the function (including datumCopy and
similar stuff) happens in the caller's context ...

However, I don't think the number of union_tuples calls is likely to be
very high, especially for large tables. Because we split the table into
2048 chunks, and then cap the chunk size by 8192. For large tables
(where this matters) we're likely close to 8192.

> 
>> +++ b/src/backend/catalog/index.c
>> [...]
>> -        indexRelation->rd_rel->relam == BTREE_AM_OID)
>> +        (indexRelation->rd_rel->relam == BTREE_AM_OID ||
>> +         indexRelation->rd_rel->relam == BRIN_AM_OID))
> 
> I think this needs some more effort. I imagine a new
> IndexAmRoutine->amcanbuildparallel is more appropriate than this
> hard-coded list - external indexes may want to utilize the parallel
> index creation planner facility, too.
> 

Good idea. I added the IndexAmRoutine flag and used it here.

> 
> Some notes:
> As a project PostgreSQL seems to be trying to move away from
> hardcoding heap into everything in favour of the more AM-agnostic
> 'table'. I suggest replacing all mentions of "heap" in the arguments
> with "table", to reduce the work future maintainers need to do to fix
> this. Even when this AM is mostly targetted towards the heap AM, other
> AMs (such as those I've heard of that were developed internally at
> EDB) use the same block-addressing that heap does, and should thus be
> compatible with BRIN. Thus, "heap" is not a useful name here.
> 

I'm not against doing that, but I'd prefer to do that in a separate
patch. There's a bunch of preexisting heap references, so and I don't
want to introduce inconsistency (patch using table, old code heap) nor
do I want to tweak unrelated code.

> There are 2 new mentions of "tuplestore" in the patch, while the
> structure used is tuplesort: one on form_and_spill_tuple, and one on
> brinbuildCallbackParallel. Please update those comments.
> 
> That's it for code review. I'll do some performance comparisons and
> testing soon, too.
> 

Thanks! Attached is a patch squashing the previous version into a single
v3 commit, with fixes for your review in a separate commit.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
Hi,

On Wed, 22 Nov 2023 at 20:16, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 11/20/23 20:48, Matthias van de Meent wrote:
>> On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> Hi,
>>>
>>> here's an updated patch, addressing the review comments, and reworking
>>> how the work is divided between the workers & leader etc.
>>>
>>
>> After code-only review, here are some comments:
>>
>>> +++ b/src/backend/access/brin/brin.c
>>> [...]
>>> +/* Magic numbers for parallel state sharing */
>>> +#define PARALLEL_KEY_BRIN_SHARED        UINT64CONST(0xA000000000000001)
>>> +#define PARALLEL_KEY_TUPLESORT            UINT64CONST(0xA000000000000002)
>>
>> These shm keys use the same constants also in use in
>> access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
>> operations, I'd prefer if we didn't actively introduce conflicting
>> identifiers when we still have significant amounts of unused values
>> remaining.
>>
>
> Hmmm. Is there some rule of thumb how to pick these key values?

None that I know of.
There is a warning in various places that define these constants that
they take care to not conflict with plan node's node_id: parallel plan
execution uses plain plan node IDs as keys, and as node_id is
int-sized, any other key value that's created manually of value < 2^32
should be sure that it can't be executed in a parallel backend.
But apart from that one case, I can't find a convention, no.

>>> +#define PARALLEL_KEY_QUERY_TEXT            UINT64CONST(0xA000000000000003)
>>
>> This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
>> others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
>> more than brin's), backend/executor/execParallel.c
>> (0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
>> I've not checked that their uses are exactly the same, I'd expect at
>> least btree to match mostly, if not fully, 1:1).
>> I think we could probably benefit from a less ad-hoc sharing of query
>> texts. I don't think that needs to happen specifically in this patch,
>> but I think it's something to keep in mind in future efforts.
>>
>
> I'm afraid I don't quite get what you mean by "ad hoc sharing of query
> texts". Are you saying we shouldn't propagate the query text to the
> parallel workers? Why? Or what's the proper solution?

What I mean is that we have several different keys that all look like
they contain the debug query string, and always for the same debugging
purposes. For debugging, I think it'd be useful to use one well-known
key, rather than N well-known keys in each of the N parallel
subsystems.

But as mentioned, it doesn't need to happen in this patch, as that'd
increase scope beyond brin/index ams.

>>> +    state->bs_numtuples = brinshared->indtuples;
>>
>> My IDE complains about bs_numtuples being an integer. This is a
>> pre-existing issue, but still valid: we can hit an overflow on tables
>> with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
>> but problematic nonetheless.
>>
>
> True. I think I've been hesitant to make this a double because it seems
> a bit weird to do +1 with a double, and at some point (d == d+1). But
> this seems safe, we're guaranteed to be far away from that threshold.

Yes, ignoring practical constraints like page space, we "only" have
bitspace for 2^48 tuples in each (non-partitioned) relation, so
double's 56 significant bits should be more than enough to count
tuples.

>> I also noticed that this is likely to execute `union_tuples` many
>> times when pages_per_range is coprime with the parallel table scan's
>> block stride (or when we for other reasons have many tuples returned
>> for each range); and this `union_tuples` internally allocates and
>> deletes its own memory context for its deserialization of the 'b'
>> tuple. I think we should just pass a scratch context instead, so that
>> we don't have the overhead of continously creating then deleting the
>> same memory context
>
> Perhaps. Looking at the code, isn't it a bit strange how union_tuples
> uses the context? It creates the context, calls brin_deform_tuple in
> that context, but then the rest of the function (including datumCopy and
> similar stuff) happens in the caller's context ...

The union operator may leak (lots of) memory, so I think it makes
sense to keep a context around that can be reset after we've extracted
the merge result.

> However, I don't think the number of union_tuples calls is likely to be
> very high, especially for large tables. Because we split the table into
> 2048 chunks, and then cap the chunk size by 8192. For large tables
> (where this matters) we're likely close to 8192.

I agree that the merging part of the index creation is the last part,
and usually has no high impact on the total performance of the reindex
operation, but in memory-constrained environments releasing and then
requesting the same chunk of memory over and over again just isn't
great.
Also note that parallel scan chunk sizes decrease when we're about to
hit the end of the table, and that a different AM may have different
ideas about scanning a table in parallel; it could very well decide to
use striped assignments exclusively, as opposed to on-demand chunk
allocations; both increasing the chance that brin's page ranges are
processed by more than one backend.

>> As a project PostgreSQL seems to be trying to move away from
>> hardcoding heap into everything in favour of the more AM-agnostic
>> 'table'. I suggest replacing all mentions of "heap" in the arguments
>> with "table", to reduce the work future maintainers need to do to fix
>> this.
>
> I'm not against doing that, but I'd prefer to do that in a separate
> patch. There's a bunch of preexisting heap references, so and I don't
> want to introduce inconsistency (patch using table, old code heap) nor
> do I want to tweak unrelated code.

Sure.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 11/23/23 13:33, Matthias van de Meent wrote:
> Hi,
> 
> On Wed, 22 Nov 2023 at 20:16, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 11/20/23 20:48, Matthias van de Meent wrote:
>>> On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> here's an updated patch, addressing the review comments, and reworking
>>>> how the work is divided between the workers & leader etc.
>>>>
>>>
>>> After code-only review, here are some comments:
>>>
>>>> +++ b/src/backend/access/brin/brin.c
>>>> [...]
>>>> +/* Magic numbers for parallel state sharing */
>>>> +#define PARALLEL_KEY_BRIN_SHARED        UINT64CONST(0xA000000000000001)
>>>> +#define PARALLEL_KEY_TUPLESORT            UINT64CONST(0xA000000000000002)
>>>
>>> These shm keys use the same constants also in use in
>>> access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
>>> operations, I'd prefer if we didn't actively introduce conflicting
>>> identifiers when we still have significant amounts of unused values
>>> remaining.
>>>
>>
>> Hmmm. Is there some rule of thumb how to pick these key values?
> 
> None that I know of.
> There is a warning in various places that define these constants that
> they take care to not conflict with plan node's node_id: parallel plan
> execution uses plain plan node IDs as keys, and as node_id is
> int-sized, any other key value that's created manually of value < 2^32
> should be sure that it can't be executed in a parallel backend.
> But apart from that one case, I can't find a convention, no.
> 

OK, in that case 0xB is fine.

>>>> +#define PARALLEL_KEY_QUERY_TEXT            UINT64CONST(0xA000000000000003)
>>>
>>> This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
>>> others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
>>> more than brin's), backend/executor/execParallel.c
>>> (0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
>>> I've not checked that their uses are exactly the same, I'd expect at
>>> least btree to match mostly, if not fully, 1:1).
>>> I think we could probably benefit from a less ad-hoc sharing of query
>>> texts. I don't think that needs to happen specifically in this patch,
>>> but I think it's something to keep in mind in future efforts.
>>>
>>
>> I'm afraid I don't quite get what you mean by "ad hoc sharing of query
>> texts". Are you saying we shouldn't propagate the query text to the
>> parallel workers? Why? Or what's the proper solution?
> 
> What I mean is that we have several different keys that all look like
> they contain the debug query string, and always for the same debugging
> purposes. For debugging, I think it'd be useful to use one well-known
> key, rather than N well-known keys in each of the N parallel
> subsystems.
> 
> But as mentioned, it doesn't need to happen in this patch, as that'd
> increase scope beyond brin/index ams.
> 

Agreed.

>>> I also noticed that this is likely to execute `union_tuples` many
>>> times when pages_per_range is coprime with the parallel table scan's
>>> block stride (or when we for other reasons have many tuples returned
>>> for each range); and this `union_tuples` internally allocates and
>>> deletes its own memory context for its deserialization of the 'b'
>>> tuple. I think we should just pass a scratch context instead, so that
>>> we don't have the overhead of continously creating then deleting the
>>> same memory context
>>
>> Perhaps. Looking at the code, isn't it a bit strange how union_tuples
>> uses the context? It creates the context, calls brin_deform_tuple in
>> that context, but then the rest of the function (including datumCopy and
>> similar stuff) happens in the caller's context ...
> 
> The union operator may leak (lots of) memory, so I think it makes
> sense to keep a context around that can be reset after we've extracted
> the merge result.
> 

But does the current code actually achieve that? It does create a "brin
union" context, but then it only does this:

    /* Use our own memory context to avoid retail pfree */
    cxt = AllocSetContextCreate(CurrentMemoryContext,
                                "brin union",
                                ALLOCSET_DEFAULT_SIZES);
    oldcxt = MemoryContextSwitchTo(cxt);
    db = brin_deform_tuple(bdesc, b, NULL);
    MemoryContextSwitchTo(oldcxt);

Surely that does not limit the amount of memory used by the actual union
functions in any way?

>> However, I don't think the number of union_tuples calls is likely to be
>> very high, especially for large tables. Because we split the table into
>> 2048 chunks, and then cap the chunk size by 8192. For large tables
>> (where this matters) we're likely close to 8192.
> 
> I agree that the merging part of the index creation is the last part,
> and usually has no high impact on the total performance of the reindex
> operation, but in memory-constrained environments releasing and then
> requesting the same chunk of memory over and over again just isn't
> great.

OK, I'll take a look at the scratch context you suggested.

My point however was we won't actually do that very often, because on
large tables the BRIN ranges are likely smaller than the parallel scan
chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
ranges are large, there'll be few of them.

> Also note that parallel scan chunk sizes decrease when we're about to
> hit the end of the table, and that a different AM may have different
> ideas about scanning a table in parallel; it could very well decide to
> use striped assignments exclusively, as opposed to on-demand chunk
> allocations; both increasing the chance that brin's page ranges are
> processed by more than one backend.
> 

Yeah, but the ramp-up and ramp-down should have negligible impact, IMO.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 11/23/23 13:33, Matthias van de Meent wrote:
>> The union operator may leak (lots of) memory, so I think it makes
>> sense to keep a context around that can be reset after we've extracted
>> the merge result.
>>
>
> But does the current code actually achieve that? It does create a "brin
> union" context, but then it only does this:
>
>     /* Use our own memory context to avoid retail pfree */
>     cxt = AllocSetContextCreate(CurrentMemoryContext,
>                                 "brin union",
>                                 ALLOCSET_DEFAULT_SIZES);
>     oldcxt = MemoryContextSwitchTo(cxt);
>     db = brin_deform_tuple(bdesc, b, NULL);
>     MemoryContextSwitchTo(oldcxt);
>
> Surely that does not limit the amount of memory used by the actual union
> functions in any way?

Oh, yes, of course. For some reason I thought that covered the calls
to the union operator function too, but it indeed only covers
deserialization. I do think it is still worthwhile to not do the
create/delete cycle, but won't hold the patch back for that.

>>> However, I don't think the number of union_tuples calls is likely to be
>>> very high, especially for large tables. Because we split the table into
>>> 2048 chunks, and then cap the chunk size by 8192. For large tables
>>> (where this matters) we're likely close to 8192.
>>
>> I agree that the merging part of the index creation is the last part,
>> and usually has no high impact on the total performance of the reindex
>> operation, but in memory-constrained environments releasing and then
>> requesting the same chunk of memory over and over again just isn't
>> great.
>
> OK, I'll take a look at the scratch context you suggested.
>
> My point however was we won't actually do that very often, because on
> large tables the BRIN ranges are likely smaller than the parallel scan
> chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
> ranges are large, there'll be few of them.

That's true, so maybe I'm concerned about something that amounts to
only marginal gains.

I noticed that the v4 patch doesn't yet update the documentation in
indexam.sgml with am->amcanbuildparallel.
Once that is included and reviewed I think this will be ready, unless
you want to address any of my comments upthread (that I marked with
'not in this patch') in this patch.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 11/28/23 16:39, Matthias van de Meent wrote:
> On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 11/23/23 13:33, Matthias van de Meent wrote:
>>> The union operator may leak (lots of) memory, so I think it makes
>>> sense to keep a context around that can be reset after we've extracted
>>> the merge result.
>>>
>>
>> But does the current code actually achieve that? It does create a "brin
>> union" context, but then it only does this:
>>
>>     /* Use our own memory context to avoid retail pfree */
>>     cxt = AllocSetContextCreate(CurrentMemoryContext,
>>                                 "brin union",
>>                                 ALLOCSET_DEFAULT_SIZES);
>>     oldcxt = MemoryContextSwitchTo(cxt);
>>     db = brin_deform_tuple(bdesc, b, NULL);
>>     MemoryContextSwitchTo(oldcxt);
>>
>> Surely that does not limit the amount of memory used by the actual union
>> functions in any way?
> 
> Oh, yes, of course. For some reason I thought that covered the calls
> to the union operator function too, but it indeed only covers
> deserialization. I do think it is still worthwhile to not do the
> create/delete cycle, but won't hold the patch back for that.
> 

I think the union_tuples() changes are better left for a separate patch.

>>>> However, I don't think the number of union_tuples calls is likely to be
>>>> very high, especially for large tables. Because we split the table into
>>>> 2048 chunks, and then cap the chunk size by 8192. For large tables
>>>> (where this matters) we're likely close to 8192.
>>>
>>> I agree that the merging part of the index creation is the last part,
>>> and usually has no high impact on the total performance of the reindex
>>> operation, but in memory-constrained environments releasing and then
>>> requesting the same chunk of memory over and over again just isn't
>>> great.
>>
>> OK, I'll take a look at the scratch context you suggested.
>>
>> My point however was we won't actually do that very often, because on
>> large tables the BRIN ranges are likely smaller than the parallel scan
>> chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
>> ranges are large, there'll be few of them.
> 
> That's true, so maybe I'm concerned about something that amounts to
> only marginal gains.
> 

However, after thinking about this a bit more, I think we actually do
need to do something about the memory management when merging tuples.
AFAIK the general assumption was that union_tuple() only runs for a
single range, and then the whole context gets freed. But the way the
merging was implemented, it all runs in a single context. And while a
single union_tuple() may not need a lot memory, in total it may be
annoying. I just added a palloc(1MB) into union_tuples and ended up with
~160MB allocated in the PortalContext on just 2GB table. In practice the
memory will grow more slowly, but not great :-/

The attached 0003 patch adds a memory context that's reset after
producing a merged BRIN tuple for each page range.

> I noticed that the v4 patch doesn't yet update the documentation in
> indexam.sgml with am->amcanbuildparallel.

Should be fixed by 0002. I decided to add a simple note to ambuild(),
not sure if something more is needed.

> Once that is included and reviewed I think this will be ready, unless
> you want to address any of my comments upthread (that I marked with
> 'not in this patch') in this patch.
> 

Thanks. I believe the attached version addresses it. There's also 0004
with some indentation tweaks per pgindent.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Tue, 28 Nov 2023 at 18:59, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 11/28/23 16:39, Matthias van de Meent wrote:
> > On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >> On 11/23/23 13:33, Matthias van de Meent wrote:
> >>> The union operator may leak (lots of) memory, so I think it makes
> >>> sense to keep a context around that can be reset after we've extracted
> >>> the merge result.
> >>>
> >>
> >> But does the current code actually achieve that? It does create a "brin
> >> union" context, but then it only does this:
> >>
> >>     /* Use our own memory context to avoid retail pfree */
> >>     cxt = AllocSetContextCreate(CurrentMemoryContext,
> >>                                 "brin union",
> >>                                 ALLOCSET_DEFAULT_SIZES);
> >>     oldcxt = MemoryContextSwitchTo(cxt);
> >>     db = brin_deform_tuple(bdesc, b, NULL);
> >>     MemoryContextSwitchTo(oldcxt);
> >>
> >> Surely that does not limit the amount of memory used by the actual union
> >> functions in any way?
> >
> > Oh, yes, of course. For some reason I thought that covered the calls
> > to the union operator function too, but it indeed only covers
> > deserialization. I do think it is still worthwhile to not do the
> > create/delete cycle, but won't hold the patch back for that.
> >
>
> I think the union_tuples() changes are better left for a separate patch.
>
> >>>> However, I don't think the number of union_tuples calls is likely to be
> >>>> very high, especially for large tables. Because we split the table into
> >>>> 2048 chunks, and then cap the chunk size by 8192. For large tables
> >>>> (where this matters) we're likely close to 8192.
> >>>
> >>> I agree that the merging part of the index creation is the last part,
> >>> and usually has no high impact on the total performance of the reindex
> >>> operation, but in memory-constrained environments releasing and then
> >>> requesting the same chunk of memory over and over again just isn't
> >>> great.
> >>
> >> OK, I'll take a look at the scratch context you suggested.
> >>
> >> My point however was we won't actually do that very often, because on
> >> large tables the BRIN ranges are likely smaller than the parallel scan
> >> chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
> >> ranges are large, there'll be few of them.
> >
> > That's true, so maybe I'm concerned about something that amounts to
> > only marginal gains.
> >
>
> However, after thinking about this a bit more, I think we actually do
> need to do something about the memory management when merging tuples.
> AFAIK the general assumption was that union_tuple() only runs for a
> single range, and then the whole context gets freed.

Correct, but it is also is (or should be) assumed that union_tuple
will be called several times in the same context to fix repeat
concurrent updates. Presumably, that only happens rarely, but it's
something that should be kept in mind regardless.

> But the way the
> merging was implemented, it all runs in a single context. And while a
> single union_tuple() may not need a lot memory, in total it may be
> annoying. I just added a palloc(1MB) into union_tuples and ended up with
> ~160MB allocated in the PortalContext on just 2GB table. In practice the
> memory will grow more slowly, but not great :-/
>
> The attached 0003 patch adds a memory context that's reset after
> producing a merged BRIN tuple for each page range.

Looks good.

This also made me think a bit more about how we're working with the
tuples. With your latest patch, we always deserialize and re-serialize
the sorted brin tuples, just in case the next tuple will also be a
BRIN tuple of the same page range. Could we save some of that
deserialization time by optimistically expecting that we're not going
to need to merge the tuple and only store a local copy of it locally?
See attached 0002; this saves some cycles in common cases.

The v20231128 version of the patchset (as squashed, attached v5-0001)
looks good to me.

Kind regards,

Matthias van de Meent
Neon (http://neon.tech)

Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 11/29/23 15:42, Matthias van de Meent wrote:
> On Tue, 28 Nov 2023 at 18:59, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 11/28/23 16:39, Matthias van de Meent wrote:
>>> On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> On 11/23/23 13:33, Matthias van de Meent wrote:
>>>>> The union operator may leak (lots of) memory, so I think it makes
>>>>> sense to keep a context around that can be reset after we've extracted
>>>>> the merge result.
>>>>>
>>>>
>>>> But does the current code actually achieve that? It does create a "brin
>>>> union" context, but then it only does this:
>>>>
>>>>     /* Use our own memory context to avoid retail pfree */
>>>>     cxt = AllocSetContextCreate(CurrentMemoryContext,
>>>>                                 "brin union",
>>>>                                 ALLOCSET_DEFAULT_SIZES);
>>>>     oldcxt = MemoryContextSwitchTo(cxt);
>>>>     db = brin_deform_tuple(bdesc, b, NULL);
>>>>     MemoryContextSwitchTo(oldcxt);
>>>>
>>>> Surely that does not limit the amount of memory used by the actual union
>>>> functions in any way?
>>>
>>> Oh, yes, of course. For some reason I thought that covered the calls
>>> to the union operator function too, but it indeed only covers
>>> deserialization. I do think it is still worthwhile to not do the
>>> create/delete cycle, but won't hold the patch back for that.
>>>
>>
>> I think the union_tuples() changes are better left for a separate patch.
>>
>>>>>> However, I don't think the number of union_tuples calls is likely to be
>>>>>> very high, especially for large tables. Because we split the table into
>>>>>> 2048 chunks, and then cap the chunk size by 8192. For large tables
>>>>>> (where this matters) we're likely close to 8192.
>>>>>
>>>>> I agree that the merging part of the index creation is the last part,
>>>>> and usually has no high impact on the total performance of the reindex
>>>>> operation, but in memory-constrained environments releasing and then
>>>>> requesting the same chunk of memory over and over again just isn't
>>>>> great.
>>>>
>>>> OK, I'll take a look at the scratch context you suggested.
>>>>
>>>> My point however was we won't actually do that very often, because on
>>>> large tables the BRIN ranges are likely smaller than the parallel scan
>>>> chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
>>>> ranges are large, there'll be few of them.
>>>
>>> That's true, so maybe I'm concerned about something that amounts to
>>> only marginal gains.
>>>
>>
>> However, after thinking about this a bit more, I think we actually do
>> need to do something about the memory management when merging tuples.
>> AFAIK the general assumption was that union_tuple() only runs for a
>> single range, and then the whole context gets freed.
> 
> Correct, but it is also is (or should be) assumed that union_tuple
> will be called several times in the same context to fix repeat
> concurrent updates. Presumably, that only happens rarely, but it's
> something that should be kept in mind regardless.
> 

In theory, yes. But union_tuples() is used only in summarize_range(),
and that only processes a single page range.

>> But the way the
>> merging was implemented, it all runs in a single context. And while a
>> single union_tuple() may not need a lot memory, in total it may be
>> annoying. I just added a palloc(1MB) into union_tuples and ended up with
>> ~160MB allocated in the PortalContext on just 2GB table. In practice the
>> memory will grow more slowly, but not great :-/
>>
>> The attached 0003 patch adds a memory context that's reset after
>> producing a merged BRIN tuple for each page range.
> 
> Looks good.
> 
> This also made me think a bit more about how we're working with the
> tuples. With your latest patch, we always deserialize and re-serialize
> the sorted brin tuples, just in case the next tuple will also be a
> BRIN tuple of the same page range. Could we save some of that
> deserialization time by optimistically expecting that we're not going
> to need to merge the tuple and only store a local copy of it locally?
> See attached 0002; this saves some cycles in common cases.
> 

Good idea!

> The v20231128 version of the patchset (as squashed, attached v5-0001)
> looks good to me.
> 

Cool. I'll put this through a bit more stress testing, and then I'll get
it pushed.

Thanks for the reviews!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 11/29/23 15:52, Tomas Vondra wrote:
>> ...
>>
>> This also made me think a bit more about how we're working with the
>> tuples. With your latest patch, we always deserialize and re-serialize
>> the sorted brin tuples, just in case the next tuple will also be a
>> BRIN tuple of the same page range. Could we save some of that
>> deserialization time by optimistically expecting that we're not going
>> to need to merge the tuple and only store a local copy of it locally?
>> See attached 0002; this saves some cycles in common cases.
>>
> 
> Good idea!
> 

FWIW there's a bug, in this part of the optimization:

------------------
+    if (memtuple == NULL)
+        memtuple = brin_deform_tuple(state->bs_bdesc, btup,
+                                     memtup_holder);
+
     union_tuples(state->bs_bdesc, memtuple, btup);
     continue;
------------------

The deforming should use prevbtup, otherwise union_tuples() jut combines
two copies of the same tuple.

Which however brings me to the bigger issue with this - my stress test
found this issue pretty quickly, but then I spent quite a bit of time
trying to find what went wrong. I find this reworked code pretty hard to
understand, and not necessarily because of how it's written. The problem
is it the same loop tries to juggle multiple pieces of information with
different lifespans, and so on. I find it really hard to reason about
how it behaves ...

I did try to measure how much it actually saves, but none of the tests I
did actually found measurable improvement. So I'm tempted to just not
include this part, and accept that we may deserialize some of the tuples
unnecessarily.

Did you actually observe measurable improvements in some cases?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 11/29/23 15:52, Tomas Vondra wrote:
> >> ...
> >>
> >> This also made me think a bit more about how we're working with the
> >> tuples. With your latest patch, we always deserialize and re-serialize
> >> the sorted brin tuples, just in case the next tuple will also be a
> >> BRIN tuple of the same page range. Could we save some of that
> >> deserialization time by optimistically expecting that we're not going
> >> to need to merge the tuple and only store a local copy of it locally?
> >> See attached 0002; this saves some cycles in common cases.
> >>
> >
> > Good idea!
> >
>
> FWIW there's a bug, in this part of the optimization:
>
> ------------------
> +    if (memtuple == NULL)
> +        memtuple = brin_deform_tuple(state->bs_bdesc, btup,
> +                                     memtup_holder);
> +
>      union_tuples(state->bs_bdesc, memtuple, btup);
>      continue;
> ------------------
>
> The deforming should use prevbtup, otherwise union_tuples() jut combines
> two copies of the same tuple.

Good point. There were some more issues as well, fixes are attached.

> Which however brings me to the bigger issue with this - my stress test
> found this issue pretty quickly, but then I spent quite a bit of time
> trying to find what went wrong. I find this reworked code pretty hard to
> understand, and not necessarily because of how it's written. The problem
> is it the same loop tries to juggle multiple pieces of information with
> different lifespans, and so on. I find it really hard to reason about
> how it behaves ...

Yeah, it'd be nice if we had a peek option for sortsupport, that'd
improve context handling.

> I did try to measure how much it actually saves, but none of the tests I
> did actually found measurable improvement. So I'm tempted to just not
> include this part, and accept that we may deserialize some of the tuples
> unnecessarily.
>
> Did you actually observe measurable improvements in some cases?

The improvements would mostly stem from brin indexes with multiple
(potentially compressed) by-ref types, as they go through more complex
and expensive code to deserialize, requiring separate palloc() and
memcpy() calls each.
For single-column and by-value types the improvements are expected to
be negligible, because there is no meaningful difference between
copying a single by-ref value and copying its container; the
additional work done for each tuple is marginal for those.

For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
(sha256((id+1)::text::bytea)::text),
(sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
measured a difference of 10x less time spent in the main loop of
_brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
ranges. It's not a lot, but worth at least something, I guess?

The attached patch fixes the issue that you called out .
It also further updates _brin_end_parallel: the final 'write empty
tuples' loop is never hit and is thus removed, because if there were
any tuples in the spool we'd have filled the empty ranges at the end
of the main loop, and if there were no tuples in the spool then the
memtuple would still be at its original initialized value of 0 thus
resulting in a constant false condition. I also updated some comments.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 11/29/23 21:30, Matthias van de Meent wrote:
> On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 11/29/23 15:52, Tomas Vondra wrote:
>>>> ...
>>>>
>>>> This also made me think a bit more about how we're working with the
>>>> tuples. With your latest patch, we always deserialize and re-serialize
>>>> the sorted brin tuples, just in case the next tuple will also be a
>>>> BRIN tuple of the same page range. Could we save some of that
>>>> deserialization time by optimistically expecting that we're not going
>>>> to need to merge the tuple and only store a local copy of it locally?
>>>> See attached 0002; this saves some cycles in common cases.
>>>>
>>>
>>> Good idea!
>>>
>>
>> FWIW there's a bug, in this part of the optimization:
>>
>> ------------------
>> +    if (memtuple == NULL)
>> +        memtuple = brin_deform_tuple(state->bs_bdesc, btup,
>> +                                     memtup_holder);
>> +
>>      union_tuples(state->bs_bdesc, memtuple, btup);
>>      continue;
>> ------------------
>>
>> The deforming should use prevbtup, otherwise union_tuples() jut combines
>> two copies of the same tuple.
> 
> Good point. There were some more issues as well, fixes are attached.
> 
>> Which however brings me to the bigger issue with this - my stress test
>> found this issue pretty quickly, but then I spent quite a bit of time
>> trying to find what went wrong. I find this reworked code pretty hard to
>> understand, and not necessarily because of how it's written. The problem
>> is it the same loop tries to juggle multiple pieces of information with
>> different lifespans, and so on. I find it really hard to reason about
>> how it behaves ...
> 
> Yeah, it'd be nice if we had a peek option for sortsupport, that'd
> improve context handling.
> 
>> I did try to measure how much it actually saves, but none of the tests I
>> did actually found measurable improvement. So I'm tempted to just not
>> include this part, and accept that we may deserialize some of the tuples
>> unnecessarily.
>>
>> Did you actually observe measurable improvements in some cases?
> 
> The improvements would mostly stem from brin indexes with multiple
> (potentially compressed) by-ref types, as they go through more complex
> and expensive code to deserialize, requiring separate palloc() and
> memcpy() calls each.
> For single-column and by-value types the improvements are expected to
> be negligible, because there is no meaningful difference between
> copying a single by-ref value and copying its container; the
> additional work done for each tuple is marginal for those.
> 
> For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
> (sha256((id+1)::text::bytea)::text),
> (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
> measured a difference of 10x less time spent in the main loop of
> _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
> ranges. It's not a lot, but worth at least something, I guess?
> 

It is something, but I can't really convince myself it's worth the extra
code complexity. It's a somewhat extreme example, and the parallelism
certainly saves much more than this.

> The attached patch fixes the issue that you called out .
> It also further updates _brin_end_parallel: the final 'write empty
> tuples' loop is never hit and is thus removed, because if there were
> any tuples in the spool we'd have filled the empty ranges at the end
> of the main loop, and if there were no tuples in the spool then the
> memtuple would still be at its original initialized value of 0 thus
> resulting in a constant false condition. I also updated some comments.
> 

Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
insert the empty ranges in the main loop, because we're already looking
at the *next* summary.

But I think the idea was to insert empty ranges if there's a chunk of
empty ranges at the end of the table, after the last tuple the index
build reads. But I'm not sure that can actually happen ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Wed, 29 Nov 2023 at 21:56, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 11/29/23 21:30, Matthias van de Meent wrote:
>> On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>> I did try to measure how much it actually saves, but none of the tests I
>>> did actually found measurable improvement. So I'm tempted to just not
>>> include this part, and accept that we may deserialize some of the tuples
>>> unnecessarily.
>>>
>>> Did you actually observe measurable improvements in some cases?
>>
>> The improvements would mostly stem from brin indexes with multiple
>> (potentially compressed) by-ref types, as they go through more complex
>> and expensive code to deserialize, requiring separate palloc() and
>> memcpy() calls each.
>> For single-column and by-value types the improvements are expected to
>> be negligible, because there is no meaningful difference between
>> copying a single by-ref value and copying its container; the
>> additional work done for each tuple is marginal for those.
>>
>> For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
>> (sha256((id+1)::text::bytea)::text),
>> (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
>> measured a difference of 10x less time spent in the main loop of
>> _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
>> ranges. It's not a lot, but worth at least something, I guess?
>>
>
> It is something, but I can't really convince myself it's worth the extra
> code complexity. It's a somewhat extreme example, and the parallelism
> certainly saves much more than this.

True. For this, I usually keep in mind that the docs on multi-column
indexes still indicate to use 1 N-column brin index over N 1-column
brin indexes (assuming the same storage parameters), so multi-column
BRIN indexes should not be considered to be uncommon:

"The only reason to have multiple BRIN indexes instead of one
multicolumn BRIN index on a single table is to have a different
pages_per_range storage parameter."

Note that most of the time in my example index is spent in creating
the actual tuples due to the use of hashing for data generation; for
index or plain to-text formatting the improvement is much more
pronounced: If I use an 8-column index (id::text, id, ...), index
creation takes ~500ms with 4+ workers. Of this, deforming takes some
20ms, though when skipping the deforming step (i.e.,with my patch) it
takes ~3.5ms. That's a 3% shaved off the build time when the index
shape is beneficial.

> > The attached patch fixes the issue that you called out .
> > It also further updates _brin_end_parallel: the final 'write empty
> > tuples' loop is never hit and is thus removed, because if there were
> > any tuples in the spool we'd have filled the empty ranges at the end
> > of the main loop, and if there were no tuples in the spool then the
> > memtuple would still be at its original initialized value of 0 thus
> > resulting in a constant false condition. I also updated some comments.
> >
>
> Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
> insert the empty ranges in the main loop, because we're already looking
> at the *next* summary.

Yes, merging adds some significant complexity here. I don't think we
can easily get around that though...

> But I think the idea was to insert empty ranges if there's a chunk of
> empty ranges at the end of the table, after the last tuple the index
> build reads. But I'm not sure that can actually happen ...

This would be trivial to construct with partial indexes; e.g. WHERE
(my_pk IS NULL) would consist of exclusively empty ranges.
I don't see a lot of value in partial BRIN indexes, but I may be
overlooking something.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 11/29/23 23:59, Matthias van de Meent wrote:
> On Wed, 29 Nov 2023 at 21:56, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 11/29/23 21:30, Matthias van de Meent wrote:
>>> On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> I did try to measure how much it actually saves, but none of the tests I
>>>> did actually found measurable improvement. So I'm tempted to just not
>>>> include this part, and accept that we may deserialize some of the tuples
>>>> unnecessarily.
>>>>
>>>> Did you actually observe measurable improvements in some cases?
>>>
>>> The improvements would mostly stem from brin indexes with multiple
>>> (potentially compressed) by-ref types, as they go through more complex
>>> and expensive code to deserialize, requiring separate palloc() and
>>> memcpy() calls each.
>>> For single-column and by-value types the improvements are expected to
>>> be negligible, because there is no meaningful difference between
>>> copying a single by-ref value and copying its container; the
>>> additional work done for each tuple is marginal for those.
>>>
>>> For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
>>> (sha256((id+1)::text::bytea)::text),
>>> (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
>>> measured a difference of 10x less time spent in the main loop of
>>> _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
>>> ranges. It's not a lot, but worth at least something, I guess?
>>>
>>
>> It is something, but I can't really convince myself it's worth the extra
>> code complexity. It's a somewhat extreme example, and the parallelism
>> certainly saves much more than this.
> 
> True. For this, I usually keep in mind that the docs on multi-column
> indexes still indicate to use 1 N-column brin index over N 1-column
> brin indexes (assuming the same storage parameters), so multi-column
> BRIN indexes should not be considered to be uncommon:
> 
> "The only reason to have multiple BRIN indexes instead of one
> multicolumn BRIN index on a single table is to have a different
> pages_per_range storage parameter."
> 
> Note that most of the time in my example index is spent in creating
> the actual tuples due to the use of hashing for data generation; for
> index or plain to-text formatting the improvement is much more
> pronounced: If I use an 8-column index (id::text, id, ...), index
> creation takes ~500ms with 4+ workers. Of this, deforming takes some
> 20ms, though when skipping the deforming step (i.e.,with my patch) it
> takes ~3.5ms. That's a 3% shaved off the build time when the index
> shape is beneficial.
> 

That's all true, and while 3.5% is not something to ignore, my POV is
that the parallelism speeds this up from ~2000ms to ~500ms. Yes, it
would be great to shave off the extra 1% (relative to the original
duration). But I don't have a great idea how to do code that in a way
that is readable, and I don't want to stall the patch indefinitely
because of a comparatively small improvement.

Therefore I propose we get the simpler code committed and leave this as
a future improvement.

>>> The attached patch fixes the issue that you called out .
>>> It also further updates _brin_end_parallel: the final 'write empty
>>> tuples' loop is never hit and is thus removed, because if there were
>>> any tuples in the spool we'd have filled the empty ranges at the end
>>> of the main loop, and if there were no tuples in the spool then the
>>> memtuple would still be at its original initialized value of 0 thus
>>> resulting in a constant false condition. I also updated some comments.
>>>
>>
>> Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
>> insert the empty ranges in the main loop, because we're already looking
>> at the *next* summary.
> 
> Yes, merging adds some significant complexity here. I don't think we
> can easily get around that though...
> 
>> But I think the idea was to insert empty ranges if there's a chunk of
>> empty ranges at the end of the table, after the last tuple the index
>> build reads. But I'm not sure that can actually happen ...
> 
> This would be trivial to construct with partial indexes; e.g. WHERE
> (my_pk IS NULL) would consist of exclusively empty ranges.
> I don't see a lot of value in partial BRIN indexes, but I may be
> overlooking something.
> 

Oh, I haven't even thought about partial BRIN indexes! I'm sure for
those it's even more important to actually fill-in the empty ranges,
otherwise we end up scanning the whole supposedly filtered-out part of
the table. I'll do some testing with that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Thu, 30 Nov 2023 at 01:10, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 11/29/23 23:59, Matthias van de Meent wrote:
>> On Wed, 29 Nov 2023 at 21:56, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> On 11/29/23 21:30, Matthias van de Meent wrote:
>>>> On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
>>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>> I did try to measure how much it actually saves, but none of the tests I
>>>>> did actually found measurable improvement. So I'm tempted to just not
>>>>> include this part, and accept that we may deserialize some of the tuples
>>>>> unnecessarily.
>>>>>
>>>>> Did you actually observe measurable improvements in some cases?
>>>>
>>>> The improvements would mostly stem from brin indexes with multiple
>>>> (potentially compressed) by-ref types, as they go through more complex
>>>> and expensive code to deserialize, requiring separate palloc() and
>>>> memcpy() calls each.
>>>> For single-column and by-value types the improvements are expected to
>>>> be negligible, because there is no meaningful difference between
>>>> copying a single by-ref value and copying its container; the
>>>> additional work done for each tuple is marginal for those.
>>>>
>>>> For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
>>>> (sha256((id+1)::text::bytea)::text),
>>>> (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
>>>> measured a difference of 10x less time spent in the main loop of
>>>> _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
>>>> ranges. It's not a lot, but worth at least something, I guess?
>>>>
>>>
>>> It is something, but I can't really convince myself it's worth the extra
>>> code complexity. It's a somewhat extreme example, and the parallelism
>>> certainly saves much more than this.
>>
>> True. For this, I usually keep in mind that the docs on multi-column
>> indexes still indicate to use 1 N-column brin index over N 1-column
>> brin indexes (assuming the same storage parameters), so multi-column
>> BRIN indexes should not be considered to be uncommon:
>>
>> "The only reason to have multiple BRIN indexes instead of one
>> multicolumn BRIN index on a single table is to have a different
>> pages_per_range storage parameter."
>>
>> Note that most of the time in my example index is spent in creating
>> the actual tuples due to the use of hashing for data generation; for
>> index or plain to-text formatting the improvement is much more
>> pronounced: If I use an 8-column index (id::text, id, ...), index
>> creation takes ~500ms with 4+ workers. Of this, deforming takes some
>> 20ms, though when skipping the deforming step (i.e.,with my patch) it
>> takes ~3.5ms. That's a 3% shaved off the build time when the index
>> shape is beneficial.
>>
>
> That's all true, and while 3.5% is not something to ignore, my POV is
> that the parallelism speeds this up from ~2000ms to ~500ms. Yes, it
> would be great to shave off the extra 1% (relative to the original
> duration). But I don't have a great idea how to do code that in a way
> that is readable, and I don't want to stall the patch indefinitely
> because of a comparatively small improvement.
>
> Therefore I propose we get the simpler code committed and leave this as
> a future improvement.

That's fine with me, it is one reason why I kept it as a separate patch file.

>>>> The attached patch fixes the issue that you called out .
>>>> It also further updates _brin_end_parallel: the final 'write empty
>>>> tuples' loop is never hit and is thus removed, because if there were
>>>> any tuples in the spool we'd have filled the empty ranges at the end
>>>> of the main loop, and if there were no tuples in the spool then the
>>>> memtuple would still be at its original initialized value of 0 thus
>>>> resulting in a constant false condition. I also updated some comments.
>>>>
>>>
>>> Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
>>> insert the empty ranges in the main loop, because we're already looking
>>> at the *next* summary.
>>
>> Yes, merging adds some significant complexity here. I don't think we
>> can easily get around that though...
>>
>>> But I think the idea was to insert empty ranges if there's a chunk of
>>> empty ranges at the end of the table, after the last tuple the index
>>> build reads. But I'm not sure that can actually happen ...
>>
>> This would be trivial to construct with partial indexes; e.g. WHERE
>> (my_pk IS NULL) would consist of exclusively empty ranges.
>> I don't see a lot of value in partial BRIN indexes, but I may be
>> overlooking something.
>>
>
> Oh, I haven't even thought about partial BRIN indexes! I'm sure for
> those it's even more important to actually fill-in the empty ranges,
> otherwise we end up scanning the whole supposedly filtered-out part of
> the table. I'll do some testing with that.

I just ran some more tests in less favorable environments, and it
looks like I hit a bug:

% SET max_parallel_workers = 0;
% CREATE INDEX ... USING brin (...);
ERROR:  cannot update tuples during a parallel operation

Fix attached in 0002.
In 0003 I add the mentioned backfilling of empty ranges at the end of
the table. I added it for both normal and parallel index builds, as
normal builds apparently also didn't yet have this yet.

Kind regards,

Matthias van de Meent

Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 11/30/23 18:47, Matthias van de Meent wrote:
> ...
>
> I just ran some more tests in less favorable environments, and it
> looks like I hit a bug:
> 
> % SET max_parallel_workers = 0;
> % CREATE INDEX ... USING brin (...);
> ERROR:  cannot update tuples during a parallel operation
> 
> Fix attached in 0002.

Yeah, that's a bug, thanks for the fix. Yeah Just jumping to a "cleanup"
label seems a bit cleaner (if that can be said about using goto), so I
tweaked the patch to do that instead.

> In 0003 I add the mentioned backfilling of empty ranges at the end of
> the table. I added it for both normal and parallel index builds, as
> normal builds apparently also didn't yet have this yet.
> 

Right. I was thinking about doing that to, but you beat me to it. I
don't want to bury this in the main patch adding parallel builds, it's
not really related to parallel CREATE INDEX. And it'd be weird to have
this for parallel builds first, so I rebased it as 0001.

As for the backfilling, I think we need to simplify the code a bit. We
have three places doing essentially the same thing (one for serial
builds, two for parallel builds). That's unnecessarily verbose, and
makes it harder to understand the code. But more importantly, the three
places are not doing exactly the same - some increment the current range
before, some do it at the end of the loop, etc. I got confused by this
multiple times.

So 0004 simplifies this - the backfilling is done by a function called
from all the places. The main complexity is in ensuring all three places
have the same concept of how to specify the range (of ranges) to fill.

Note: The serial might have two places too, but the main loop in
brinbuildCallback() does it range by range. It's a bit less efficient as
it can't use the pre-built empty tuple easily, but that's fine IMO.


skipping the last page range?
-----------------------------

I noticed you explicitly skipped backfilling empty tuple for the last
page range. Can you explain? I suspect the idea was that the user
activity would trigger building the tuple once that page range is
filled, but we don't really know if the table receives any changes. It
might easily be just a static table, in which case the last range would
remain unsummarized. If this is the right thing to do, the serial build
should do that too probably ...

But I don't think that's the correct thing to do - I think CREATE INDEX
is expected to always build a complete index, so my version always
builds an index for all table pages.


BlockNumber overflows
---------------------

The one thing that I'm not quite sure is correct is whether this handles
overflows/underflows correctly. I mean, imagine you have a huge table
that's almost 0xFFFFFFFF blocks, pages_per_range is prime, and the last
range ends less than pages_per_range from 0xFFFFFFFF. Then this

    blkno += pages_per_range;

can overflow, and might start inserting index tuples again (so we'd end
up with a duplicate).

I do think the current patch does this correctly, but AFAICS this is a
pre-existing issue ...

Anyway, while working on this / stress-testing it, I realized there's a
bug in how we allocate the emptyTuple. It's allocated lazily, but if can
easily happen in the per-range context we introduced last week. It needs
to be allocated in the context covering the whole index build.

I think the best way to do that is per 0006, i.e. allocate it in the
BrinBuildState, along with the appropriate memory context.

Obviously, all of this (0002-0006) should be squashed into a single
commit, I only keep it separate to make it clearer what changed.


stress-testing script
---------------------

I'm also attaching the bash script I use to stress test this - it's just
a loop that creates somewhat random table (different number of rows,
distinct values, ...), maybe deletes some of it, creates an index
(possibly partial), and then does various checks on it (checks number of
ranges, queries the table, etc.). It's somewhat primitive but it turned
out to be very capable in triggering bugs in BlockNumber arithmetic,
emptyTuple allocations, etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Matthias van de Meent
Дата:
On Sun, 3 Dec 2023 at 17:46, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> On 11/30/23 18:47, Matthias van de Meent wrote:
> > ...
> >
> > I just ran some more tests in less favorable environments, and it
> > looks like I hit a bug:
> >
> > % SET max_parallel_workers = 0;
> > % CREATE INDEX ... USING brin (...);
> > ERROR:  cannot update tuples during a parallel operation
> >
> > Fix attached in 0002.
>
> Yeah, that's a bug, thanks for the fix. Yeah Just jumping to a "cleanup"
> label seems a bit cleaner (if that can be said about using goto), so I
> tweaked the patch to do that instead.

Good point, I agree that's cleaner.

> > In 0003 I add the mentioned backfilling of empty ranges at the end of
> > the table. I added it for both normal and parallel index builds, as
> > normal builds apparently also didn't yet have this yet.
> >
>
> Right. I was thinking about doing that to, but you beat me to it. I
> don't want to bury this in the main patch adding parallel builds, it's
> not really related to parallel CREATE INDEX. And it'd be weird to have
> this for parallel builds first, so I rebased it as 0001.

OK.

> As for the backfilling, I think we need to simplify the code a bit.
>
> So 0004 simplifies this - the backfilling is done by a function called
> from all the places. The main complexity is in ensuring all three places
> have the same concept of how to specify the range (of ranges) to fill.

Good points, +1. However, the simplification in 0005 breaks that with
an underflow:

> @@ -1669,6 +1672,19 @@ initialize_brin_buildstate(Relation idxRel, BrinRevmap *revmap,
>     state->bs_worker_id = 0;
>     state->bs_spool = NULL;
>
> +    /*
> +     * Calculate the start of the last page range. Page numbers are 0-based,
> +     * so to get the index of the last page we need to subtract one. Then the
> +     * integer division gives us the proper 0-based range index.
> +     */
> +    state->bs_maxRangeStart = ((tablePages - 1) / pagesPerRange) * pagesPerRange;

When the table is empty, this will try to fill all potential ranges up
to InvalidBlockNo's range, which is obviously invalid. It also breaks
the regression tests, as showin in CFBot.

> skipping the last page range?
> -----------------------------
>
> I noticed you explicitly skipped backfilling empty tuple for the last
> page range. Can you explain? I suspect the idea was that the user
> activity would trigger building the tuple once that page range is
> filled, but we don't really know if the table receives any changes. It
> might easily be just a static table, in which case the last range would
> remain unsummarized. If this is the right thing to do, the serial build
> should do that too probably ...
>
> But I don't think that's the correct thing to do - I think CREATE INDEX
> is expected to always build a complete index, so my version always
> builds an index for all table pages.

Hmm. My idea here is to create an index that is closer to what you get
when you hit the insertion path with aminsert. This isn't 1:1 how the
index builds ranges during (re)index when there is data for that
range, but I thought it to be a close enough analog. Either way, I
don't mind it adding an empty range for the last range if that's
considered useful.

> BlockNumber overflows
> ---------------------
>
> The one thing that I'm not quite sure is correct is whether this handles
> overflows/underflows correctly. I mean, imagine you have a huge table
> that's almost 0xFFFFFFFF blocks, pages_per_range is prime, and the last
> range ends less than pages_per_range from 0xFFFFFFFF. Then this
>
>     blkno += pages_per_range;
>
> can overflow, and might start inserting index tuples again (so we'd end
> up with a duplicate).
>
> I do think the current patch does this correctly, but AFAICS this is a
> pre-existing issue ...

Yes, I know I've flagged this at least once before. IIRC, the response
back then was that it's a very unlikely issue, as you'd have to extend
the relation to at least the first block of the last range, which
would currently be InvalidBlockNo - 131072 + 1, or just shy of 32TB of
data at 8kB BLCKSZ. That's not exactly a common use case, and BRIN
range ID wraparound is likely the least of your worries at that point.

> Anyway, while working on this / stress-testing it, I realized there's a
> bug in how we allocate the emptyTuple. It's allocated lazily, but if can
> easily happen in the per-range context we introduced last week. It needs
> to be allocated in the context covering the whole index build.

Yeah, I hadn't tested with (very) sparse datasets yet.

> I think the best way to do that is per 0006, i.e. allocate it in the
> BrinBuildState, along with the appropriate memory context.

That fix looks fine to me.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:

On 12/4/23 16:00, Matthias van de Meent wrote:
> On Sun, 3 Dec 2023 at 17:46, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>> On 11/30/23 18:47, Matthias van de Meent wrote:
>>> ...
>>>
>>> I just ran some more tests in less favorable environments, and it
>>> looks like I hit a bug:
>>>
>>> % SET max_parallel_workers = 0;
>>> % CREATE INDEX ... USING brin (...);
>>> ERROR:  cannot update tuples during a parallel operation
>>>
>>> Fix attached in 0002.
>>
>> Yeah, that's a bug, thanks for the fix. Yeah Just jumping to a "cleanup"
>> label seems a bit cleaner (if that can be said about using goto), so I
>> tweaked the patch to do that instead.
> 
> Good point, I agree that's cleaner.
> 
>>> In 0003 I add the mentioned backfilling of empty ranges at the end of
>>> the table. I added it for both normal and parallel index builds, as
>>> normal builds apparently also didn't yet have this yet.
>>>
>>
>> Right. I was thinking about doing that to, but you beat me to it. I
>> don't want to bury this in the main patch adding parallel builds, it's
>> not really related to parallel CREATE INDEX. And it'd be weird to have
>> this for parallel builds first, so I rebased it as 0001.
> 
> OK.
> 
>> As for the backfilling, I think we need to simplify the code a bit.
>>
>> So 0004 simplifies this - the backfilling is done by a function called
>> from all the places. The main complexity is in ensuring all three places
>> have the same concept of how to specify the range (of ranges) to fill.
> 
> Good points, +1. However, the simplification in 0005 breaks that with
> an underflow:
> 
>> @@ -1669,6 +1672,19 @@ initialize_brin_buildstate(Relation idxRel, BrinRevmap *revmap,
>>     state->bs_worker_id = 0;
>>     state->bs_spool = NULL;
>>
>> +    /*
>> +     * Calculate the start of the last page range. Page numbers are 0-based,
>> +     * so to get the index of the last page we need to subtract one. Then the
>> +     * integer division gives us the proper 0-based range index.
>> +     */
>> +    state->bs_maxRangeStart = ((tablePages - 1) / pagesPerRange) * pagesPerRange;
> 
> When the table is empty, this will try to fill all potential ranges up
> to InvalidBlockNo's range, which is obviously invalid. It also breaks
> the regression tests, as showin in CFBot.
> 

Whoooops! You're right, ofc. If it's empty, we should use 0 instead.
That's what we do now anyway, BRIN will have the first range even for
empty tables.

>> skipping the last page range?
>> -----------------------------
>>
>> I noticed you explicitly skipped backfilling empty tuple for the last
>> page range. Can you explain? I suspect the idea was that the user
>> activity would trigger building the tuple once that page range is
>> filled, but we don't really know if the table receives any changes. It
>> might easily be just a static table, in which case the last range would
>> remain unsummarized. If this is the right thing to do, the serial build
>> should do that too probably ...
>>
>> But I don't think that's the correct thing to do - I think CREATE INDEX
>> is expected to always build a complete index, so my version always
>> builds an index for all table pages.
> 
> Hmm. My idea here is to create an index that is closer to what you get
> when you hit the insertion path with aminsert. This isn't 1:1 how the
> index builds ranges during (re)index when there is data for that
> range, but I thought it to be a close enough analog. Either way, I
> don't mind it adding an empty range for the last range if that's
> considered useful.
> 

I understand, but I'm not sure if keeping this consistency with aminsert
has any material benefit. It's not like we do that now, I think (for
empty tables we already build the first index range).

>> BlockNumber overflows
>> ---------------------
>>
>> The one thing that I'm not quite sure is correct is whether this handles
>> overflows/underflows correctly. I mean, imagine you have a huge table
>> that's almost 0xFFFFFFFF blocks, pages_per_range is prime, and the last
>> range ends less than pages_per_range from 0xFFFFFFFF. Then this
>>
>>     blkno += pages_per_range;
>>
>> can overflow, and might start inserting index tuples again (so we'd end
>> up with a duplicate).
>>
>> I do think the current patch does this correctly, but AFAICS this is a
>> pre-existing issue ...
> 
> Yes, I know I've flagged this at least once before. IIRC, the response
> back then was that it's a very unlikely issue, as you'd have to extend
> the relation to at least the first block of the last range, which
> would currently be InvalidBlockNo - 131072 + 1, or just shy of 32TB of
> data at 8kB BLCKSZ. That's not exactly a common use case, and BRIN
> range ID wraparound is likely the least of your worries at that point.
> 

Probably true, but it seems somewhat careless and untidy ...

>> Anyway, while working on this / stress-testing it, I realized there's a
>> bug in how we allocate the emptyTuple. It's allocated lazily, but if can
>> easily happen in the per-range context we introduced last week. It needs
>> to be allocated in the context covering the whole index build.
> 
> Yeah, I hadn't tested with (very) sparse datasets yet.
> 

I haven't actually checked what the failing cases look like, but I don't
think it needs to be particularly sparse. AFAIK it's just that the
script deletes a chunk of the data somewhere in the table and/or it also
creates a partial index.

>> I think the best way to do that is per 0006, i.e. allocate it in the
>> BrinBuildState, along with the appropriate memory context.
> 
> That fix looks fine to me.
> 

Thanks!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
Hi,

I've pushed the first two parts (backfill of empty ranges for serial
builds, allowing parallelism) after a bit more cleanup, adding a simple
pageinspect test to 0001, improving comments and some minor adjustments.

I ended up removing the protections against BlockNumber overflows, and
moved them into a separate WIP patch. I still think we should probably
reconsider the position that we don't need to worry about issues so
close to the 32TB boundary, but it seemed rather weird to fix only the
new bits and leave the existing issues in place.

I'm attaching that as a WIP patch, but I don't know if/when I'll get
back to this.


Thanks for the reviews/reworks/ideas!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
Hi,

While cleaning up some unnecessary bits of the code and slightly
inaccurate comments, I ran into a failure when the parallel scan (used
by the parallel build) happened to be a synchronized scan. When the scan
did not start on page 0, the parallel callback failed to correctly
handle tuples after wrapping around to the start of the table.

AFAICS the extensive testing I did during development did not detect
this because strictly speaking the index was "correct" (as in not
returning incorrect results in queries), just less efficient (missing
some ranges, and some ranges being "wider" than necessary). Or perhaps
the tests happened to not trigger synchronized scans.

Should be fixed by 1ccab5038eaf261f. It took me ages to realize what the
problem is, and I initially suspected there's some missing coordination
between the workers/leader, or something.

So I started comparing the code to btree, which is where it originated,
and I realized there's indeed one difference - the BRIN code only does
half the work with the workersdonecv variable. The workers do correctly
update the count and notify the leader, but the leader never waits for
the count to be 0. That is, there's nothing like _bt_parallel_heapscan.

I wonder whether this actually is a problem, considering the differences
between the flow in BRIN and BTREE. In particular, the "leader" does the
work in _brin_end_parallel() after WaitForParallelWorkersToFinish(). So
it's not like there might be a worker still processing data, I think.

But now that I think about it, maybe it's not such a great idea to do
this kind of work in _brin_end_parallel(). Maybe it should do just stuff
related to termination of workers etc. and the merging of results should
happen elsewhere - earlier in brinbuild()? Then it'd make sense to have
something like _bt_parallel_heapscan ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Andres Freund
Дата:
Hi,

While preparing a differential code coverage report between 16 and HEAD, one
thing that stands out is the parallel brin build code. Neither on
coverage.postgresql.org nor locally is that code reached during our tests.

https://coverage.postgresql.org/src/backend/access/brin/brin.c.gcov.html#2333

Greetings,

Andres Freund



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 4/13/24 10:36, Andres Freund wrote:
> Hi,
> 
> While preparing a differential code coverage report between 16 and HEAD, one
> thing that stands out is the parallel brin build code. Neither on
> coverage.postgresql.org nor locally is that code reached during our tests.
>

Thanks for pointing this out, it's definitely something that I need to
improve (admittedly, should have been part of the patch). I'll also look
into eliminating the difference between BTREE and BRIN parallel builds,
mentioned in my last message in this thread.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 4/13/24 11:19, Tomas Vondra wrote:
> On 4/13/24 10:36, Andres Freund wrote:
>> Hi,
>>
>> While preparing a differential code coverage report between 16 and HEAD, one
>> thing that stands out is the parallel brin build code. Neither on
>> coverage.postgresql.org nor locally is that code reached during our tests.
>>
> 
> Thanks for pointing this out, it's definitely something that I need to
> improve (admittedly, should have been part of the patch). I'll also look
> into eliminating the difference between BTREE and BRIN parallel builds,
> mentioned in my last message in this thread.
> 

Here's a couple patches adding a test for the parallel CREATE INDEX with
BRIN. The actual test is 0003/0004 - I added the test to pageinspect,
because that allows cross-checking the index to one built without
parallelism, which I think is better than just doing CREATE INDEX
without properly testing it produces correct results.

It's not entirely trivial because for some opclasses (e.g. minmax-multi)
the results depend on the order in which values are added, and order in
which summaries from different workers are merged.

Funnily enough, while adding the test, I ran into two pre-existing bugs.
One is that brin_bloom_union forgot to update the number of bits set in
the bitmap, another one is that 6bcda4a721 changes PG_DETOAST_DATUM to
the _PACKED version, which however does the wrong thing. Both of which
are mostly harmless - it only affects the output function, which is
unused outside pageinspect. No impact on query correctness etc.

The test needs a bit more work to make sure it works on 32-bit machines
etc. which I think may affect available space on a page, which in turn
might affect the minmax-multi summaries. But I'll take care this early
next week.


Funnily

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
I've pushed this, including backpatching the two fixes. I've reduced the
amount of data needed by the test, and made sure it works on 32-bits too
(I was a bit worried it might be sensitive to that, but that seems not
to be the case).

There's still the question of maybe removing the differences between the
BTREE and BRIN code for parallel builds, I mentioned in [1]. That's more
of a cosmetic issue, but I'll add it as an open item for myself.


regards


[1]
https://www.postgresql.org/message-id/3733d042-71e1-6ae6-5fac-00c12db62db6%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Alexander Lakhin
Дата:
Hello Tomas,

14.04.2024 20:09, Tomas Vondra wrote:
> I've pushed this, including backpatching the two fixes. I've reduced the
> amount of data needed by the test, and made sure it works on 32-bits too
> (I was a bit worried it might be sensitive to that, but that seems not
> to be the case).

I've discovered that that test addition brings some instability to the test.
With the following pageinspect/Makefile modification:
-REGRESS = page btree brin gin gist hash checksum oldextversions
+REGRESS = page btree brin $(shell printf 'brin %.0s' `seq 99`) gin gist hash checksum oldextversions

echo "autovacuum_naptime = 1" > /tmp/temp.config
TEMP_CONFIG=/tmp/temp.config make -s check -C contrib/pageinspect
fails for me as below:
...
ok 17        - brin                                      127 ms
not ok 18    - brin                                      140 ms
ok 19        - brin                                      125 ms
...
# 4 of 107 tests failed.

The following change:
-CREATE TABLE brin_parallel_test (a int, b text, c bigint) WITH (fillfactor=40);
+CREATE TEMP TABLE brin_parallel_test (a int, b text, c bigint) WITH (fillfactor=40);
(similar to e2933a6e1) makes the test pass reliably for me.

Best regards,
Alexander



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:

On 4/15/24 08:00, Alexander LAW wrote:
> Hello Tomas,
> 
> 14.04.2024 20:09, Tomas Vondra wrote:
>> I've pushed this, including backpatching the two fixes. I've reduced the
>> amount of data needed by the test, and made sure it works on 32-bits too
>> (I was a bit worried it might be sensitive to that, but that seems not
>> to be the case).
> 
> I've discovered that that test addition brings some instability to the
> test.
> With the following pageinspect/Makefile modification:
> -REGRESS = page btree brin gin gist hash checksum oldextversions
> +REGRESS = page btree brin $(shell printf 'brin %.0s' `seq 99`) gin gist
> hash checksum oldextversions
> 
> echo "autovacuum_naptime = 1" > /tmp/temp.config
> TEMP_CONFIG=/tmp/temp.config make -s check -C contrib/pageinspect
> fails for me as below:
> ...
> ok 17        - brin                                      127 ms
> not ok 18    - brin                                      140 ms
> ok 19        - brin                                      125 ms
> ...
> # 4 of 107 tests failed.
> 
> The following change
> -CREATE TABLE brin_parallel_test (a int, b text, c bigint) WITH
> (fillfactor=40);
> +CREATE TEMP TABLE brin_parallel_test (a int, b text, c bigint) WITH
> (fillfactor=40);
> (similar to e2933a6e1) makes the test pass reliably for me.
> 

Thanks! This reproduces the issue for me.

I believe this happens because the test does "DELETE + VACUUM" to
generate "gaps" in the table, to get empty ranges in the BRIN. I guess
what's happening is that something (autovacuum or likely something else)
blocks the explicit VACUUM from cleaning some of the pages with deleted
tuples, but then the cleanup happens shortly after between building the
the serial/parallel indexes. That would explain the differences reported
by the regression test.

When I thought about this while writing the test, my reasoning was that
even if the explicit vacuum occasionally fails to clean something, it
should affect all the indexes equally. Which is why I wrote the test to
compare the results using EXCEPT, not checking the exact output.

I'm not a huge fan of temporary tables in regression tests, because it
disappears at the end, making it impossible to inspect the data after a
failure. But the only other option I could think of is disabling
autovacuum on the table, but that does not seem to prevent the failures.

I'll try a bit more to make this work without the temp table.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 4/15/24 10:18, Tomas Vondra wrote:
> ...
>
> I'll try a bit more to make this work without the temp table.
> 

Considering the earlier discussion in e2933a6e1, I think making the
table TEMP is the best fix, so I'll do that. Thanks for remembering that
change, Alexander!

Attached is the cleanup I thought about doing earlier in this patch [1]
to make the code more like btree. The diff might make it seem like a big
change, but it really just moves the merge code into a separate function
and makes it use using the conditional variable. I still believe the old
code is correct, but this seems like an improvement so plan to push this
soon and resolve the open item.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 4/15/24 20:35, Tomas Vondra wrote:
> On 4/15/24 10:18, Tomas Vondra wrote:
>> ...
>>
>> I'll try a bit more to make this work without the temp table.
>>
> 
> Considering the earlier discussion in e2933a6e1, I think making the
> table TEMP is the best fix, so I'll do that. Thanks for remembering that
> change, Alexander!
> 

D'oh! I pushed this fix to stabilize the test earlier today, but I just
realized it unfortunately makes the test useless. The idea of the test
was to build BRIN indexes with/without parallelism, and check that the
indexes are exactly the same.

The instability comes from deletes, which I added to get "empty" ranges
in the table, which may not be cleaned up in time for the CREATE INDEX
commands, depending on what else is happening. A TEMPORARY table does
not have this issue (as observed in e2933a6e1), but there's the minor
problem that plan_create_index_workers() does this:

  /*
   * Determine if it's safe to proceed.
   *
   * Currently, parallel workers can't access the leader's temporary
   * tables. Furthermore, any index predicate or index expressions must
   * be parallel safe.
   */
  if (heap->rd_rel->relpersistence == RELPERSISTENCE_TEMP ||
    !is_parallel_safe(root, (Node *) RelationGetIndexExpressions(index)) ||
    !is_parallel_safe(root, (Node *) RelationGetIndexPredicate(index)))
  {
    parallel_workers = 0;
    goto done;
  }

That is, no parallel index builds on temporary tables. Which means the
test is not actually testing anything :-( Much more stable, but not very
useful for finding issues.

I think the best way to stabilize the test is to just not delete the
rows. That means we won't have any "empty" ranges (runs of pages without
any tuples).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 4/16/24 22:33, Tomas Vondra wrote:
> On 4/15/24 20:35, Tomas Vondra wrote:
>> On 4/15/24 10:18, Tomas Vondra wrote:
>
> ...
> 
> That is, no parallel index builds on temporary tables. Which means the
> test is not actually testing anything :-( Much more stable, but not very
> useful for finding issues.
> 
> I think the best way to stabilize the test is to just not delete the
> rows. That means we won't have any "empty" ranges (runs of pages without
> any tuples).
> 

I just pushed a revert and a patch to stabilize the test in a different
way - Matthias mentioned to me off-list that DELETE is not the only way
to generate empty ranges in a BRIN index, because partial indexes have
the same effect. After playing with that a bit, that seems to work fine
(works with parallel builds, not affected by cleanup), so done that way.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel CREATE INDEX for BRIN indexes

От
Tomas Vondra
Дата:
On 4/15/24 20:35, Tomas Vondra wrote:
> ...
>
> Attached is the cleanup I thought about doing earlier in this patch [1]
> to make the code more like btree. The diff might make it seem like a big
> change, but it really just moves the merge code into a separate function
> and makes it use using the conditional variable. I still believe the old
> code is correct, but this seems like an improvement so plan to push this
> soon and resolve the open item.
>

I've now pushed this cleanup patch, after rewording the commit message a
little bit, etc. I believe this resolves the open item tracking this, so
I've moved it to the "resolved" part.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company