Re: Parallel CREATE INDEX for BRIN indexes

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Parallel CREATE INDEX for BRIN indexes
Дата
Msg-id 5384d42e-1961-c2fd-1955-1a5a99fa1234@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Parallel CREATE INDEX for BRIN indexes  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Ответы Re: Parallel CREATE INDEX for BRIN indexes  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Список pgsql-hackers
On 11/23/23 13:33, Matthias van de Meent wrote:
> Hi,
> 
> On Wed, 22 Nov 2023 at 20:16, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 11/20/23 20:48, Matthias van de Meent wrote:
>>> On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> here's an updated patch, addressing the review comments, and reworking
>>>> how the work is divided between the workers & leader etc.
>>>>
>>>
>>> After code-only review, here are some comments:
>>>
>>>> +++ b/src/backend/access/brin/brin.c
>>>> [...]
>>>> +/* Magic numbers for parallel state sharing */
>>>> +#define PARALLEL_KEY_BRIN_SHARED        UINT64CONST(0xA000000000000001)
>>>> +#define PARALLEL_KEY_TUPLESORT            UINT64CONST(0xA000000000000002)
>>>
>>> These shm keys use the same constants also in use in
>>> access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
>>> operations, I'd prefer if we didn't actively introduce conflicting
>>> identifiers when we still have significant amounts of unused values
>>> remaining.
>>>
>>
>> Hmmm. Is there some rule of thumb how to pick these key values?
> 
> None that I know of.
> There is a warning in various places that define these constants that
> they take care to not conflict with plan node's node_id: parallel plan
> execution uses plain plan node IDs as keys, and as node_id is
> int-sized, any other key value that's created manually of value < 2^32
> should be sure that it can't be executed in a parallel backend.
> But apart from that one case, I can't find a convention, no.
> 

OK, in that case 0xB is fine.

>>>> +#define PARALLEL_KEY_QUERY_TEXT            UINT64CONST(0xA000000000000003)
>>>
>>> This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
>>> others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
>>> more than brin's), backend/executor/execParallel.c
>>> (0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
>>> I've not checked that their uses are exactly the same, I'd expect at
>>> least btree to match mostly, if not fully, 1:1).
>>> I think we could probably benefit from a less ad-hoc sharing of query
>>> texts. I don't think that needs to happen specifically in this patch,
>>> but I think it's something to keep in mind in future efforts.
>>>
>>
>> I'm afraid I don't quite get what you mean by "ad hoc sharing of query
>> texts". Are you saying we shouldn't propagate the query text to the
>> parallel workers? Why? Or what's the proper solution?
> 
> What I mean is that we have several different keys that all look like
> they contain the debug query string, and always for the same debugging
> purposes. For debugging, I think it'd be useful to use one well-known
> key, rather than N well-known keys in each of the N parallel
> subsystems.
> 
> But as mentioned, it doesn't need to happen in this patch, as that'd
> increase scope beyond brin/index ams.
> 

Agreed.

>>> I also noticed that this is likely to execute `union_tuples` many
>>> times when pages_per_range is coprime with the parallel table scan's
>>> block stride (or when we for other reasons have many tuples returned
>>> for each range); and this `union_tuples` internally allocates and
>>> deletes its own memory context for its deserialization of the 'b'
>>> tuple. I think we should just pass a scratch context instead, so that
>>> we don't have the overhead of continously creating then deleting the
>>> same memory context
>>
>> Perhaps. Looking at the code, isn't it a bit strange how union_tuples
>> uses the context? It creates the context, calls brin_deform_tuple in
>> that context, but then the rest of the function (including datumCopy and
>> similar stuff) happens in the caller's context ...
> 
> The union operator may leak (lots of) memory, so I think it makes
> sense to keep a context around that can be reset after we've extracted
> the merge result.
> 

But does the current code actually achieve that? It does create a "brin
union" context, but then it only does this:

    /* Use our own memory context to avoid retail pfree */
    cxt = AllocSetContextCreate(CurrentMemoryContext,
                                "brin union",
                                ALLOCSET_DEFAULT_SIZES);
    oldcxt = MemoryContextSwitchTo(cxt);
    db = brin_deform_tuple(bdesc, b, NULL);
    MemoryContextSwitchTo(oldcxt);

Surely that does not limit the amount of memory used by the actual union
functions in any way?

>> However, I don't think the number of union_tuples calls is likely to be
>> very high, especially for large tables. Because we split the table into
>> 2048 chunks, and then cap the chunk size by 8192. For large tables
>> (where this matters) we're likely close to 8192.
> 
> I agree that the merging part of the index creation is the last part,
> and usually has no high impact on the total performance of the reindex
> operation, but in memory-constrained environments releasing and then
> requesting the same chunk of memory over and over again just isn't
> great.

OK, I'll take a look at the scratch context you suggested.

My point however was we won't actually do that very often, because on
large tables the BRIN ranges are likely smaller than the parallel scan
chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
ranges are large, there'll be few of them.

> Also note that parallel scan chunk sizes decrease when we're about to
> hit the end of the table, and that a different AM may have different
> ideas about scanning a table in parallel; it could very well decide to
> use striped assignments exclusively, as opposed to on-demand chunk
> allocations; both increasing the chance that brin's page ranges are
> processed by more than one backend.
> 

Yeah, but the ramp-up and ramp-down should have negligible impact, IMO.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: PoC: prefetching index leaf pages (for inserts)
Следующее
От: Amul Sul
Дата:
Сообщение: Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression