Обсуждение: Feature proposal: immutable/sealed partitions (and maybe tables, too)

Поиск
Список
Период
Сортировка

Feature proposal: immutable/sealed partitions (and maybe tables, too)

От
Levi Aul
Дата:
My company runs some large OLAP data warehouses with append-only, time-partitioned datasets. Our workloads involve aggregations and joins, and query the data in ways not amenable to constraint-exclusion; and we serve a high concurrent number of these queries at once from a single DB.

In other words, our workload is inherently one that acquires "way too many locks." Our largest performance bottleneck, according to pg_wait_sampling, is the LockManager itself. Despite most of our queries spending only milliseconds actually executing, they often spend seconds during planning waiting to acquire hundreds of access-shared locks.

Given that our datasets are append-only, all our partitions for each table save for the one "active" one (the one for the current time period) are effectively immutable. No DML-triggered writes will occur to these. I think this is pretty common in data-warehouse use-cases of PG.

If PG could avoid the need to acquire the locks for these effectively-immutable partitions, then the remaining number of tables would be low enough to fit into the per-backend LWLock slots set, and so avoid LockManager contention. I believe this could be a large optimization not just for our use-case, but in a number of other high-concurrency OLAP use-cases.

My proposal for how this "lock elision under large numbers of immutable partitions" could be accomplished:

1. Add some DDL statement to mark partitions as sealed/unsealed. (ALTER TABLE ... SEAL PARTITION foo)
2. When query-planning DML against a partition or a partitioned table, treat a sealed partition as if it had an always-false check constraint.
2. Define a "locking group" abstraction, where many entities can register themselves under the same lock, such that access to all members of the locking group requires only acquiring the single locking-group lock. All sealed partitions of the same table would share a locking group.

Under such a setup, querying a time-based partitioned table with one active (unsealed) partition would only ever require acquiring, at most, two locks — the one for the active partition, and the one for the sealed-partitions locking group.

The trade-off for this is that acquiring an exclusive-access lock on the sealed-partitions locking-group for a table becomes much more expensive than it would have been to acquire for a single partition. But this isn't a problem in practice, because hot-path operations that take an exclusive-access lock (DML writes) are disallowed against sealed partitions. The only time the lock-group would need to be exclusive-access acquired, would be to change its membership — an administrative DDL operation.

Besides being useful operationally, such a mechanism would also be helpful on the business-logic level, as you can rely on partition sealing to turn accidental insertions of new data into any but the active partition(s) into a constraint violation. (Currently, to achieve this, separate triggers need to be maintained on each sealed partition.)

And, with knowledge of the administrative intent for a table to be immutable, further operational optimizations could be performed. A few off the top of my head:

1. Running CLUSTER or VACUUM (FULL, FREEZE) after the partition is marked as immutable, could rewrite the table using an implicit "heap_immutable" access method (still reported as "heap"), which would drop the min_xid column (as everything in a sealed table is guaranteed to be always-visible), and thus remove the table for consideration for xid-wraparound-protection rewriting. Such partitions would then require a rewrite back to "heap" if unsealed.

2. Alternatively, such storage-rewriting DDL statements could switch the table — and its indices — over to using an entirely different access-methods, which would store the data+indices in "perfect" packed forms, to maximize read performance while also minimizing disk usage.

3. ANALYZE could have an (otherwise-impractical) "EXACT" argument, to populate statistics with exact aggregate values, requiring reading all rows rather than sampling rows. This could pre-bake table-level aggregates for most columns, like having a single, table-sized BRIN block-range.

If this concept of "marking as sealed" were extended to tables rather than only partitions, then further work could be done related to optimization of bulk loads — e.g. having CREATE TABLE AS ... SEALED not generate WAL segments for the table as it is populated, but rather treat the table as UNLOGGED during population, and then, after creation, take the entire finalized/sealed table's backing files and either pass them directly to archive_command / send them directly to WAL receivers; or split+stream them into retrospective WAL segments (each segment containing a single "put this 16MB of data into this file at this position" op), and send those.

Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)

От
Ron
Дата:
By "SEALED", do you mean "READ ONLY"?

On 9/6/22 14:39, Levi Aul wrote:
> My company runs some large OLAP data warehouses with append-only, 
> time-partitioned datasets. Our workloads involve aggregations and joins, 
> and query the data in ways not amenable to constraint-exclusion; and we 
> serve a high concurrent number of these queries at once from a single DB.
>
> In other words, our  workload is inherently one that acquires "way too 
> many locks." Our largest performance bottleneck, according to 
> pg_wait_sampling, is the LockManager itself. Despite most of our queries 
> spending only milliseconds actually executing, they often spend seconds 
> during planning waiting to acquire hundreds of access-shared locks.
>
> Given that our datasets are append-only, all our partitions for each table 
> save for the one "active" one (the one for the current time period) are 
> effectively immutable. No DML-triggered writes will occur to these. I 
> think this is pretty common in data-warehouse use-cases of PG.
>
> If PG could avoid the need to acquire the locks for these 
> effectively-immutable partitions, then the remaining number of tables 
> would be low enough to fit into the per-backend LWLock slots set, and so 
> avoid LockManager contention. I believe this could be a large optimization 
> not just for our use-case, but in a number of other high-concurrency OLAP 
> use-cases.
>
> My proposal for how this "lock elision under large numbers of immutable 
> partitions" could be accomplished:
>
> 1. Add some DDL statement to mark partitions as sealed/unsealed. (ALTER 
> TABLE ... SEAL PARTITION foo)
> 2. When query-planning DML against a partition or a partitioned table, 
> treat a sealed partition as if it had an always-false check constraint.
> 2. Define a "locking group" abstraction, where many entities can register 
> themselves under the same lock, such that access to all members of the 
> locking group requires only acquiring the single locking-group lock. All 
> sealed partitions of the same table would share a locking group.
>
> Under such a setup, querying a time-based partitioned table with one 
> active (unsealed) partition would only ever require acquiring, at most, 
> two locks — the one for the active partition, and the one for the 
> sealed-partitions locking group.
>
> The trade-off for this is that acquiring an exclusive-access lock on the 
> sealed-partitions locking-group for a table becomes much more expensive 
> than it would have been to acquire for a single partition. But this isn't 
> a problem in practice, because hot-path operations that take an 
> exclusive-access lock (DML writes) are disallowed against sealed 
> partitions. The only time the lock-group would need to be exclusive-access 
> acquired, would be to change its membership — an administrative DDL operation.
>
> Besides being useful operationally, such a mechanism would also be helpful 
> on the business-logic level, as you can rely on partition sealing to turn 
> accidental insertions of new data into any but the active partition(s) 
> into a constraint violation. (Currently, to achieve this, separate 
> triggers need to be maintained on each sealed partition.)
>
> And, with knowledge of the administrative intent for a table to be 
> immutable, further operational optimizations could be performed. A few off 
> the top of my head:
>
> 1. Running CLUSTER or VACUUM (FULL, FREEZE) after the partition is marked 
> as immutable, could rewrite the table using an implicit "heap_immutable" 
> access method (still reported as "heap"), which would drop the min_xid 
> column (as everything in a sealed table is guaranteed to be 
> always-visible), and thus remove the table for consideration for 
> xid-wraparound-protection rewriting. Such partitions would then require a 
> rewrite back to "heap" if unsealed.
>
> 2. Alternatively, such storage-rewriting DDL statements could switch the 
> table — and its indices — over to using an entirely different 
> access-methods, which would store the data+indices in "perfect" packed 
> forms, to maximize read performance while also minimizing disk usage.
>
> 3. ANALYZE could have an (otherwise-impractical) "EXACT" argument, to 
> populate statistics with exact aggregate values, requiring reading all 
> rows rather than sampling rows. This could pre-bake table-level aggregates 
> for most columns, like having a single, table-sized BRIN block-range.
>
> If this concept of "marking as sealed" were extended to tables rather than 
> only partitions, then further work could be done related to optimization 
> of bulk loads — e.g. having CREATE TABLE AS ... SEALED not generate WAL 
> segments for the table as it is populated, but rather treat the table as 
> UNLOGGED during population, and then, after creation, take the entire 
> finalized/sealed table's backing files and either pass them directly to 
> archive_command / send them directly to WAL receivers; or split+stream 
> them into retrospective WAL segments (each segment containing a single 
> "put this 16MB of data into this file at this position" op), and send those.

-- 
Angular momentum makes the world go 'round.



Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)

От
David Rowley
Дата:
On Wed, 7 Sept 2022 at 07:40, Levi Aul <levi@covalenthq.com> wrote:
> In other words, our workload is inherently one that acquires "way too many locks." Our largest performance
bottleneck,according to pg_wait_sampling, is the LockManager itself. Despite most of our queries spending only
millisecondsactually executing, they often spend seconds during planning waiting to acquire hundreds of access-shared
locks.

It would be good to have a better understanding of the problem you're
facing.  There have been many changes to table partitioning since
PostgreSQL 10 and many of those changes affect the number of
partitions which are locked for certain classes of queries.

It would be good to know the following:

1. Which version of PostgreSQL are you having this problem with, and;
2. Example of queries you're having this problem with.

If you share that information we may be able to inform you about
features/performance improvements in newer versions which help with
the problem you're facing.

You mention "constraint-exclusion", that's no longer how we perform
partition pruning and hasn't been since (if I remember correctly)
PostgreSQL 11. Perhaps you're using PG10?

David



Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)

От
Levi Aul
Дата:
We're using Postgres 14.5. I meant partition pruning.

To be clear, this isn't a bug report. There is no bug—everything is working exactly as it should. The partitions are not being pruned because the workload consists of OLAP aggregations that fetch a small number of rows spread across all partitions in the set, relying for speed on an index that isn't prefixed with the partitioning key (nor can it be.)

I'll try to avoid the particulars of the business domain (it gets clumsy because there's a lot of jargon collisions, with tables named things like "transactions"), but you can think of it abstractly as follows: we have a table holding a CQRS event-stream; we are trying to discover all events related to a particular person, where events "related to a person" are related either directly (by an indexed field of the event) or indirectly (by a foreign-key reference to the event from a row in a second partitioned table — let's call it "event attributes" — where the person is an indexed field of the event-attribute.) We construct/linearize/uniquify a time-ordered rowset of all such related events; and then we reduce/aggregate some value fields from those events. This query requires index scans of all N partitions of events + all N partitions of event_attributes.

This workload is performing exactly how you'd expect it to perform (i.e. badly) given Postgres's current operational pragmatics around partition locking. The only way it could possibly perform better, is if Postgres didn't have to acquire N shared-access locks in order to index-scan N partitions. And the only way that could work, is if Postgres could make some assumption about the locking behavior of the partitions. Thus my feature proposal.

To be clear, I'm more interested in discussing the feature proposal than in solving the immediate problem. The immediate problem has an obvious, if painful, solution: merging the historical partitions into a single huge historical partition per table. The feature proposal, meanwhile, has the potential to solve many of our current business-level problems if the "further optimizations" I mentioned can be made.

Also, to be clear, I'm interested in implementing the feature I've proposed myself. I've read the relevant parts of the Postgres codebase and feel confident I can make the required changes. I would just like to have a design discussion around the shape the feature should take, with Postgres stakeholders, before I go to all that effort to build something they might not accept upstream.

On Tue, Sep 6, 2022 at 5:53 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 7 Sept 2022 at 07:40, Levi Aul <levi@covalenthq.com> wrote:
> In other words, our workload is inherently one that acquires "way too many locks." Our largest performance bottleneck, according to pg_wait_sampling, is the LockManager itself. Despite most of our queries spending only milliseconds actually executing, they often spend seconds during planning waiting to acquire hundreds of access-shared locks.

It would be good to have a better understanding of the problem you're
facing.  There have been many changes to table partitioning since
PostgreSQL 10 and many of those changes affect the number of
partitions which are locked for certain classes of queries.

It would be good to know the following:

1. Which version of PostgreSQL are you having this problem with, and;
2. Example of queries you're having this problem with.

If you share that information we may be able to inform you about
features/performance improvements in newer versions which help with
the problem you're facing.

You mention "constraint-exclusion", that's no longer how we perform
partition pruning and hasn't been since (if I remember correctly)
PostgreSQL 11. Perhaps you're using PG10?

David


--

Levi Aul, CTO, Covalent

Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)

От
David Rowley
Дата:
On Wed, 7 Sept 2022 at 13:33, Levi Aul <levi@covalenthq.com> wrote:
> To be clear, this isn't a bug report. There is no bug—everything is working exactly as it should. The partitions are
notbeing pruned because the workload consists of OLAP aggregations that fetch a small number of rows spread across all
partitionsin the set, relying for speed on an index that isn't prefixed with the partitioning key (nor can it be.) 

Probably the -hackers mailing list might a better place to discuss
design ideas for new features.  -general is more for general help with
using the software, not hacking on it.

The main reason individual partitions need to be locked is because
they can still be referenced by queries directly as if they were just
a normal table. To get around that we'd either need to have the
locking groups, as you describe, or remove the ability to access the
partition directly, not through the top-level partitioned table.  The
ship has probably sailed on the latter one, but it probably could be
done as an opt-in feature if the former was too difficult or
impractical.

FWIW, I'm not quite seeing why you need "sealed" partitions for the
group locking idea.  I understand the other parts you mentioned about
conversion to a table AM which is more optimized for non-transactional
workloads, but that seems like a different problem that you're mixing
in and adding complexity to the whole thing. If that's true, then it
might be better not to mix that in and confuse / complicate your
explanation of the problem and proposed solution.

I'd suggest posting to -hackers and stating that your queries can't
make use of partition pruning and that currently all partitions are
being locked and you believe that this is a bottleneck.  Some examples
of perf output to show how large the locking overhead is. Extra points
for hacking up some crude code so we don't obtain the partition locks
to show what the performance could be if we didn't lock all the
partitions. That'll help show you have a worthy cause, as FWIW, I'm
surprised that executor startup / shutdown for a plan which accesses a
large number of partitions is not drowning out the locking overheads.
As far as I knew, this problem was only visible when run-time
partition pruning removed the large majority of the Append/MergeAppend
subnodes and made executor startup/shutdown significantly faster.

David