Обсуждение: When to use PARTITION BY HASH?

Поиск

Список

Период

Сортировка

When to use PARTITION BY HASH?

От

Oleksandr Shulgin

Дата:

02 июня 2020 г., 20:17:11

Hi!

I was reading up on declarative partitioning[1] and I'm not sure what could be a possible application of Hash partitioning.

Is anyone actually using it? What are typical use cases? What benefits does such a partitioning scheme provide?

On its face, it seems that it can only give you a number of tables which are smaller than the un-partitioned one, but I fail to see how it would provide any of the potential advantages listed in the documentation.

With a reasonable hash function, the distribution of rows across partitions should be more or less equal, so I wouldn't expect any of the following to hold true:

- "...most of the heavily accessed rows of the table are in a single partition or a small number of partitions."

- "Bulk loads and deletes can be accomplished by adding or removing partitions...",

etc.

That *might* turn out to be the case with a small number of distinct values in the partitioning column(s), but then why rely on hash assignment instead of using PARTITION BY LIST in the first place?

Regards,
--
Alex

[1] https://www.postgresql.org/docs/12/ddl-partitioning.html

Re: When to use PARTITION BY HASH?

От

MichaelDBA

Дата:

02 июня 2020 г., 20:39:40

Hi,

I use it quite often, since I'm dealing with partitioning keys that have 
high cardinality, ie, high number of different values.  If your 
cardinality is very high, but your spacing between values is not 
uniform, HASH will balance your partitioned tables naturally.  If your 
spacing between values is consistent, perhaps RANGE partitioning would 
be better.

Regards,
Michael Vitale

Oleksandr Shulgin wrote on 6/2/2020 1:17 PM:
> Hi!
>
> I was reading up on declarative partitioning[1] and I'm not sure what 
> could be a possible application of Hash partitioning.
>
> Is anyone actually using it? What are typical use cases?  What 
> benefits does such a partitioning scheme provide?
>
> On its face, it seems that it can only give you a number of tables 
> which are smaller than the un-partitioned one, but I fail to see how 
> it would provide any of the potential advantages listed in the 
> documentation.
>
> With a reasonable hash function, the distribution of rows across 
> partitions should be more or less equal, so I wouldn't expect any of 
> the following to hold true:
> - "...most of the heavily accessed rows of the table are in a single 
> partition or a small number of partitions."
> - "Bulk loads and deletes can be accomplished by adding or removing 
> partitions...",
> etc.
>
> That *might* turn out to be the case with a small number of distinct 
> values in the partitioning column(s), but then why rely on hash 
> assignment instead of using PARTITION BY LIST in the first place?
>
> Regards,
> -- 
> Alex
>
> [1] https://www.postgresql.org/docs/12/ddl-partitioning.html
>

Re: When to use PARTITION BY HASH?

От

MichaelDBA

Дата:

02 июня 2020 г., 20:39:40

Hi,

I use it quite often, since I'm dealing with partitioning keys that have 
high cardinality, ie, high number of different values.  If your 
cardinality is very high, but your spacing between values is not 
uniform, HASH will balance your partitioned tables naturally.  If your 
spacing between values is consistent, perhaps RANGE partitioning would 
be better.

Regards,
Michael Vitale

Oleksandr Shulgin wrote on 6/2/2020 1:17 PM:
> Hi!
>
> I was reading up on declarative partitioning[1] and I'm not sure what 
> could be a possible application of Hash partitioning.
>
> Is anyone actually using it? What are typical use cases?  What 
> benefits does such a partitioning scheme provide?
>
> On its face, it seems that it can only give you a number of tables 
> which are smaller than the un-partitioned one, but I fail to see how 
> it would provide any of the potential advantages listed in the 
> documentation.
>
> With a reasonable hash function, the distribution of rows across 
> partitions should be more or less equal, so I wouldn't expect any of 
> the following to hold true:
> - "...most of the heavily accessed rows of the table are in a single 
> partition or a small number of partitions."
> - "Bulk loads and deletes can be accomplished by adding or removing 
> partitions...",
> etc.
>
> That *might* turn out to be the case with a small number of distinct 
> values in the partitioning column(s), but then why rely on hash 
> assignment instead of using PARTITION BY LIST in the first place?
>
> Regards,
> -- 
> Alex
>
> [1] https://www.postgresql.org/docs/12/ddl-partitioning.html
>

Re: When to use PARTITION BY HASH?

От

"David G. Johnston"

Дата:

02 июня 2020 г., 20:43:02

On Tue, Jun 2, 2020 at 10:17 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

That *might* turn out to be the case with a small number of distinct values in the partitioning column(s), but then why rely on hash assignment instead of using PARTITION BY LIST in the first place?

[1] https://www.postgresql.org/docs/12/ddl-partitioning.html

Why the cross-posting? (-performance is oriented toward problem solving, not theory, so -general is the one and only PostgreSQL list this should have been sent to)

Anyway, quoting the documentation you linked to:

"When choosing how to partition your table, it's also important to consider what changes may occur in the future. For example, if you choose to have one partition per customer and you currently have a small number of large customers, consider the implications if in several years you instead find yourself with a large number of small customers. In this case, it may be better to choose to partition by HASH and choose a reasonable number of partitions rather than trying to partition by LIST and hoping that the number of customers does not increase beyond what it is practical to partition the data by."

Hashing does indeed preclude some of the benefits and introduces others.

I suspect that having a hash function that turns its input into a different output and checking for equality on the output would be better than trying to "OR" a partition list together in order to combine multiple inputs onto the same table.

David J.

Re: When to use PARTITION BY HASH?

От

"David G. Johnston"

Дата:

02 июня 2020 г., 20:43:02

On Tue, Jun 2, 2020 at 10:17 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

That *might* turn out to be the case with a small number of distinct values in the partitioning column(s), but then why rely on hash assignment instead of using PARTITION BY LIST in the first place?

[1] https://www.postgresql.org/docs/12/ddl-partitioning.html

Why the cross-posting? (-performance is oriented toward problem solving, not theory, so -general is the one and only PostgreSQL list this should have been sent to)

Anyway, quoting the documentation you linked to:

"When choosing how to partition your table, it's also important to consider what changes may occur in the future. For example, if you choose to have one partition per customer and you currently have a small number of large customers, consider the implications if in several years you instead find yourself with a large number of small customers. In this case, it may be better to choose to partition by HASH and choose a reasonable number of partitions rather than trying to partition by LIST and hoping that the number of customers does not increase beyond what it is practical to partition the data by."

Hashing does indeed preclude some of the benefits and introduces others.

I suspect that having a hash function that turns its input into a different output and checking for equality on the output would be better than trying to "OR" a partition list together in order to combine multiple inputs onto the same table.

David J.

Re: When to use PARTITION BY HASH?

От

Michel Pelletier

Дата:

02 июня 2020 г., 20:45:12

On Tue, Jun 2, 2020 at 10:17 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

Hi!

I was reading up on declarative partitioning[1] and I'm not sure what could be a possible application of Hash partitioning.

Is anyone actually using it? What are typical use cases? What benefits does such a partitioning scheme provide?

On its face, it seems that it can only give you a number of tables which are smaller than the un-partitioned one, but I fail to see how it would provide any of the potential advantages listed in the documentation.

I'm sure there will be many delightful answers to your question, and I look forward to them! From my point of view, hash partitioning is very useful for spreading out high insert/update load. Yes its' true you end up with more smaller tables than one big large one, but remember the indexes are (often) tree data structures. Smaller trees are faster than bigger trees. By making the indexes smaller they are faster. Since the planner can knows to only examine the specific index it needs, this ends up being a lot faster.

Postgres can also parallelize queries on partitions. This is different from a parallel sequential scan, which can also happen per-partition, so there are multiple levels of parallel opportunity.

And last that I can think of, you can put the different partitions in different tablespaces, improving the total IO bandwidth.

-Michel

With a reasonable hash function, the distribution of rows across partitions should be more or less equal, so I wouldn't expect any of the following to hold true:
- "...most of the heavily accessed rows of the table are in a single partition or a small number of partitions."
- "Bulk loads and deletes can be accomplished by adding or removing partitions...",
etc.

That *might* turn out to be the case with a small number of distinct values in the partitioning column(s), but then why rely on hash assignment instead of using PARTITION BY LIST in the first place?

Regards,
--
Alex

[1] https://www.postgresql.org/docs/12/ddl-partitioning.html

Re: When to use PARTITION BY HASH?

От

Michel Pelletier

Дата:

02 июня 2020 г., 20:45:12

On Tue, Jun 2, 2020 at 10:17 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

Hi!

I was reading up on declarative partitioning[1] and I'm not sure what could be a possible application of Hash partitioning.

Is anyone actually using it? What are typical use cases? What benefits does such a partitioning scheme provide?

On its face, it seems that it can only give you a number of tables which are smaller than the un-partitioned one, but I fail to see how it would provide any of the potential advantages listed in the documentation.

I'm sure there will be many delightful answers to your question, and I look forward to them! From my point of view, hash partitioning is very useful for spreading out high insert/update load. Yes its' true you end up with more smaller tables than one big large one, but remember the indexes are (often) tree data structures. Smaller trees are faster than bigger trees. By making the indexes smaller they are faster. Since the planner can knows to only examine the specific index it needs, this ends up being a lot faster.

Postgres can also parallelize queries on partitions. This is different from a parallel sequential scan, which can also happen per-partition, so there are multiple levels of parallel opportunity.

And last that I can think of, you can put the different partitions in different tablespaces, improving the total IO bandwidth.

-Michel

With a reasonable hash function, the distribution of rows across partitions should be more or less equal, so I wouldn't expect any of the following to hold true:
- "...most of the heavily accessed rows of the table are in a single partition or a small number of partitions."
- "Bulk loads and deletes can be accomplished by adding or removing partitions...",
etc.

That *might* turn out to be the case with a small number of distinct values in the partitioning column(s), but then why rely on hash assignment instead of using PARTITION BY LIST in the first place?

Regards,
--
Alex

[1] https://www.postgresql.org/docs/12/ddl-partitioning.html

Re: When to use PARTITION BY HASH?

От

Stephen Frost

Дата:

02 июня 2020 г., 20:47:12

Greetings,

Please don't cross post to multiple lists without any particular reason
for doing so- pick whichever list makes sense and post to that.

* Oleksandr Shulgin (oleksandr.shulgin@zalando.de) wrote:
> I was reading up on declarative partitioning[1] and I'm not sure what could
> be a possible application of Hash partitioning.

Yeah, I tend to agree with this.

> Is anyone actually using it?  What are typical use cases?  What benefits
> does such a partitioning scheme provide?

I'm sure folks are using it but that doesn't make it a good solution.

> On its face, it seems that it can only give you a number of tables which
> are smaller than the un-partitioned one, but I fail to see how it would
> provide any of the potential advantages listed in the documentation.

Having smaller tables can be helpful when it comes to dealing with
things like VACUUM (particularly since, even though we can avoid having
to scan the entire heap, we have to go through the indexes in order to
clean them up and generally larger tables have larger indexes),
however..

> With a reasonable hash function, the distribution of rows across partitions
> should be more or less equal, so I wouldn't expect any of the following to
> hold true:
> - "...most of the heavily accessed rows of the table are in a single
> partition or a small number of partitions."
> - "Bulk loads and deletes can be accomplished by adding or removing
> partitions...",
> etc.
>
> That *might* turn out to be the case with a small number of distinct values
> in the partitioning column(s), but then why rely on hash assignment instead
> of using PARTITION BY LIST in the first place?

You're entirely correct with this- there's certainly no small number of
situations where you end up with a 'hot' partition when using hashing
(which is true in other RDBMS's too, of course...) and that ends up
being pretty painful to deal with.

Also, you're right that you don't get to do bulk load/drop when using
hash partitioning, which is absolutely one of the largest benefits to
partitioning in the first place, so, yeah, their usefullness is.. rather
limited.  Better to do your own partitioning based on actual usage
patterns that you know and the database's hash function certainly
doesn't.

Thanks,

Stephen

Вложения

signature.asc

Re: When to use PARTITION BY HASH?

От

Oleksandr Shulgin

Дата:

03 июня 2020 г., 10:38:49

On Tue, Jun 2, 2020 at 7:47 PM Stephen Frost <sfrost@snowman.net> wrote:

Please don't cross post to multiple lists without any particular reason
for doing so- pick whichever list makes sense and post to that.

Sorry for the trouble, I should've checked it more carefully.

When posting I did think it may be relevant to the performance list as well.

At the same time, wouldn't it make sense to document this policy explicitly?

/me resists the urge of cross-posting to pgsql-www ;)

Cheers,

--

Alex

Re: When to use PARTITION BY HASH?

От

Oleksandr Shulgin

Дата:

03 июня 2020 г., 14:55:25

(sticking to pgsql-general)

On Tue, Jun 2, 2020 at 7:45 PM Michel Pelletier <pelletier.michel@gmail.com> wrote:

On Tue, Jun 2, 2020 at 10:17 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

I was reading up on declarative partitioning[1] and I'm not sure what could be a possible application of Hash partitioning.

Is anyone actually using it? What are typical use cases? What benefits does such a partitioning scheme provide?

On its face, it seems that it can only give you a number of tables which are smaller than the un-partitioned one, but I fail to see how it would provide any of the potential advantages listed in the documentation.

From my point of view, hash partitioning is very useful for spreading out high insert/update load.

Do you also assign the partitions to different tablespaces as you've hinted below or do you see performance improvement from partitioning alone? How does that work? Does it give better results than using a RAID to spread the disk IO, for example?

Yes its' true you end up with more smaller tables than one big large one, but remember the indexes are (often) tree data structures. Smaller trees are faster than bigger trees. By making the indexes smaller they are faster. Since the planner can knows to only examine the specific index it needs, this ends up being a lot faster.

That sounds logical, but can it be demonstrated? If the index(es) fit in memory fully, it doesn't make a measurable difference, I guess?

With hash partitioning you are not expected, in general, to end up with a small number of partitions being accessed more heavily than the rest. So your indexes will also not fit into memory.

I have the feeling that using a hash function to distribute rows simply contradicts the basic assumption of when you would think of partitioning your table at all: that is to make sure the most active part of the table and indexes is small enough to be cached in memory.

Regards,

--
Alex

Re: When to use PARTITION BY HASH?

От

Jeff Janes

Дата:

04 июня 2020 г., 17:32:42

On Wed, Jun 3, 2020 at 7:55 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

With hash partitioning you are not expected, in general, to end up with a small number of partitions being accessed more heavily than the rest. So your indexes will also not fit into memory.

I have the feeling that using a hash function to distribute rows simply contradicts the basic assumption of when you would think of partitioning your table at all: that is to make sure the most active part of the table and indexes is small enough to be cached in memory.

While hash partitioning doesn't appeal to me, I think this may be overly pessimistic. It would not be all that unusual for your customers to take turns being highly active and less active. Especially if you do occasional bulk loads all with the same customer_id for any given load, for example. So while you might not have a permanently hot partition, you could have partitions which are hot in turn. Of course you could get the same benefit (and probably better) with list or range partitioning rather than hash, but then you have to maintain those lists or ranges when you add new customers.

Cheers,

Jeff

Re: When to use PARTITION BY HASH?

От

Oleksandr Shulgin

Дата:

05 июня 2020 г., 13:11:54

On Thu, Jun 4, 2020 at 4:32 PM Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Jun 3, 2020 at 7:55 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

With hash partitioning you are not expected, in general, to end up with a small number of partitions being accessed more heavily than the rest. So your indexes will also not fit into memory.

I have the feeling that using a hash function to distribute rows simply contradicts the basic assumption of when you would think of partitioning your table at all: that is to make sure the most active part of the table and indexes is small enough to be cached in memory.

While hash partitioning doesn't appeal to me, I think this may be overly pessimistic. It would not be all that unusual for your customers to take turns being highly active and less active. Especially if you do occasional bulk loads all with the same customer_id for any given load, for example.

For a bulk load you'd likely want to go with an empty partition w/o indexes and build them later, after loading the tuples. While it might not be possible with any given partitioning scheme either, using hash partitioning most certainly precludes that.

So while you might not have a permanently hot partition, you could have partitions which are hot in turn. Of course you could get the same benefit (and probably better) with list or range partitioning rather than hash, but then you have to maintain those lists or ranges when you add new customers.

Why are LRU eviction from the shared buffers and OS disk cache not good enough to handle this?

This actually applies to any partitioning scheme: the hot dataset could be recognized by these caching layers. Does it not happen in practice?

--

Alex

Re: When to use PARTITION BY HASH?

От

Jeff Janes

Дата:

05 июня 2020 г., 16:51:29

On Fri, Jun 5, 2020 at 6:12 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

On Thu, Jun 4, 2020 at 4:32 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jun 3, 2020 at 7:55 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

With hash partitioning you are not expected, in general, to end up with a small number of partitions being accessed more heavily than the rest. So your indexes will also not fit into memory.

I have the feeling that using a hash function to distribute rows simply contradicts the basic assumption of when you would think of partitioning your table at all: that is to make sure the most active part of the table and indexes is small enough to be cached in memory.

While hash partitioning doesn't appeal to me, I think this may be overly pessimistic. It would not be all that unusual for your customers to take turns being highly active and less active. Especially if you do occasional bulk loads all with the same customer_id for any given load, for example.

For a bulk load you'd likely want to go with an empty partition w/o indexes and build them later, after loading the tuples.

That only works if the bulk load is starting from zero. If you are adding a million rows to something that already has 100 million, you would probably spend more time rebuilding the indexes than you saved by dropping them. And of course to go with an empty partition, you have to be using partitioning of some kind to start with; and then you need to be futzing around creating/detaching and indexing and attaching. With hash partitioning, you might get much of the benefit with none of the futzing.

So while you might not have a permanently hot partition, you could have partitions which are hot in turn. Of course you could get the same benefit (and probably better) with list or range partitioning rather than hash, but then you have to maintain those lists or ranges when you add new customers.

Why are LRU eviction from the shared buffers and OS disk cache not good enough to handle this?

Data density. If the rows are spread out randomly throughout the table, the density of currently relevant tuples per MB of RAM is much lower than if they are in partitions which align with current relevance. Of course you could CLUSTER the table on what would otherwise be the partition key, but clustered tables don't stay clustered, while partitioned ones stay partitioned. Also, clustering the table wouldn't help with the relevant data density in the indexes (other than the index being clustered on, or other ones highly correlated with that one). This can be particularly important for index maintenance and with HDD, as the OS disk cache is in my experince pretty bad at deciding when to write dirty blocks which have been handed to it, versus retain them in the hopes they will be re-dirtied soon, or have adjacent blocks dirtied and then combined into one write.

This actually applies to any partitioning scheme: the hot dataset could be recognized by these caching layers. Does it not happen in practice?

Caching only happens at the page level, not the tuple level. So if your hot tuples are interspersed with cold ones, you can get poor caching effectiveness.

Cheers,

Jeff

Re: When to use PARTITION BY HASH?

От

Michel Pelletier

Дата:

06 июня 2020 г., 19:13:25

On Wed, Jun 3, 2020 at 4:55 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

Do you also assign the partitions to different tablespaces as you've hinted below or do you see performance improvement from partitioning alone? How does that work? Does it give better results than using a RAID to spread the disk IO, for example?

In general you could find write throughput improvements from all three, partitioning, tablespacing, and disk striping. It depends on your problem. Hash partitioning is a common feature in other databases as well. The hash strategy works for many distributed access patterns.

Yes its' true you end up with more smaller tables than one big large one, but remember the indexes are (often) tree data structures. Smaller trees are faster than bigger trees. By making the indexes smaller they are faster. Since the planner can knows to only examine the specific index it needs, this ends up being a lot faster.

That sounds logical, but can it be demonstrated? If the index(es) fit in memory fully, it doesn't make a measurable difference, I guess?

Well lets take a step back here and look at the question, hash partitioning exists in Postgres, is it useful? While I appreciate the need to see a fact demonstrated, and generally avoiding argument by authority, it is true that many of the very smartest database people in the world conceived of, discussed, implemented and documented this feature for us. It stands to reason that it is useful, or it wouldn't exist. So maybe this is more about finding or needing better partitioning documentation.

With hash partitioning you are not expected, in general, to end up with a small number of partitions being accessed more heavily than the rest. So your indexes will also not fit into memory.

Indexes are not (usually) constant time structures, they take more time the bigger they get. So partitioned indexes will be smaller, quicker to insert into, and quicker to vacuum, and also gain possible pruning advantages on query when you split them up. If the planner can, knowing the key, exclude all but one partition, it won't even look at the other tables, so if you hash partition by primary key, you reduce the search space to 1/N immediately.

Indexes with high update activity also suffer from a problem called "index bloat" where spares "holes" get punched in the buckets of btree indexes from updates and delete deletes. These holes are minimized by vacuuming but the bigger the index gets, the harder that process is to maintain. Smaller indexes suffer less from index bloat, and remedying the situation is easier because you can reindex partitions independently of each other. Your not just reducing the query load to an Nth, you're reducing the maintenance load.

I have the feeling that using a hash function to distribute rows simply contradicts the basic assumption of when you would think of partitioning your table at all: that is to make sure the most active part of the table and indexes is small enough to be cached in memory.

I think you might be framing this with a specific data pattern in mind, not all data distributions have a "most active" or power law distribution of data. For example i work with a lot of commercial airline position data that services both real-time queries and ad-hoc analytical queries over arbitrary airframe identifiers. There is no advantage trying to have a "most active" data strategy because all airframes in the air at any given time are by definition most active. A medium sized drone may send out as many pings as a jumbo jet in a given interval of time.

-Michel

Regards,
--
Alex

Re: When to use PARTITION BY HASH?

От

Ron

Дата:

07 июня 2020 г., 04:58:08

On 6/5/20 8:51 AM, Jeff Janes wrote:

On Fri, Jun 5, 2020 at 6:12 AM Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:

[snip]

For a bulk load you'd likely want to go with an empty partition w/o indexes and build them later, after loading the tuples.

That only works if the bulk load is starting from zero. If you are adding a million rows to something that already has 100 million, you would probably spend more time rebuilding the indexes than you saved by dropping them.

It's too bad that Postgres doesn't have "deferred index updates" during bulk (but still transactional) loads, where the index nodes are updated en masse every "commit count" number of rows. That's really useful in this situation, but I've only seen it in one legacy RDBMS.

--
Angular momentum makes the world go 'round.

Re: When to use PARTITION BY HASH?

От

Oleksandr Shulgin

Дата:

08 июня 2020 г., 11:40:15

On Sat, Jun 6, 2020 at 6:14 PM Michel Pelletier <pelletier.michel@gmail.com> wrote:

Well lets take a step back here and look at the question, hash partitioning exists in Postgres, is it useful? While I appreciate the need to see a fact demonstrated, and generally avoiding argument by authority, it is true that many of the very smartest database people in the world conceived of, discussed, implemented and documented this feature for us. It stands to reason that it is useful, or it wouldn't exist. So maybe this is more about finding or needing better partitioning documentation.

Fair point.

I've found the original commit adding this feature in version 11: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1aba8e651ac3e37e1d2d875842de1e0ed22a651e

It says:

"Hash partitioning is useful when you want to partition a growing data
set evenly. This can be useful to keep table sizes reasonable, which
makes maintenance operations such as VACUUM faster, or to enable
partition-wise join."

It also includes a link to discussion, though that starts in the middle of a long thread.

The original thread is here: https://www.postgresql.org/message-id/flat/20170228233313.fc14d8b6.nagata%40sraoss.co.jp

However, these threads only argue about implementation details and it's not easy to find a discussion of motivation for this particular partitioning scheme support.

I guess it was quite obvious to the participants at that point already.

With hash partitioning you are not expected, in general, to end up with a small number of partitions being accessed more heavily than the rest. So your indexes will also not fit into memory.

Indexes are not (usually) constant time structures, they take more time the bigger they get. So partitioned indexes will be smaller, quicker to insert into, and quicker to vacuum, and also gain possible pruning advantages on query when you split them up. If the planner can, knowing the key, exclude all but one partition, it won't even look at the other tables, so if you hash partition by primary key, you reduce the search space to 1/N immediately.

Indexes with high update activity also suffer from a problem called "index bloat" where spares "holes" get punched in the buckets of btree indexes from updates and delete deletes. These holes are minimized by vacuuming but the bigger the index gets, the harder that process is to maintain. Smaller indexes suffer less from index bloat, and remedying the situation is easier because you can reindex partitions independently of each other. Your not just reducing the query load to an Nth, you're reducing the maintenance load.

Thanks for taking your time to explain it in detail. Though I do not tend to believe the insert/scan performance benefit is measurable without trying it, I do see the benefits for maintenance.

I have the feeling that using a hash function to distribute rows simply contradicts the basic assumption of when you would think of partitioning your table at all: that is to make sure the most active part of the table and indexes is small enough to be cached in memory.

I think you might be framing this with a specific data pattern in mind, not all data distributions have a "most active" or power law distribution of data.

I'm just referring to the first bullet-point in the docs:

"Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory."

I think it does not apply to hash partitioning in the general case.

--

Alex

Re: When to use PARTITION BY HASH?

От

Ron

Дата:

08 июня 2020 г., 16:07:04

On 6/8/20 3:40 AM, Oleksandr Shulgin wrote:
[snip]

I've found the original commit adding this feature in version 11: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1aba8e651ac3e37e1d2d875842de1e0ed22a651e
It says:

"Hash partitioning is useful when you want to partition a growing data
set evenly. This can be useful to keep table sizes reasonable, which
makes maintenance operations such as VACUUM faster, or to enable
partition-wise join."

How does hashed (meaning "randomly?) distribution of records make partition-wise joins more efficient?

Or -- since I interpret that as having to do with "locality of data" -- am I misunderstanding the meaning of "partition-wise joins"?

--
Angular momentum makes the world go 'round.

Re: When to use PARTITION BY HASH?

От

David Rowley

Дата:

09 июня 2020 г., 00:25:12

On Tue, 9 Jun 2020 at 01:07, Ron <ronljohnsonjr@gmail.com> wrote:
>
> On 6/8/20 3:40 AM, Oleksandr Shulgin wrote:
> [snip]
>
> I've found the original commit adding this feature in version 11:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1aba8e651ac3e37e1d2d875842de1e0ed22a651e
> It says:
>
> "Hash partitioning is useful when you want to partition a growing data
> set evenly.  This can be useful to keep table sizes reasonable, which
> makes maintenance operations such as VACUUM faster, or to enable
> partition-wise join."
>
>
> How does hashed (meaning "randomly?) distribution of records make partition-wise joins more efficient?

Hash partitioning certainly does not mean putting the tuple in some
random partition. It means putting the tuple in the partition with the
correct remainder value after dividing the hash value by the largest
partition modulus.

> Or -- since I interpret that as having to do with "locality of data" -- am I misunderstanding the meaning of
"partition-wisejoins"?

If it was not a partitioned table before then partition-wise joins
wouldn't be possible.  Having partition-wise joins could make joining
two identically partitioned tables faster. We need only look in the
corresponding partition on the other side of the join for join
partners for each tuple. For hash joins, hash tables can be smaller,
which can mean not having to batch, and possibly having the hash table
fit better into a CPU cache. For merge joins, sorts, having the data
partially pre-sorted in chunks means fewer operations for qsort which
can result in speedups.

David