Обсуждение: Postgresql partitioning - single hot table or distributed
Hi, we are considering database partitioning as a possible solution to some performance issues we are having with our database, and we are trying to decide on a partitioning scheme. We have a moderately write heavy application (approx 50 inserts per second, with writes outnumbering reads by roughly 5:1), and the table in question looks something like this: ------------------------------------------------------------------------------- column name : id | value | server_id | created_at column type : integer | string | integer | timestamp with time zone other info : pk | | fk, indexed | indexed ------------------------------------------------------------------------------- Or initial thoughts on partitioning was to partition by date using the created_at column, with a separate partition for each month; however the vast majority of our inserts would be for 'now', so we would be almost entirely writing to the partition for the current month. Other month partitions might get occasional updates, but this would be a relatively infrequent occurrence. Alternatively we wondered about partitioning by the server_id foreign key, using for example the modulo of the foreign key id. This would give us a finite number of partitions (rather than the potentially unbounded date option), and would likely cause writes to be much more evenly distributed between the partitions. Does anyone have any likely idea which would be the better choice. The single hot table getting most of the inserts, which might mean any indexes are fully in memory, or dividing the writes more evenly over all of our partitions? Many thanks for any advice. Sam
On Tue, Jun 29, 2010 at 4:00 PM, sam mulube <sam.mulube@gmail.com> wrote: > Alternatively we wondered about partitioning by the server_id foreign > key, using for example the modulo of the foreign key id. This would > give us a finite number of partitions (rather than the potentially > unbounded date option), and would likely cause writes to be much more > evenly distributed between the partitions. Do you ever delete? If so, what is the delete pattern? I have several large tables split into 100 partitions which keeps them in the O(10million) row size range each, and that has proven to be ideal for our use. We insert into the partition directly, and select from it whenever possible avoiding the constraint exclusion step. The only time we get bad performance is when doing a join that cannot run constraint exclusion, and that turns out to be pretty expensive.
Hi Vick, Currently we aren't deleting anything due to business requirements though at some point we will have to start deleting out some data. I suspect when we do it won't be as simple as just dropping the oldest data; some customers will have data that we want to keep permanently, while others will be ok to be dropped after 30 days. Inserting directly into the specific partition is interesting, but if you're going to go down that route then aren't you starting to implement the partitioning yourself in application code. In that case what benefit does keeping the Postgresql partitioning in place actually give you? Thanks for the reply. Sam On 30 June 2010 02:39, Vick Khera <vivek@khera.org> wrote: > On Tue, Jun 29, 2010 at 4:00 PM, sam mulube <sam.mulube@gmail.com> wrote: >> Alternatively we wondered about partitioning by the server_id foreign >> key, using for example the modulo of the foreign key id. This would >> give us a finite number of partitions (rather than the potentially >> unbounded date option), and would likely cause writes to be much more >> evenly distributed between the partitions. > > Do you ever delete? If so, what is the delete pattern? > > I have several large tables split into 100 partitions which keeps them > in the O(10million) row size range each, and that has proven to be > ideal for our use. We insert into the partition directly, and select > from it whenever possible avoiding the constraint exclusion step. The > only time we get bad performance is when doing a join that cannot run > constraint exclusion, and that turns out to be pretty expensive. >
On Wed, Jun 30, 2010 at 10:20 AM, sam mulube <sam.mulube@gmail.com> wrote: > Inserting directly into the specific partition is interesting, but if > you're going to go down that route then aren't you starting to > implement the partitioning yourself in application code. In that case > what benefit does keeping the Postgresql partitioning in place > actually give you? > Your benefit comes in very fast lookups when constraint exclusion (or index query) are working for you. You also get simplified syntax for doing joins against the entire data set. If however you always only search on the ID, then there is really not much benefit other than having smaller indexes. Smaller indexes are great when you have to re-index. Instead of taking say 15 minutes to reindex the one big table, you take 10-15seconds per partition, which means your applications can move along with only a few seconds delay while you reindex instead of being blocked. For one of my tables I could have gone the trigger method for routing inserts because those are onsies-twosies, but the main table I have it is much more efficient to compute the partition up front since I do a *lot* of inserts in a big batch. The drawback to partitioning by an ID number using modulo is that for constraint exclusion to work you have to actually add something like "AND (my_id % 42) = 0" to match the constraint. The exclusion is not done by executing the constraint, but by proving the constraint will hold true for the WHERE clause. My reasoning was if I'm going to add that to my select queries, I might as well just pick the table by doing that arithmetic in my application up front. It is faster than having the DB do the constraint proof for each of the 100 partitions.
Vick Khera wrote: > The drawback to partitioning by an ID number using modulo is that for > constraint exclusion to work you have to actually add something like > "AND (my_id % 42) = 0" to match the constraint. The exclusion is not > done by executing the constraint, but by proving the constraint will > hold true for the WHERE clause. Vick's "Case study of partitioning a large table in Postgres 8" presentation at http://www.mailermailer.com/labs/presentations/index.rwp covers this topic in more detail. The important thing to realize is that your partitioning scheme must respect the limitations of the constraint exclusion code in the query planner to be most useful. You really need to consider not just the structure of the data, but what the queries against it look like, to make that call. If you can hack the application to include the modulo bit in every query that might be a sufficient breakdown for you. But you have to consider what that looks like from the query side, given what the optimizer knows how to do, not just the perspective of the table structure. Building a quick prototype and using EXPLAIN ANALYZE of common queries is invaluable here to do an early investigation of any potential redesign. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Tue, Jun 29, 2010 at 4:00 PM, sam mulube <sam.mulube@gmail.com> wrote: > Hi, > > we are considering database partitioning as a possible solution to > some performance issues we are having with our database, and we are > trying to decide on a partitioning scheme. We have a moderately write > heavy application (approx 50 inserts per second, with writes > outnumbering reads by roughly 5:1), and the table in question looks > something like this: > > ------------------------------------------------------------------------------- > column name : id | value | server_id | created_at > column type : integer | string | integer | timestamp > with time zone > other info : pk | | fk, indexed | > indexed > ------------------------------------------------------------------------------- > > Or initial thoughts on partitioning was to partition by date using the > created_at column, with a separate partition for each month; however > the vast majority of our inserts would be for 'now', so we would be > almost entirely writing to the partition for the current month. Other > month partitions might get occasional updates, but this would be a > relatively infrequent occurrence. > > Alternatively we wondered about partitioning by the server_id foreign > key, using for example the modulo of the foreign key id. This would > give us a finite number of partitions (rather than the potentially > unbounded date option), and would likely cause writes to be much more > evenly distributed between the partitions. > > Does anyone have any likely idea which would be the better choice. The > single hot table getting most of the inserts, which might mean any > indexes are fully in memory, or dividing the writes more evenly over > all of our partitions? Are most of your selects for now to now - 1 day or so as well? If so, then look at having one big partition for historical data and one small one for the last day. Every x hours run a cron job that moves everything in the current partition to the old archive partition(s).