Обсуждение: Guidance Requested - Bulk Inserting + Queries
Experts, Quick Summary: data can now be inserted very quickly via COPY + removing indexes, but is there a design or some tricks to still allow someone to query while the partition is still active and 'hot' ? - Postgres 9.1 - Windows 7 (64-bit) , although this is just for the current test and could vary depending on situation - We have 4 main tables with daily partitions - Each table/partition has multiple indexes on it - Streaming logs from client machines into our server app which processes the logs and tries to shove all that data into these daily partitions as fast as it can. - Using COPY and removed original primary key unique constraints to try to get it to be as fast as possible (some duplicates are possible) - Will remove duplicates in a later step (disregard for this post) We now found (thanks Andres and Snow-Man in #postgresql) that in our tests, after the indexes get too large performance drops signficantly and our system limps forward due to disk reads (presumably for the indexes). If we remove the indexes, performance for our entire sample test is great and everything is written to postgresql very quickly. This allows us to shove lots and lots of data in (for production possibly 100 GB or a TB per day!) My question is, what possible routes can I take where we can have both fast inserts (with indexes removed until the end of the day), but still allow a user to query against today's data? Is this even possible? One idea would be possibly have hourly tables for today and as soon as we can try to re-add indexes. Another possible solution might be to stream the data to another "reader" postgres instance that has indexes, although I'm not very versed in replication. Any ideas would be greatly appreciated. Thanks! Ben -- Benjamin Johnson http://getcarbonblack.com/ | @getcarbonblack cell: 312.933.3612 | @chicagoben
> We now found (thanks Andres and Snow-Man in #postgresql) that in our > tests, after the indexes get too large performance drops signficantly > and our system limps forward due to disk reads (presumably for the > indexes). If we remove the indexes, performance for our entire sample > test is great and everything is written to postgresql very quickly. It's usually the fact that the data you index is "random" as opposed to, say, an always incremented value (could be a timestamp, or a sequence) that leads to insert problems with btrees. > My question is, what possible routes can I take where we can have both > fast inserts (with indexes removed until the end of the day), but still > allow a user to query against today's data? Is this even possible? One > idea would be possibly have hourly tables for today and as soon as we > can try to re-add indexes. Yep, that's the only way I've found: use smaller partitions. That leads to slower reads (due to the fact that you have to visit more indexes to read the same amount of data). But you'll get faster writes. > Another possible solution might be to stream > the data to another "reader" postgres instance that has indexes, > although I'm not very versed in replication. I don't think you can do that. Another option that you have is to use ssd instead of HD for the indexes only (that is, having the indexes in a separate tablespace of ssds). The problem is that your disks usually can't keep up with the number of random writes it takes to update N "random values" btrees; ssd might help. Can you post some numbers, such as # of indexes, # of rows you're trying to insert per hour etc etc?
We're trying to split the current day into hourly tables so that the size of the indexes that are popular is much lower and therefore we can support more rows across the day. We also are using numerics where we could be using bigints, so we're going to also work on that to see how much smaller we can get it. Once a daily table is not "today", we will remove duplicates, so we can combine that step with rolling up the hourly tables into one daily table.
In a *small* test (1-2 orders of magnitude smaller than some potential customer environments), the cumulative size of the daily indexes is 3.6 GB and that's for only about half of the test.
We're talking 4 different daily partitioned tables with each table having 1 - 6 indexes (yes, a lot!).
I'll post another update when I have it.
Thanks Leonardo.
On 11/30/2011 10:17 AM, Leonardo Francalanci wrote:
>> We now found (thanks Andres and Snow-Man in #postgresql) that in our
>> tests, after the indexes get too large performance drops signficantly
>> and our system limps forward due to disk reads (presumably for the
>> indexes). If we remove the indexes, performance for our entire sample
>> test is great and everything is written to postgresql very quickly.
> It's usually the fact that the data you index is "random" as opposed to,
> say, an always incremented value (could be a timestamp, or a sequence)
> that leads to insert problems with btrees.
>> My question is, what possible routes can I take where we can have both
>> fast inserts (with indexes removed until the end of the day), but still
>> allow a user to query against today's data? Is this even possible? One
>> idea would be possibly have hourly tables for today and as soon as we
>> can try to re-add indexes.
> Yep, that's the only way I've found: use smaller partitions. That leads
> to slower reads (due to the fact that you have to visit more indexes to
> read the same amount of data). But you'll get faster writes.
>
>> Another possible solution might be to stream
>> the data to another "reader" postgres instance that has indexes,
>> although I'm not very versed in replication.
> I don't think you can do that.
> Another option that you have is to use ssd instead of HD for the indexes
> only (that is, having the indexes in a separate tablespace of ssds). The
> problem is that your disks usually can't keep up with the number of
> random writes it takes to update N "random values" btrees; ssd might help.
> Can you post some numbers, such as # of indexes, # of rows you're trying
> to insert per hour etc etc?
>
>
--
Benjamin Johnson
http://getcarbonblack.com/ | @getcarbonblack
cell: 312.933.3612
In a *small* test (1-2 orders of magnitude smaller than some potential customer environments), the cumulative size of the daily indexes is 3.6 GB and that's for only about half of the test.
We're talking 4 different daily partitioned tables with each table having 1 - 6 indexes (yes, a lot!).
I'll post another update when I have it.
Thanks Leonardo.
On 11/30/2011 10:17 AM, Leonardo Francalanci wrote:
>> We now found (thanks Andres and Snow-Man in #postgresql) that in our
>> tests, after the indexes get too large performance drops signficantly
>> and our system limps forward due to disk reads (presumably for the
>> indexes). If we remove the indexes, performance for our entire sample
>> test is great and everything is written to postgresql very quickly.
> It's usually the fact that the data you index is "random" as opposed to,
> say, an always incremented value (could be a timestamp, or a sequence)
> that leads to insert problems with btrees.
>> My question is, what possible routes can I take where we can have both
>> fast inserts (with indexes removed until the end of the day), but still
>> allow a user to query against today's data? Is this even possible? One
>> idea would be possibly have hourly tables for today and as soon as we
>> can try to re-add indexes.
> Yep, that's the only way I've found: use smaller partitions. That leads
> to slower reads (due to the fact that you have to visit more indexes to
> read the same amount of data). But you'll get faster writes.
>
>> Another possible solution might be to stream
>> the data to another "reader" postgres instance that has indexes,
>> although I'm not very versed in replication.
> I don't think you can do that.
> Another option that you have is to use ssd instead of HD for the indexes
> only (that is, having the indexes in a separate tablespace of ssds). The
> problem is that your disks usually can't keep up with the number of
> random writes it takes to update N "random values" btrees; ssd might help.
> Can you post some numbers, such as # of indexes, # of rows you're trying
> to insert per hour etc etc?
>
>
--
Benjamin Johnson
http://getcarbonblack.com/ | @getcarbonblack
cell: 312.933.3612
On Wed, Nov 30, 2011 at 7:27 AM, Benjamin Johnson <benjamin.johnson@getcarbonblack.com> wrote: > Experts, > > Quick Summary: data can now be inserted very quickly via COPY + removing > indexes, but is there a design or some tricks to still allow someone to > query while the partition is still active and 'hot' ? > > - Postgres 9.1 > - Windows 7 (64-bit) , although this is just for the current test and > could vary depending on situation > - We have 4 main tables with daily partitions How long are the daily partitions kept for? > - Each table/partition has multiple indexes on it > - Streaming logs from client machines into our server app which > processes the logs and tries to shove all that data into these daily > partitions as fast as it can. Why shove it in as fast as you can? If you want to both read and write at the same time, then focusing first only on writing and worrying about reading as an after thought seems like the wrong thing to do. > - Using COPY and removed original primary key unique constraints to try > to get it to be as fast as possible (some duplicates are possible) > - Will remove duplicates in a later step (disregard for this post) > > We now found (thanks Andres and Snow-Man in #postgresql) that in our > tests, after the indexes get too large performance drops signficantly > and our system limps forward due to disk reads (presumably for the > indexes). How many hours worth of data can be loaded into the new partition before the performance knee hits? After the knee, how does the random disk read activity you see compare to the maximum random disk reads your IO system can support? How many COPYs were you doing at the same time? During this test, was there background select activity going on, or was the system only used for COPY? > If we remove the indexes, performance for our entire sample > test is great and everything is written to postgresql very quickly. > This allows us to shove lots and lots of data in (for production > possibly 100 GB or a TB per day!) How much do you need to shove in per day? If you need to insert it, and index it, and run queries, and deal with maintenance of the older partitions, then you will need a lot of spare capacity, relative to just inserting, to do all of those things. Do you have windows where there is less insert activity in which other things can get done? Cheers, Jeff
Jeff, Sorry for the delayed response. Please see (some) answers inline. On 12/1/2011 9:06 AM, Jeff Janes wrote: > On Wed, Nov 30, 2011 at 7:27 AM, Benjamin Johnson > <benjamin.johnson@getcarbonblack.com> wrote: >> Experts, >> >> Quick Summary: data can now be inserted very quickly via COPY + removing >> indexes, but is there a design or some tricks to still allow someone to >> query while the partition is still active and 'hot' ? >> >> - Postgres 9.1 >> - Windows 7 (64-bit) , although this is just for the current test and >> could vary depending on situation >> - We have 4 main tables with daily partitions > How long are the daily partitions kept for? We want this to be user-configurable but ideally 30 - 90 days, possibly longer for (order of magnitude) smaller customers. >> - Each table/partition has multiple indexes on it >> - Streaming logs from client machines into our server app which >> processes the logs and tries to shove all that data into these daily >> partitions as fast as it can. > Why shove it in as fast as you can? If you want to both read and > write at the same time, then focusing first only on writing and > worrying about reading as an after thought seems like the wrong thing > to do. You're probably correct in that we need to think about the entire system as a whole. We're concerned with getting the data from our host-based to our server where it is processed and stored. Because our system is essentially a giant logging service for your enterprise, most of the time we're collecting data and writing it. The main times it will be read is when some security incident occurs, but between those we expect it to be very write heavy. We're probably most concerned with write performance because we were originally seeing poor times and were scared by how well it would scale. We've improved it a lot so we might just need to take a step back and see what else we can do for the overall system. >> - Using COPY and removed original primary key unique constraints to try >> to get it to be as fast as possible (some duplicates are possible) >> - Will remove duplicates in a later step (disregard for this post) >> >> We now found (thanks Andres and Snow-Man in #postgresql) that in our >> tests, after the indexes get too large performance drops signficantly >> and our system limps forward due to disk reads (presumably for the >> indexes). > How many hours worth of data can be loaded into the new partition > before the performance knee hits? In simulations, if I try to simulate the amount of data a large customer would send, then it is just about an hour worth of data before the indexes get to be several gigabytes in size and performance really goes downhill -- the "knee" if you will. > After the knee, how does the random disk read activity you see compare > to the maximum random disk reads your IO system can support? How many > COPYs were you doing at the same time? I don't have exact statistics, but we had 4 writer threads all doing copy into 4 tables as fast as they receive data. The system is very much NOT ideal -- Windows 7 Developer-Class Workstation with (one) 7200 RPM Harddrive. I want to find bottlebecks in this system and then see what real servers can handle. (We're a small company and only now are starting to be able to invest in dev/test servers. > > During this test, was there background select activity going on, or > was the system only used for COPY? I pretty much stripped it entirely down to just doing the writes. Data was coming in over HTTP to a python web stack, but that was pretty much just passing these logfiles to the (C++) writer threads. >> If we remove the indexes, performance for our entire sample >> test is great and everything is written to postgresql very quickly. >> This allows us to shove lots and lots of data in (for production >> possibly 100 GB or a TB per day!) > How much do you need to shove in per day? If you need to insert it, > and index it, and run queries, and deal with maintenance of the older > partitions, then you will need a lot of spare capacity, relative to > just inserting, to do all of those things. Do you have windows where > there is less insert activity in which other things can get done? That's something we keep asking ourselves. Right now it's about 10 MB / client per day. Some customers want 50,000 clients which would be 500 GB per day if my math is correct. We know we will never handle this with a single server, but we want to get up as high as we can (say 5000 - 10000) before saying that our customers have to add more hardware. > Cheers, > > Jeff We managed to sort of get around the issue by having hourly tables inherit from our daily tables. This makes our indexes smaller and the writes in our tests don't seem to hit this same limit (at least so far.) I have a couple follow-up questions: 1) Would it be acceptable to have let's say 60 daily partitions and then each of those has 24 hourly partitions? Would it be better to after a day or two (so that data is now old and mostly unchanged) "rollup" the hourly tables into their respective daily table and then remove the hourly tables? 2) Some of our indexes are on an identifier that is a hash of some event attributes, so it's basically a random BIGINT. We believe part of the problem is that each row could be in an entirely different location in the index thus causing lots of seeking and thrashing. Would doing something like having our index become a multi-column index by doing (event_timestamp, originally_index_column) be better so that they closer in proximity to other events coming in around the same time? I have to admit that I don't really know how indexes are stored / paged. 3) Does anyone else have similar systems where they have a ton of data coming in that they also want to query? Any tips you can provide or alternative designs? Once the data is in, it will 99.9% of the time (100% of the time for some tables) be static. Part of the issue is that the user wants to be able to search based on all sorts of attributes -- this leads to lots of indexes and more disk/memory usage when writing. Ben -- Benjamin Johnson http://getcarbonblack.com/ | @getcarbonblack cell: 312.933.3612
On Wed, Dec 21, 2011 at 6:30 PM, Benjamin Johnson <benjamin.johnson@getcarbonblack.com> wrote: > Jeff, > > Sorry for the delayed response. Please see (some) answers inline. > > On 12/1/2011 9:06 AM, Jeff Janes wrote: >> On Wed, Nov 30, 2011 at 7:27 AM, Benjamin Johnson >> Why shove it in as fast as you can? If you want to both read and >> write at the same time, then focusing first only on writing and >> worrying about reading as an after thought seems like the wrong thing >> to do. > You're probably correct in that we need to think about the entire system > as a whole. We're concerned with getting the data > from our host-based to our server where it is processed and stored. > Because our system is essentially a giant logging service for > your enterprise, most of the time we're collecting data and writing it. > The main times it will be read is when some security incident > occurs, but between those we expect it to be very write heavy. OK, that does have an interesting flavor, in which the typical row will be queried about read zero times, but you can't predict in advance which ones are more likely to ever be queried. Do you know how necessary all your indexes are for supporting the queries? If the queries are relatively rare, maybe you could support them simply with seq scans on unindexed tables/partitions, at least on the leading edge partitions. >> How many hours worth of data can be loaded into the new partition >> before the performance knee hits? > In simulations, if I try to simulate the amount of data a large customer > would send, then it is just about an hour worth of data before the indexes > get to be several gigabytes in size and performance really goes downhill > -- the "knee" if you will. So having hourly partitions with live indexes might be cutting it pretty close. Once something pushes you over the edge into degraded performance, you would never be able to recover. >> After the knee, how does the random disk read activity you see compare >> to the maximum random disk reads your IO system can support? How many >> COPYs were you doing at the same time? > I don't have exact statistics, but we had 4 writer threads all doing > copy into 4 tables as fast as they receive data. Are they receiving data at the rate they would naturally? I.e. does it take an hour to simulate an hour's worth of data? If they go into different tables, then they are going into different indices and so are all competing with each other for cache space for the index leaf blocks (rather than sharing that cache space as they might possibly if they were going into the same table). So you run out of cache space and your performance collapses at one forth the total size as if you made them take turns. Of course if you make them take turns, you have to either throttle or buffer their data retrieval. Also there is a question of how often you would have to rotate turns, and how long it would take to exchange out the buffers upon a turn rotation. (There are stupid OS tricks you can pull outside of PG to help that process along, but trying to coordinate that would be painful.) > The system is very much NOT ideal -- Windows 7 Developer-Class > Workstation with (one) 7200 RPM Harddrive. I want to find bottlebecks > in this > system and then see what real servers can handle. (We're a small > company and only now are starting to be able to invest in dev/test servers. I think you said that for loading into large-grained partitions with live indexes, the bottleneck was the random reads needed to pull in the leaf blocks. In that case, if you change to RAID with striping you should be able to scale with the effective number of spindles, provided you have enough parallel copies going on to keep each spindle busy with its own random read. Of course those parallel copies would make the RAM issues worse, but by saying large-grained partitions I mean that you've already given up on the notion having the indices fit in RAM, so at that point you might as well get the spindle-scaling. ... > > We managed to sort of get around the issue by having hourly tables > inherit from our daily tables. This makes our indexes smaller and the > writes in our tests don't > seem to hit this same limit (at least so far.) I have a couple > follow-up questions: > > 1) Would it be acceptable to have let's say 60 daily partitions and then > each of those has 24 hourly partitions? It sounds like each client gets their own hardware, but of each client can have several thousand customers, how is that handled? All dumped into one giant partitioned (on time) table, or does each customer get their own table? 60*24*thousands would certainly add up! If it is just 60*24, it will certainly slow down your queries (the ones not using constraint exclusion anyway) some as it has to do a look up in 1440 btrees for each query, but if queries are fast enough then they are fast enough. It should be pretty easy to test, if you know the types of queries you will be seeing. > Would it be better to after a > day or two (so that data is now old and mostly unchanged) "rollup" the > hourly tables into their respective daily table and then remove the > hourly tables? That would generate an awful lot of extraneous IO (mostly sequential rather than random, so more efficient, but still IO) which is going to compete with the rest of the IO going on, in order to solve a problem that you don't yet know that you have. > > 2) Some of our indexes are on an identifier that is a hash of some event > attributes, so it's basically a random BIGINT. We believe part of the > problem is that each row could be in an entirely different location in > the index thus causing lots of seeking and thrashing. Would doing > something like having our index become a multi-column index by doing > (event_timestamp, originally_index_column) be better so that they closer > in proximity to other events coming in around the same time? I have to > admit that I don't really know how indexes are stored / paged. What if you just drop this index but keep the others while loading? If dropping just that index has a big effect, then changing it as you describe would almost certainly help on the loading, but would the new index still efficiently support the same queries that the old one did? I.e. could all queries based on the hash code be reformulated to query on both exact time stamp and the hash code? Otherwise you would be throwing the baby out with the bath water. > > 3) Does anyone else have similar systems where they have a ton of data > coming in that they also want to query? Any tips you can provide or > alternative designs? Once the data is in, it will 99.9% of the time > (100% of the time for some tables) be static. Part of the issue is that > the user wants to be able to search based on all sorts of attributes -- > this leads to lots of indexes and more disk/memory usage when writing. Have you experimentally verified that all of the indexes really are needed to get acceptable query performance? I tend to error on the side of adding more indices just in case it might be useful, but you already know you have a problem caused by index maintenance so defaulting to not having them until you have proof that it is needed might be better in that case. Cheers, Jeff