Обсуждение: Hardware for writing/updating 12,000,000 rows per hour

Поиск
Список
Период
Сортировка

Hardware for writing/updating 12,000,000 rows per hour

От
Arya F
Дата:
Would it be possible to achieve 12,000,000 writes/updates on a single server? If so what kind of hardware should I be looking for?

Re: Hardware for writing/updating 12,000,000 rows per hour

От
Ron
Дата:
On 7/26/19 2:56 PM, Arya F wrote:
> Would it be possible to achieve 12,000,000 writes/updates on a single 
> server? If so what kind of hardware should I be looking for?

That's only 3,333 modifications/second.  How big are your records?

-- 
Angular momentum makes the world go 'round.



Re: Hardware for writing/updating 12,000,000 rows per hour

От
Arya F
Дата:
As most about 2000 characters.

On Fri, Jul 26, 2019 at 3:03 PM Ron <ronljohnsonjr@gmail.com> wrote:
On 7/26/19 2:56 PM, Arya F wrote:
> Would it be possible to achieve 12,000,000 writes/updates on a single
> server? If so what kind of hardware should I be looking for?

That's only 3,333 modifications/second.  How big are your records?

--
Angular momentum makes the world go 'round.


Re: Hardware for writing/updating 12,000,000 rows per hour

От
Tom Lane
Дата:
[ please don't top-post ]

Arya F <arya6000@gmail.com> writes:
> On Fri, Jul 26, 2019 at 3:03 PM Ron <ronljohnsonjr@gmail.com> wrote:
>> On 7/26/19 2:56 PM, Arya F wrote:
>>> Would it be possible to achieve 12,000,000 writes/updates on a single
>>> server? If so what kind of hardware should I be looking for?

>> That's only 3,333 modifications/second.  How big are your records?

> As most about 2000 characters.

Do you need 3K independent commits per second?  Or can you batch them?
Even just turning off synchronous_commit would move the goalposts
pretty far in terms of the storage hardware you'll need for this.

            regards, tom lane



Re: Hardware for writing/updating 12,000,000 rows per hour

От
Arya F
Дата:
On Fri, Jul 26, 2019 at 3:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
[ please don't top-post ]

Arya F <arya6000@gmail.com> writes:
> On Fri, Jul 26, 2019 at 3:03 PM Ron <ronljohnsonjr@gmail.com> wrote:
>> On 7/26/19 2:56 PM, Arya F wrote:
>>> Would it be possible to achieve 12,000,000 writes/updates on a single
>>> server? If so what kind of hardware should I be looking for?

>> That's only 3,333 modifications/second.  How big are your records?

> As most about 2000 characters.

Do you need 3K independent commits per second?  Or can you batch them?
Even just turning off synchronous_commit would move the goalposts
pretty far in terms of the storage hardware you'll need for this.

                        regards, tom lane

I think I can modify my application to do a batch update. Right now the server has an HDD and it really can't handle a lot of updates and inserts per second. Would changing to a regular SSD be able to easily do 3000 updates per second? 

Re: Hardware for writing/updating 12,000,000 rows per hour

От
Alvaro Herrera
Дата:
On 2019-Jul-26, Arya F wrote:

> I think I can modify my application to do a batch update. Right now the
> server has an HDD and it really can't handle a lot of updates and inserts
> per second. Would changing to a regular SSD be able to easily do 3000
> updates per second?

That's a pretty hard question in isolation -- you need to consider how
many indexes are there to update, whether the updated columns are
indexed or not, what the datatypes are, how much locality of access
you'll have ... I'm probably missing some other important factors.  (Of
course, you'll have to tune various PG server settings to find your
sweet spot.)

I suggest that should be measuring instead of trying to guess.  A
reasonably cheap way is to rent a machine somewhere with the type of
hardware you think you'll need, and run your workload there for long
enough, making sure to carefully observe important metrics such as table
size, accumulated bloat, checkpoint regime, overall I/O activity, and so
on.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Hardware for writing/updating 12,000,000 rows per hour

От
farjad.farid
Дата:
With this kind of design requirements it is worth considering hardware "failure & recovery". Even SSDs can and do fail.


It is not just a matter of just speed. RAID disks of some kind, depending on the budget is worth the effort.



-----Original Message-----
From: Alvaro Herrera <alvherre@2ndquadrant.com>
Sent: 2019 July 26 22:39
To: Arya F <arya6000@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; Ron <ronljohnsonjr@gmail.com>; pgsql-general@lists.postgresql.org
Subject: Re: Hardware for writing/updating 12,000,000 rows per hour

On 2019-Jul-26, Arya F wrote:

> I think I can modify my application to do a batch update. Right now
> the server has an HDD and it really can't handle a lot of updates and
> inserts per second. Would changing to a regular SSD be able to easily
> do 3000 updates per second?

That's a pretty hard question in isolation -- you need to consider how many indexes are there to update, whether the
updatedcolumns are indexed or not, what the datatypes are, how much locality of access you'll have ... I'm probably
missingsome other important factors.  (Of course, you'll have to tune various PG server settings to find your sweet
spot.)

I suggest that should be measuring instead of trying to guess.  A reasonably cheap way is to rent a machine somewhere
withthe type of hardware you think you'll need, and run your workload there for long enough, making sure to carefully
observeimportant metrics such as table size, accumulated bloat, checkpoint regime, overall I/O activity, and so on. 

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Hardware for writing/updating 12,000,000 rows per hour

От
Arya F
Дата:


On Sat, Jul 27, 2019 at 11:49 AM farjad.farid <farjad.farid@checknetworks.com> wrote:
With this kind of design requirements it is worth considering hardware "failure & recovery". Even SSDs can and do fail.

It is not just a matter of just speed. RAID disks of some kind, depending on the budget is worth the effort. 



-----Original Message-----
From: Alvaro Herrera <alvherre@2ndquadrant.com>
Sent: 2019 July 26 22:39
To: Arya F <arya6000@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; Ron <ronljohnsonjr@gmail.com>; pgsql-general@lists.postgresql.org
Subject: Re: Hardware for writing/updating 12,000,000 rows per hour

On 2019-Jul-26, Arya F wrote:

> I think I can modify my application to do a batch update. Right now
> the server has an HDD and it really can't handle a lot of updates and
> inserts per second. Would changing to a regular SSD be able to easily
> do 3000 updates per second?

That's a pretty hard question in isolation -- you need to consider how many indexes are there to update, whether the updated columns are indexed or not, what the datatypes are, how much locality of access you'll have ... I'm probably missing some other important factors.  (Of course, you'll have to tune various PG server settings to find your sweet spot.)

I suggest that should be measuring instead of trying to guess.  A reasonably cheap way is to rent a machine somewhere with the type of hardware you think you'll need, and run your workload there for long enough, making sure to carefully observe important metrics such as table size, accumulated bloat, checkpoint regime, overall I/O activity, and so on.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Hi Farjad

I was thinking of having physical or logical replication. Or is having RAID a must if I don't want to lose data? 

RE: Hardware for writing/updating 12,000,000 rows per hour

От
farjad.farid
Дата:

HI Arya,

 

It is not clear what is the budget and why there is so much data? Is this a real time system, e.g. 24/7 operation. Even if each row takes up just 50 bytes, that is a lot of data in/out of your CPUs/memory/hard disk, any one of which could fail.

 

Personally I would recommend analyzing the software for any pattern that might help you to reduce the use of hard disk. Push the data as much as possible to memory, then the overflow to hard disk(if at all possible) I know it might be difficult but it could save you a lot of down time/maintenance. Also double check reliability of hard disk vs. SSD. Not all SSDs are server grade or motherboards or memory.  Use hardware RAID card, not software based as it degrades the overall performance.

 

As a minimum I would recommend a RAID configuration bearing in mind the budget. Also check the card manufacturer’s reliability figures. Reliability of all components matter. Short term cost saving, could cost a lot more in this kind of situations. At least made a request. If they reject it and things go wrong you could point out that you had made the request.

 

What about backing up the data?

 

Good luck.

 

 

 

 

From: Arya F <arya6000@gmail.com>
Sent: 2019 July 27 17:56
To: farjad.farid <farjad.farid@checknetworks.com>
Cc: Alvaro Herrera <alvherre@2ndquadrant.com>; Tom Lane <tgl@sss.pgh.pa.us>; Ron <ronljohnsonjr@gmail.com>; pgsql-general@lists.postgresql.org
Subject: Re: Hardware for writing/updating 12,000,000 rows per hour

 

 

 

On Sat, Jul 27, 2019 at 11:49 AM farjad.farid <farjad.farid@checknetworks.com> wrote:

With this kind of design requirements it is worth considering hardware "failure & recovery". Even SSDs can and do fail.

It is not just a matter of just speed. RAID disks of some kind, depending on the budget is worth the effort. 



-----Original Message-----
From: Alvaro Herrera <alvherre@2ndquadrant.com>
Sent: 2019 July 26 22:39
To: Arya F <arya6000@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; Ron <ronljohnsonjr@gmail.com>; pgsql-general@lists.postgresql.org
Subject: Re: Hardware for writing/updating 12,000,000 rows per hour

On 2019-Jul-26, Arya F wrote:

> I think I can modify my application to do a batch update. Right now
> the server has an HDD and it really can't handle a lot of updates and
> inserts per second. Would changing to a regular SSD be able to easily
> do 3000 updates per second?

That's a pretty hard question in isolation -- you need to consider how many indexes are there to update, whether the updated columns are indexed or not, what the datatypes are, how much locality of access you'll have ... I'm probably missing some other important factors.  (Of course, you'll have to tune various PG server settings to find your sweet spot.)

I suggest that should be measuring instead of trying to guess.  A reasonably cheap way is to rent a machine somewhere with the type of hardware you think you'll need, and run your workload there for long enough, making sure to carefully observe important metrics such as table size, accumulated bloat, checkpoint regime, overall I/O activity, and so on.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

 

 

Hi Farjad

I was thinking of having physical or logical replication. Or is having RAID a must if I don't want to lose data? 

Re: Hardware for writing/updating 12,000,000 rows per hour

От
Neil
Дата:

On Jul 27, 2019, at 11:55 AM, Arya F <arya6000@gmail.com> wrote:

On Sat, Jul 27, 2019 at 11:49 AM farjad.farid <farjad.farid@checknetworks.com> wrote:
With this kind of design requirements it is worth considering hardware "failure & recovery". Even SSDs can and do fail.

It is not just a matter of just speed. RAID disks of some kind, depending on the budget is worth the effort. 

-----Original Message-----
From: Alvaro Herrera <alvherre@2ndquadrant.com>
Sent: 2019 July 26 22:39
To: Arya F <arya6000@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; Ron <ronljohnsonjr@gmail.com>; pgsql-general@lists.postgresql.org
Subject: Re: Hardware for writing/updating 12,000,000 rows per hour

On 2019-Jul-26, Arya F wrote:

> I think I can modify my application to do a batch update. Right now
> the server has an HDD and it really can't handle a lot of updates and
> inserts per second. Would changing to a regular SSD be able to easily
> do 3000 updates per second?

That's a pretty hard question in isolation -- you need to consider how many indexes are there to update, whether the updated columns are indexed or not, what the datatypes are, how much locality of access you'll have ... I'm probably missing some other important factors.  (Of course, you'll have to tune various PG server settings to find your sweet spot.)

I suggest that should be measuring instead of trying to guess.  A reasonably cheap way is to rent a machine somewhere with the type of hardware you think you'll need, and run your workload there for long enough, making sure to carefully observe important metrics such as table size, accumulated bloat, checkpoint regime, overall I/O activity, and so on.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi Farjad

I was thinking of having physical or logical replication. Or is having RAID a must if I don’t want to lose data? 

Arya,

Answering the question of what hardware would work better requires a lot of thought. I would suggest that you hire someone that can do the calculations to get you in the ball park and provide other pros and cons.  Then test with rented or cloud equipment like Álvaro suggested.  The following are some basic back of the hand calculations to show you the thought process, the numbers presented here are probably way off and need to be based on your hardware and research.

The first thing you need to understand is how much data are you going to be moving around.  You mentioned ’12,000,000 writes/updates per hour’ and the size of the tuples getting updated at about 2k.  That is about 3,333 per second (as someone else already noted). That translates to about 3333 * 2000 = 6.6 MB/sec of basic data movement if they are all inserts, not including index updates, replication, vacuum, etc.  If they are all updates then you can double that.  So lets say they are 1/2 updates, that means basic data movement is 2 x for updates and 1x for the inserts so that changes it to 9.9 MB/sec.  Lets say the index keys are total 200 Bytes we have 3,330 * 200 * 1.5 (half update, half inserts) = 0.99 MB/Sec.  If you have an existing system all of these things can be measured.

This brings the total to 10.9 MB/sec sustained operation minimum, not including WAL, OS, Vacuum processing, etc. and provided the data is being processed evenly over time, which it never is.  This will start to tax a standard HDD since it has to handle the OS, WAL log (about the same I/O as the database), and database, which probably puts the sustained usage certainly above 22MB/sec, considering a single drive handling a little over 40MB/sec (and most drives do not perform at their rating). Considering that data storage is not processed evenly over time, then you also need to consider peaks in the processing and multiply this data rate by a factor (which without knowing your data access pattern is impossible to predict).  So we already suspect based on the fact that your HDD is not handling it that there might be more going on than you have provided and that a single hard drive is nowhere near adequate.

Now lets think about architecture.

Best practice would say to have the OS on one drive, the WAL log on another drive, and the database on another drive.  The reason is that you can probably get 40+MB/Sec on each HDD drive. Of course server grade hardware with 15,000 RPM HDDs would be higher performance.  If you only have one drive then the OS, the WAL log, and the database are competing for the one resource.  With 3 drives, depending on the I/O architecture (not a laptop) you can probably get 3 times the I/O throughput, or 40+MB/sec on each HDD in our contrived example.

The other major help is memory. If you can fit the whole database in memory or at least the part that is being used regularly, then slow I/O from an HDD is not so important for update operations.  But again that depends on the data access patterns, if you are updating the same tuple over and over then it can stay in memory most of the time and can reduce database HDD I/O.  You still have WAL I/O.  Lets say this is about 39GB/hour (10.9MB/sec * 3600).  So with 39 GB of data to hold most of the database in memory, would probably require 48GB or 64GB of main memory.  Of course there are many reasons why this would work with less or need more. Additional memory might allow you to continue to use HDDs depending on access patterns.  Sometimes more memory is cheaper than more or faster disks.

Now about SDDs.  SDDs should easily give you 6 to 10 times the throughput, provided the computer is designed for SDDs.  This means that you might, based on the back of hand calcs here, get by with one SDD.  However from a reliability standpoint I don’t think anyone would recommend that.

The other issue you need to consider is how many CPU cores you should have.  This won’t effect I/O throughput, but it will determine how many simultaneous connections can be processing at one time.  It might be that CPU starvation is limiting the data rate of your current hardware.  So if you have more CPUs the sustained or peak data rates might go up, depending on how you calculated the ’12,000,000 writes/updates per hour’.

None of these ideas here consider reliability which would determine whether the 3 drives are bare, mirrored, raid, or jbods.  Also note that any form of raid can reduce the throughput and have other reliability problems if not correctly engineered.  Data loss is a completely different set of issues and can be handled many ways.

Raid may provide reliability if your storage fails, but may not provide reliability to the customer if the computer itself fails, the power supply fails, or the network fails.  If customer reliability is critical, then I would expect you to have duplicate hardware with failover.  In this case, for example, it is not completely crazy to run the OS drive as a single bare drive, no raid.  If it fails, the system just switches to the failover hardware while you fix it.  If you cannot afford this type of redundancy, then you might have a single system with redundant power suppliers, mirrored drives, etc.  It all depends on your risk issues, the amount of money you have to spend, and the technical knowledge you have to manage the system.  This last part is very important when a failure occurs.  You don’t want to have to learn while you are in a failure or recovery situation. The more complex your system is to support high availability the more knowledge you need.

If you are unsure of these issues, which are pretty basic system admin and hardware design issues, then you should get someone with experience to help you.

Neil






RE: Hardware for writing/updating 12,000,000 rows per hour

От
farjad.farid
Дата:

HI Arya,

 

Probably the easiest solution is to use cloud computing with dedicated network.

 

Good luck.

 

 

 

 

From: Neil <neil@fairwindsoft.com>
Sent: 2019 July 28 01:33
To: Arya F <arya6000@gmail.com>
Cc: pgsql-general@lists.postgresql.org; farjad.farid <farjad.farid@checknetworks.com>; Alvaro Herrera <alvherre@2ndquadrant.com>; Tom Lane <tgl@sss.pgh.pa.us>; Ron <ronljohnsonjr@gmail.com>
Subject: Re: Hardware for writing/updating 12,000,000 rows per hour

 

 

On Jul 27, 2019, at 11:55 AM, Arya F <arya6000@gmail.com> wrote:

 

On Sat, Jul 27, 2019 at 11:49 AM farjad.farid <farjad.farid@checknetworks.com> wrote:

With this kind of design requirements it is worth considering hardware "failure & recovery". Even SSDs can and do fail.

It is not just a matter of just speed. RAID disks of some kind, depending on the budget is worth the effort. 

-----Original Message-----
From: Alvaro Herrera <alvherre@2ndquadrant.com>
Sent: 2019 July 26 22:39
To: Arya F <arya6000@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; Ron <ronljohnsonjr@gmail.com>; pgsql-general@lists.postgresql.org
Subject: Re: Hardware for writing/updating 12,000,000 rows per hour

On 2019-Jul-26, Arya F wrote:

> I think I can modify my application to do a batch update. Right now
> the server has an HDD and it really can't handle a lot of updates and
> inserts per second. Would changing to a regular SSD be able to easily
> do 3000 updates per second?

That's a pretty hard question in isolation -- you need to consider how many indexes are there to update, whether the updated columns are indexed or not, what the datatypes are, how much locality of access you'll have ... I'm probably missing some other important factors.  (Of course, you'll have to tune various PG server settings to find your sweet spot.)

I suggest that should be measuring instead of trying to guess.  A reasonably cheap way is to rent a machine somewhere with the type of hardware you think you'll need, and run your workload there for long enough, making sure to carefully observe important metrics such as table size, accumulated bloat, checkpoint regime, overall I/O activity, and so on.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

 

Hi Farjad

I was thinking of having physical or logical replication. Or is having RAID a must if I don’t want to lose data? 

 

Arya,

 

Answering the question of what hardware would work better requires a lot of thought. I would suggest that you hire someone that can do the calculations to get you in the ball park and provide other pros and cons.  Then test with rented or cloud equipment like Álvaro suggested.  The following are some basic back of the hand calculations to show you the thought process, the numbers presented here are probably way off and need to be based on your hardware and research.

 

The first thing you need to understand is how much data are you going to be moving around.  You mentioned ’12,000,000 writes/updates per hour’ and the size of the tuples getting updated at about 2k.  That is about 3,333 per second (as someone else already noted). That translates to about 3333 * 2000 = 6.6 MB/sec of basic data movement if they are all inserts, not including index updates, replication, vacuum, etc.  If they are all updates then you can double that.  So lets say they are 1/2 updates, that means basic data movement is 2 x for updates and 1x for the inserts so that changes it to 9.9 MB/sec.  Lets say the index keys are total 200 Bytes we have 3,330 * 200 * 1.5 (half update, half inserts) = 0.99 MB/Sec.  If you have an existing system all of these things can be measured.

 

This brings the total to 10.9 MB/sec sustained operation minimum, not including WAL, OS, Vacuum processing, etc. and provided the data is being processed evenly over time, which it never is.  This will start to tax a standard HDD since it has to handle the OS, WAL log (about the same I/O as the database), and database, which probably puts the sustained usage certainly above 22MB/sec, considering a single drive handling a little over 40MB/sec (and most drives do not perform at their rating). Considering that data storage is not processed evenly over time, then you also need to consider peaks in the processing and multiply this data rate by a factor (which without knowing your data access pattern is impossible to predict).  So we already suspect based on the fact that your HDD is not handling it that there might be more going on than you have provided and that a single hard drive is nowhere near adequate.

 

Now lets think about architecture.

 

Best practice would say to have the OS on one drive, the WAL log on another drive, and the database on another drive.  The reason is that you can probably get 40+MB/Sec on each HDD drive. Of course server grade hardware with 15,000 RPM HDDs would be higher performance.  If you only have one drive then the OS, the WAL log, and the database are competing for the one resource.  With 3 drives, depending on the I/O architecture (not a laptop) you can probably get 3 times the I/O throughput, or 40+MB/sec on each HDD in our contrived example.

 

The other major help is memory. If you can fit the whole database in memory or at least the part that is being used regularly, then slow I/O from an HDD is not so important for update operations.  But again that depends on the data access patterns, if you are updating the same tuple over and over then it can stay in memory most of the time and can reduce database HDD I/O.  You still have WAL I/O.  Lets say this is about 39GB/hour (10.9MB/sec * 3600).  So with 39 GB of data to hold most of the database in memory, would probably require 48GB or 64GB of main memory.  Of course there are many reasons why this would work with less or need more. Additional memory might allow you to continue to use HDDs depending on access patterns.  Sometimes more memory is cheaper than more or faster disks.

 

Now about SDDs.  SDDs should easily give you 6 to 10 times the throughput, provided the computer is designed for SDDs.  This means that you might, based on the back of hand calcs here, get by with one SDD.  However from a reliability standpoint I don’t think anyone would recommend that.

 

The other issue you need to consider is how many CPU cores you should have.  This won’t effect I/O throughput, but it will determine how many simultaneous connections can be processing at one time.  It might be that CPU starvation is limiting the data rate of your current hardware.  So if you have more CPUs the sustained or peak data rates might go up, depending on how you calculated the ’12,000,000 writes/updates per hour’.

 

None of these ideas here consider reliability which would determine whether the 3 drives are bare, mirrored, raid, or jbods.  Also note that any form of raid can reduce the throughput and have other reliability problems if not correctly engineered.  Data loss is a completely different set of issues and can be handled many ways.

 

Raid may provide reliability if your storage fails, but may not provide reliability to the customer if the computer itself fails, the power supply fails, or the network fails.  If customer reliability is critical, then I would expect you to have duplicate hardware with failover.  In this case, for example, it is not completely crazy to run the OS drive as a single bare drive, no raid.  If it fails, the system just switches to the failover hardware while you fix it.  If you cannot afford this type of redundancy, then you might have a single system with redundant power suppliers, mirrored drives, etc.  It all depends on your risk issues, the amount of money you have to spend, and the technical knowledge you have to manage the system.  This last part is very important when a failure occurs.  You don’t want to have to learn while you are in a failure or recovery situation. The more complex your system is to support high availability the more knowledge you need.

 

If you are unsure of these issues, which are pretty basic system admin and hardware design issues, then you should get someone with experience to help you.

 

Neil