Обсуждение: Can anyone explain this pgbench results?
Hi, Below are some results of running pgbench, run on a machine that is doing nothing else than running PostgreSQL woth pgbench.The strange thing is that the results are *constantly alternating* hight (750-850 transactions)and low (50-80 transactions),no matter how many test I run. If I wait a long time (> 5 minutes) after running the test, I always get a hightscore, followed by a low one, followed by a high one, low one etc. I was expecting a low(ish) score the first run (because the tables are not loaded in the cache yet), followed by continueshigh(ish) scores, but not an alternating pattern. I also did not expect so much difference, given the hardware Ihave (Dual Opteron, 4GB memory , 3Ware SATA RAID5 with 5 disks, seerate swap and pg_log disks). Anyone any idea? Results of pgbench: postgres@panoramix:/usr/lib/postgresql/8.1/bin$ ./pgbench -c 10 -t 150 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 10 number of transactions per client: 150 number of transactions actually processed: 1500/1500 tps = 50.651705 (including connections establishing) tps = 50.736338 (excluding connections establishing) postgres@panoramix:/usr/lib/postgresql/8.1/bin$ ./pgbench -c 10 -t 150 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 10 number of transactions per client: 150 number of transactions actually processed: 1500/1500 tps = 816.972995 (including connections establishing) tps = 836.951755 (excluding connections establishing) postgres@panoramix:/usr/lib/postgresql/8.1/bin$ ./pgbench -c 10 -t 150 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 10 number of transactions per client: 150 number of transactions actually processed: 1500/1500 tps = 42.924294 (including connections establishing) tps = 42.986747 (excluding connections establishing) postgres@panoramix:/usr/lib/postgresql/8.1/bin$ ./pgbench -c 10 -t 150 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 10 number of transactions per client: 150 number of transactions actually processed: 1500/1500 tps = 730.651970 (including connections establishing) tps = 748.538852 (excluding connections establishing) TIA Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
On Mon, Mar 06, 2006 at 04:29:49PM +0100, Joost Kraaijeveld wrote: > Below are some results of running pgbench, run on a machine that > is doing nothing else than running PostgreSQL woth pgbench. The > strange thing is that the results are *constantly alternating* hight > (750-850 transactions)and low (50-80 transactions), no matter how > many test I run. If I wait a long time (> 5 minutes) after running > the test, I always get a hight score, followed by a low one, followed > by a high one, low one etc. The default checkpoint_timeout is 300 seconds (5 minutes). Is it coincidence that the "long time" between fast results is about the same? What's your setting? Are your test results more consistent if you execute CHECKPOINT between them? -- Michael Fuhr
Hi Michael, Thanls for your response. Michael Fuhr wrote: > On Mon, Mar 06, 2006 at 04:29:49PM +0100, Joost Kraaijeveld wrote: >> Below are some results of running pgbench, run on a machine that >> is doing nothing else than running PostgreSQL woth pgbench. The >> strange thing is that the results are *constantly alternating* hight >> (750-850 transactions)and low (50-80 transactions), no matter how >> many test I run. If I wait a long time (> 5 minutes) after running >> the test, I always get a hight score, followed by a low one, followed >> by a high one, low one etc. > > The default checkpoint_timeout is 300 seconds (5 minutes). Is it > coincidence that the "long time" between fast results is about the > same? I have not measured the "long wait time". But I can run multiple test in 3 minutes: the fast test lasts 3 sec, the long one40 secs (see below). During the tests there is not much activity on the partition where the logfiles are (other controllerand disk than the database and swap) postgres@panoramix:/usr/lib/postgresql/8.1/bin$ time ./pgbench -c 10 -t 150 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 10 number of transactions per client: 150 number of transactions actually processed: 1500/1500 tps = 531.067258 (including connections establishing) tps = 541.694790 (excluding connections establishing) real 0m2.892s user 0m0.105s sys 0m0.145s postgres@panoramix:/usr/lib/postgresql/8.1/bin$ time ./pgbench -c 10 -t 150 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 10 number of transactions per client: 150 number of transactions actually processed: 1500/1500 tps = 37.064000 (including connections establishing) tps = 37.114023 (excluding connections establishing) real 0m40.531s user 0m0.088s sys 0m0.132s >What's your setting? Default. > Are your test results more consistent > if you execute CHECKPOINT between them? Could you tell me how I could do that? Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
On Mon, Mar 06, 2006 at 07:46:05PM +0100, Joost Kraaijeveld wrote: > Michael Fuhr wrote: > > What's your setting? > > Default. Have you tweaked postgresql.conf at all? If so, what non-default settings are you using? > > Are your test results more consistent > > if you execute CHECKPOINT between them? > > Could you tell me how I could do that? Connect to the database as a superuser and execute a CHECKPOINT statement. http://www.postgresql.org/docs/8.1/interactive/sql-checkpoint.html From the command line you could do something like psql -c checkpoint pgbench -c 10 -t 150 test psql -c checkpoint pgbench -c 10 -t 150 test psql -c checkpoint pgbench -c 10 -t 150 test -- Michael Fuhr
[Please copy the mailing list on replies.] On Mon, Mar 06, 2006 at 09:38:20PM +0100, Joost Kraaijeveld wrote: > Michael Fuhr wrote: > > Have you tweaked postgresql.conf at all? If so, what non-default > > settings are you using? > > Yes, I have tweaked the following settings: > > shared_buffers = 40000 > work_mem = 512000 > maintenance_work_mem = 512000 > max_fsm_pages = 40000 > effective_cache_size = 131072 Are you sure you need work_mem that high? How did you decide on that value? Are all other settings at their defaults? No changes to the write ahead log (WAL) or background writer (bgwriter) settings? What version of PostgreSQL are you running? The paths in your original message suggest 8.1.x. > >>> Are your test results more consistent > > psql -c checkpoint > > pgbench -c 10 -t 150 test > > psql -c checkpoint > > pgbench -c 10 -t 150 test > > psql -c checkpoint > > pgbench -c 10 -t 150 test > > OK, that leads to a consistant hight score. I also noticed that > "psql -c checkpoint" results in I/O on the database partition but > not on the partition that has the logfiles (pg_xlog directory). Do > you know if that how it should be? A checkpoint updates the database files with the data from the write-ahead log; you're seeing those writes to the database partition. The postmaster does checkpoints every checkpoint_timeout seconds (default 300) or every checkpoint_segment log segments (default 3); it also uses a background writer to trickle pages to the database files between checkpoints so the checkpoints don't have as much work to do. I've been wondering if your pgbench runs are being affected by that background activity; the fact that you get consistently good performance after forcing a checkpoint suggests that that might be the case. If you run pgbench several times without intervening checkpoints, do your postmaster logs have any messages like "checkpoints are occurring too frequently"? It might be useful to increase checkpoint_warning up to the value of checkpoint_timeout and then see if you get any such messages during pgbench runs. If checkpoints are happening a lot more often than every checkpoint_timeout seconds then try increasing checkpoint_segments (assuming you have the disk space). After doing so, restart the database and run pgbench several times without intervening checkpoints and see if performance is more consistent. Note that tuning PostgreSQL for pgbench performance might be irrelevant for your actual needs unless your usage patterns happen to resemble what pgbench does. -- Michael Fuhr
Hi Michael, Michael Fuhr wrote: >>> Have you tweaked postgresql.conf at all? If so, what non-default >>> settings are you using? >> >> Yes, I have tweaked the following settings: >> >> shared_buffers = 40000 >> work_mem = 512000 >> maintenance_work_mem = 512000 >> max_fsm_pages = 40000 >> effective_cache_size = 131072 > > Are you sure you need work_mem that high? How did you decide on > that value? I have used http://www.powerpostgresql.com/Downloads/annotated_conf_80.html , expecting that the differences between 8.0and 8.1 do not invalidate the recommendations. I have checked with (some) of my (large) queries and adjusted upward untillI had no temp files in the PGDATA/base/DB_OID/pgsql_tmp. (The warning about > Are all other settings at their defaults? Yep. > No changes to the write ahead log (WAL) or background writer (bgwriter) settings? No, because the forementioned document explicitely states that it has recomendations on these subjects. > What version of PostgreSQL are you running? The paths in your > original message suggest 8.1.x. Debian's Ecth 8.1.0-3 > A checkpoint updates the database files with the data from the > write-ahead log; you're seeing those writes to the database partition. > The postmaster does checkpoints every checkpoint_timeout seconds > (default 300) or every checkpoint_segment log segments (default 3); > it also uses a background writer to trickle pages to the database > files between checkpoints so the checkpoints don't have as much > work to do. I've been wondering if your pgbench runs are being > affected by that background activity; the fact that you get > consistently good performance after forcing a checkpoint suggests > that that might be the case. OK, thanks. To be sure if I understand it correctly: 1. Every update/insert is first written to a WAL log file which is in the PGDATA/pg_xlog directory. 2. Routinely the background writer than writes the changes to the PGDATA/base/DB_OID/ directory. 2. Postmaster forces after 300 secs or if the log segments are full (which ever comes first?) a checkpoint so that the WALlog file are empty ( I assume that that are the changes the background writer has not written yet since the last forcedcheckpont?). > If you run pgbench several times without intervening checkpoints, > do your postmaster logs have any messages like "checkpoints are > occurring too frequently"? It might be useful to increase > checkpoint_warning up to the value of checkpoint_timeout and then > see if you get any such messages during pgbench runs. If checkpoints > are happening a lot more often than every checkpoint_timeout seconds > then try increasing checkpoint_segments (assuming you have the disk > space). After doing so, restart the database and run pgbench several > times without intervening checkpoints and see if performance is > more consistent. I will try that this day. > Note that tuning PostgreSQL for pgbench performance might be > irrelevant for your actual needs unless your usage patterns happen > to resemble what pgbench does. The advantage of using pgbench is a repeatable short command that leads to something that is showing in actual real worldusage. My problem is with the raw performance of my disk array (3Ware 9500S-8 SATA RAID5 controller with 5 disks). I am having *very*serious performance problems if I do large updates on my databases. E.g. an update of 1 (boolean) column in a table(update prototype.customers set deleted = false) that has 368915 records last forever (> 3500 secs ). The only noticabledisk activity during such an update is on the disk/partition that has the PGDATA/base/DB_OID/ directory (/dev/sdc,the 3Ware 9800S-8 RAID 5 array). There is *no* noticable disk activity on the disk/partition that hase the PGDATA/pg_xlogdirectory (/dev/sdb, on a Sil 3114 on-board SAT controller). The throughtput during the update is ~ 2 MB/sec.The thoughtput during a large file copy or running bonnie (a benchmark) is > 40 MB/sec. My primary goal is to understandthe differences ( and than sue the guilty ones ;-)), and than maybe either learn to live with it or find a solution.The number of write operations/sec during the update is ~ 2000 /sec. I suspect that the RAID card cannot handlea lot of small write operations (with fsync?) in a short time without performance penalty (and yes, the write cacheon the controller is enabled). Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
Hi Michael, Michael Fuhr wrote: > If you run pgbench several times without intervening checkpoints, > do your postmaster logs have any messages like "checkpoints are > occurring too frequently"? It might be useful to increase > checkpoint_warning up to the value of checkpoint_timeout and then > see if you get any such messages during pgbench runs. If checkpoints > are happening a lot more often than every checkpoint_timeout seconds > then try increasing checkpoint_segments (assuming you have the disk > space). After doing so, restart the database and run pgbench several > times without intervening checkpoints and see if performance is > more consistent. I got the "checkpoints are occurring too frequently". Increasing the number of checkpoint_segments from the default 3 to10 resulted in more tests without performance penalty (~ 5-6 tests). The perfomance penalty is also a little less. It takesseveral minutes for the background writer to catch up. This will solve my problems at the customers site (they do not run sm many sales transaction per second), but not my ownproblem while converting the old database to a new databse :-(. Maybe I should invest in other hardware or re-arrangemy RAID5 in a RAID10 (or 50???). Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
On Mon, Mar 06, 2006 at 04:29:49PM +0100, Joost Kraaijeveld wrote: > I was expecting a low(ish) score the first run (because the tables are not loaded in the cache yet), followed by continueshigh(ish) scores, but not an alternating pattern. I also did not expect so much difference, given the hardware Ihave (Dual Opteron, 4GB memory , 3Ware SATA RAID5 with 5 disks, seerate swap and pg_log disks). On a side-note: RAID5 and databases generally don't mix well. Most people find that pg_xlog will live happily with the OS; it's the data files that need the most bandwidth. If you start swapping, performance will tank to the point that it's unlikely that swap being on seperate disks will help at all. Better off to just keep it with the OS and use the disks for the database tables. Speaking of 'disks', what's your exact layout? Do you have a 5 drive raid5 for the OS and the database, 1 drive for swap and 1 drive for pg_xlog? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > Speaking of 'disks', what's your exact layout? Do you have a 5 drive > raid5 for the OS and the database, 1 drive for swap and 1 drive for > pg_xlog? On a Sil SATA 3114 controller: /dev/sda OS + Swap /dev/sdb /var with pg_xlog On the 3Ware 9500S-8, 5 disk array: /dev/sdc with the database (and very safe, my MP3 collection ;-)) As I wrote in one of my posts to Michael, I suspect that the card is not handling the amount of write operations as wellas I expected. I wonder if anyone else sees the same characteristics with this kind of card. Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
On Tue, Mar 07, 2006 at 08:49:30PM +0100, Joost Kraaijeveld wrote: > > Jim C. Nasby wrote: > > > Speaking of 'disks', what's your exact layout? Do you have a 5 drive > > raid5 for the OS and the database, 1 drive for swap and 1 drive for > > pg_xlog? > > On a Sil SATA 3114 controller: > /dev/sda OS + Swap > /dev/sdb /var with pg_xlog > > On the 3Ware 9500S-8, 5 disk array: > /dev/sdc with the database (and very safe, my MP3 collection ;-)) > > As I wrote in one of my posts to Michael, I suspect that the card is not handling the amount of write operations as wellas I expected. I wonder if anyone else sees the same characteristics with this kind of card. Well, the problem is that you're using RAID5, which has a huge write overhead. You're unlikely to get good performance with it. Also, it sounds like sda and sdb are not mirrored. If that's the case, you have no protection from a drive failure taking out your entire database, because you'd lose pg_xlog. If you want better performance your best bets are to either setup RAID10 or if you don't care about the data, just go to RAID0. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > Well, the problem is that you're using RAID5, which has a huge write > overhead. You're unlikely to get good performance with it. Apparently. But I had no idea that the performance hit would be that big. Running bonnie or copying a large file with dd show that the card can do 30-50 MB/sec. Running a large update on my postgresqldatabase however, show a throughtput of ~ 2MB/sec, doing between ~ 2500 - 2300 writes/second (avarage). with anutilisation of almost always 100%, and large await times ( almost always > 700), large io-wait percentages (>50%), allmeasured with iostat. > Also, it sounds like sda and sdb are not mirrored. If that's the case, > you have no protection from a drive failure taking out your entire > database, because you'd lose pg_xlog. > > If you want better performance your best bets are to either > setup RAID10 or if you don't care about the data, just go to RAID0. Because it is just my development machine I think I will opt for the last option. More diskspace left. Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
On Tue, Mar 07, 2006 at 09:15:37PM +0100, Joost Kraaijeveld wrote: > Jim C. Nasby wrote: > > Well, the problem is that you're using RAID5, which has a huge write > > overhead. You're unlikely to get good performance with it. > Apparently. But I had no idea that the performance hit would be that big. > > Running bonnie or copying a large file with dd show that the card can do 30-50 MB/sec. Running a large update on my postgresqldatabase however, show a throughtput of ~ 2MB/sec, doing between ~ 2500 - 2300 writes/second (avarage). with anutilisation of almost always 100%, and large await times ( almost always > 700), large io-wait percentages (>50%), allmeasured with iostat. While there are some issues with PostgreSQL not getting as close to the theoretical maximum of a dd bs=8k (you did match the block size to PostgreSQL's page size, right? :) ), a bigger issue in this case is that better cards are able to remove much/all of the RAID5 write penalty in the case where you're doing a large sequential write, because it will just blow entire stripes down to disk. This is very different from doing a more random IO. And it's also very possible that if you use a block size that's smaller than the stripe size that the controller won't be able to pick up on that. In any case, RAID0 will absolutely be the fastest performance you can get. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461