Обсуждение: General performance/load issue
Hello everyone, I'm having some troubles with a Postgresql server. We're using PG has a database backend for a very big website (lots of data and much traffic). The issue : server suddenly (1H after restart) becomes slow (queries not responding), load rises (>20 instead of 1), iowait rises (20 to 70%) Version : 9.0.5 Server : Dual Xeon X5650 (24 cores total) Memory : 48 GB Disks : SSD Top when overloaded : 21537 postgres 20 0 6420m 899m 892m D 22 1.9 0:01.86 postgres 21534 postgres 20 0 6554m 1.3g 1.1g D 14 2.7 0:01.72 postgres 21518 postgres 20 0 6419m 2.1g 2.1g D 10 4.6 0:02.10 postgres 21038 postgres 20 0 6716m 3.8g 3.5g D 10 8.0 0:45.46 postgres 21103 postgres 20 0 6571m 3.7g 3.6g D 9 7.8 0:32.19 postgres 21079 postgres 20 0 6575m 3.8g 3.7g D 8 8.1 0:36.39 postgres 21359 postgres 20 0 6563m 1.9g 1.8g D 4 4.1 0:09.10 postgres 21422 postgres 20 0 6563m 1.9g 1.7g D 4 3.9 0:08.34 postgres 19656 postgres 20 0 6727m 5.3g 5.0g D 3 11.1 1:58.25 postgres 21418 postgres 20 0 6685m 2.1g 1.9g D 2 4.5 0:12.42 postgres 21413 postgres 20 0 6693m 2.1g 1.8g D 2 4.4 0:11.06 postgres 21541 postgres 20 0 6421m 719m 711m D 1 1.5 0:00.48 postgres 14044 postgres 20 0 6418m 755m 750m D 1 1.6 0:04.71 postgres 21326 postgres 20 0 6685m 2.2g 2.0g D 1 4.7 0:15.82 postgres 21031 postgres 20 0 6688m 3.6g 3.4g D 1 7.7 0:44.18 postgres 21055 postgres 20 0 6575m 4.1g 3.9g D 1 8.6 0:39.11 postgres 21357 postgres 20 0 6693m 2.3g 2.0g D 1 4.9 0:13.74 postgres Memory when overloaded : ~ # free -m total used free shared buffers cached Mem: 48339 47087 1251 0 248 38720 -/+ buffers/cache: 8118 40220 Swap: 8190 1 8189 Postgresql.conf : max_connections = 50 shared_buffers = 12G temp_buffers = 40MB work_mem = 128MB maintenance_work_mem = 256MB effective_cache_size = 12GB max_files_per_process = 8192 fsync = off checkpoint_segments = 256 checkpoint_timeout = 30min checkpoint_completion_target = 0.9 seq_page_cost = 2.0 random_page_cost = 2.0 Did I do anything wrong? Any idea? Regards, Gaëtan
On 24 Listopad 2011, 14:51, Gaëtan Allart wrote: > Hello everyone, > > I'm having some troubles with a Postgresql server. > We're using PG has a database backend for a very big website (lots of data > and much traffic). > > The issue : server suddenly (1H after restart) becomes slow (queries not > responding), load rises (>20 instead of 1), iowait rises (20 to 70%) > > Version : 9.0.5 > Server : Dual Xeon X5650 (24 cores total) > Memory : 48 GB > Disks : SSD > > > Top when overloaded : Top is not the most useful tool here, I guess. Use "iotop" (will show you which processes are doing the I/O) and tools like vmstat / iostat. > Postgresql.conf : > > max_connections = 50 > shared_buffers = 12G > temp_buffers = 40MB > work_mem = 128MB > maintenance_work_mem = 256MB > max_files_per_process = 8192 > checkpoint_segments = 256 > checkpoint_timeout = 30min > checkpoint_completion_target = 0.9 Fine. Let's see the options that look suspicious. > effective_cache_size = 12GB Why have you set it like this? According to the "free" output you've posted the cache has about 38G, so why just 12G here? There are possible reasons, but I don't think this is the case. > fsync = off A really bad idea. I guess your data are worthless to you, right? > seq_page_cost = 2.0 > random_page_cost = 2.0 Eh? First of all, what really matters is the relative value of those two values, and it's good habit to leave seq_page_cost = 1.0 and change just the other values. Plus the random I/O is not as cheap as sequential I/O even on SSD drives, so I't recommend something like this: seq_page_cost = 1.0 random_page_cost = 2.0 (or maybe 1.5) Anyway this needs to be tested properly - watch the performance and tune if needed. > Did I do anything wrong? Any idea? Not sure. My guess is you're getting bitten by a checkpoint. We need to know a few more details. 1) What is dirty_background_ratio / dirty_ratio (see /proc/sys/vm/ directory) 2) enable log_checkpoints in postgresql.conf and see how it correlates to the bad performance 3) check which processes are responsible for the I/O (use iotop) Tomas
Hi Thomas, I will be using iotop ;) Right now, most i/o come from "postgres: wal writer process". -> effective_cache_size Okay, I'll rise it to 32Gb. -> fsync : changed to on ;) -> seq_pages : i'll run tests. Thanks. -> dirty : cat /proc/sys/vm/dirty_ratio 20 cat /proc/sys/vm/dirty_background_ratio10 Thanks a lot Tomas. You're really helpful! Gaëtan Le 24/11/11 15:09, « Tomas Vondra » <tv@fuzzy.cz> a écrit : >On 24 Listopad 2011, 14:51, Gaëtan Allart wrote: >> Hello everyone, >> >> I'm having some troubles with a Postgresql server. >> We're using PG has a database backend for a very big website (lots of >>data >> and much traffic). >> >> The issue : server suddenly (1H after restart) becomes slow (queries not >> responding), load rises (>20 instead of 1), iowait rises (20 to 70%) >> >> Version : 9.0.5 >> Server : Dual Xeon X5650 (24 cores total) >> Memory : 48 GB >> Disks : SSD >> >> >> Top when overloaded : > >Top is not the most useful tool here, I guess. Use "iotop" (will show you >which processes are doing the I/O) and tools like vmstat / iostat. > >> Postgresql.conf : >> >> max_connections = 50 >> shared_buffers = 12G >> temp_buffers = 40MB >> work_mem = 128MB >> maintenance_work_mem = 256MB >> max_files_per_process = 8192 >> checkpoint_segments = 256 >> checkpoint_timeout = 30min >> checkpoint_completion_target = 0.9 > >Fine. Let's see the options that look suspicious. > >> effective_cache_size = 12GB > >Why have you set it like this? According to the "free" output you've >posted the cache has about 38G, so why just 12G here? There are possible >reasons, but I don't think this is the case. > >> fsync = off > >A really bad idea. I guess your data are worthless to you, right? > >> seq_page_cost = 2.0 >> random_page_cost = 2.0 > >Eh? First of all, what really matters is the relative value of those two >values, and it's good habit to leave seq_page_cost = 1.0 and change just >the other values. > >Plus the random I/O is not as cheap as sequential I/O even on SSD drives, >so I't recommend something like this: > >seq_page_cost = 1.0 >random_page_cost = 2.0 (or maybe 1.5) > >Anyway this needs to be tested properly - watch the performance and tune >if needed. > >> Did I do anything wrong? Any idea? > >Not sure. My guess is you're getting bitten by a checkpoint. We need to >know a few more details. > >1) What is dirty_background_ratio / dirty_ratio (see /proc/sys/vm/ >directory) > >2) enable log_checkpoints in postgresql.conf and see how it correlates to >the bad performance > >3) check which processes are responsible for the I/O (use iotop) > >Tomas
On 24 Listopad 2011, 15:27, Gaëtan Allart wrote: > Hi Thomas, > > I will be using iotop ;) > Right now, most i/o come from "postgres: wal writer process". What do you mean by "most I/O" - how much data is it writing? Is there a vacuum running at the same time? What other processes are doing I/O? Post a few lines of "iostat -x 5" so we know what kind of I/O we're dealing with. > -> dirty : > > cat /proc/sys/vm/dirty_ratio 20 > cat /proc/sys/vm/dirty_background_ratio 10 This means the cache may contain up to 3.2GB of data before the system starts to write them out on background, and about 6.4GB before the processes can't use the write cache. What about /proc/sys/vm/dirty_expire_centiseconds? Is there something interesting in the postgresql.log? Autovacuum maybe? Tomas
On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 24 Listopad 2011, 14:51, Gaëtan Allart wrote: >> Hello everyone, >> >> I'm having some troubles with a Postgresql server. >> We're using PG has a database backend for a very big website (lots of data >> and much traffic). >> >> The issue : server suddenly (1H after restart) becomes slow (queries not >> responding), load rises (>20 instead of 1), iowait rises (20 to 70%) >> >> Version : 9.0.5 >> Server : Dual Xeon X5650 (24 cores total) >> Memory : 48 GB >> Disks : SSD >> >> >> Top when overloaded : > > Top is not the most useful tool here, I guess. Use "iotop" (will show you > which processes are doing the I/O) and tools like vmstat / iostat. > >> Postgresql.conf : >> >> max_connections = 50 >> shared_buffers = 12G >> temp_buffers = 40MB >> work_mem = 128MB >> maintenance_work_mem = 256MB >> max_files_per_process = 8192 >> checkpoint_segments = 256 >> checkpoint_timeout = 30min >> checkpoint_completion_target = 0.9 > > Fine. Let's see the options that look suspicious. > I think you missed some suspicious settings... I'd recommend setting shared buffers to 8gb, and I'd likely reduce checkpoint segements to 30 and set the checkpoint timeout back to 5 minutes. Everything about the way this server is configured (including those vm settings) is pushing it towards delaying the WAL/Buffer/Checkpoint as long as possible, which matches with the idea of good performance initial followed by a period of poor performance and heavy i/o. On a side note, I'd guess your work_mem is probably too high. 50 (connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM, which is 25% of total ram on the box. That doesn't necessarily mean game over, but it seem like it wouldn't be that hard to get thrashing being set up that way. YMMV. Robert Treat conjecture: xzilla.net consulting: omniti.com
On 24 Listopad 2011, 16:39, Robert Treat wrote: > On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 24 Listopad 2011, 14:51, Gaëtan Allart wrote: >>> Postgresql.conf : >>> >>> max_connections = 50 >>> shared_buffers = 12G >>> temp_buffers = 40MB >>> work_mem = 128MB >>> maintenance_work_mem = 256MB >>> max_files_per_process = 8192 >>> checkpoint_segments = 256 >>> checkpoint_timeout = 30min >>> checkpoint_completion_target = 0.9 >> >> Fine. Let's see the options that look suspicious. >> > > I think you missed some suspicious settings... I'd recommend setting > shared buffers to 8gb, and I'd likely reduce checkpoint segements to > 30 and set the checkpoint timeout back to 5 minutes. Everything about > the way this server is configured (including those vm settings) is > pushing it towards delaying the WAL/Buffer/Checkpoint as long as > possible, which matches with the idea of good performance initial > followed by a period of poor performance and heavy i/o. Yes, checkpoints were my first thought too. OTOH the OP reported that most of the I/O is caused by WAL writer - that's not exactly the part that does the work during checkpoint. Plus the WAL may not be postponed, as it's usually O_DIRECT and fsynced, right. You're right that the writes are postponed, but I generally see that as a good thing when combined with spread checkpoints. And even with those vm settings (about 3.2GB for background writes), I wouldn't expect this behaviour (because the page cache usually expires after 30 seconds). Say you need 100% of the shared buffers is dirty and need to be written. You have 27 minutes (30*0.9) to do that - that means about 8MB/s. With 30 seconds expire there might be about 240MB before the pdflush starts to write the data to the SSD. And that can surely handle more than 50MB/s. So why the long delay? The question is what else is going on there. But all this is just guessing - I want to see the log_checkpoint message, iostat results etc. > On a side note, I'd guess your work_mem is probably too high. 50 > (connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM, > which is 25% of total ram on the box. That doesn't necessarily mean > game over, but it seem like it wouldn't be that hard to get thrashing > being set up that way. YMMV. This is one of the reasons why effective_cache_size should be lower than 32GB, probably ... Tomas
Hi Robert, Thanks for your help as well. You're right about checkpoints, it's running pretty good at start then encounter heavy i/os. I've changed theses settings and also reduced work_mem a little and reduced effective_cache_size btw. LOG: parameter "work_mem" changed to "96MB" LOG: parameter "effective_cache_size" changed to "24GB" LOG: parameter "checkpoint_segments" changed to "40" LOG: parameter "checkpoint_timeout" changed to "5min" Apparently, it's been running fine since I made the first changes recommenced by Tomas. Let's wait for a couple of hours again to confirm this. Gaëtan Le 24/11/11 16:39, « Robert Treat » <rob@xzilla.net> a écrit : >On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 24 Listopad 2011, 14:51, Gaëtan Allart wrote: >>> Hello everyone, >>> >>> I'm having some troubles with a Postgresql server. >>> We're using PG has a database backend for a very big website (lots of >>>data >>> and much traffic). >>> >>> The issue : server suddenly (1H after restart) becomes slow (queries >>>not >>> responding), load rises (>20 instead of 1), iowait rises (20 to 70%) >>> >>> Version : 9.0.5 >>> Server : Dual Xeon X5650 (24 cores total) >>> Memory : 48 GB >>> Disks : SSD >>> >>> >>> Top when overloaded : >> >> Top is not the most useful tool here, I guess. Use "iotop" (will show >>you >> which processes are doing the I/O) and tools like vmstat / iostat. >> >>> Postgresql.conf : >>> >>> max_connections = 50 >>> shared_buffers = 12G >>> temp_buffers = 40MB >>> work_mem = 128MB >>> maintenance_work_mem = 256MB >>> max_files_per_process = 8192 >>> checkpoint_segments = 256 >>> checkpoint_timeout = 30min >>> checkpoint_completion_target = 0.9 >> >> Fine. Let's see the options that look suspicious. >> > >I think you missed some suspicious settings... I'd recommend setting >shared buffers to 8gb, and I'd likely reduce checkpoint segements to >30 and set the checkpoint timeout back to 5 minutes. Everything about >the way this server is configured (including those vm settings) is >pushing it towards delaying the WAL/Buffer/Checkpoint as long as >possible, which matches with the idea of good performance initial >followed by a period of poor performance and heavy i/o. > >On a side note, I'd guess your work_mem is probably too high. 50 >(connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM, >which is 25% of total ram on the box. That doesn't necessarily mean >game over, but it seem like it wouldn't be that hard to get thrashing >being set up that way. YMMV. > >Robert Treat >conjecture: xzilla.net >consulting: omniti.com
Tomas, I've enabled logging of checkpoints. I'm waiting for the next i/o crisisŠ Gaëtan Le 24/11/11 17:02, « Tomas Vondra » <tv@fuzzy.cz> a écrit : >On 24 Listopad 2011, 16:39, Robert Treat wrote: >> On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> On 24 Listopad 2011, 14:51, Gaëtan Allart wrote: >>>> Postgresql.conf : >>>> >>>> max_connections = 50 >>>> shared_buffers = 12G >>>> temp_buffers = 40MB >>>> work_mem = 128MB >>>> maintenance_work_mem = 256MB >>>> max_files_per_process = 8192 >>>> checkpoint_segments = 256 >>>> checkpoint_timeout = 30min >>>> checkpoint_completion_target = 0.9 >>> >>> Fine. Let's see the options that look suspicious. >>> >> >> I think you missed some suspicious settings... I'd recommend setting >> shared buffers to 8gb, and I'd likely reduce checkpoint segements to >> 30 and set the checkpoint timeout back to 5 minutes. Everything about >> the way this server is configured (including those vm settings) is >> pushing it towards delaying the WAL/Buffer/Checkpoint as long as >> possible, which matches with the idea of good performance initial >> followed by a period of poor performance and heavy i/o. > >Yes, checkpoints were my first thought too. OTOH the OP reported that most >of the I/O is caused by WAL writer - that's not exactly the part that does >the work during checkpoint. Plus the WAL may not be postponed, as it's >usually O_DIRECT and fsynced, right. > >You're right that the writes are postponed, but I generally see that as a >good thing when combined with spread checkpoints. And even with those vm >settings (about 3.2GB for background writes), I wouldn't expect this >behaviour (because the page cache usually expires after 30 seconds). Say >you need 100% of the shared buffers is dirty and need to be written. You >have 27 minutes (30*0.9) to do that - that means about 8MB/s. With 30 >seconds expire there might be about 240MB before the pdflush starts to >write the data to the SSD. And that can surely handle more than 50MB/s. So >why the long delay? The question is what else is going on there. > >But all this is just guessing - I want to see the log_checkpoint message, >iostat results etc. > >> On a side note, I'd guess your work_mem is probably too high. 50 >> (connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM, >> which is 25% of total ram on the box. That doesn't necessarily mean >> game over, but it seem like it wouldn't be that hard to get thrashing >> being set up that way. YMMV. > >This is one of the reasons why effective_cache_size should be lower than >32GB, probably ... > >Tomas
Le 24 novembre 2011 17:02, Tomas Vondra <tv@fuzzy.cz> a écrit : > On 24 Listopad 2011, 16:39, Robert Treat wrote: >> On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> On 24 Listopad 2011, 14:51, Gaëtan Allart wrote: >>>> Postgresql.conf : >>>> >>>> max_connections = 50 >>>> shared_buffers = 12G >>>> temp_buffers = 40MB >>>> work_mem = 128MB >>>> maintenance_work_mem = 256MB >>>> max_files_per_process = 8192 >>>> checkpoint_segments = 256 >>>> checkpoint_timeout = 30min >>>> checkpoint_completion_target = 0.9 >>> >>> Fine. Let's see the options that look suspicious. >>> >> >> I think you missed some suspicious settings... I'd recommend setting >> shared buffers to 8gb, and I'd likely reduce checkpoint segements to >> 30 and set the checkpoint timeout back to 5 minutes. Everything about >> the way this server is configured (including those vm settings) is >> pushing it towards delaying the WAL/Buffer/Checkpoint as long as >> possible, which matches with the idea of good performance initial >> followed by a period of poor performance and heavy i/o. > > Yes, checkpoints were my first thought too. OTOH the OP reported that most > of the I/O is caused by WAL writer - that's not exactly the part that does > the work during checkpoint. Plus the WAL may not be postponed, as it's > usually O_DIRECT and fsynced, right. > > You're right that the writes are postponed, but I generally see that as a > good thing when combined with spread checkpoints. And even with those vm > settings (about 3.2GB for background writes), I wouldn't expect this > behaviour (because the page cache usually expires after 30 seconds). Say > you need 100% of the shared buffers is dirty and need to be written. You > have 27 minutes (30*0.9) to do that - that means about 8MB/s. With 30 > seconds expire there might be about 240MB before the pdflush starts to > write the data to the SSD. And that can surely handle more than 50MB/s. So > why the long delay? The question is what else is going on there. > > But all this is just guessing - I want to see the log_checkpoint message, > iostat results etc. > >> On a side note, I'd guess your work_mem is probably too high. 50 >> (connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM, >> which is 25% of total ram on the box. That doesn't necessarily mean >> game over, but it seem like it wouldn't be that hard to get thrashing >> being set up that way. YMMV. > > This is one of the reasons why effective_cache_size should be lower than > 32GB, probably ... according to 'free' output, 38GB is what is here right now. effective_cache_size is just informative, so you can put it to 1TB without memory issue. And, it is OS cache+PG cache. There is not enougth information yet to be sure on what's happening. log_checkpoint output will help for sure. -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
Finally, it "crashed" againŠ :-( Here's the output of iotop while databased was inaccessible : 32361 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: mydb mydb host(34847) idle 32244 be/4 postgres 163.48 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb mydb host(34660) SELECT 32045 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb mydb host(33765) SELECT 32158 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb mydb host(34112) SELECT 32242 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb mydb host(34632) SELECT 32372 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: mydb mydb host(38858) idle in transaction 32231 be/4 postgres 15.57 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb mydb host(38602) SELECT 28811 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb mydb host(40594) SELECT 32190 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb mydb host(38497) SELECT And the latest logs : LOG: checkpoint complete: wrote 3192 buffers (0.2%); 0 transaction log file(s) added, 1 removed, 0 recycled; write=262.636 s, sync=135.456 s, total=416.630 s LOG: checkpoint complete: wrote 716 buffers (0.0%); 0 transaction log file(s) added, 1 removed, 0 recycled; write=165.497 s, sync=17.111 s, total=193.199 s WARNING: pgstat wait timeout STATEMENT: SELECT 'commit',sum(pg_stat_get_db_xact_commit(oid)) FROM pg_database WHERE datname=$1 UNION ALL SELECT 'rollback',sum(pg_stat_get_db_xact_rollback(oid)) FROM pg_database WHERE datname=$2 LOG: checkpoint complete: wrote 699 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=203.023 s, sync=119.037 s, total=335.012 s LOG: checkpoint complete: wrote 348 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=120.412 s, sync=0.020 s, total=120.435 s Does this help? Gaëtan Le 24/11/11 17:02, « Tomas Vondra » <tv@fuzzy.cz> a écrit : >On 24 Listopad 2011, 16:39, Robert Treat wrote: >> On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> On 24 Listopad 2011, 14:51, Gaëtan Allart wrote: >>>> Postgresql.conf : >>>> >>>> max_connections = 50 >>>> shared_buffers = 12G >>>> temp_buffers = 40MB >>>> work_mem = 128MB >>>> maintenance_work_mem = 256MB >>>> max_files_per_process = 8192 >>>> checkpoint_segments = 256 >>>> checkpoint_timeout = 30min >>>> checkpoint_completion_target = 0.9 >>> >>> Fine. Let's see the options that look suspicious. >>> >> >> I think you missed some suspicious settings... I'd recommend setting >> shared buffers to 8gb, and I'd likely reduce checkpoint segements to >> 30 and set the checkpoint timeout back to 5 minutes. Everything about >> the way this server is configured (including those vm settings) is >> pushing it towards delaying the WAL/Buffer/Checkpoint as long as >> possible, which matches with the idea of good performance initial >> followed by a period of poor performance and heavy i/o. > >Yes, checkpoints were my first thought too. OTOH the OP reported that most >of the I/O is caused by WAL writer - that's not exactly the part that does >the work during checkpoint. Plus the WAL may not be postponed, as it's >usually O_DIRECT and fsynced, right. > >You're right that the writes are postponed, but I generally see that as a >good thing when combined with spread checkpoints. And even with those vm >settings (about 3.2GB for background writes), I wouldn't expect this >behaviour (because the page cache usually expires after 30 seconds). Say >you need 100% of the shared buffers is dirty and need to be written. You >have 27 minutes (30*0.9) to do that - that means about 8MB/s. With 30 >seconds expire there might be about 240MB before the pdflush starts to >write the data to the SSD. And that can surely handle more than 50MB/s. So >why the long delay? The question is what else is going on there. > >But all this is just guessing - I want to see the log_checkpoint message, >iostat results etc. > >> On a side note, I'd guess your work_mem is probably too high. 50 >> (connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM, >> which is 25% of total ram on the box. That doesn't necessarily mean >> game over, but it seem like it wouldn't be that hard to get thrashing >> being set up that way. YMMV. > >This is one of the reasons why effective_cache_size should be lower than >32GB, probably ... > >Tomas
Le 24 novembre 2011 20:38, Gaëtan Allart <gaetan@nexylan.com> a écrit : > Finally, it "crashed" againŠ :-( > > Here's the output of iotop while databased was inaccessible : > > 32361 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: mydb > mydb host(34847) idle > 32244 be/4 postgres 163.48 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb > mydb host(34660) SELECT > 32045 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb > mydb host(33765) SELECT > 32158 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb > mydb host(34112) SELECT > 32242 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb > mydb host(34632) SELECT > 32372 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: mydb > mydb host(38858) idle in transaction > 32231 be/4 postgres 15.57 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb > mydb host(38602) SELECT > 28811 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb > mydb host(40594) SELECT > 32190 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb > mydb host(38497) SELECT > > > > And the latest logs : > > LOG: checkpoint complete: wrote 3192 buffers (0.2%); 0 transaction log > file(s) added, 1 removed, 0 recycled; write=262.636 s, sync=135.456 s, > total=416.630 s > LOG: checkpoint complete: wrote 716 buffers (0.0%); 0 transaction log > file(s) added, 1 removed, 0 recycled; write=165.497 s, sync=17.111 s, > total=193.199 s > > > WARNING: pgstat wait timeout > STATEMENT: SELECT 'commit',sum(pg_stat_get_db_xact_commit(oid)) FROM > pg_database WHERE datname=$1 > UNION ALL > SELECT > 'rollback',sum(pg_stat_get_db_xact_rollback(oid)) FROM pg_database WHERE > datname=$2 > > LOG: checkpoint complete: wrote 699 buffers (0.0%); 0 transaction log > file(s) added, 0 removed, 0 recycled; write=203.023 s, sync=119.037 s, > total=335.012 s > LOG: checkpoint complete: wrote 348 buffers (0.0%); 0 transaction log > file(s) added, 0 removed, 0 recycled; write=120.412 s, sync=0.020 s, > total=120.435 s > > > > > > Does this help? yes. It seem you have an issue with your checkpoint syncing time, it is fixed in 9.1 and backported in 9.0 here : http://projects.2ndquadrant.com/backports It is possible you have other problems that explains the issue you have. An immediate solution before trying a patch is to reduce your shared_buffer setting to something very low, like 1GB. > > Gaëtan > > Le 24/11/11 17:02, « Tomas Vondra » <tv@fuzzy.cz> a écrit : > >>On 24 Listopad 2011, 16:39, Robert Treat wrote: >>> On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >>>> On 24 Listopad 2011, 14:51, Gaëtan Allart wrote: >>>>> Postgresql.conf : >>>>> >>>>> max_connections = 50 >>>>> shared_buffers = 12G >>>>> temp_buffers = 40MB >>>>> work_mem = 128MB >>>>> maintenance_work_mem = 256MB >>>>> max_files_per_process = 8192 >>>>> checkpoint_segments = 256 >>>>> checkpoint_timeout = 30min >>>>> checkpoint_completion_target = 0.9 >>>> >>>> Fine. Let's see the options that look suspicious. >>>> >>> >>> I think you missed some suspicious settings... I'd recommend setting >>> shared buffers to 8gb, and I'd likely reduce checkpoint segements to >>> 30 and set the checkpoint timeout back to 5 minutes. Everything about >>> the way this server is configured (including those vm settings) is >>> pushing it towards delaying the WAL/Buffer/Checkpoint as long as >>> possible, which matches with the idea of good performance initial >>> followed by a period of poor performance and heavy i/o. >> >>Yes, checkpoints were my first thought too. OTOH the OP reported that most >>of the I/O is caused by WAL writer - that's not exactly the part that does >>the work during checkpoint. Plus the WAL may not be postponed, as it's >>usually O_DIRECT and fsynced, right. >> >>You're right that the writes are postponed, but I generally see that as a >>good thing when combined with spread checkpoints. And even with those vm >>settings (about 3.2GB for background writes), I wouldn't expect this >>behaviour (because the page cache usually expires after 30 seconds). Say >>you need 100% of the shared buffers is dirty and need to be written. You >>have 27 minutes (30*0.9) to do that - that means about 8MB/s. With 30 >>seconds expire there might be about 240MB before the pdflush starts to >>write the data to the SSD. And that can surely handle more than 50MB/s. So >>why the long delay? The question is what else is going on there. >> >>But all this is just guessing - I want to see the log_checkpoint message, >>iostat results etc. >> >>> On a side note, I'd guess your work_mem is probably too high. 50 >>> (connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM, >>> which is 25% of total ram on the box. That doesn't necessarily mean >>> game over, but it seem like it wouldn't be that hard to get thrashing >>> being set up that way. YMMV. >> >>This is one of the reasons why effective_cache_size should be lower than >>32GB, probably ... >> >>Tomas > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
On 24 Listopad 2011, 23:19, Cédric Villemain wrote: > Le 24 novembre 2011 20:38, Gaëtan Allart <gaetan@nexylan.com> a écrit : >> Finally, it "crashed" againŠ :-( >> >> Here's the output of iotop while databased was inaccessible : >> >> 32361 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: >> mydb >> mydb host(34847) idle >> 32244 be/4 postgres 163.48 K/s 0.00 B/s 0.00 % 99.99 % postgres: >> mydb >> mydb host(34660) SELECT >> 32045 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: >> mydb >> mydb host(33765) SELECT >> 32158 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: >> mydb >> mydb host(34112) SELECT >> 32242 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: >> mydb >> mydb host(34632) SELECT >> 32372 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: >> mydb >> mydb host(38858) idle in transaction >> 32231 be/4 postgres 15.57 K/s 0.00 B/s 0.00 % 99.99 % postgres: >> mydb >> mydb host(38602) SELECT >> 28811 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: >> mydb >> mydb host(40594) SELECT >> 32190 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: >> mydb >> mydb host(38497) SELECT This iotop output is rather incomplete - it does not show which of the processes actually did the I/O. The fact that a process does 99% of the I/O and yet performs no reads or writes is a bit strange. Use "iotop -o" and post the result. >> Does this help? > > yes. > It seem you have an issue with your checkpoint syncing time, it is > fixed in 9.1 and backported in 9.0 here : > http://projects.2ndquadrant.com/backports People generally don't want to apply backports on their own, especially when it's a production server and when it's unclear it actually fixes the issue they have. I'm not sure about that. > It is possible you have other problems that explains the issue you > have. An immediate solution before trying a patch is to reduce your > shared_buffer setting to something very low, like 1GB. Well, using low shared_buffers was used especially before 8.3, when the spread checkpoints were not available. It prevents the I/O overload when the database suddenly decides to write all of the dirty buffers. But he's on 9.0 (so he already has spread checkpoints). Plus the number of buffers he's writing is negligible - usually about 700 buffers (6MB), 3192 buffers (25MB) at most. That surely should not be a problem for the SSD he's using. Tomas
Le 25 novembre 2011 11:25, Tomas Vondra <tv@fuzzy.cz> a écrit : > On 24 Listopad 2011, 23:19, Cédric Villemain wrote: >> Le 24 novembre 2011 20:38, Gaëtan Allart <gaetan@nexylan.com> a écrit : >>> Finally, it "crashed" againŠ :-( >>> >>> Here's the output of iotop while databased was inaccessible : >>> >>> 32361 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: >>> mydb >>> mydb host(34847) idle >>> 32244 be/4 postgres 163.48 K/s 0.00 B/s 0.00 % 99.99 % postgres: >>> mydb >>> mydb host(34660) SELECT >>> 32045 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: >>> mydb >>> mydb host(33765) SELECT >>> 32158 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: >>> mydb >>> mydb host(34112) SELECT >>> 32242 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: >>> mydb >>> mydb host(34632) SELECT >>> 32372 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: >>> mydb >>> mydb host(38858) idle in transaction >>> 32231 be/4 postgres 15.57 K/s 0.00 B/s 0.00 % 99.99 % postgres: >>> mydb >>> mydb host(38602) SELECT >>> 28811 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: >>> mydb >>> mydb host(40594) SELECT >>> 32190 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: >>> mydb >>> mydb host(38497) SELECT > > This iotop output is rather incomplete - it does not show which of the > processes actually did the I/O. The fact that a process does 99% of the > I/O and yet performs no reads or writes is a bit strange. > > Use "iotop -o" and post the result. > >>> Does this help? >> >> yes. >> It seem you have an issue with your checkpoint syncing time, it is >> fixed in 9.1 and backported in 9.0 here : >> http://projects.2ndquadrant.com/backports > > People generally don't want to apply backports on their own, especially > when it's a production server and when it's unclear it actually fixes the > issue they have. I'm not sure about that. I agree that most people don't want to do that themselves, but if it happens to be the solution they can proceed or ask someone to do it. People want to see their production system back to a normal situation, here the limited information are not enought to be sure, but the checkpoint sync time are clear: sync time are not correct. It is very probable that compacting the fsync will help, but it is not sure it is required yet. > >> It is possible you have other problems that explains the issue you >> have. An immediate solution before trying a patch is to reduce your >> shared_buffer setting to something very low, like 1GB. > > Well, using low shared_buffers was used especially before 8.3, when the > spread checkpoints were not available. It prevents the I/O overload when > the database suddenly decides to write all of the dirty buffers. But he's > on 9.0 (so he already has spread checkpoints). It is a different animal here. > > Plus the number of buffers he's writing is negligible - usually about 700 > buffers (6MB), 3192 buffers (25MB) at most. That surely should not be a > problem for the SSD he's using. See the blog entry from Greg Smith: http://blog.2ndquadrant.com/en/2011/06/backporting-and-checkpoint-tro.html And the slides of his talk at pgconf2011: http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdf I was just pointing that there are known issues in this area, with known solutions. Getting more information on vacuum activity, bgwriter activity should help too. Gaëtan, do you still have critical performance issue, or is it back to normal/loaded (but not overloaded) situation ? -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
On 25 Listopad 2011, 12:43, Cédric Villemain wrote: > Le 25 novembre 2011 11:25, Tomas Vondra <tv@fuzzy.cz> a écrit : >> On 24 Listopad 2011, 23:19, Cédric Villemain wrote: >>> >>> It seem you have an issue with your checkpoint syncing time, it is >>> fixed in 9.1 and backported in 9.0 here : >>> http://projects.2ndquadrant.com/backports >> >> People generally don't want to apply backports on their own, especially >> when it's a production server and when it's unclear it actually fixes >> the >> issue they have. I'm not sure about that. > > I agree that most people don't want to do that themselves, but if it > happens to be the solution they can proceed or ask someone to do it. > People want to see their production system back to a normal situation, > here the limited information are not enought to be sure, but the > checkpoint sync time are clear: sync time are not correct. > It is very probable that compacting the fsync will help, but it is not > sure it is required yet. Yes, the sync times are quite crazy. Especially given the tiny number of buffers to be written and the fact that the SSD should handle random I/O quite well. >>> It is possible you have other problems that explains the issue you >>> have. An immediate solution before trying a patch is to reduce your >>> shared_buffer setting to something very low, like 1GB. >> >> Well, using low shared_buffers was used especially before 8.3, when the >> spread checkpoints were not available. It prevents the I/O overload when >> the database suddenly decides to write all of the dirty buffers. But >> he's >> on 9.0 (so he already has spread checkpoints). > > It is a different animal here. > >> >> Plus the number of buffers he's writing is negligible - usually about >> 700 >> buffers (6MB), 3192 buffers (25MB) at most. That surely should not be a >> problem for the SSD he's using. > > See the blog entry from Greg Smith: > http://blog.2ndquadrant.com/en/2011/06/backporting-and-checkpoint-tro.html > > And the slides of his talk at pgconf2011: > http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdf > > I was just pointing that there are known issues in this area, with > known solutions. Thanks for the links, interesting stuff. Still, my impression is that the SSD is stressed by something else, and the high fsync values during a checkpoint are merely a symptom. So fixing a checkpoint (using the backpatch) won't actually fix the issue. But I'm just guessing here. > Getting more information on vacuum activity, bgwriter activity should help > too. Yes, that'd be nice. Gaëtan, can you post bgwriter-related options from postgresql.conf and two snapshosts from pg_stat_bgwriter (say 5 minutes apart, collected when the db is slow)? A complete 'iotop -o' output would be nice too. BTW what filesystem are we talking about? What mount options have you used? What about the /proc/sys/vm/dirty_expire_centiseconds? Tomas
Hello Tomas and Cédric, Right now, the server is not all right. Load is above 30 and queries are slow like hell. Here's the complete iotop. Note the 71 MB/s writes (apparently on SELECT queries). Total DISK READ: 633.35 K/s | Total DISK WRITE: 71.06 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 27352 be/4 postgres 15.64 K/s 86.01 K/s 0.00 % 99.99 % postgres: database database 176.31.228.6(38816) SELECT 20226 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 99.99 % postgres: database database 176.31.228.6(34166) SELECT 26950 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 82.14 % postgres: database database 46.105.104.205(40820) SELECT 23160 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 81.14 % postgres: database database 46.105.104.205(58091) SELECT 29184 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 79.17 % postgres: database database 46.105.104.205(51047) SELECT 27271 be/4 postgres 23.46 K/s 234.58 K/s 0.00 % 77.15 % postgres: database database 46.105.104.205(42315) SELECT 28224 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 74.09 % postgres: database database 46.105.104.205(49871) SELECT 27450 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 70.12 % postgres: database database 46.105.104.205(49096) SELECT 29206 be/4 postgres 121.20 K/s 0.00 B/s 0.00 % 62.99 % postgres: database database 176.31.228.6(39787) SELECT 8202 be/4 postgres 129.02 K/s 0.00 B/s 0.00 % 60.59 % postgres: database database 91.121.89.14(57291) SELECT 29209 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 55.75 % postgres: database database 176.31.228.6(39807) idle 27046 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 52.35 % postgres: database database 46.105.104.205(41285) SELECT 29217 be/4 postgres 54.73 K/s 0.00 B/s 0.00 % 50.18 % postgres: database database 46.105.104.205(51178) SELECT 3249 be/4 root 3.91 K/s 320.59 K/s 0.00 % 40.31 % [kjournald] 28918 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.83 % postgres: database database 46.105.104.205(50164) SELECT 26856 be/4 postgres 7.82 K/s 234.58 K/s 0.00 % 34.30 % postgres: database database 46.105.104.205(40589) SELECT 29205 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.17 % postgres: database database 176.31.228.6(39784) SELECT 4175 be/4 root 0.00 B/s 0.00 B/s 0.00 % 1.84 % [flush-8:0] 28905 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.02 % postgres: database database 46.105.104.205(50125) SELECT 28919 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.01 % postgres: database database 46.105.104.205(50167) SELECT 19807 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 0.00 % postgres: database database 176.31.228.6(33598) SELECT 28703 be/4 postgres 0.00 B/s 86.01 K/s 0.00 % 0.00 % postgres: database database 176.31.228.6(39260) SELECT 29183 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 0.00 % postgres: database database 46.105.104.205(51044) SELECT Here is dirty_expire_centisecs : cat /proc/sys/vm/dirty_expire_centisecs 3000 Bgwriter configuration is default : #bgwriter_delay = 200ms # 10-10000ms between rounds #bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round #bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers scanned/round Is there anything I can provide to help you ? Thanks again, Gaëtan Le 25/11/11 14:12, « Tomas Vondra » <tv@fuzzy.cz> a écrit : >On 25 Listopad 2011, 12:43, Cédric Villemain wrote: >> Le 25 novembre 2011 11:25, Tomas Vondra <tv@fuzzy.cz> a écrit : >>> On 24 Listopad 2011, 23:19, Cédric Villemain wrote: >>>> >>>> It seem you have an issue with your checkpoint syncing time, it is >>>> fixed in 9.1 and backported in 9.0 here : >>>> http://projects.2ndquadrant.com/backports >>> >>> People generally don't want to apply backports on their own, especially >>> when it's a production server and when it's unclear it actually fixes >>> the >>> issue they have. I'm not sure about that. >> >> I agree that most people don't want to do that themselves, but if it >> happens to be the solution they can proceed or ask someone to do it. >> People want to see their production system back to a normal situation, >> here the limited information are not enought to be sure, but the >> checkpoint sync time are clear: sync time are not correct. >> It is very probable that compacting the fsync will help, but it is not >> sure it is required yet. > >Yes, the sync times are quite crazy. Especially given the tiny number of >buffers to be written and the fact that the SSD should handle random I/O >quite well. > >>>> It is possible you have other problems that explains the issue you >>>> have. An immediate solution before trying a patch is to reduce your >>>> shared_buffer setting to something very low, like 1GB. >>> >>> Well, using low shared_buffers was used especially before 8.3, when the >>> spread checkpoints were not available. It prevents the I/O overload >>>when >>> the database suddenly decides to write all of the dirty buffers. But >>> he's >>> on 9.0 (so he already has spread checkpoints). >> >> It is a different animal here. >> >>> >>> Plus the number of buffers he's writing is negligible - usually about >>> 700 >>> buffers (6MB), 3192 buffers (25MB) at most. That surely should not be a >>> problem for the SSD he's using. >> >> See the blog entry from Greg Smith: >> >>http://blog.2ndquadrant.com/en/2011/06/backporting-and-checkpoint-tro.htm >>l >> >> And the slides of his talk at pgconf2011: >> >>http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon >>2011.pdf >> >> I was just pointing that there are known issues in this area, with >> known solutions. > >Thanks for the links, interesting stuff. Still, my impression is that the >SSD is stressed by something else, and the high fsync values during a >checkpoint are merely a symptom. So fixing a checkpoint (using the >backpatch) won't actually fix the issue. But I'm just guessing here. > >> Getting more information on vacuum activity, bgwriter activity should >>help >> too. > >Yes, that'd be nice. Gaëtan, can you post bgwriter-related options from >postgresql.conf and two snapshosts from pg_stat_bgwriter (say 5 minutes >apart, collected when the db is slow)? A complete 'iotop -o' output would >be nice too. > >BTW what filesystem are we talking about? What mount options have you >used? > >What about the /proc/sys/vm/dirty_expire_centiseconds? > >Tomas
Le 25 novembre 2011 23:47, Gaëtan Allart <gaetan@nexylan.com> a écrit : > Hello Tomas and Cédric, > > Right now, the server is not all right. Load is above 30 and queries are > slow like hell. > > > Here's the complete iotop. Note the 71 MB/s writes (apparently on SELECT > queries). > > Total DISK READ: 633.35 K/s | Total DISK WRITE: 71.06 M/s > TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND > > > 27352 be/4 postgres 15.64 K/s 86.01 K/s 0.00 % 99.99 % postgres: > database database 176.31.228.6(38816) SELECT > 20226 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 99.99 % postgres: > database database 176.31.228.6(34166) SELECT > 26950 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 82.14 % postgres: > database database 46.105.104.205(40820) SELECT > 23160 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 81.14 % postgres: > database database 46.105.104.205(58091) SELECT > 29184 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 79.17 % postgres: > database database 46.105.104.205(51047) SELECT > 27271 be/4 postgres 23.46 K/s 234.58 K/s 0.00 % 77.15 % postgres: > database database 46.105.104.205(42315) SELECT > 28224 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 74.09 % postgres: > database database 46.105.104.205(49871) SELECT > 27450 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 70.12 % postgres: > database database 46.105.104.205(49096) SELECT > 29206 be/4 postgres 121.20 K/s 0.00 B/s 0.00 % 62.99 % postgres: > database database 176.31.228.6(39787) SELECT > 8202 be/4 postgres 129.02 K/s 0.00 B/s 0.00 % 60.59 % postgres: > database database 91.121.89.14(57291) SELECT > 29209 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 55.75 % postgres: > database database 176.31.228.6(39807) idle > 27046 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 52.35 % postgres: > database database 46.105.104.205(41285) SELECT > 29217 be/4 postgres 54.73 K/s 0.00 B/s 0.00 % 50.18 % postgres: > database database 46.105.104.205(51178) SELECT > 3249 be/4 root 3.91 K/s 320.59 K/s 0.00 % 40.31 % [kjournald] > 28918 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.83 % postgres: > database database 46.105.104.205(50164) SELECT > 26856 be/4 postgres 7.82 K/s 234.58 K/s 0.00 % 34.30 % postgres: > database database 46.105.104.205(40589) SELECT > 29205 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.17 % postgres: > database database 176.31.228.6(39784) SELECT > 4175 be/4 root 0.00 B/s 0.00 B/s 0.00 % 1.84 % [flush-8:0] > 28905 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.02 % postgres: > database database 46.105.104.205(50125) SELECT > 28919 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.01 % postgres: > database database 46.105.104.205(50167) SELECT > 19807 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 0.00 % postgres: > database database 176.31.228.6(33598) SELECT > 28703 be/4 postgres 0.00 B/s 86.01 K/s 0.00 % 0.00 % postgres: > database database 176.31.228.6(39260) SELECT > 29183 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 0.00 % postgres: > database database 46.105.104.205(51044) SELECT > > > Here is dirty_expire_centisecs : > > cat /proc/sys/vm/dirty_expire_centisecs > 3000 > > > Bgwriter configuration is default : > > #bgwriter_delay = 200ms # 10-10000ms between rounds > #bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round > #bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers scanned/round > > > > > Is there anything I can provide to help you ? the checkpoints logs and the output of pg_stat_bgwriter (as asked by Tomas). It is probable that shared_buffers are too small for your workload (expected), do you still have issue with checkpoint sync duration ? You should increase shared_buffers (you can iterate) *and* try to not have the long checkpoint sync. (keep shared_buffers <= 8GB, as already mentioned upthread) The output you provided does not show everything, and the select are not looking to write that much. What process are writing ? > > > > Thanks again, > > Gaëtan > > > > Le 25/11/11 14:12, « Tomas Vondra » <tv@fuzzy.cz> a écrit : > >>On 25 Listopad 2011, 12:43, Cédric Villemain wrote: >>> Le 25 novembre 2011 11:25, Tomas Vondra <tv@fuzzy.cz> a écrit : >>>> On 24 Listopad 2011, 23:19, Cédric Villemain wrote: >>>>> >>>>> It seem you have an issue with your checkpoint syncing time, it is >>>>> fixed in 9.1 and backported in 9.0 here : >>>>> http://projects.2ndquadrant.com/backports >>>> >>>> People generally don't want to apply backports on their own, especially >>>> when it's a production server and when it's unclear it actually fixes >>>> the >>>> issue they have. I'm not sure about that. >>> >>> I agree that most people don't want to do that themselves, but if it >>> happens to be the solution they can proceed or ask someone to do it. >>> People want to see their production system back to a normal situation, >>> here the limited information are not enought to be sure, but the >>> checkpoint sync time are clear: sync time are not correct. >>> It is very probable that compacting the fsync will help, but it is not >>> sure it is required yet. >> >>Yes, the sync times are quite crazy. Especially given the tiny number of >>buffers to be written and the fact that the SSD should handle random I/O >>quite well. >> >>>>> It is possible you have other problems that explains the issue you >>>>> have. An immediate solution before trying a patch is to reduce your >>>>> shared_buffer setting to something very low, like 1GB. >>>> >>>> Well, using low shared_buffers was used especially before 8.3, when the >>>> spread checkpoints were not available. It prevents the I/O overload >>>>when >>>> the database suddenly decides to write all of the dirty buffers. But >>>> he's >>>> on 9.0 (so he already has spread checkpoints). >>> >>> It is a different animal here. >>> >>>> >>>> Plus the number of buffers he's writing is negligible - usually about >>>> 700 >>>> buffers (6MB), 3192 buffers (25MB) at most. That surely should not be a >>>> problem for the SSD he's using. >>> >>> See the blog entry from Greg Smith: >>> >>>http://blog.2ndquadrant.com/en/2011/06/backporting-and-checkpoint-tro.htm >>>l >>> >>> And the slides of his talk at pgconf2011: >>> >>>http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon >>>2011.pdf >>> >>> I was just pointing that there are known issues in this area, with >>> known solutions. >> >>Thanks for the links, interesting stuff. Still, my impression is that the >>SSD is stressed by something else, and the high fsync values during a >>checkpoint are merely a symptom. So fixing a checkpoint (using the >>backpatch) won't actually fix the issue. But I'm just guessing here. >> >>> Getting more information on vacuum activity, bgwriter activity should >>>help >>> too. >> >>Yes, that'd be nice. Gaëtan, can you post bgwriter-related options from >>postgresql.conf and two snapshosts from pg_stat_bgwriter (say 5 minutes >>apart, collected when the db is slow)? A complete 'iotop -o' output would >>be nice too. >> >>BTW what filesystem are we talking about? What mount options have you >>used? >> >>What about the /proc/sys/vm/dirty_expire_centiseconds? >> >>Tomas > > -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
Here are the latest checkpoint logs : LOG: checkpoint complete: wrote 842 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=168.970 s, sync=0.005 s, total=168.977 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 318 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=63.818 s, sync=0.006 s, total=63.825 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 744 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=147.035 s, sync=0.006 s, total=147.043 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 108 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=35.410 s, sync=14.921 s, total=54.811 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 393 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=88.835 s, sync=43.210 s, total=135.728 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 914 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=148.162 s, sync=14.249 s, total=170.481 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 202 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=53.152 s, sync=0.004 s, total=53.159 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 897 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=42.414 s, sync=1.175 s, total=58.957 s LOG: checkpoint starting: shutdown immediate LOG: checkpoint complete: wrote 666 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.027 s, sync=1.600 s, total=1.630 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 627 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=125.856 s, sync=0.006 s, total=125.864 s LOG: checkpoint starting: time And pg_stat_bgwriter : postgres=# select *,now() from pg_stat_bgwriter; checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc | now -------------------+-----------------+--------------------+---------------+ ------------------+-----------------+---------------+---------------------- --------- 388 | 13 | 494948 | 4306591 | 13555 | 7458743 | 7835244602 | 2011-11-26 00:43:47.232924+01 (1 row) postgres=# select *,now() from pg_stat_bgwriter; checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc | now -------------------+-----------------+--------------------+---------------+ ------------------+-----------------+---------------+---------------------- --------- 389 | 13 | 501802 | 4352198 | 13809 | 7469220 | 7839778941 | 2011-11-26 00:49:00.680779+01 (1 row) Processes that were writing were "SELECT" queries against database. Gaëtan Le 26/11/11 00:17, « Cédric Villemain » <cedric.villemain.debian@gmail.com> a écrit : >Le 25 novembre 2011 23:47, Gaëtan Allart <gaetan@nexylan.com> a écrit : >> Hello Tomas and Cédric, >> >> Right now, the server is not all right. Load is above 30 and queries are >> slow like hell. >> >> >> Here's the complete iotop. Note the 71 MB/s writes (apparently on SELECT >> queries). >> >> Total DISK READ: 633.35 K/s | Total DISK WRITE: 71.06 M/s >> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND >> >> >> 27352 be/4 postgres 15.64 K/s 86.01 K/s 0.00 % 99.99 % postgres: >> database database 176.31.228.6(38816) SELECT >> 20226 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 99.99 % postgres: >> database database 176.31.228.6(34166) SELECT >> 26950 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 82.14 % postgres: >> database database 46.105.104.205(40820) SELECT >> 23160 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 81.14 % postgres: >> database database 46.105.104.205(58091) SELECT >> 29184 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 79.17 % postgres: >> database database 46.105.104.205(51047) SELECT >> 27271 be/4 postgres 23.46 K/s 234.58 K/s 0.00 % 77.15 % postgres: >> database database 46.105.104.205(42315) SELECT >> 28224 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 74.09 % postgres: >> database database 46.105.104.205(49871) SELECT >> 27450 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 70.12 % postgres: >> database database 46.105.104.205(49096) SELECT >> 29206 be/4 postgres 121.20 K/s 0.00 B/s 0.00 % 62.99 % postgres: >> database database 176.31.228.6(39787) SELECT >> 8202 be/4 postgres 129.02 K/s 0.00 B/s 0.00 % 60.59 % postgres: >> database database 91.121.89.14(57291) SELECT >> 29209 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 55.75 % postgres: >> database database 176.31.228.6(39807) idle >> 27046 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 52.35 % postgres: >> database database 46.105.104.205(41285) SELECT >> 29217 be/4 postgres 54.73 K/s 0.00 B/s 0.00 % 50.18 % postgres: >> database database 46.105.104.205(51178) SELECT >> 3249 be/4 root 3.91 K/s 320.59 K/s 0.00 % 40.31 % [kjournald] >> 28918 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.83 % postgres: >> database database 46.105.104.205(50164) SELECT >> 26856 be/4 postgres 7.82 K/s 234.58 K/s 0.00 % 34.30 % postgres: >> database database 46.105.104.205(40589) SELECT >> 29205 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.17 % postgres: >> database database 176.31.228.6(39784) SELECT >> 4175 be/4 root 0.00 B/s 0.00 B/s 0.00 % 1.84 % [flush-8:0] >> 28905 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.02 % postgres: >> database database 46.105.104.205(50125) SELECT >> 28919 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.01 % postgres: >> database database 46.105.104.205(50167) SELECT >> 19807 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 0.00 % postgres: >> database database 176.31.228.6(33598) SELECT >> 28703 be/4 postgres 0.00 B/s 86.01 K/s 0.00 % 0.00 % postgres: >> database database 176.31.228.6(39260) SELECT >> 29183 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 0.00 % postgres: >> database database 46.105.104.205(51044) SELECT >> >> >> Here is dirty_expire_centisecs : >> >> cat /proc/sys/vm/dirty_expire_centisecs >> 3000 >> >> >> Bgwriter configuration is default : >> >> #bgwriter_delay = 200ms # 10-10000ms between rounds >> #bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round >> #bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers >>scanned/round >> >> >> >> >> Is there anything I can provide to help you ? > >the checkpoints logs and the output of pg_stat_bgwriter (as asked by >Tomas). > >It is probable that shared_buffers are too small for your workload >(expected), do you still have issue with checkpoint sync duration ? > >You should increase shared_buffers (you can iterate) *and* try to not >have the long checkpoint sync. (keep shared_buffers <= 8GB, as already >mentioned upthread) > >The output you provided does not show everything, and the select are >not looking to write that much. What process are writing ? > >> >> >> >> Thanks again, >> >> Gaëtan >> >> >> >> Le 25/11/11 14:12, « Tomas Vondra » <tv@fuzzy.cz> a écrit : >> >>>On 25 Listopad 2011, 12:43, Cédric Villemain wrote: >>>> Le 25 novembre 2011 11:25, Tomas Vondra <tv@fuzzy.cz> a écrit : >>>>> On 24 Listopad 2011, 23:19, Cédric Villemain wrote: >>>>>> >>>>>> It seem you have an issue with your checkpoint syncing time, it is >>>>>> fixed in 9.1 and backported in 9.0 here : >>>>>> http://projects.2ndquadrant.com/backports >>>>> >>>>> People generally don't want to apply backports on their own, >>>>>especially >>>>> when it's a production server and when it's unclear it actually fixes >>>>> the >>>>> issue they have. I'm not sure about that. >>>> >>>> I agree that most people don't want to do that themselves, but if it >>>> happens to be the solution they can proceed or ask someone to do it. >>>> People want to see their production system back to a normal situation, >>>> here the limited information are not enought to be sure, but the >>>> checkpoint sync time are clear: sync time are not correct. >>>> It is very probable that compacting the fsync will help, but it is not >>>> sure it is required yet. >>> >>>Yes, the sync times are quite crazy. Especially given the tiny number of >>>buffers to be written and the fact that the SSD should handle random I/O >>>quite well. >>> >>>>>> It is possible you have other problems that explains the issue you >>>>>> have. An immediate solution before trying a patch is to reduce your >>>>>> shared_buffer setting to something very low, like 1GB. >>>>> >>>>> Well, using low shared_buffers was used especially before 8.3, when >>>>>the >>>>> spread checkpoints were not available. It prevents the I/O overload >>>>>when >>>>> the database suddenly decides to write all of the dirty buffers. But >>>>> he's >>>>> on 9.0 (so he already has spread checkpoints). >>>> >>>> It is a different animal here. >>>> >>>>> >>>>> Plus the number of buffers he's writing is negligible - usually about >>>>> 700 >>>>> buffers (6MB), 3192 buffers (25MB) at most. That surely should not >>>>>be a >>>>> problem for the SSD he's using. >>>> >>>> See the blog entry from Greg Smith: >>>> >>>>http://blog.2ndquadrant.com/en/2011/06/backporting-and-checkpoint-tro.h >>>>tm >>>>l >>>> >>>> And the slides of his talk at pgconf2011: >>>> >>>>http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGC >>>>on >>>>2011.pdf >>>> >>>> I was just pointing that there are known issues in this area, with >>>> known solutions. >>> >>>Thanks for the links, interesting stuff. Still, my impression is that >>>the >>>SSD is stressed by something else, and the high fsync values during a >>>checkpoint are merely a symptom. So fixing a checkpoint (using the >>>backpatch) won't actually fix the issue. But I'm just guessing here. >>> >>>> Getting more information on vacuum activity, bgwriter activity should >>>>help >>>> too. >>> >>>Yes, that'd be nice. Gaëtan, can you post bgwriter-related options from >>>postgresql.conf and two snapshosts from pg_stat_bgwriter (say 5 minutes >>>apart, collected when the db is slow)? A complete 'iotop -o' output >>>would >>>be nice too. >>> >>>BTW what filesystem are we talking about? What mount options have you >>>used? >>> >>>What about the /proc/sys/vm/dirty_expire_centiseconds? >>> >>>Tomas >> >> > > > >-- >Cédric Villemain +33 (0)6 20 30 22 52 >http://2ndQuadrant.fr/ >PostgreSQL: Support 24x7 - Développement, Expertise et Formation
Dne 26.11.2011 00:17, Cédric Villemain napsal(a): > Le 25 novembre 2011 23:47, Gaëtan Allart <gaetan@nexylan.com> a écrit : >> Hello Tomas and Cédric, >> >> Right now, the server is not all right. Load is above 30 and queries are >> slow like hell. >> >> >> Here's the complete iotop. Note the 71 MB/s writes (apparently on SELECT >> queries). >> >> Total DISK READ: 633.35 K/s | Total DISK WRITE: 71.06 M/s >> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND >> >> >> 27352 be/4 postgres 15.64 K/s 86.01 K/s 0.00 % 99.99 % postgres: >> database database 176.31.228.6(38816) SELECT >> 20226 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 99.99 % postgres: >> database database 176.31.228.6(34166) SELECT >> 26950 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 82.14 % postgres: >> database database 46.105.104.205(40820) SELECT >> 23160 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 81.14 % postgres: >> database database 46.105.104.205(58091) SELECT >> 29184 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 79.17 % postgres: >> database database 46.105.104.205(51047) SELECT Gaetan, you keep deleting the interesting rows for some reason. None of the rows you've posted writes more than a few KB/s - what about the rows that actually write those 71MB/s? >> Here is dirty_expire_centisecs : >> >> cat /proc/sys/vm/dirty_expire_centisecs >> 3000 OK. That's a default value and it's usually too high (just like the ratios), but don't change it until we find out what's wrong. >> Bgwriter configuration is default : >> >> #bgwriter_delay = 200ms # 10-10000ms between rounds >> #bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round >> #bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers scanned/round OK. >> Is there anything I can provide to help you ? > > the checkpoints logs and the output of pg_stat_bgwriter (as asked by Tomas). > > It is probable that shared_buffers are too small for your workload > (expected), do you still have issue with checkpoint sync duration ? I don't think that's the case. Too small shared buffers usually cause a lot of reads (especially when all the queries are SELECTs as here), but he has problems with writes. And according to the pg_stat_bgwriter Gaetan posted, the checkpoints wrote about 54MB, bgwriter about 370MB and backends about 80MB (during the 5 minutes between snapshots). So I'm really wondering where those 70MB/s of writes come from. Two things just crossed my mind. The first one are hint bits - this may cause a SELECT to write a lot of data. But I guess this is included in the pg_stat_bgwriter stats. The second one is on-disk sorting - this happens when a query needs to sort so much data it can't be done in work_mem, so the data are pushed to the disk, and AFAIK it's not included into the pg_stat_bgwriter. But he has work_mem set to 128MB so it's rather unlikely. Gaetan, can you verify that those queries that write the most data to the disk are not performing any huge sorts etc? (See iotop which postgres process is writing a lot of data and use the PID and pg_stat_activity to find out which query it's executing.) Tomas
A better view of iotop : TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 16974 be/4 postgres 46.33 K/s 0.00 B/s 0.00 % 7.21 % postgres: database database 46.105.111.92(54930) idle 383 be/4 postgres 7.72 K/s 0.00 B/s 0.00 % 1.56 % postgres: database database 46.105.104.205(51706) idle in transaction 15934 be/4 postgres 0.00 B/s 38.61 K/s 0.00 % 0.04 % postgres: wal writer process 31487 be/4 postgres 0.00 B/s 11.32 M/s 0.00 % 0.00 % postgres: database database 46.105.104.205(48616) SELECT 29541 be/4 postgres 0.00 B/s 11.52 M/s 0.00 % 0.00 % postgres: database database 46.105.104.205(40675) SELECT 31419 be/4 postgres 0.00 B/s 5.99 M/s 0.00 % 0.00 % postgres: database database 46.105.104.205(48476) SELECT 31875 be/4 postgres 0.00 B/s 15.23 M/s 0.00 % 0.00 % postgres: database database 46.105.104.205(50228) SELECT 30985 be/4 postgres 0.00 B/s 10.55 M/s 0.00 % 0.00 % postgres: database database 46.105.104.205(47672) SELECT 30902 be/4 postgres 0.00 B/s 6.04 M/s 0.00 % 0.00 % postgres: database database 176.31.228.6(43295) SELECT 30903 be/4 postgres 0.00 B/s 4.79 M/s 0.00 % 0.00 % postgres: database database 46.105.104.205(47565) SELECT 15933 be/4 postgres 0.00 B/s 77.22 K/s 0.00 % 0.00 % postgres: writer process As you can see, SELECTS are writing very much on the disk. At this moment, pg_stat_activity show many many queries running at the same time. It looks like that all queries are suddenly writing on disk and not a particular one, making me think of a buffer issue or something. This is a example of 10 MB/s writing query : SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS "id", "table"."flux_id" AS "flux_id", "table"."locale_id" AS "locale_id", "table"."url_article" AS "url_article", "table"."original_url" AS "original_url", "table"."name" AS "name", "table"."description" AS "description", "table"."content" AS "content", "table"."permis" AS "permis", "table"."reviewed" AS "reviewed", "table"."author_id" AS "author_id", "table"."poster_id" AS "poster_id", "table"."post_date" AS "post_date", "table"."edit_date" AS "edit_date", "table"."add_date" AS "add_date", "table"."comments_open" AS "comments_open", "table"."site_id" AS "site_id", "table"."is_local" AS "is_local", "table"."status" AS "status", "table"."visits" AS "visits", "table"."votes" AS "votes", "table"."score" AS "score", "arti Checkpoints logs still show very long write times : LOG: checkpoint complete: wrote 92 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=49.622 s, sync=6.510 s, total=63.625 s Couldn't this be a hardware issue ? Best regards, Gaëtan Le 26/11/11 04:42, « Tomas Vondra » <tv@fuzzy.cz> a écrit : >Dne 26.11.2011 00:17, Cédric Villemain napsal(a): >> Le 25 novembre 2011 23:47, Gaëtan Allart <gaetan@nexylan.com> a écrit : >>> Hello Tomas and Cédric, >>> >>> Right now, the server is not all right. Load is above 30 and queries >>>are >>> slow like hell. >>> >>> >>> Here's the complete iotop. Note the 71 MB/s writes (apparently on >>>SELECT >>> queries). >>> >>> Total DISK READ: 633.35 K/s | Total DISK WRITE: 71.06 M/s >>> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND >>> >>> >>> 27352 be/4 postgres 15.64 K/s 86.01 K/s 0.00 % 99.99 % postgres: >>> database database 176.31.228.6(38816) SELECT >>> 20226 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 99.99 % postgres: >>> database database 176.31.228.6(34166) SELECT >>> 26950 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 82.14 % postgres: >>> database database 46.105.104.205(40820) SELECT >>> 23160 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 81.14 % postgres: >>> database database 46.105.104.205(58091) SELECT >>> 29184 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 79.17 % postgres: >>> database database 46.105.104.205(51047) SELECT > >Gaetan, you keep deleting the interesting rows for some reason. None of >the rows you've posted writes more than a few KB/s - what about the rows >that actually write those 71MB/s? > >>> Here is dirty_expire_centisecs : >>> >>> cat /proc/sys/vm/dirty_expire_centisecs >>> 3000 > >OK. That's a default value and it's usually too high (just like the >ratios), but don't change it until we find out what's wrong. > >>> Bgwriter configuration is default : >>> >>> #bgwriter_delay = 200ms # 10-10000ms between rounds >>> #bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round >>> #bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers >>>scanned/round > >OK. > >>> Is there anything I can provide to help you ? >> >> the checkpoints logs and the output of pg_stat_bgwriter (as asked by >>Tomas). >> >> It is probable that shared_buffers are too small for your workload >> (expected), do you still have issue with checkpoint sync duration ? > >I don't think that's the case. Too small shared buffers usually cause a >lot of reads (especially when all the queries are SELECTs as here), but >he has problems with writes. > >And according to the pg_stat_bgwriter Gaetan posted, the checkpoints >wrote about 54MB, bgwriter about 370MB and backends about 80MB (during >the 5 minutes between snapshots). > >So I'm really wondering where those 70MB/s of writes come from. > >Two things just crossed my mind. The first one are hint bits - this may >cause a SELECT to write a lot of data. But I guess this is included in >the pg_stat_bgwriter stats. > >The second one is on-disk sorting - this happens when a query needs to >sort so much data it can't be done in work_mem, so the data are pushed >to the disk, and AFAIK it's not included into the pg_stat_bgwriter. > >But he has work_mem set to 128MB so it's rather unlikely. Gaetan, can >you verify that those queries that write the most data to the disk are >not performing any huge sorts etc? (See iotop which postgres process is >writing a lot of data and use the PID and pg_stat_activity to find out >which query it's executing.) > >Tomas > >-- >Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-general
On 26 Listopad 2011, 10:45, Gaëtan Allart wrote: > A better view of iotop : > > TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND > 31875 be/4 postgres 0.00 B/s 15.23 M/s 0.00 % 0.00 % postgres: > database database 46.105.104.205(50228) SELECT > 30985 be/4 postgres 0.00 B/s 10.55 M/s 0.00 % 0.00 % postgres: > database database 46.105.104.205(47672) SELECT > > As you can see, SELECTS are writing very much on the disk. > > At this moment, pg_stat_activity show many many queries running at the > same time. > It looks like that all queries are suddenly writing on disk and not a > particular one, making me think of a buffer issue or something. No, I don't think this is an issue with the size of shared buffers. That'd influence the pg_stat_bgwriter - the buffers_backend would grow much faster, and it's not the case. So I'm guessing it's either hint bits or sorting. If I had to, I'd probably guess about the hint bits - see for example this thread http://archives.postgresql.org/pgsql-performance/2008-05/msg00182.php it more or less behaves like what you described so far. > This is a example of 10 MB/s writing query : > > SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS "id", > "table"."flux_id" AS "flux_id", "table"."locale_id" AS "locale_id", > "table"."url_article" AS "url_article", "table"."original_url" AS > "original_url", "table"."name" AS "name", "table"."description" AS > "description", "table"."content" AS "content", "table"."permis" AS > "permis", "table"."reviewed" AS "reviewed", "table"."author_id" AS > "author_id", "table"."poster_id" AS "poster_id", "table"."post_date" AS > "post_date", "table"."edit_date" AS "edit_date", "table"."add_date" AS > "add_date", "table"."comments_open" AS "comments_open", "table"."site_id" > AS "site_id", "table"."is_local" AS "is_local", "table"."status" AS > "status", "table"."visits" AS "visits", "table"."votes" AS "votes", > "table"."score" AS "score", "arti Post EXPLAIN ANALYZE output of this query (use explain.depesz.com to post it). > Checkpoints logs still show very long write times : > > LOG: checkpoint complete: wrote 92 buffers (0.0%); 0 transaction log > file(s) added, 0 removed, 1 recycled; write=49.622 s, sync=6.510 s, > total=63.625 s No, that's fine - that's what spread checkpoints do. Once the checkpoint_timeout expires, the system decides to perform a checkpoint, i.e. it has to write all dirty (modified) shared buffers to the disk. But it knows the next checkpoint will happen in checkpoint_timeout, so it has about 5 minutes to write all the data. So it says something like 'I need to write 540MB want to write that in 270 seconds (5 minutes * completion_target), so I'll write at 2MB/s'. So the write phase is expected to take long. But it has to sync the data at the end, and that's where the problems usually happen - so the important thing is 'sync' and that improved significantly. 6 seconds is not great but it's not a big issue I guess. > Couldn't this be a hardware issue ? I don't think so. The problem is that the queries write a lot of data, and the SSD can't help in this case. The SSDs provide much better random performance (compared to spinners), but the sequential performance is not much better. So pushing 70MB/s to the SSD may be fully utilized. Have you tested the basic performance (like bonnie++, dd etc) before going into production? There are some common SSD-related mistakes (e.g. not aligning the partitions properly) that may easily cause 30% performance loss. But this won't fix the problem, although you could get a bit better performance. Tomas
On Fri, Nov 25, 2011 at 6:48 PM, Gaëtan Allart <gaetan@nexylan.com> wrote: > Here are the latest checkpoint logs : > > LOG: checkpoint complete: wrote 842 buffers (0.1%); 0 transaction log > file(s) added, 0 removed, 0 recycled; write=168.970 s, sync=0.005 s, > total=168.977 s > LOG: checkpoint starting: time > LOG: checkpoint complete: wrote 318 buffers (0.0%); 0 transaction log > file(s) added, 0 removed, 2 recycled; write=63.818 s, sync=0.006 s, > total=63.825 s > LOG: checkpoint starting: time > LOG: checkpoint complete: wrote 744 buffers (0.1%); 0 transaction log > file(s) added, 0 removed, 0 recycled; write=147.035 s, sync=0.006 s, > total=147.043 s > LOG: checkpoint starting: time > LOG: checkpoint complete: wrote 108 buffers (0.0%); 0 transaction log > file(s) added, 0 removed, 3 recycled; write=35.410 s, sync=14.921 s, > total=54.811 s > LOG: checkpoint starting: time > LOG: checkpoint complete: wrote 393 buffers (0.0%); 0 transaction log > file(s) added, 0 removed, 0 recycled; write=88.835 s, sync=43.210 s, > total=135.728 s > LOG: checkpoint starting: time > LOG: checkpoint complete: wrote 914 buffers (0.1%); 0 transaction log > file(s) added, 0 removed, 1 recycled; write=148.162 s, sync=14.249 s, > total=170.481 s > LOG: checkpoint starting: time > LOG: checkpoint complete: wrote 202 buffers (0.0%); 0 transaction log > file(s) added, 0 removed, 1 recycled; write=53.152 s, sync=0.004 s, > total=53.159 s > LOG: checkpoint starting: time > LOG: checkpoint complete: wrote 897 buffers (0.1%); 0 transaction log > file(s) added, 0 removed, 1 recycled; write=42.414 s, sync=1.175 s, > total=58.957 s > LOG: checkpoint starting: shutdown immediate > LOG: checkpoint complete: wrote 666 buffers (0.1%); 0 transaction log > file(s) added, 0 removed, 1 recycled; write=0.027 s, sync=1.600 s, > total=1.630 s > LOG: checkpoint starting: time > LOG: checkpoint complete: wrote 627 buffers (0.1%); 0 transaction log > file(s) added, 0 removed, 1 recycled; write=125.856 s, sync=0.006 s, > total=125.864 s > LOG: checkpoint starting: time > > > And pg_stat_bgwriter : > > postgres=# select *,now() from pg_stat_bgwriter; > checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean > | maxwritten_clean | buffers_backend | buffers_alloc | now > > -------------------+-----------------+--------------------+---------------+ > ------------------+-----------------+---------------+---------------------- > --------- > 388 | 13 | 494948 | 4306591 > | 13555 | 7458743 | 7835244602 | 2011-11-26 > 00:43:47.232924+01 > (1 row) > > postgres=# select *,now() from pg_stat_bgwriter; > checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean > | maxwritten_clean | buffers_backend | buffers_alloc | now > > -------------------+-----------------+--------------------+---------------+ > ------------------+-----------------+---------------+---------------------- > --------- > 389 | 13 | 501802 | 4352198 > | 13809 | 7469220 | 7839778941 | 2011-11-26 > 00:49:00.680779+01 > (1 row) > > Processes that were writing were "SELECT" queries against database. > What was the filesystem involved? What is the underlying disk layout (you said it's SSD's, but how what type, how many, and in what configuration? Also how is Postgres set up on top of the disks (all of $PGDATA and OS on one volume? Split up?) Also, how many active connections do you typically have? Can you reduce your sort mem to something like 4MB, and set log_temp_files to 0? Robert Treat conjecture: xzilla.net consulting: omniti.com
Hello Robert, I'm having around 30 active connections (average). As far as disk layout is concerned, ever thing's on the same disk (raid 1 with 2 SSDs). Gaëtan Le 26/11/11 15:25, « Robert Treat » <rob@xzilla.net> a écrit : >On Fri, Nov 25, 2011 at 6:48 PM, Gaëtan Allart <gaetan@nexylan.com> wrote: >> Here are the latest checkpoint logs : >> >> LOG: checkpoint complete: wrote 842 buffers (0.1%); 0 transaction log >> file(s) added, 0 removed, 0 recycled; write=168.970 s, sync=0.005 s, >> total=168.977 s >> LOG: checkpoint starting: time >> LOG: checkpoint complete: wrote 318 buffers (0.0%); 0 transaction log >> file(s) added, 0 removed, 2 recycled; write=63.818 s, sync=0.006 s, >> total=63.825 s >> LOG: checkpoint starting: time >> LOG: checkpoint complete: wrote 744 buffers (0.1%); 0 transaction log >> file(s) added, 0 removed, 0 recycled; write=147.035 s, sync=0.006 s, >> total=147.043 s >> LOG: checkpoint starting: time >> LOG: checkpoint complete: wrote 108 buffers (0.0%); 0 transaction log >> file(s) added, 0 removed, 3 recycled; write=35.410 s, sync=14.921 s, >> total=54.811 s >> LOG: checkpoint starting: time >> LOG: checkpoint complete: wrote 393 buffers (0.0%); 0 transaction log >> file(s) added, 0 removed, 0 recycled; write=88.835 s, sync=43.210 s, >> total=135.728 s >> LOG: checkpoint starting: time >> LOG: checkpoint complete: wrote 914 buffers (0.1%); 0 transaction log >> file(s) added, 0 removed, 1 recycled; write=148.162 s, sync=14.249 s, >> total=170.481 s >> LOG: checkpoint starting: time >> LOG: checkpoint complete: wrote 202 buffers (0.0%); 0 transaction log >> file(s) added, 0 removed, 1 recycled; write=53.152 s, sync=0.004 s, >> total=53.159 s >> LOG: checkpoint starting: time >> LOG: checkpoint complete: wrote 897 buffers (0.1%); 0 transaction log >> file(s) added, 0 removed, 1 recycled; write=42.414 s, sync=1.175 s, >> total=58.957 s >> LOG: checkpoint starting: shutdown immediate >> LOG: checkpoint complete: wrote 666 buffers (0.1%); 0 transaction log >> file(s) added, 0 removed, 1 recycled; write=0.027 s, sync=1.600 s, >> total=1.630 s >> LOG: checkpoint starting: time >> LOG: checkpoint complete: wrote 627 buffers (0.1%); 0 transaction log >> file(s) added, 0 removed, 1 recycled; write=125.856 s, sync=0.006 s, >> total=125.864 s >> LOG: checkpoint starting: time >> >> >> And pg_stat_bgwriter : >> >> postgres=# select *,now() from pg_stat_bgwriter; >> checkpoints_timed | checkpoints_req | buffers_checkpoint | >>buffers_clean >> | maxwritten_clean | buffers_backend | buffers_alloc | now >> >> >>-------------------+-----------------+--------------------+-------------- >>-+ >> >>------------------+-----------------+---------------+-------------------- >>-- >> --------- >> 388 | 13 | 494948 | 4306591 >> | 13555 | 7458743 | 7835244602 | 2011-11-26 >> 00:43:47.232924+01 >> (1 row) >> >> postgres=# select *,now() from pg_stat_bgwriter; >> checkpoints_timed | checkpoints_req | buffers_checkpoint | >>buffers_clean >> | maxwritten_clean | buffers_backend | buffers_alloc | now >> >> >>-------------------+-----------------+--------------------+-------------- >>-+ >> >>------------------+-----------------+---------------+-------------------- >>-- >> --------- >> 389 | 13 | 501802 | 4352198 >> | 13809 | 7469220 | 7839778941 | 2011-11-26 >> 00:49:00.680779+01 >> (1 row) >> >> Processes that were writing were "SELECT" queries against database. >> > >What was the filesystem involved? What is the underlying disk layout >(you said it's SSD's, but how what type, how many, and in what >configuration? Also how is Postgres set up on top of the disks (all of >$PGDATA and OS on one volume? Split up?) > >Also, how many active connections do you typically have? Can you >reduce your sort mem to something like 4MB, and set log_temp_files to >0? > > >Robert Treat >conjecture: xzilla.net >consulting: omniti.com
Rahh :/ It's getting worse and worse :/ Database has to be restarted every 2 hours causing much traffic loss :/ As far as the server is concerned, it was running great 7 days ago and had been running like this for months. I really don't get why it suddenly went "I/oing"Š Here's the current postgresql.conf : shared_buffers = 6GB # min 128kB # (change requires restart) temp_buffers = 40MB # min 800kB # (change requires restart) work_mem = 96MB # min 64k maintenance_work_mem = 256MB # min 1MB max_stack_depth = 2MB # min 100kB effective_cache_size = 32GB max_files_per_process = 8192 # min 25 # (change requires restart) fsync = on # turns forced synchronization on or offf synchronous_commit = off # immediate fsync at commit # supported by the operating system: wal_buffers = 16MB # min 32kB checkpoint_segments = 40 # in logfile segments, min 1, 16MB eachcheckpoint_timeout = 5min # range 30s-1h checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0 seq_page_cost = 1.0 # measured on an arbitrary scale random_page_cost = 2.0 # same scale as above Isn't there anything I can do to keep my database Up&running even with bad performance? Filesystem is ext3. Running over a hardware RAID-1 config. Gaëtan Le 26/11/11 15:12, « Tomas Vondra » <tv@fuzzy.cz> a écrit : >On 26 Listopad 2011, 10:45, Gaëtan Allart wrote: >> A better view of iotop : >> >> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND >> 31875 be/4 postgres 0.00 B/s 15.23 M/s 0.00 % 0.00 % postgres: >> database database 46.105.104.205(50228) SELECT >> 30985 be/4 postgres 0.00 B/s 10.55 M/s 0.00 % 0.00 % postgres: >> database database 46.105.104.205(47672) SELECT >> >> As you can see, SELECTS are writing very much on the disk. >> >> At this moment, pg_stat_activity show many many queries running at the >> same time. >> It looks like that all queries are suddenly writing on disk and not a >> particular one, making me think of a buffer issue or something. > >No, I don't think this is an issue with the size of shared buffers. That'd >influence the pg_stat_bgwriter - the buffers_backend would grow much >faster, and it's not the case. > >So I'm guessing it's either hint bits or sorting. If I had to, I'd >probably guess about the hint bits - see for example this thread > >http://archives.postgresql.org/pgsql-performance/2008-05/msg00182.php > >it more or less behaves like what you described so far. > >> This is a example of 10 MB/s writing query : >> >> SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS "id", >> "table"."flux_id" AS "flux_id", "table"."locale_id" AS "locale_id", >> "table"."url_article" AS "url_article", "table"."original_url" AS >> "original_url", "table"."name" AS "name", "table"."description" AS >> "description", "table"."content" AS "content", "table"."permis" AS >> "permis", "table"."reviewed" AS "reviewed", "table"."author_id" AS >> "author_id", "table"."poster_id" AS "poster_id", "table"."post_date" AS >> "post_date", "table"."edit_date" AS "edit_date", "table"."add_date" AS >> "add_date", "table"."comments_open" AS "comments_open", >>"table"."site_id" >> AS "site_id", "table"."is_local" AS "is_local", "table"."status" AS >> "status", "table"."visits" AS "visits", "table"."votes" AS "votes", >> "table"."score" AS "score", "arti > >Post EXPLAIN ANALYZE output of this query (use explain.depesz.com to post >it). > >> Checkpoints logs still show very long write times : >> >> LOG: checkpoint complete: wrote 92 buffers (0.0%); 0 transaction log >> file(s) added, 0 removed, 1 recycled; write=49.622 s, sync=6.510 s, >> total=63.625 s > >No, that's fine - that's what spread checkpoints do. Once the >checkpoint_timeout expires, the system decides to perform a checkpoint, >i.e. it has to write all dirty (modified) shared buffers to the disk. But >it knows the next checkpoint will happen in checkpoint_timeout, so it has >about 5 minutes to write all the data. > >So it says something like 'I need to write 540MB want to write that in 270 >seconds (5 minutes * completion_target), so I'll write at 2MB/s'. So the >write phase is expected to take long. But it has to sync the data at the >end, and that's where the problems usually happen - so the important thing >is 'sync' and that improved significantly. 6 seconds is not great but it's >not a big issue I guess. > >> Couldn't this be a hardware issue ? > >I don't think so. The problem is that the queries write a lot of data, and >the SSD can't help in this case. The SSDs provide much better random >performance (compared to spinners), but the sequential performance is not >much better. So pushing 70MB/s to the SSD may be fully utilized. > >Have you tested the basic performance (like bonnie++, dd etc) before going >into production? There are some common SSD-related mistakes (e.g. not >aligning the partitions properly) that may easily cause 30% performance >loss. But this won't fix the problem, although you could get a bit better >performance. > >Tomas
Uhm… I'm seeing dozens and dozens of temporary file creations in logs : LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1425", size 25340 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.195", size 2720340 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.3495", size 24724 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.2674", size 2712452 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1850", size 25284 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.822", size 2717464 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1050", size 25060 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.501", size 2700248 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.3768", size 23156 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.2132", size 2713204 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1886", size 23744 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.311", size 2829600 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.3849", size 25088 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.2352", size 2770352 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1267", size 25592 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.765", size 2803744 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.3775", size 24444 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.2686", size 2858836 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1094", size 24948 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.106", size 2800140 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1399", size 23912 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.211", size 2761788 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.3747", size 27188 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.2639", size 2749672 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1053", size 24276 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.452", size 2948712 Does this help ? Gaëtan Le 26/11/11 17:47, « Gaëtan Allart » <gaetan@nexylan.com> a écrit : >Rahh :/ > >It's getting worse and worse :/ Database has to be restarted every 2 hours >causing much traffic loss :/ > >As far as the server is concerned, it was running great 7 days ago and had >been running like this for months. I really don't get why it suddenly went >"I/oing"Š > > >Here's the current postgresql.conf : > >shared_buffers = 6GB # min 128kB > # (change requires restart) >temp_buffers = 40MB # min 800kB > # (change requires restart) >work_mem = 96MB # min 64k >maintenance_work_mem = 256MB # min 1MB >max_stack_depth = 2MB # min 100kB > >effective_cache_size = 32GB > > >max_files_per_process = 8192 # min 25 > # (change requires restart) > >fsync = on # turns forced synchronization on or offf >synchronous_commit = off # immediate fsync at commit > # supported by the operating system: >wal_buffers = 16MB # min 32kB > > >checkpoint_segments = 40 # in logfile segments, min 1, 16MB >eachcheckpoint_timeout = 5min # range 30s-1h >checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0 > > > > >seq_page_cost = 1.0 # measured on an arbitrary scale >random_page_cost = 2.0 # same scale as above > > > > > >Isn't there anything I can do to keep my database Up&running even with bad >performance? > >Filesystem is ext3. Running over a hardware RAID-1 config. > > >Gaëtan > >Le 26/11/11 15:12, « Tomas Vondra » <tv@fuzzy.cz> a écrit : > >>On 26 Listopad 2011, 10:45, Gaëtan Allart wrote: >>> A better view of iotop : >>> >>> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND >>> 31875 be/4 postgres 0.00 B/s 15.23 M/s 0.00 % 0.00 % postgres: >>> database database 46.105.104.205(50228) SELECT >>> 30985 be/4 postgres 0.00 B/s 10.55 M/s 0.00 % 0.00 % postgres: >>> database database 46.105.104.205(47672) SELECT >>> >>> As you can see, SELECTS are writing very much on the disk. >>> >>> At this moment, pg_stat_activity show many many queries running at the >>> same time. >>> It looks like that all queries are suddenly writing on disk and not a >>> particular one, making me think of a buffer issue or something. >> >>No, I don't think this is an issue with the size of shared buffers. >>That'd >>influence the pg_stat_bgwriter - the buffers_backend would grow much >>faster, and it's not the case. >> >>So I'm guessing it's either hint bits or sorting. If I had to, I'd >>probably guess about the hint bits - see for example this thread >> >>http://archives.postgresql.org/pgsql-performance/2008-05/msg00182.php >> >>it more or less behaves like what you described so far. >> >>> This is a example of 10 MB/s writing query : >>> >>> SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS "id", >>> "table"."flux_id" AS "flux_id", "table"."locale_id" AS "locale_id", >>> "table"."url_article" AS "url_article", "table"."original_url" AS >>> "original_url", "table"."name" AS "name", "table"."description" AS >>> "description", "table"."content" AS "content", "table"."permis" AS >>> "permis", "table"."reviewed" AS "reviewed", "table"."author_id" AS >>> "author_id", "table"."poster_id" AS "poster_id", "table"."post_date" AS >>> "post_date", "table"."edit_date" AS "edit_date", "table"."add_date" AS >>> "add_date", "table"."comments_open" AS "comments_open", >>>"table"."site_id" >>> AS "site_id", "table"."is_local" AS "is_local", "table"."status" AS >>> "status", "table"."visits" AS "visits", "table"."votes" AS "votes", >>> "table"."score" AS "score", "arti >> >>Post EXPLAIN ANALYZE output of this query (use explain.depesz.com to post >>it). >> >>> Checkpoints logs still show very long write times : >>> >>> LOG: checkpoint complete: wrote 92 buffers (0.0%); 0 transaction log >>> file(s) added, 0 removed, 1 recycled; write=49.622 s, sync=6.510 s, >>> total=63.625 s >> >>No, that's fine - that's what spread checkpoints do. Once the >>checkpoint_timeout expires, the system decides to perform a checkpoint, >>i.e. it has to write all dirty (modified) shared buffers to the disk. But >>it knows the next checkpoint will happen in checkpoint_timeout, so it has >>about 5 minutes to write all the data. >> >>So it says something like 'I need to write 540MB want to write that in >>270 >>seconds (5 minutes * completion_target), so I'll write at 2MB/s'. So the >>write phase is expected to take long. But it has to sync the data at the >>end, and that's where the problems usually happen - so the important >>thing >>is 'sync' and that improved significantly. 6 seconds is not great but >>it's >>not a big issue I guess. >> >>> Couldn't this be a hardware issue ? >> >>I don't think so. The problem is that the queries write a lot of data, >>and >>the SSD can't help in this case. The SSDs provide much better random >>performance (compared to spinners), but the sequential performance is not >>much better. So pushing 70MB/s to the SSD may be fully utilized. >> >>Have you tested the basic performance (like bonnie++, dd etc) before >>going >>into production? There are some common SSD-related mistakes (e.g. not >>aligning the partitions properly) that may easily cause 30% performance >>loss. But this won't fix the problem, although you could get a bit better >>performance. >> >>Tomas > > >-- >Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-general
Dne 26.11.2011 18:08, Gaëtan Allart napsal(a): > Uhm… > > I'm seeing dozens and dozens of temporary file creations in logs : > > LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1399", size 23912 > LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.211", size > 2761788 > LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.3747", size 27188 > LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.2639", size > 2749672 > LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1053", size 24276 > LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.452", size > 2948712 Please, add timestamp to log_line_prefix so that we know how frequently that happens. Anyway the output is a bit strange. It's writing a lot of temp files that are significantly smaller (about 3MB) than work_mem (128MB). Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > Anyway the output is a bit strange. It's writing a lot of temp files > that are significantly smaller (about 3MB) than work_mem (128MB). The most obvious guess about what's doing that is a hash join that has a drastic overestimate of how many rows it has to hash, so that it's batching the join unnecessarily. AFAIR, sorts don't spill to disk until they've actually reached work_mem, but hash join works differently. regards, tom lane
Dne 26.11.2011 17:47, Gaëtan Allart napsal(a): > Rahh :/ > > It's getting worse and worse :/ Database has to be restarted every 2 hours > causing much traffic loss :/ > > As far as the server is concerned, it was running great 7 days ago and had > been running like this for months. I really don't get why it suddenly went > "I/oing"Š Given the info provided so far we can rule out checkpoint issues. These sudden changes in performance happen when the when the stats are off or vacuum is not able to maintain all the tables. 1) Post EXPLAIN ANALYZE of the queries that cause a lot of writes. 2) How much did the database grew recently? 3) Is the autovacuum running fine? Check the logs, and maybe set log_autovacuum_min_duration = 100 or a different value. Set the log_line_prefix (add '%t'). 4) Check the pg_stat_all_tables: - large n_dead_tup values (especially compared to n_live_tup) - tables with a lot of changed (n_tup_ins + n_tup_upd + n_tup_del + n_tup_hot_upd) that were not vacuumed / analyzed recently Tomas
On 26.11.2011 19:08, Tom Lane wrote: > Tomas Vondra <tv@fuzzy.cz> writes: >> Anyway the output is a bit strange. It's writing a lot of temp files >> that are significantly smaller (about 3MB) than work_mem (128MB). > > The most obvious guess about what's doing that is a hash join that has > a drastic overestimate of how many rows it has to hash, so that it's > batching the join unnecessarily. AFAIR, sorts don't spill to disk until > they've actually reached work_mem, but hash join works differently. OK, so it probably is not sorting, probably. What about other nodes using hash table - aggregates etc.? Would those spill to the disk in case of overestimation? Gaetan did not post the whole query, but it starts like this: SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS "id", "table"."flux_id" AS "flux_id", "table"."locale_id" AS "locale_id", ... so it obviously uses a DISTINCT clause, probably evaluated using a hash aggregate. And this is allegedly one of the queries that write a lot. Tomas
Actually, this is because I changed sort_mem to 4 Mb as asked by Robert. I removed this setting.. Gaëtan Le 26/11/11 18:58, « Tomas Vondra » <tv@fuzzy.cz> a écrit : >Dne 26.11.2011 18:08, Gaëtan Allart napsal(a): >> UhmŠ >> >> I'm seeing dozens and dozens of temporary file creations in logs : >> >> LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1399", size >>23912 >> LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.211", size >> 2761788 >> LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.3747", size >>27188 >> LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp13772.2639", size >> 2749672 >> LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.1053", size >>24276 >> LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15388.452", size >> 2948712 > >Please, add timestamp to log_line_prefix so that we know how frequently >that happens. > >Anyway the output is a bit strange. It's writing a lot of temp files >that are significantly smaller (about 3MB) than work_mem (128MB). > >Tomas > >-- >Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-general
On 27 Listopad 2011, 10:45, Gaëtan Allart wrote: > Actually, this is because I changed sort_mem to 4 Mb as asked by Robert. > > I removed this setting.. This is exactly the reason why it's recommended not to do any changes until it's reasonably certain what is the caause. Did increasing the work_mem again solved the issues? Is the database running fine? If no, post the iotop output again, so we can see what was the impact. And post the EXPLAIN ANALYZE of the query, so we can see if the estimates are OK. Anyway, Rober was right - the work_mem=128MB is rather large and should be decreased. Use something like a binary search to find the right value - use 64MB and watch the log for temp file messages. If there's a lot of messages, choose the upper interval and set (64+128)/2 = 96MB, otherwise 32MB. And so on. That should lead you to the minimal work_mem value. Tomas