Обсуждение: weird performances problem
Hi all, We are operating a 1.5GB postgresql database for a year and we have problems for nearly a month. Usually everything is OK with the database, queries are executed fast even if they are complicated but sometimes and for half an hour, we have a general slow down. The server is a dedicated quad xeon with 4GB and a RAID1 array for the system and a RAID10 one for postgresql data. We have very few updates/inserts/deletes during the day. Postgresql version is 7.4.8. - the database is vacuumed, analyzed regularly (but we are not using autovacuum) and usually performs pretty well ; - IOs are OK, the database is entirely in RAM (see top.txt and iostat.txt attached) ; - CPUs are usually 25% idle, load is never really growing and its maximum is below 5 ; - I attached two plans for a simple query, the one is what we have when the server is fast, the other when we have a slow down: it's exactly the same plan but, as you can see it, the time to fetch the first row from indexes is quite high for the slow query ; - during this slow down, queries that usually take 500ms can take up to 60 seconds (and sometimes even more) ; - we have up to 130 permanent connections from our apache servers during this slow down as we have a lot of apache children waiting for a response. I attached a vmstat output. Context switches are quite high but I'm not sure it can be called a context switch storm and that this is the cause of our problem. Thanks for any advice or idea to help us understanding this problem and hopefully solve it. Regards, -- Guillaume [root@bd root]# iostat 10 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 7.20 0.00 92.00 0 920 sda1 0.00 0.00 0.00 0 0 sda2 6.40 0.00 78.40 0 784 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 0.00 0.00 0.00 0 0 sda6 0.80 0.00 13.60 0 136 sdb 5.00 0.00 165.60 0 1656 sdb1 5.00 0.00 165.60 0 1656 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 1.30 0.00 20.80 0 208 sda1 0.00 0.00 0.00 0 0 sda2 0.70 0.00 9.60 0 96 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 0.00 0.00 0.00 0 0 sda6 0.60 0.00 11.20 0 112 sdb 5.40 0.00 173.60 0 1736 sdb1 5.40 0.00 173.60 0 1736 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 2.20 0.00 28.00 0 280 sda1 0.00 0.00 0.00 0 0 sda2 2.20 0.00 28.00 0 280 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 0.00 0.00 0.00 0 0 sda6 0.00 0.00 0.00 0 0 sdb 5.20 0.00 171.20 0 1712 sdb1 5.20 0.00 171.20 0 1712 QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..13.52 rows=2 width=1119) (actual time=0.154..0.167 rows=1 loops=1) -> Index Scan using pk_newslang on newslang nl (cost=0.00..3.87 rows=1 width=1004) (actual time=0.053..0.055 rows=1loops=1) Index Cond: (((codelang)::text = 'FRA'::text) AND (3498704 = numnews)) -> Nested Loop Left Join (cost=0.00..9.61 rows=2 width=119) (actual time=0.050..0.059 rows=1 loops=1) -> Index Scan using pk_news on news n (cost=0.00..3.31 rows=2 width=98) (actual time=0.021..0.023 rows=1 loops=1) Index Cond: (numnews = 3498704) -> Index Scan using pk_photo on photo p (cost=0.00..3.14 rows=1 width=25) (actual time=0.021..0.025 rows=1 loops=1) Index Cond: (p.numphoto = "outer".numphoto) Total runtime: 0.320 ms (9 rows) QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=0.00..13.52 rows=2 width=1119) (actual time=155.286..155.305 rows=1 loops=1) -> Index Scan using pk_newslang on newslang nl (cost=0.00..3.87 rows=1 width=1004) (actual time=44.575..44.579 rows=1loops=1) Index Cond: (((codelang)::text = 'FRA'::text) AND (3498704 = numnews)) -> Nested Loop Left Join (cost=0.00..9.61 rows=2 width=119) (actual time=110.648..110.660 rows=1 loops=1) -> Index Scan using pk_news on news n (cost=0.00..3.31 rows=2 width=98) (actual time=0.169..0.174 rows=1 loops=1) Index Cond: (numnews = 3498704) -> Index Scan using pk_photo on photo p (cost=0.00..3.14 rows=1 width=25) (actual time=110.451..110.454 rows=1loops=1) Index Cond: (p.numphoto = "outer".numphoto) Total runtime: 155.514 ms (9 rows) 17:08:41 up 19 days, 15:16, 1 user, load average: 4.03, 4.26, 4.36 288 processes: 285 sleeping, 3 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 59.0% 0.0% 8.8% 0.2% 0.0% 0.0% 31.9% cpu00 52.3% 0.0% 13.3% 0.9% 0.0% 0.0% 33.3% cpu01 65.7% 0.0% 7.6% 0.0% 0.0% 0.0% 26.6% cpu02 58.0% 0.0% 7.6% 0.0% 0.0% 0.0% 34.2% cpu03 60.0% 0.0% 6.6% 0.0% 0.0% 0.0% 33.3% Mem: 3857224k av, 3495880k used, 361344k free, 0k shrd, 92160k buff 2374048k actv, 463576k in_d, 37708k in_c Swap: 4281272k av, 25412k used, 4255860k free 2173392k cached [root@bd root]# vmstat 10 procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 4 0 25412 250160 91736 2170944 0 1 3 3 4 3 5 3 3 0 7 0 25412 247160 91736 2171024 0 0 0 116 588 4483 65 4 31 0 3 0 25412 235456 91752 2171132 0 0 0 129 491 3670 70 4 26 0 5 0 25412 233696 91760 2171216 0 0 0 152 530 4768 61 4 34 0 5 0 25412 233248 91768 2171232 0 0 0 183 624 5379 59 5 36 0 9 0 25412 195332 91788 2171304 0 0 0 127 541 4811 58 5 37 0
On Thu, Nov 17, 2005 at 06:47:09PM +0100, Guillaume Smet wrote: > queries are executed fast even if they are complicated but sometimes and > for half an hour, we have a general slow down. Is it exactly half an hour? What changes at the time that happens (i.e. what else happens on the machine?). Is this a time, for example, when logrotate is killing your I/O with file moves? A -- Andrew Sullivan | ajs@crankycanuck.ca I remember when computers were frustrating because they *did* exactly what you told them to. That actually seems sort of quaint now. --J.D. Baldwin
Andrew, Andrew Sullivan wrote: > Is it exactly half an hour? What changes at the time that happens > (i.e. what else happens on the machine?). Is this a time, for > example, when logrotate is killing your I/O with file moves? No, it's not exactly half an hour. It's just that it slows down for some time (10, 20, 30 minutes) and then it's OK again. It happens several times per day. I checked if there are other processes running when we have this slow down but it's not the case. There's not really a difference between when it's OK or not (apart from the number of connections because the db is too slow): load is still at 4 or 5, iowait is still at 0%, there's still cpu idle and we still have free memory. I can't find what is the limit and why there is cpu idle. I forgot to give our non default postgresql.conf parameters: shared_buffers = 28800 sort_mem = 32768 vacuum_mem = 32768 max_fsm_pages = 350000 max_fsm_relations = 2000 checkpoint_segments = 16 effective_cache_size = 270000 random_page_cost = 2 Thanks for your help -- Guillaume
> I forgot to give our non default postgresql.conf parameters: > shared_buffers = 28800 > sort_mem = 32768 > vacuum_mem = 32768 > max_fsm_pages = 350000 > max_fsm_relations = 2000 > checkpoint_segments = 16 > effective_cache_size = 270000 > random_page_cost = 2 Isn't sort_mem quite high? Remember that sort_mem size is allocated for each sort, not for each connection. Mine is 4096 (4 MB). My effective_cache_size is set to 27462. What OS are you running? regards Claus
On Fri, Nov 18, 2005 at 12:35:06AM +0100, Guillaume Smet wrote: > sort_mem = 32768 I would be very suspicious of that much memory for sort. Please see the docs for what that does. That is the amount that _each sort_ can allocate before spilling to disk. If some set of your users are causing complicated queries with, say, four sorts apiece, then each user is potentially allocating 4x that much memory. That's going to wreak havoc on your disk buffers (which are tricky to monitor on most systems, and impossible on some). This'd be the first knob I'd twiddle, for sure. A -- Andrew Sullivan | ajs@crankycanuck.ca It is above all style through which power defers to reason. --J. Robert Oppenheimer
"Guillaume Smet" <guillaume.smet@openwide.fr> wrote > [root@bd root]# iostat 10 > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > sda 7.20 0.00 92.00 0 920 > sda1 0.00 0.00 0.00 0 0 > sda2 6.40 0.00 78.40 0 784 > sda3 0.00 0.00 0.00 0 0 > sda4 0.00 0.00 0.00 0 0 > sda5 0.00 0.00 0.00 0 0 > sda6 0.80 0.00 13.60 0 136 > sdb 5.00 0.00 165.60 0 1656 > sdb1 5.00 0.00 165.60 0 1656 > > Nested Loop (cost=0.00..13.52 rows=2 width=1119) (actual > time=155.286..155.305 rows=1 loops=1) > -> Index Scan using pk_newslang on newslang nl (cost=0.00..3.87 rows=1 > width=1004) (actual time=44.575..44.579 rows=1 loops=1) > Index Cond: (((codelang)::text = 'FRA'::text) AND (3498704 = > numnews)) > -> Nested Loop Left Join (cost=0.00..9.61 rows=2 width=119) (actual > time=110.648..110.660 rows=1 loops=1) > -> Index Scan using pk_news on news n (cost=0.00..3.31 rows=2 > width=98) (actual time=0.169..0.174 rows=1 loops=1) > Index Cond: (numnews = 3498704) > -> Index Scan using pk_photo on photo p (cost=0.00..3.14 rows=1 > width=25) (actual time=110.451..110.454 rows=1 loops=1) > Index Cond: (p.numphoto = "outer".numphoto) > Total runtime: 155.514 ms > Someone is doing a massive *write* at this time, which makes your query *read* quite slow. Can you find out which process is doing write? Regards, Qingqing
Andrew, > I would be very suspicious of that much memory for sort. Please see > the docs for what that does. That is the amount that _each sort_ can > allocate before spilling to disk. > If some set of your users are > causing complicated queries with, say, four sorts apiece, then each > user is potentially allocating 4x that much memory. That's going to > wreak havoc on your disk buffers (which are tricky to monitor on most > systems, and impossible on some). Yes, we have effectively complicated queries. That's why we put the sort_mem so high. I'll see if we can put it lower for the next few days to see if it improves our performances. Thanks for your help. -- Guillaume
Qingqing Zhou wrote: > Someone is doing a massive *write* at this time, which makes your query > *read* quite slow. Can you find out which process is doing write? Indexes should be in memory so I don't expect a massive write to slow down the select queries. sdb is the RAID10 array dedicated to our data so the postgresql process is the only one to write on it. I'll check which write queries are running because there should really be a few updates/inserts on our db during the day. On a four days log analysis, I have the following: SELECT 403,964 INSERT 574 UPDATE 393 DELETE 26 So it's not really normal to have a massive write during the day. Thanks for your help -- Guillaume
If I understand your HW config correctly, all of the pg stuff is on the same RAID 10 set? If so, give WAL its own dedicated RAID 10 set. This looks like the old problem of everything stalling while WAL is being committed to HD. This concept works for other tables as well. If you have a tables that both want services at the same time, disk arm contention will drag performance into the floor when they are on the same HW set. Profile your HD access and put tables that want to be accessed at the same time on different HD sets. Even if you have to buy more HW to do it. Ron At 04:56 AM 11/22/2005, Guillaume Smet wrote: >Qingqing Zhou wrote: >>Someone is doing a massive *write* at this time, which makes your >>query *read* quite slow. Can you find out which process is doing write? > >Indexes should be in memory so I don't expect a massive write to >slow down the select queries. sdb is the RAID10 array dedicated to >our data so the postgresql process is the only one to write on it. >I'll check which write queries are running because there should >really be a few updates/inserts on our db during the day. > >On a four days log analysis, I have the following: >SELECT 403,964 >INSERT 574 >UPDATE 393 >DELETE 26 >So it's not really normal to have a massive write during the day. > >Thanks for your help > >-- >Guillaume > >---------------------------(end of broadcast)--------------------------- >TIP 2: Don't 'kill -9' the postmaster
Ron wrote: > If I understand your HW config correctly, all of the pg stuff is on the > same RAID 10 set? No, the system and the WAL are on a RAID 1 array and the data on their own RAID 10 array. As I said earlier, there's only a few writes in the database so I'm not really sure the WAL can be a limitation: IIRC, it's only used for writes isn't it? Don't you think we should have some io wait if the database was waiting for the WAL? We _never_ have any io wait on this server but our CPUs are still 30-40% idle. A typical top we have on this server is: 15:22:39 up 24 days, 13:30, 2 users, load average: 3.86, 3.96, 3.99 156 processes: 153 sleeping, 3 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 50.6% 0.0% 4.7% 0.0% 0.6% 0.0% 43.8% cpu00 47.4% 0.0% 3.1% 0.3% 1.5% 0.0% 47.4% cpu01 43.7% 0.0% 3.7% 0.0% 0.5% 0.0% 51.8% cpu02 58.9% 0.0% 7.7% 0.0% 0.1% 0.0% 33.0% cpu03 52.5% 0.0% 4.1% 0.0% 0.1% 0.0% 43.0% Mem: 3857224k av, 3307416k used, 549808k free, 0k shrd, 80640k buff 2224424k actv, 482552k in_d, 49416k in_c Swap: 4281272k av, 10032k used, 4271240k free 2602424k cached As you can see, we don't swap, we have free memory, we have all our data cached (our database size is 1.5 GB). Context switch are between 10,000 and 20,000 per seconds. > This concept works for other tables as well. If you have a tables that > both want services at the same time, disk arm contention will drag > performance into the floor when they are on the same HW set. > Profile your HD access and put tables that want to be accessed at the > same time on different HD sets. Even if you have to buy more HW to do it. I use iostat and I can only see a little write activity and no read activity on both raid arrays. -- Guillaume
Claus and Andrew, Claus Guttesen wrote: > Isn't sort_mem quite high? Remember that sort_mem size is allocated > for each sort, not for each connection. Mine is 4096 (4 MB). My > effective_cache_size is set to 27462. I tested sort mem from 4096 to 32768 (4096, 8192, 16384, 32768) this afternoon and 32768 is definitely the best value for us. We still have free memory using it, we don't have any swap and queries are generally faster (I log all queries taking more than 500ms and the log is growing far faster with lower values of sort_mem). > What OS are you running? # cat /etc/redhat-release CentOS release 3.6 (Final) so it's a RHEL 3 upd 6. # uname -a Linux our.host 2.4.21-37.ELsmp #1 SMP Wed Sep 28 14:05:46 EDT 2005 i686 i686 i386 GNU/Linux # cat /proc/cpuinfo 4x processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) MP CPU 2.20GHz stepping : 6 cpu MHz : 2192.976 cache size : 512 KB # cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 3949797376 3478376448 471420928 0 83410944 2679156736 Swap: 4384022528 9797632 4374224896 MemTotal: 3857224 kB MemFree: 460372 kB MemShared: 0 kB Buffers: 81456 kB Cached: 2610764 kB HTH Regards, -- Guillaume
At 09:26 AM 11/22/2005, Guillaume Smet wrote: >Ron wrote: >>If I understand your HW config correctly, all of the pg stuff is on >>the same RAID 10 set? > >No, the system and the WAL are on a RAID 1 array and the data on >their own RAID 10 array. As has been noted many times around here, put the WAL on its own dedicated HD's. You don't want any head movement on those HD's. >As I said earlier, there's only a few writes in the database so I'm >not really sure the WAL can be a limitation: IIRC, it's only used >for writes isn't it? When you reach a WAL checkpoint, pg commits WAL data to HD... ...and does almost nothing else until said commit is done. >Don't you think we should have some io wait if the database was >waiting for the WAL? We _never_ have any io wait on this server but >our CPUs are still 30-40% idle. _Something_ is doing long bursts of write IO on sdb and sdb1 every 30 minutes or so according to your previous posts. Profile your DBMS and find out what. >A typical top we have on this server is: > 15:22:39 up 24 days, 13:30, 2 users, load average: 3.86, 3.96, 3.99 >156 processes: 153 sleeping, 3 running, 0 zombie, 0 stopped >CPU states: cpu user nice system irq softirq iowait idle > total 50.6% 0.0% 4.7% 0.0% 0.6% 0.0% 43.8% > cpu00 47.4% 0.0% 3.1% 0.3% 1.5% 0.0% 47.4% > cpu01 43.7% 0.0% 3.7% 0.0% 0.5% 0.0% 51.8% > cpu02 58.9% 0.0% 7.7% 0.0% 0.1% 0.0% 33.0% > cpu03 52.5% 0.0% 4.1% 0.0% 0.1% 0.0% 43.0% >Mem: 3857224k av, 3307416k used, 549808k free, 0k shrd, 80640k buff > 2224424k actv, 482552k in_d, 49416k in_c >Swap: 4281272k av, 10032k used, 4271240k >free 2602424k cached > >As you can see, we don't swap, we have free memory, we have all our >data cached (our database size is 1.5 GB). > >Context switch are between 10,000 and 20,000 per seconds. That's actually a reasonably high CS rate. Again, why? >>This concept works for other tables as well. If you have tables >>that both want services at the same time, disk arm contention will >>drag performance into the floor when they are on the same HW set. >>Profile your HD access and put tables that want to be accessed at >>the same time on different HD sets. Even if you have to buy more HW to do it. > >I use iostat and I can only see a little write activity and no read >activity on both raid arrays. Remember it's not just the overall amount, it's _when_and _where_ the write activity takes place. If you have almost no write activity, but whenever it happens it all happens to the same place by multiple things contending for the same HDs, your performance during that time will be poor. Since the behavior you are describing fits that cause very well, I'd see if you can verify that's what's going on. Ron
Ron, First of all, thanks for your time. > As has been noted many times around here, put the WAL on its own > dedicated HD's. You don't want any head movement on those HD's. Yep, I know that. That's just we supposed it was not so important if it was nearly a readonly database which is wrong according to you. > _Something_ is doing long bursts of write IO on sdb and sdb1 every 30 > minutes or so according to your previous posts. It's not every 30 minutes. It's a 20-30 minutes slow down 3-4 times a day when we have a high load. Anyway apart from this problem which is temporary, we have cpu idle all the day when we don't have any io wait (and nearly no write) and the server is enough loaded to use all the 4 cpus. I don't think it's normal. It's not a very good idea but do you think we can put fsync to off during a few minutes to check if the WAL is effectively our problem? A simple reload of the configuration seems to take it into account. So can we disable it temporarily even when the database is running? If it is the case, I think we'll test it and if it solved our problem, we'll ask our customer to buy two new disks to have a specific RAID 1 array for the pg_xlog. > That's actually a reasonably high CS rate. Again, why? I'm not so surprised considering what I read before about Xeon multiprocessors, pg 7.4 and the famous context switch storm. We are planning a migration to 8.1 to (hopefully) solve this problem. Perhaps our problem is due to that high CS rate. -- Guillaume
At 10:26 AM 11/22/2005, Guillaume Smet wrote: >Ron, > >First of all, thanks for your time. Happy to help. >>As has been noted many times around here, put the WAL on its own >>dedicated HD's. You don't want any head movement on those HD's. > >Yep, I know that. That's just we supposed it was not so important if >it was nearly a readonly database which is wrong according to you. It's just good practice with pg that pg-xlog should always get it's own dedicated HD set. OTOH, I'm not at all convinced given the scant evidence so far that it is the primary problem here; particularly since if I understand you correctly, px-xlog is not on sdb or sdb1 where you are having the write storm. >>_Something_ is doing long bursts of write IO on sdb and sdb1 every >>30 minutes or so according to your previous posts. > >It's not every 30 minutes. It's a 20-30 minutes slow down 3-4 times >a day when we have a high load. Thanks for the correction and I apologize for the misunderstanding. Clearly the first step is to instrument sdb and sdb1 so that you understand exactly what is being accessed and written on them. Possibilities that come to mind: a) Are some of your sorts requiring more than 32MB during high load? If sorts do not usually require HD temp files and suddenly do, you could get behavior like yours. b) Are you doing too many 32MB sorts during high load? Same comment as above. c) Are you doing some sort of data updates or appends during high load that do not normally occur? d) Are you constantly doing "a little" write IO that turns into a write storm under high load because of queuing issues? Put the scaffolding in needed to trace _exactly_ what's happening on sdb and sdb1 throughout the day and then examine the traces over a day, a few days, and a week. I'll bet you will notice some patterns that will be helpful in identifying what's going on. Ron