Обсуждение: High SYS CPU - need advise
-- Vlad
On 11/14/12 1:13 PM, Vlad wrote: > Postgresql 9.1.6. > Postgres usually has 400-500 connected clients, most of them are idle. > Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk. thats a really high client connection count for a 8 core system. I'd consider implementing a connection pool (like pgbouncer), and rewriting your client applications to connect, do a transaction, disconnect, so the actual number of postgres connections is much lower, say in the 16-48 range. -- john r pierce N 37, W 122 santa cruz ca mid-left coast
-- Vlad
On 11/14/12 1:13 PM, Vlad wrote:thats a really high client connection count for a 8 core system.Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
I'd consider implementing a connection pool (like pgbouncer), and rewriting your client applications to connect, do a transaction, disconnect, so the actual number of postgres connections is much lower, say in the 16-48 range.
On 11/14/12 1:34 PM, Vlad wrote: > thanks for your feedback. While implementing connection pooling would > make resources utilization more efficient, I don't think it's the root > of my problem. Most of the connected clients are at IDLE. When I do > > select * from pg_stat_activity where current_query not like '%IDLE%'; > > I only see several active queries at any given time. what about during these spikes? -- john r pierce N 37, W 122 santa cruz ca mid-left coast
On Wed, Nov 14, 2012 at 4:08 PM, John R Pierce <pierce@hogranch.com> wrote: > On 11/14/12 1:34 PM, Vlad wrote: >> >> thanks for your feedback. While implementing connection pooling would make >> resources utilization more efficient, I don't think it's the root of my >> problem. Most of the connected clients are at IDLE. When I do >> >> select * from pg_stat_activity where current_query not like '%IDLE%'; >> >> I only see several active queries at any given time. > > > what about during these spikes? Yeah -- if you are seeing a lot of queries pile up during these times, it's time to explore connection pooler because it will keep system load manageable during these situations. things to check: *) blocked queries (pg_locks/pg_stat_activity) *) i/o wait. in particular, is linux page cache flushing. *) query stampede: we will want to measure TPS leading into the stall and out of it. *) anything else running on the box? *) you have a large amount of table? maybe this statistics file related? *) let's log checkpoints to see if there is correlation with stall *) nice to have vmstat/iostat during/before/after stall. for example, massive spike of context switches during stall could point to o/s scheduler issue. merlin
things to check:
*) blocked queries (pg_locks/pg_stat_activity)
*) i/o wait. in particular, is linux page cache flushing.
*) query stampede: we will want to measure TPS leading into the stall
and out of it.
*) anything else running on the box?
*) you have a large amount of table? maybe this statistics file related?
*) let's log checkpoints to see if there is correlation with stall
*) nice to have vmstat/iostat during/before/after stall. for example,
massive spike of context switches during stall could point to o/s
scheduler issue.
On Thu, Nov 15, 2012 at 11:50 AM, Vlad <marchenko@gmail.com> wrote: > there is no big spike of queries that cause that, queries come in relatively > stable pace. It's just when the higher rate of queries coming, the more > likely this to happen. yes, when stall happens , the active queries pile up > - but that's the result of a stall (the server reacts slow on a keypress, > not to mention queries execution), not the cause. > > procs -----------memory---------- ---swap-- -----io---- --system-- > -----cpu----- > r b swpd free buff cache si so bi bo in cs us sy id > wa st > 1 0 0 279240 12016 14431964 0 0 32 0 197852 4299 15 9 > 76 0 0 > 4 0 0 225984 12024 14419696 0 0 0 64 197711 5158 11 9 > 79 1 0 > 0 0 0 260112 12024 14413636 0 0 48 0 196708 4618 17 10 > 73 0 0 > 6 0 0 233936 12024 14375784 0 0 104 0 179861 4884 19 17 > 64 0 0 > 30 0 0 224904 12024 14354812 0 0 8 0 51088 1205 9 86 > 5 0 0 > 72 0 0 239144 12024 14333852 0 0 144 0 45601 542 2 98 > 0 0 0 > 78 0 0 224840 12024 14328536 0 0 0 0 38732 481 2 94 > 5 0 0 > 22 1 0 219072 12032 14250652 0 0 136 100 47323 1231 9 90 > 1 0 0 hm. well, we can definitely rule out i/o. I reviewed your last posting, and you said: "Out of the top 50 processes in top, 48 of them are postmasters, one is syslog, and one is psql. Each of the postmasters have a high %CPU, the top ones being 80% and higher, the rest being anywhere between 30% - 60%. Would postmaster 'queries' that are running attribute to the sys CPU usage, or should they be under the 'us' CPU usage?" Is this still true? Can we capture strace from one of the high % postmasters to see if there's any clues there. Maybe we've uncovered some type of weird spinlock contention issue. How large is your database (or at least the portion of it that's commonly used)? Would you characterize your queries as mostly small lookups, scans, or a mix? merlin
-- Vlad
Is this still true? Can we capture strace from one of the high %
postmasters to see if there's any clues there. Maybe we've uncovered
some type of weird spinlock contention issue. How large is your
database (or at least the portion of it that's commonly used)? Would
you characterize your queries as mostly small lookups, scans, or a
mix?
merlin
On Thu, Nov 15, 2012 at 2:20 PM, Vlad <marchenko@gmail.com> wrote: > Merlin, > > this is not my report, probably from a thread that I've referenced as having > a common symptoms. Here is info about my db: > > > Postgresql 9.1.6. > Postgres usually has 400-500 connected clients, most of them are idle. > Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk. > We have mostly select statements (joins across few tables), using indexes > and resulting in a small number of records returned. > > So mostly small lookups across few tables joining by a primary key / index. > Also, I just perform an experiment - I switched our app over to using hot > PITR instead of master and it experienced the same problem. So since PITR db > can only perform read-only queries, there is no write-locks (except maybe > when pitr is playing wal records from the master?), nevertheless SYS CPU > jumped sky. yeah. ok, nest steps: *) can you confirm that postgres process is using high cpu (according to top) during stall time *) if, so, please strace that process and save some of the log *) you're using a 'bleeding edge' kernel. so we must be suspicious of a regression there, particularly in the scheduler. *) I am suspicious of spinlock issue. so, if we can't isolate the problem, is running a hand complied postgres a possibility (for lock stats)? *) what is the output of this: echo /proc/sys/vm/zone_reclaim_mode merlin
yeah. ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall time
*) if, so, please strace that process and save some of the log
*) you're using a 'bleeding edge' kernel. so we must be suspicious of
a regression there, particularly in the scheduler.
*) I am suspicious of spinlock issue. so, if we can't isolate the
problem, is running a hand complied postgres a possibility (for lock
stats)?
*) what is the output of this:
echo /proc/sys/vm/zone_reclaim_mode
On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko@gmail.com> wrote: > >> >> yeah. ok, nest steps: >> *) can you confirm that postgres process is using high cpu (according >> to top) during stall time > > > yes, CPU is spread across a lot of postmasters > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster > 30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster > 11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster > 8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster > 29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster > 11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster > 29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster > 29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster > >> >> *) if, so, please strace that process and save some of the log > > > https://dl.dropbox.com/u/109778/stall_postmaster.log ok, excellent. reviewing the log, this immediately caught my eye: recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5 recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"..., 64, 0, NULL, NULL) = 64 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout) semop(41713721, {{2, 1, 0}}, 1) = 0 lseek(295, 0, SEEK_END) = 0 lseek(296, 0, SEEK_END) = 8192 this is definitely pointing to spinlock issue. see: slock.c what you are seeing here is a backend getting caught in a spin loop via a stuck spinlock. 0-9 = 10 times. did you by any chance check the logs? did any backends restart? we are looking for this: elog(PANIC, "stuck spinlock (%p) detected at %s:%d", lock, file, line); Anything else going on in the log? The way to enable locks status is via LWLOCK_STATS macro. But before doing any thing plz check logs for error/panic. merlin
On Thu, Nov 15, 2012 at 3:49 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko@gmail.com> wrote: >> >>> >>> yeah. ok, nest steps: >>> *) can you confirm that postgres process is using high cpu (according >>> to top) during stall time >> >> >> yes, CPU is spread across a lot of postmasters >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster >> 30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster >> 11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster >> 8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster >> 29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster >> 11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster >> 29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster >> 29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster >> >>> >>> *) if, so, please strace that process and save some of the log >> >> >> https://dl.dropbox.com/u/109778/stall_postmaster.log > > ok, excellent. reviewing the log, this immediately caught my eye: > > recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5 > recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"..., > 64, 0, NULL, NULL) = 64 > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout) > semop(41713721, {{2, 1, 0}}, 1) = 0 > lseek(295, 0, SEEK_END) = 0 > lseek(296, 0, SEEK_END) = 8192 > > this is definitely pointing to spinlock issue. see: slock.c what you > are seeing here is a backend getting caught in a spin loop via a stuck > spinlock. 0-9 = 10 times. did you by any chance check the logs? did > any backends restart? we are looking for this: > > elog(PANIC, "stuck spinlock (%p) detected at %s:%d", > lock, file, line); > > Anything else going on in the log? The way to enable locks status is > via LWLOCK_STATS macro. But before doing any thing plz check logs for > error/panic. ah, scratch that. reading the code again, slock.c allows for for up to 1 second (that is, you can delay 1000 times). so the panic is unlikely to be in the log. also spinlock waits shouldn't increase by exactly 1000 us. investigating... merlin
Merlin Moncure escribió: > ok, excellent. reviewing the log, this immediately caught my eye: > > recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5 > recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"..., > 64, 0, NULL, NULL) = 64 > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout) > semop(41713721, {{2, 1, 0}}, 1) = 0 > lseek(295, 0, SEEK_END) = 0 > lseek(296, 0, SEEK_END) = 8192 > > this is definitely pointing to spinlock issue. I met Rik van Riel (Linux kernel hacker) recently and we chatted about this briefly. He strongly suggested that we should consider using futexes on Linux instead of spinlocks; the big advantage being that futexes sleep instead of spinning when contention is high. That would reduce the system load in this scenario. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
-- Vlad
On Thu, Nov 15, 2012 at 4:29 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Merlin Moncure escribió: > >> ok, excellent. reviewing the log, this immediately caught my eye: >> >> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5 >> recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"..., >> 64, 0, NULL, NULL) = 64 >> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout) >> semop(41713721, {{2, 1, 0}}, 1) = 0 >> lseek(295, 0, SEEK_END) = 0 >> lseek(296, 0, SEEK_END) = 8192 >> >> this is definitely pointing to spinlock issue. > > I met Rik van Riel (Linux kernel hacker) recently and we chatted about > this briefly. He strongly suggested that we should consider using > futexes on Linux instead of spinlocks; the big advantage being that > futexes sleep instead of spinning when contention is high. That would > reduce the system load in this scenario. Well, so do postgres spinlocks right? When we overflow spins_per_delay we go to pg_usleep which proxies to select() -- postgres spinlocks are a hybrid implementation. Moving to futex is possible improvement (that's another discussion) in degenerate cases but I'm not sure that I've exactly zeroed in on the problem. Or am I missing something? What I've been scratching my head over is what code exactly would cause an iterative sleep like the above. The code is here: pg_usleep(cur_delay * 1000L); /* increase delay by a random fraction between 1X and 2X */ cur_delay += (int) (cur_delay * ((double) random() / (double) MAX_RANDOM_VALUE) + 0.5); /* wrap back to minimum delay when max is exceeded */ if (cur_delay > MAX_DELAY_MSEC) cur_delay = MIN_DELAY_MSEC; ...so cur_delay is supposed to increase in non linear fashion. I've looked around the sleep, usleep, and latch calls as of yet haven't found anything that advances timeout just like that (yet, need to do another pass). And we don't know for sure if this is directly related to OP's problem. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > What I've been scratching my head over is what code exactly would > cause an iterative sleep like the above. The code is here: > pg_usleep(cur_delay * 1000L); > /* increase delay by a random fraction between 1X and 2X */ > cur_delay += (int) (cur_delay * > ((double) random() / (double) MAX_RANDOM_VALUE) + 0.5); > /* wrap back to minimum delay when max is exceeded */ > if (cur_delay > MAX_DELAY_MSEC) > cur_delay = MIN_DELAY_MSEC; IIRC that coding is of relatively recent vintage. The OP is probably running some older version that increased cur_delay differently. regards, tom lane
Tom, I just checked the version I'm running (9.1.6), and the code is quite similar (src/backend/storage/lmgr/s_lock.c) pg_usleep(cur_delay * 1000L); #if defined(S_LOCK_TEST) fprintf(stdout, "*"); fflush(stdout); #endif /* increase delay by a random fraction between 1X and 2X */ cur_delay += (int) (cur_delay * ((double) random() / (double) MAX_RANDOM_VALUE) + 0.5); /* wrap back to minimum delay when max is exceeded */ if (cur_delay > MAX_DELAY_MSEC) cur_delay = MIN_DELAY_MSEC; -- vlad
On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >>> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) >>> select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout) >>> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) >>> select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout) >>> select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout) >>> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) >>> select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout) This is not entirely inconsistent with the spinlock. Note that 1000 is repeated 3 times, and 5000 is missing. This might just be a misleading random sample we got here. I've seen similar close spacing in some simulations I've run. It is not clear to me why we use a resolution of 1 msec here. If the OS's implementation of select() eventually rounds to the nearest msec, that is its business. But why do we have to lose intermediate precision due to its decision? Cheers, Jeff
On Thu, Nov 15, 2012 at 6:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > >>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >>>> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) >>>> select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout) >>>> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) >>>> select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout) >>>> select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout) >>>> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) >>>> select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout) > > This is not entirely inconsistent with the spinlock. Note that 1000 > is repeated 3 times, and 5000 is missing. > > This might just be a misleading random sample we got here. I've seen > similar close spacing in some simulations I've run. > > It is not clear to me why we use a resolution of 1 msec here. If the > OS's implementation of select() eventually rounds to the nearest msec, > that is its business. But why do we have to lose intermediate > precision due to its decision? Yeah -- you're right, this is definitely spinlock issue. Next steps: *) in mostly read workloads, we have a couple of known frequent offenders. In particular the 'BufFreelistLock'. One way we can influence that guy is to try and significantly lower/raise shared buffers. So this is one thing to try. *) failing that, LWLOCK_STATS macro can be compiled in to give us some information about the particular lock(s) we're binding on. Hopefully it's a lwlock -- this will make diagnosing the problem easier. *) if we're not blocking on lwlock, it's possibly a buffer pin related issue? I've seen this before, for example on an index scan that is dependent on an seq scan. This long thread: "http://postgresql.1045698.n5.nabble.com/9-2beta1-parallel-queries-ReleasePredicateLocks-CheckForSerializableConflictIn-in-the-oprofile-td5709812i100.html" has a lot information about that case and deserves a review. *) we can consider experimenting with futex (http://archives.postgresql.org/pgsql-hackers/2012-06/msg01588.php) to see if things improve. This is dangerous, and could crash your server/eat your data, so fair warning. merlin
Merlin, > Yeah -- you're right, this is definitely spinlock issue. Next steps: > > *) in mostly read workloads, we have a couple of known frequent > offenders. In particular the 'BufFreelistLock'. One way we can > influence that guy is to try and significantly lower/raise shared > buffers. So this is one thing to try. server has 32gb with 3.2gb dedicated for share buffers. I've increased it to 13.2gb, the stall still happened (I have a way of controlling number of queries hitting postgresql by shutting down own memcache-based app cache, so to test for stall I temporary shut down few memcached servers). > *) failing that, LWLOCK_STATS macro can be compiled in to give us some > information about the particular lock(s) we're binding on. Hopefully > it's a lwlock -- this will make diagnosing the problem easier. I've enabled that macro, seeing flying lwlock messages in the log (see below), even when there is no high-sys-cpu stall observed at the moment. Should I be looking for something in particular? PID 17293 lwlock 5906: shacq 1 exacq 0 blk 0 PID 17293 lwlock 5932: shacq 1 exacq 0 blk 0 PID 17293 lwlock 5934: shacq 1 exacq 0 blk 0 PID 17293 lwlock 10854: shacq 1 exacq 0 blk 0 PID 17293 lwlock 10856: shacq 4 exacq 0 blk 0 PID 17293 lwlock 10858: shacq 15 exacq 0 blk 0 PID 17293 lwlock 16442: shacq 4 exacq 0 blk 0 PID 17293 lwlock 16596: shacq 1 exacq 0 blk 0 PID 17293 lwlock 20458: shacq 3 exacq 0 blk 0 PID 17293 lwlock 20460: shacq 10 exacq 0 blk 0 PID 17293 lwlock 20464: shacq 2 exacq 0 blk 0 PID 17293 lwlock 20466: shacq 1 exacq 0 blk 0 PID 17293 lwlock 20480: shacq 2 exacq 0 blk 0 PID 17293 lwlock 20482: shacq 1 exacq 0 blk 0 PID 17293 lwlock 20484: shacq 1 exacq 0 blk 0 PID 17293 lwlock 20508: shacq 12 exacq 0 blk 0 PID 17293 lwlock 20510: shacq 6 exacq 0 blk 0 PID 17293 lwlock 20938: shacq 11 exacq 0 blk 0 PID 17293 lwlock 20940: shacq 22 exacq 0 blk 0 PID 17293 lwlock 20942: shacq 28 exacq 0 blk 0 PID 17293 lwlock 20944: shacq 56 exacq 0 blk 0 PID 17293 lwlock 20946: shacq 1 exacq 0 blk 0 PID 17293 lwlock 20948: shacq 2 exacq 0 blk 0 PID 17293 lwlock 20950: shacq 11 exacq 0 blk 0 PID 17293 lwlock 20952: shacq 3 exacq 0 blk 0 PID 17293 lwlock 20954: shacq 7 exacq 0 blk 0 PID 17293 lwlock 20956: shacq 2 exacq 0 blk 0 PID 17293 lwlock 20958: shacq 1 exacq 0 blk 0 PID 17293 lwlock 20960: shacq 1 exacq 0 blk 0 PID 17293 lwlock 20962: shacq 2 exacq 0 blk 0 PID 17293 lwlock 20964: shacq 2 exacq 0 blk 0 PID 17293 lwlock 46624: shacq 24 exacq 0 blk 0 PID 17293 lwlock 68126: shacq 6 exacq 0 blk 0 > *) if we're not blocking on lwlock, it's possibly a buffer pin related > issue? I've seen this before, for example on an index scan that is > dependent on an seq scan. This long thread: > "http://postgresql.1045698.n5.nabble.com/9-2beta1-parallel-queries-ReleasePredicateLocks-CheckForSerializableConflictIn-in-the-oprofile-td5709812i100.html" > has a lot information about that case and deserves a review. > > *) we can consider experimenting with futex > (http://archives.postgresql.org/pgsql-hackers/2012-06/msg01588.php) > to see if things improve. This is dangerous, and could crash your > server/eat your data, so fair warning. > > merlin -- vlad
On Fri, Nov 16, 2012 at 9:52 AM, Vlad <marchenko@gmail.com> wrote: > Merlin, > > >> Yeah -- you're right, this is definitely spinlock issue. Next steps: >> >> *) in mostly read workloads, we have a couple of known frequent >> offenders. In particular the 'BufFreelistLock'. One way we can >> influence that guy is to try and significantly lower/raise shared >> buffers. So this is one thing to try. > > > server has 32gb with 3.2gb dedicated for share buffers. I've increased it to > 13.2gb, the stall still happened (I have a way of controlling number of > queries hitting postgresql by shutting down own memcache-based app cache, so > to test for stall I temporary shut down few memcached servers). > > >> *) failing that, LWLOCK_STATS macro can be compiled in to give us some >> information about the particular lock(s) we're binding on. Hopefully >> it's a lwlock -- this will make diagnosing the problem easier. > > > I've enabled that macro, seeing flying lwlock messages in the log (see > below), even when there is no high-sys-cpu stall observed at the moment. > Should I be looking for something in particular? We're looking for spikes in 'blk' which represents when lwlocks bump. If you're not seeing any then this is suggesting a buffer pin related issue -- this is also supported by the fact that raising shared buffers didn't help. If you're not seeing 'bk's, go ahead and disable the stats macro. So, what we need to know now is: *) What happens when you drastically *lower* shared buffers? Say, to 64mb? Note, you may experience higher load for unrelated reasons and have to scuttle the test. Also, if you have to crank higher to handle internal server structures, do that. This is a hail mary, but maybe something interesting spits out. *) How many specific query plans are needed to introduce the condition, Hopefully, it's not too many. If so, let's start gathering the plans. If you have a lot of plans to sift through, one thing we can attempt to eliminate noise is to tweak log_min_duration_statement so that during stall times (only) it logs offending queries that are unexpectedly blocking. *) Approximately how big is your 'working set' -- the data your queries are routinely hitting? *) Is the distribution of the *types* of queries uniform? Or do you have special processes that occur on intervals? Thanks for your patience. merlin
We're looking for spikes in 'blk' which represents when lwlocks bump.
If you're not seeing any then this is suggesting a buffer pin related
issue -- this is also supported by the fact that raising shared
buffers didn't help. If you're not seeing 'bk's, go ahead and
disable the stats macro.
So, what we need to know now is:
*) What happens when you drastically *lower* shared buffers? Say, to
64mb? Note, you may experience higher load for unrelated reasons and
have to scuttle the test. Also, if you have to crank higher to handle
internal server structures, do that. This is a hail mary, but maybe
something interesting spits out.
*) How many specific query plans are needed to introduce the
condition, Hopefully, it's not too many. If so, let's start
gathering the plans. If you have a lot of plans to sift through, one
thing we can attempt to eliminate noise is to tweak
log_min_duration_statement so that during stall times (only) it logs
offending queries that are unexpectedly blocking.
*) Approximately how big is your 'working set' -- the data your
queries are routinely hitting?
*) Is the distribution of the *types* of queries uniform? Or do you
have special processes that occur on intervals?
Thanks for your patience.
On Fri, Nov 16, 2012 at 8:21 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Fri, Nov 16, 2012 at 9:52 AM, Vlad <marchenko@gmail.com> wrote: >> >>> *) failing that, LWLOCK_STATS macro can be compiled in to give us some >>> information about the particular lock(s) we're binding on. Hopefully >>> it's a lwlock -- this will make diagnosing the problem easier. >> >> >> I've enabled that macro, seeing flying lwlock messages in the log (see >> below), even when there is no high-sys-cpu stall observed at the moment. >> Should I be looking for something in particular? > > We're looking for spikes in 'blk' which represents when lwlocks bump. Unfortunately LWLock contention is only a weak indication of spinlock contention. For example, if backends are furiously acquiring and releasing the same LWLock in shared mode and no one is getting an exclusive mode, then there can be high spinlock contention but there will never be a blk. In 9.3 there is a new field that tells how many spin delays there were on the mutex that is behind each lock. That was commit b79ab00144e64217d41, maybe he can port that back to his version. But that only tells you about LWLock mutexes, not about all the other ones in PG. The attached patch logs every spin delay with where in the source it comes from. I don't think it will have too much of a performance impact as the logs only happen when a sleep of at least 1ms is going to happen anyway. But it could generate massive amounts of logs. Cheers, Jeff
Вложения
OK, so far I settled on excluding connection caching on app side (Apache::DBI and prepare_cached) from equation and adding pgbouncer as a counter-measure. This seems to stabilize the situation - at least I'm not able to push server into high-sys-cpu stall the way how I used to do. I'm still interested in getting to the bottom of the issue vs alleviating it with connection poller, so I might setup a server on Centos 6.3 (kernel ver 2.6.32) and give it a try. Alternatively, if anyone have ideas on playing with sysctl parameters related to scheduler - please let me know. -- vlad
On Fri, Nov 16, 2012 at 11:19 AM, Vlad <marchenko@gmail.com> wrote: > >> We're looking for spikes in 'blk' which represents when lwlocks bump. >> If you're not seeing any then this is suggesting a buffer pin related >> issue -- this is also supported by the fact that raising shared >> buffers didn't help. If you're not seeing 'bk's, go ahead and >> disable the stats macro. > > > most blk comes with 0, some with 1, few hitting 100. I can't say that during > stall times the number of blk 0 vs blk non-0 are very different. right. this is feeling more and more like a buffer pin issue. but even then we can't be certain -- it could be symptom, not the cause. to prove it we need to demonstrate that everyone is spinning and waiting, which we haven't done. classic spinlock contention manifests in high user cpu. we are binding in kernel, so I wonder if it's all the select() calls. we haven't yet ruled out kernel regression. If I were you, I'd be investigating pgbouncer to see if your app is compliant with transaction mode processing, if for no other reason than it will mitigate high load issues. >> *) How many specific query plans are needed to introduce the >> condition, Hopefully, it's not too many. If so, let's start >> gathering the plans. If you have a lot of plans to sift through, one >> thing we can attempt to eliminate noise is to tweak >> log_min_duration_statement so that during stall times (only) it logs >> offending queries that are unexpectedly blocking. > > > unfortunately, there are quite a few query plans... also, I don't think > setting log_min_duration_statement will help us, cause when server is > hitting high load average, it reacts slowly even on a key press. So even > non-offending queries will be taking long to execute. I see all sorts of > queries a being executed long during stall: spanning from simple > LOG: duration: 1131.041 ms statement: SELECT 'DBD::Pg ping test' > to complex ones, joining multiple tables. > We are still looking into all the logged queries in attempt to find the ones > that are causing the problem, I'll report if we find any clues. right. merlin
On Fri, Nov 16, 2012 at 12:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Fri, Nov 16, 2012 at 8:21 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Fri, Nov 16, 2012 at 9:52 AM, Vlad <marchenko@gmail.com> wrote: >>> >>>> *) failing that, LWLOCK_STATS macro can be compiled in to give us some >>>> information about the particular lock(s) we're binding on. Hopefully >>>> it's a lwlock -- this will make diagnosing the problem easier. >>> >>> >>> I've enabled that macro, seeing flying lwlock messages in the log (see >>> below), even when there is no high-sys-cpu stall observed at the moment. >>> Should I be looking for something in particular? >> >> We're looking for spikes in 'blk' which represents when lwlocks bump. > > Unfortunately LWLock contention is only a weak indication of spinlock > contention. For example, if backends are furiously acquiring and > releasing the same LWLock in shared mode and no one is getting an > exclusive mode, then there can be high spinlock contention but there > will never be a blk. yup. this problem doesn't smell like lwlock issues. typically there the problem manifests as low cpu performance, everybody waits. classic spinlock contention (at least from what i've seen) is very high *userspace* cpu utilization and low work output. this time it's different -- OP is bogging in the kernel so it's not impossible we're troubleshooting the symptom, not the cause. > In 9.3 there is a new field that tells how many spin delays there were > on the mutex that is behind each lock. That was commit > b79ab00144e64217d41, maybe he can port that back to his version. > > But that only tells you about LWLock mutexes, not about all the other > ones in PG. > > The attached patch logs every spin delay with where in the source it comes from. yeah, OP should fire this off. good stuff. I'll bet lunch (if we ever happen to meet) it's on buffer pin. merlin
-- Vlad
yup. this problem doesn't smell like lwlock issues. typically therethe problem manifests as low cpu performance, everybody waits.
classic spinlock contention (at least from what i've seen) is very
high *userspace* cpu utilization and low work output. this time it's
different -- OP is bogging in the kernel so it's not impossible we're
troubleshooting the symptom, not the cause.yeah, OP should fire this off. good stuff. I'll bet lunch (if we ever
> In 9.3 there is a new field that tells how many spin delays there were
> on the mutex that is behind each lock. That was commit
> b79ab00144e64217d41, maybe he can port that back to his version.
>
> But that only tells you about LWLock mutexes, not about all the other
> ones in PG.
>
> The attached patch logs every spin delay with where in the source it comes from.
happen to meet) it's on buffer pin.
merlin
On Fri, Nov 16, 2012 at 2:13 PM, Vlad <marchenko@gmail.com> wrote: > ok, I've applied that patch and ran. The stall started around 13:50:45...50 > and lasted until the end > > https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log > > the actual log has more data (including statement following each 'spin > delay' record), but there is some sensitive info inside which I can't share > with public. first thoughts: no single thing really stands out -- contention is all over the place. lwlock, pinbuffer, dynahash (especially). I am again suspicious of bad scheduler interaction. any chance we can fire up pgbouncer? merlin
-- Vlad
first thoughts:
no single thing really stands out -- contention is all over the place.
lwlock, pinbuffer, dynahash (especially). I am again suspicious of
bad scheduler interaction. any chance we can fire up pgbouncer?
merlin
On Mon, Nov 19, 2012 at 12:02 PM, Vlad <marchenko@gmail.com> wrote: > > Some additional observation and food for thoughts. Our app uses connection > caching (Apache::DBI). By disabling Apache::DBI and forcing client > re-connection for every (http) request processed I eliminated the stall. The > user cpu usage jumped (mostly cause prepared sql queries are no longer > available, and some additional overhead on re-connection), but no single > case of high-sys-cpu stall. > > I can not completely rule out the possibility of some left-overs (unfinished > transaction?) remain after serving http request, which, in the absence of > connection caching, are discarded for sure.... well, that is good. note for pgbouncer transaction mode you would have had to quit using prepared statements anyways (see list of features here: http://wiki.postgresql.org/wiki/PgBouncer). you can use session mode though, and that would be a good test. faster speed could be because of application issues or maybe just spreading out the queries reduced contention. hard to say. well, what happens next is up to you -- we can spend more time chasing this down if you want, or you can hold back and see how things run for a while. what is your user load up to in the worst case? merlin
On Mon, Nov 19, 2012 at 10:50 AM, Vlad <marchenko@gmail.com> wrote: > I just did a little experiment: extracted top four queries that were > executed the longest during stall times and launched pgbench test with 240 > clients. Yet I wasn't able to put the server into a stall with that. Also > load average was hitting 120+, it was all user cpu, single digit % system. > The system remained pretty responsive (on a keypress), in contrary to when > high-sys-cpu stall happens, showing similar LA numbers. > > This makes me think that there is probably a different (than ones I tried) > query or condition that is responsible for creating high-sys-cpu. Also, from > my limited knowledge of postgresql and kernel internals, this experiment > puts under question idea that linux scheduler might be the corner stone.... > > Here is a longer log, first column is PID. High-sys-cpu stall happened > around 10:42:37 > https://dl.dropbox.com/u/109778/postgresql-2012-11-19_103425-stripped.log > > I'll try to put pgbouncer in place and repeat, also though this will take > longer time to setup. yeah. interesting -- contention was much higher this time and that changes things. strange how it was missed earlier. you're getting bounced around a lot in lwlock especially (unfortunately we don't know which one). I'm going to hazard another guess: maybe the trigger here is when the number of contending backends exceeds some critical number (probably based on the number of cores) you see a quick cpu spike (causing more backends to lock and pile up) as cache line bouncing sets in. That spike doesn't last long, because the spinlocks quickly accumulate delay counts then punt to the scheduler which is unable to cope. The exact reason why this is happening to you in exactly this way (I've never seen it) is unclear. Also the line between symptom and cause is difficult to draw. unfortunately, in your case spinlock re-scheduling isn't helping. log entries like this one: 18764 [2012-11-19 10:43:50.124 CST] LOG: JJ spin delay from file sinvaladt.c line 512 delay 212, pointer 0x7f514959a394 at character 29 are suggesting major problems. you're dangerously close to a stuck spinlock which is lights out for the database. merlin
-- Vlad
yeah. interesting -- contention was much higher this time and that
changes things. strange how it was missed earlier.
you're getting bounced around a lot in lwlock especially
(unfortunately we don't know which one). I'm going to hazard another
guess: maybe the trigger here is when the number of contending
backends exceeds some critical number (probably based on the number of
cores) you see a quick cpu spike (causing more backends to lock and
pile up) as cache line bouncing sets in. That spike doesn't last
long, because the spinlocks quickly accumulate delay counts then punt
to the scheduler which is unable to cope. The exact reason why this
is happening to you in exactly this way (I've never seen it) is
unclear. Also the line between symptom and cause is difficult to
draw.
unfortunately, in your case spinlock re-scheduling isn't helping. log
entries like this one:
18764 [2012-11-19 10:43:50.124 CST] LOG: JJ spin delay from file
sinvaladt.c line 512 delay 212, pointer 0x7f514959a394 at character 29
are suggesting major problems. you're dangerously close to a stuck
spinlock which is lights out for the database.
merlin
-- Vlad
On Sun, Nov 18, 2012 at 4:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko@gmail.com> wrote: >> ok, I've applied that patch and ran. The stall started around 13:50:45...50 >> and lasted until the end >> >> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log > > That isn't as much log as I expected. But I guess only the tip of the > iceberg gets logged, it could be that most spinlocks are contended but > then get acquired just before the sleep (and log) would have happened. > > I'm not sure how to figure out just how much spinning is going on > below the surface, but my gut feeling is there isn't enough of it to > explain the amount of slowdown you are seeing (and it would probably > be accounted for as user time rather than system time) yes. classic spinlock contention symptoms are high user cpu and almost work getting done. something else is going on here? > Maybe we could whip up a dummy standalone program that uses PG's > spinlock code in a intentionally contended way and see how your > system reports resource usage for that. Maybe your system reports > fighting for cache lines or whatever is going on behind the scenes as > system time rather than user time, or something. I don't think this is the case. It's looking more and more like scheduler issues, although it's unclear if that is due to some kind of regression or just symptoms of backends piling up. I'm also starting to wonder if the spinlock contention we are seeing (which according to your patch really isn't all that much really) is just symptomatic of scheduler issues. > It would be a little easier to figure out what was going on there if > the log included the pid (%p) so we can see if it is the same process > doing a bunch of consecutive sleeps, or a bunch of different > processes. Also, logging the millisecond (%m rather than %t) would be > helpful. > > But the way that the point of contention jumps around a lot points to > a scheduler issue rather than a postgres-specific issue. A backend > gets pre-empted while holding a spinlock, and then the kernel is > resistant to rescheduling it even though all other processes are piled > up behind it. But again, that theory would lead to more log lines > than we see, I would think. yeah. I asked for some longer strace logs but then everyone went home for the weekend. > Anyway, pgbouncer might help by ensuring that a connection gets > pre-empted by pgbouncer at a point where it has no locks, rather than > by the kernel at a point where it does hold locks. yes -- I think pgbouncer is the key here because it's going to tell us if the scheduler is the culprit here. If you look carefully at the strace logs, you'll see that system call times become unpredictable as they are getting scheduled out...particularly yielding calls like select(). So we need to determine if this is normal high load behavior or if there is an underlying mechanic. > Any chance you could try an older kernel and see if that reduces the problem? This is one thing to try. Personally I think pgbouncer/transaction mode is low hanging fruit if it 'drops in' to OP's app. If problem goes away this will pretty much point the finger at Linux AFAICT (in which case we can start messing around with the kernel). merlin
On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko@gmail.com> wrote: > ok, I've applied that patch and ran. The stall started around 13:50:45...50 > and lasted until the end > > https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log That isn't as much log as I expected. But I guess only the tip of the iceberg gets logged, it could be that most spinlocks are contended but then get acquired just before the sleep (and log) would have happened. I'm not sure how to figure out just how much spinning is going on below the surface, but my gut feeling is there isn't enough of it to explain the amount of slowdown you are seeing (and it would probably be accounted for as user time rather than system time) Maybe we could whip up a dummy standalone program that uses PG's spinlock code in a intentionally contended way and see how your system reports resource usage for that. Maybe your system reports fighting for cache lines or whatever is going on behind the scenes as system time rather than user time, or something. It would be a little easier to figure out what was going on there if the log included the pid (%p) so we can see if it is the same process doing a bunch of consecutive sleeps, or a bunch of different processes. Also, logging the millisecond (%m rather than %t) would be helpful. But the way that the point of contention jumps around a lot points to a scheduler issue rather than a postgres-specific issue. A backend gets pre-empted while holding a spinlock, and then the kernel is resistant to rescheduling it even though all other processes are piled up behind it. But again, that theory would lead to more log lines than we see, I would think. Anyway, pgbouncer might help by ensuring that a connection gets pre-empted by pgbouncer at a point where it has no locks, rather than by the kernel at a point where it does hold locks. Any chance you could try an older kernel and see if that reduces the problem? One thing that does stand out is that it looks like partitioned hash tables in dynahash.c could benefit by having their freelist partitioned as well, assuming we can give up the hash_get_num_entries function. Cheers, Jeff
On Fri, Nov 16, 2012 at 3:21 PM, Vlad <marchenko@gmail.com> wrote: > what would pgbouncer do in my case? Number of connections will decrease, but > number of active clients won't be smaller. As I understand the latter ones > are that important. Well, one thing that struck me was how little spinlock contention there actually was. Yeah, backends are delaying here and there, which isn't great, but but a few dozen delays per second across several hundred backends doesn't seem like it should be pegging sys cpu%. This is all pointing to the problem not being in postgres, but in linux. pgbouncer would do two things: 1) perhaps guard you against some o/s issue 2) keep system more responsive during stall (since by controlling the pool you control the number of queries piling up). of course, this comes at the expense of some overhead. But, before going through all that, how about timing strace recorded calls (strace -T) both in stall and non-stall conditions. I was assuming select(), but maybe it's something else....for example the recently fixed lseek. merlin
On 11/16/2012 02:31 PM, Merlin Moncure wrote: > no single thing really stands out -- contention is all over the place. > lwlock, pinbuffer, dynahash (especially). I am again suspicious of > bad scheduler interaction. any chance we can fire up pgbouncer? Just want to throw it out there, but we've been having really bad luck with the scheduler recently. But only when we use 8GB (on our 72GB system) for shared_buffers. Cut that down to 4GB, and everything is fine and dandy. I think the kernel devs have added in some overzealous scheduler code on us. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: > On 11/16/2012 02:31 PM, Merlin Moncure wrote: > >> no single thing really stands out -- contention is all over the place. >> lwlock, pinbuffer, dynahash (especially). I am again suspicious of >> bad scheduler interaction. any chance we can fire up pgbouncer? > > > Just want to throw it out there, but we've been having really bad luck with > the scheduler recently. But only when we use 8GB (on our 72GB system) for > shared_buffers. Cut that down to 4GB, and everything is fine and dandy. > > I think the kernel devs have added in some overzealous scheduler code on us. Shared buffer manipulation changing contention is suggesting you're running into free list lock issues. How many active backends/cores? merlin
On 11/20/2012 10:03 AM, Merlin Moncure wrote: > Shared buffer manipulation changing contention is suggesting you're > running into free list lock issues. How many active backends/cores? Oh, the reason I wanted to point it out was that we see a lot more than CPU contention with higher shared_buffer settings. The scheduler starts aggressively clearing out the file cache *and* freeing memory, driving load way higher than it should be. That doesn't result in a very high iowait, but all that page swapping sure isn't free. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Tue, Nov 20, 2012 at 10:12 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: > On 11/20/2012 10:03 AM, Merlin Moncure wrote: > >> Shared buffer manipulation changing contention is suggesting you're >> running into free list lock issues. How many active backends/cores? > > > Oh, the reason I wanted to point it out was that we see a lot more than CPU > contention with higher shared_buffer settings. The scheduler starts > aggressively clearing out the file cache *and* freeing memory, driving load > way higher than it should be. That doesn't result in a very high iowait, but > all that page swapping sure isn't free. have you ruled out numa issues? (http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html) merlin
On 11/20/2012 10:13 AM, Merlin Moncure wrote: > have you ruled out numa issues? > (http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html) Haha. Yeah. Our zone reclaim mode off, and node distance is 10 or 20. ZCM is only enabled by default if distance is > 20, unless there's some kernel bug triggering it even when zone reclaim reports as off. I'll also note that my tests with interleave made no difference at all. At least not with pgbench. There was a small amount of flux with TPS, but the page swap storms came regardless of NUMA tweaks. The formula worked like this: High connection count + high shared_buffers = page swap storm. I'll note that 8GB -> 4GB immediately stopped the paging everywhere, and the OS went from using 13GB for active file cache, to 45GB. I can't see how PG would cause something like that by itself. I only piped in because it's very adversely affecting our CPU load in a similarly inexplicable-but-seemingly-the-scheduler way. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Thu, Nov 15, 2012 at 4:29 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Merlin Moncure escribió: > >> ok, excellent. reviewing the log, this immediately caught my eye: >> >> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5 >> recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"..., >> 64, 0, NULL, NULL) = 64 >> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout) >> semop(41713721, {{2, 1, 0}}, 1) = 0 >> lseek(295, 0, SEEK_END) = 0 >> lseek(296, 0, SEEK_END) = 8192 >> >> this is definitely pointing to spinlock issue. > > I met Rik van Riel (Linux kernel hacker) recently and we chatted about > this briefly. He strongly suggested that we should consider using > futexes on Linux instead of spinlocks; the big advantage being that > futexes sleep instead of spinning when contention is high. That would > reduce the system load in this scenario. By the way, I think you (and Rik) 100% right. We're slamming select() where futex has much smarter wait mechanism. There are reports from users who have switched to pthreads and reporting better scalability. (https://groups.google.com/forum/?fromgroups=#!topic/pgsql.hackers/v8V82gznxs8) merlin
On Tue, Nov 20, 2012 at 8:03 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: >> On 11/16/2012 02:31 PM, Merlin Moncure wrote: >> >>> no single thing really stands out -- contention is all over the place. >>> lwlock, pinbuffer, dynahash (especially). I am again suspicious of >>> bad scheduler interaction. any chance we can fire up pgbouncer? >> >> >> Just want to throw it out there, but we've been having really bad luck with >> the scheduler recently. But only when we use 8GB (on our 72GB system) for >> shared_buffers. Cut that down to 4GB, and everything is fine and dandy. >> >> I think the kernel devs have added in some overzealous scheduler code on us. > > Shared buffer manipulation changing contention is suggesting you're > running into free list lock issues. I wouldn't expect so. Increasing shared_buffers should either fix free list lock contention, or leave it unchanged, not make it worse. Sounds more like that NUMA/interleave/zone reclaim stuff I could never wrap my head around. Cheers, Jeff
On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Nov 20, 2012 at 8:03 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: >>> On 11/16/2012 02:31 PM, Merlin Moncure wrote: >>> >>>> no single thing really stands out -- contention is all over the place. >>>> lwlock, pinbuffer, dynahash (especially). I am again suspicious of >>>> bad scheduler interaction. any chance we can fire up pgbouncer? >>> >>> >>> Just want to throw it out there, but we've been having really bad luck with >>> the scheduler recently. But only when we use 8GB (on our 72GB system) for >>> shared_buffers. Cut that down to 4GB, and everything is fine and dandy. >>> >>> I think the kernel devs have added in some overzealous scheduler code on us. >> >> Shared buffer manipulation changing contention is suggesting you're >> running into free list lock issues. > > I wouldn't expect so. Increasing shared_buffers should either fix > free list lock contention, or leave it unchanged, not make it worse. AIUI, that is simply not true (unless you raise it to the point you're not churning them). I'm looking at StrategyGetBuffer() for non-scan cases. It locks "BufFreelistLock" then loops the free list, and, if it finds nothing, engages a clock sweep. Both of those operations are dependent on the number of buffers being managed and so it's reasonable to expect some workloads to increase contention with more buffers. merlin
On 11/20/2012 04:08 PM, Jeff Janes wrote: > Shaun Thomas reports one that is (I assume) not read intensive, but > his diagnosis is that this is a kernel bug where a larger > shared_buffers for no good reason causes the kernel to kill off its > page cache. We're actually very read intensive. According to pg_stat_statements, we regularly top out at 42k queries per second, and pg_stat_database says we're pushing 7k TPS. But I'm still sure this is a kernel bug. Moving from 4GB to 6GB or 8GB causes the kernel to cut the active page cache in half, in addition to freeing 1/4 of RAM to just sit around doing nothing. That in turn causes kswapd to work constantly, while our IO drivers work to undo the damage. It's a positive feedback loop that I can reliably drive the load up to 800+ on an 800-client pgbench with two threads, all while having 0% CPU free. Make that 4GB, and not only does the problem completely disappear, but the load settles down to around 9, and the machine becomes about 60% idle. Something in there is fantastically broken, but I can't point a finger at what. I was just piping in because, in absence of an obvious PG-related culprit, the problem could be the OS itself. It certainly was in our case. That, or PG has a memory leak that only appears at > 4GB of shared buffers. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Tue, Nov 20, 2012 at 2:26 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: > On 11/20/2012 04:08 PM, Jeff Janes wrote: > >> Shaun Thomas reports one that is (I assume) not read intensive, but >> his diagnosis is that this is a kernel bug where a larger >> shared_buffers for no good reason causes the kernel to kill off its >> page cache. > > > We're actually very read intensive. Sorry, that is what I meant. I wrote "not write intensive", then decided it would be clearer to change "not write" to "read" and then forgot to remove the "not". I hate it when that happens. Atomic update commit failure in the meatware :) Cheers, Jeff
On 11/20/2012 04:35 PM, Jeff Janes wrote: > Atomic update commit failure in the meatware :) Ha. What's actually funny is that one of the affected machines started *swapping* earlier today. With 15GB free, and 12GB of inactive cache, and vm.swappiness set to 0, it somehow decided there was enough memory pressure to push 1.5GB of *something* into swap. Frankly, the kernel we're using is clearly deranged. There's really no other way to explain this. But we were blaming various elements of PG for several weeks following our upgrade. After further examination, our query profile isn't any different, and the load spikes were effectively random. All I can say is... be wary of the 3.2 kernel. It's... unique. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 21/11/12 11:41, Shaun Thomas wrote: > On 11/20/2012 04:35 PM, Jeff Janes wrote: > >> Atomic update commit failure in the meatware :) > > Ha. > > What's actually funny is that one of the affected machines started > *swapping* earlier today. With 15GB free, and 12GB of inactive cache, > and vm.swappiness set to 0, it somehow decided there was enough memory > pressure to push 1.5GB of *something* into swap. > > Frankly, the kernel we're using is clearly deranged. There's really no > other way to explain this. But we were blaming various elements of PG > for several weeks following our upgrade. After further examination, > our query profile isn't any different, and the load spikes were > effectively random. > > All I can say is... be wary of the 3.2 kernel. It's... unique. > You'll just have to upgrade to Microsoft! <Ducks, and runs away _VERY_ quickly!> Could you not install a later kernel? Cheers, Gavin
update on my problem: despite pgbouncer, the problem still occures on my end. Also, interesting observation - I ran several tests with pgbench, using queries that I think are prone to trigger high-sys-cpu-stall. What I noticed is when pgbench is started with prepared mode, the system behaves fine during stress-test (high user cpu - 85-90%, low sys cpu - 5-7%), high TPS. Though when I used non-prepared modes, the sys cpu portion jumps to 40% (and tps drops dramatically as well, but this is understandable). The test queries are pretty long (10kb+), with couple of outer joins across 1000-record tables with indexes. Maybe, we are looking in a wrong place and the issue is somewhere within planer/parser? Is there some extensive locking used in there? Another observation - it's harder to trigger high-sys-cpu stall on a freshly restarted postgresql. Though if it was running for a while, then it's much easier to do. -- vlad On 11/19/12 8:33 AM, Merlin Moncure wrote: > On Sun, Nov 18, 2012 at 4:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko@gmail.com> wrote: >>> ok, I've applied that patch and ran. The stall started around 13:50:45...50 >>> and lasted until the end >>> >>> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log >> That isn't as much log as I expected. But I guess only the tip of the >> iceberg gets logged, it could be that most spinlocks are contended but >> then get acquired just before the sleep (and log) would have happened. >> >> I'm not sure how to figure out just how much spinning is going on >> below the surface, but my gut feeling is there isn't enough of it to >> explain the amount of slowdown you are seeing (and it would probably >> be accounted for as user time rather than system time) > > yes. classic spinlock contention symptoms are high user cpu and almost > work getting done. something else is going on here? > >> Maybe we could whip up a dummy standalone program that uses PG's >> spinlock code in a intentionally contended way and see how your >> system reports resource usage for that. Maybe your system reports >> fighting for cache lines or whatever is going on behind the scenes as >> system time rather than user time, or something. > I don't think this is the case. It's looking more and more like > scheduler issues, although it's unclear if that is due to some kind of > regression or just symptoms of backends piling up. I'm also starting > to wonder if the spinlock contention we are seeing (which according to > your patch really isn't all that much really) is just symptomatic of > scheduler issues. > >> It would be a little easier to figure out what was going on there if >> the log included the pid (%p) so we can see if it is the same process >> doing a bunch of consecutive sleeps, or a bunch of different >> processes. Also, logging the millisecond (%m rather than %t) would be >> helpful. >> >> But the way that the point of contention jumps around a lot points to >> a scheduler issue rather than a postgres-specific issue. A backend >> gets pre-empted while holding a spinlock, and then the kernel is >> resistant to rescheduling it even though all other processes are piled >> up behind it. But again, that theory would lead to more log lines >> than we see, I would think. > yeah. I asked for some longer strace logs but then everyone went home > for the weekend. > >> Anyway, pgbouncer might help by ensuring that a connection gets >> pre-empted by pgbouncer at a point where it has no locks, rather than >> by the kernel at a point where it does hold locks. > yes -- I think pgbouncer is the key here because it's going to tell us > if the scheduler is the culprit here. If you look carefully at the > strace logs, you'll see that system call times become unpredictable as > they are getting scheduled out...particularly yielding calls like > select(). So we need to determine if this is normal high load > behavior or if there is an underlying mechanic. > >> Any chance you could try an older kernel and see if that reduces the problem? > This is one thing to try. Personally I think pgbouncer/transaction > mode is low hanging fruit if it 'drops in' to OP's app. If problem > goes away this will pretty much point the finger at Linux AFAICT (in > which case we can start messing around with the kernel). > > merlin
On Wed, Nov 21, 2012 at 9:29 AM, Vlad Marchenko <marchenko@gmail.com> wrote: > update on my problem: despite pgbouncer, the problem still occures on my > end. > > Also, interesting observation - I ran several tests with pgbench, using > queries that I think are prone to trigger high-sys-cpu-stall. What I noticed > is when pgbench is started with prepared mode, the system behaves fine > during stress-test (high user cpu - 85-90%, low sys cpu - 5-7%), high TPS. > Though when I used non-prepared modes, the sys cpu portion jumps to 40% (and > tps drops dramatically as well, but this is understandable). The test > queries are pretty long (10kb+), with couple of outer joins across > 1000-record tables with indexes. > > Maybe, we are looking in a wrong place and the issue is somewhere within > planer/parser? Is there some extensive locking used in there? > > Another observation - it's harder to trigger high-sys-cpu stall on a freshly > restarted postgresql. Though if it was running for a while, then it's much > easier to do. what pgbouncer mode, and how large is your pool. merlin
On Wed, Nov 21, 2012 at 7:29 AM, Vlad Marchenko <marchenko@gmail.com> wrote: > update on my problem: despite pgbouncer, the problem still occures on my > end. As Merlin asked, how big is the pool? Maybe you are using a large enough pool so as to defeat the purpose of restricting the number of connections. > Also, interesting observation - I ran several tests with pgbench, using > queries that I think are prone to trigger high-sys-cpu-stall. What I noticed > is when pgbench is started with prepared mode, the system behaves fine > during stress-test (high user cpu - 85-90%, low sys cpu - 5-7%), high TPS. > Though when I used non-prepared modes, the sys cpu portion jumps to 40% (and > tps drops dramatically as well, but this is understandable). The test > queries are pretty long (10kb+), with couple of outer joins across > 1000-record tables with indexes. Could you sanitize the queries (and some statements to generate dummy data) enough to share? > > Maybe, we are looking in a wrong place and the issue is somewhere within > planer/parser? Is there some extensive locking used in there? I don't think the locking is particular extensive, but it could be enough extra to drive something over the edge. But it would be the same nature of locking as elsewhere (spinlocks and lwlocks), so it doesn't really change the nature of the problem, which is still "Why do these user-space locks turn into high SYS cpu?" > Another observation - it's harder to trigger high-sys-cpu stall on a freshly > restarted postgresql. Though if it was running for a while, then it's much > easier to do. Maybe the long running time has built up enough resource usage to cause the kernel scheduler to get into a snit, so it decides to preempt the process while it holds a spinlock, and then refuses to run it again for a while. Cheers, Jeff
On Wed, Nov 21, 2012 at 12:17 PM, Vlad <marchenko@gmail.com> wrote:prepare statements can be fudged within some constraints. if prepared
> It turned out we can't use transaction mode, cause there are prepared
> statement used a lot within code, while processing a single http request.
statements are explicitly named via PREPARE, you can simply prepare
them all on server connection via connect_query setting and disable
the manual preparation. you then change the server_reset_query so
that they are not discarded. some basic experimentation might confirm
if this is viable strategy. automatic protocol level statements can
be an issue though.
> Also, I can't 100% rule out that there won't be any long runningthere are a number of simple tricks to deal with this:
> (statistical) queries launched (even though such requests should not come to
> this database), which would occupy connection for longer time, but do not
> create any race condition... So having pool size at 8 may be too slim .
1) move long running queries to their own pool (by changing login user
or connection string)
2) bypass pgbouncer in those cases
3) increase pool size
On Wed, Nov 21, 2012 at 12:17 PM, Vlad <marchenko@gmail.com> wrote: > It turned out we can't use transaction mode, cause there are prepared > statement used a lot within code, while processing a single http request. prepare statements can be fudged within some constraints. if prepared statements are explicitly named via PREPARE, you can simply prepare them all on server connection via connect_query setting and disable the manual preparation. you then change the server_reset_query so that they are not discarded. some basic experimentation might confirm if this is viable strategy. automatic protocol level statements can be an issue though. > Also, I can't 100% rule out that there won't be any long running > (statistical) queries launched (even though such requests should not come to > this database), which would occupy connection for longer time, but do not > create any race condition... So having pool size at 8 may be too slim . there are a number of simple tricks to deal with this: 1) move long running queries to their own pool (by changing login user or connection string) 2) bypass pgbouncer in those cases 3) increase pool size merlin
-- Vlad
If it was me, I would be starting with exactly 8 and be tuning from there.On Wed, Nov 21, 2012 at 11:56 AM, Vlad <marchenko@gmail.com> wrote:
> ok, understood.
> I need to give some more thoughts to if it's possible for us to switch to
> transaction mode from app standpoint of view.
>
> if yes, then setting pool size to 20 (for 8 cores) sounds OK?
Don't forget: each database role gets its own pool.
merlin
On Wed, Nov 21, 2012 at 11:56 AM, Vlad <marchenko@gmail.com> wrote: > ok, understood. > I need to give some more thoughts to if it's possible for us to switch to > transaction mode from app standpoint of view. > > if yes, then setting pool size to 20 (for 8 cores) sounds OK? If it was me, I would be starting with exactly 8 and be tuning from there. Don't forget: each database role gets its own pool. merlin
-- Vlad
On Tue, Nov 20, 2012 at 9:05 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> >> I wouldn't expect so. Increasing shared_buffers should either fix >> free list lock contention, or leave it unchanged, not make it worse. > > AIUI, that is simply not true (unless you raise it to the point you're > not churning them). I'm looking at StrategyGetBuffer() for non-scan > cases. It locks "BufFreelistLock" then loops the free list, and, if > it finds nothing, engages a clock sweep. The freelist should never loop. It is written as a loop, but I think there is currently no code path which ends up with valid buffers being on the freelist, so that loop will never, or at least rarely, execute more than once. > Both of those operations are > dependent on the number of buffers being managed and so it's > reasonable to expect some workloads to increase contention with more > buffers. The clock sweep can depend on the number of buffers begin managed in a worst-case sense, but I've never seen any evidence (nor analysis) that this worst case can be obtained in reality on an ongoing basis. By constructing two pathological workloads which get switched between, I can get the worst-case to happen, but when it does happen the consequences are mild compared to the amount of time needed to set up the necessary transition. In other words, the worse-case can't be triggered often enough to make a meaningful difference. Cheers, Jeff
On Tue, Nov 20, 2012 at 12:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Nov 20, 2012 at 9:05 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> >>> I wouldn't expect so. Increasing shared_buffers should either fix >>> free list lock contention, or leave it unchanged, not make it worse. >> >> AIUI, that is simply not true (unless you raise it to the point you're >> not churning them). I'm looking at StrategyGetBuffer() for non-scan >> cases. It locks "BufFreelistLock" then loops the free list, and, if >> it finds nothing, engages a clock sweep. > > The freelist should never loop. It is written as a loop, but I think > there is currently no code path which ends up with valid buffers being > on the freelist, so that loop will never, or at least rarely, execute > more than once. > >> Both of those operations are >> dependent on the number of buffers being managed and so it's >> reasonable to expect some workloads to increase contention with more >> buffers. > > The clock sweep can depend on the number of buffers begin managed in a > worst-case sense, but I've never seen any evidence (nor analysis) that > this worst case can be obtained in reality on an ongoing basis. By > constructing two pathological workloads which get switched between, I > can get the worst-case to happen, but when it does happen the > consequences are mild compared to the amount of time needed to set up > the necessary transition. In other words, the worse-case can't be > triggered often enough to make a meaningful difference. Yeah, good points; but (getting off topic here) : there have been several documented cases of lowering shared buffers improving performance under contention...the 'worst case' might be happening more than expected. In particular, what happens when a substantial percentage of the buffer pool is set with a non-zero usage count? This seems unlikely, but possible? Take note: if (buf->refcount == 0) { if (buf->usage_count > 0) { buf->usage_count--; trycounter = NBuffers; /* emphasis *./ } ISTM time spent here isn't bounded except that as more time is spent sweeping (more backends are thus waiting and not marking pages) the usage counts decrease faster until you hit steady state. Smaller buffer pool naturally would help in that scenario as your usage counts would drop faster. It strikes me as cavalier to be resetting trycounter while sitting under the #1 known contention point for read only workloads. Shouldn't SBF() work on advisory basis and try to force a buffer after N failed usage count attempts? merlin
On Tue, Nov 20, 2012 at 12:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Nov 20, 2012 at 12:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> >> The freelist should never loop. It is written as a loop, but I think >> there is currently no code path which ends up with valid buffers being >> on the freelist, so that loop will never, or at least rarely, execute >> more than once. >> >>> Both of those operations are >>> dependent on the number of buffers being managed and so it's >>> reasonable to expect some workloads to increase contention with more >>> buffers. >> >> The clock sweep can depend on the number of buffers begin managed in a >> worst-case sense, but I've never seen any evidence (nor analysis) that >> this worst case can be obtained in reality on an ongoing basis. By >> constructing two pathological workloads which get switched between, I >> can get the worst-case to happen, but when it does happen the >> consequences are mild compared to the amount of time needed to set up >> the necessary transition. In other words, the worse-case can't be >> triggered often enough to make a meaningful difference. > > Yeah, good points; but (getting off topic here) : there have been > several documented cases of lowering shared buffers improving > performance under contention...the 'worst case' might be happening > more than expected. The ones that I am aware of (mostly Greg Smith's case studies) this has been for write-intensive work loads and are related to writes/fsyncs getting gummed up. Shaun Thomas reports one that is (I assume) not read intensive, but his diagnosis is that this is a kernel bug where a larger shared_buffers for no good reason causes the kernel to kill off its page cache. From the kernel's perspective, the freelist lock doesn't look any different from any other lwlock, so I doubt that is issue is related to freelist lock. > In particular, what happens when a substantial > percentage of the buffer pool is set with a non-zero usage count? The current clock sweep algorithm is an extraordinary usagecount decrementing machine. From what I know, the only way to get much more than half of the buffers to be non-zero usage count is for the clock-sweep to rarely run (in which case, it is hard to be the bottleneck if it rarely runs), or for most of the buffer-cache to be pinned simultaneously. > This seems unlikely, but possible? Take note: > > if (buf->refcount == 0) > { > if (buf->usage_count > 0) > { > buf->usage_count--; > trycounter = NBuffers; /* emphasis *./ > } > > ISTM time spent here isn't bounded except that as more time is spent > sweeping (more backends are thus waiting and not marking pages) the > usage counts decrease faster until you hit steady state. But that is a one time thing. Once you hit the steady state, how do you get away from it again, such that a large amount of work is needed again? > Smaller > buffer pool naturally would help in that scenario as your usage counts > would drop faster. They would drop at the same rate in absolute numbers, barring the smaller buffer_cache fitting entirely in the on-board CPU cache. They would drop faster in percentage terms, but they would also increase faster in percentage terms once a candidate is found and a new page read into it. > It strikes me as cavalier to be resetting > trycounter while sitting under the #1 known contention point for read > only workloads. The only use for the trycounter is to know when to ERROR out with "no unpinned buffers available", so not resetting that seems entirely wrong. I would contest "the #1 known contention point" claim. We know that the freelist lock is a point of contention under certain conditions, but we (or at least I) also know that it is the mere acquisition of this lock, and not the work done while it is held, that is important. If I add a spurious "LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); LWLockRelease(BufFreelistLock);" to each execution of StrategyGetBuffer, then contention kicks in twice as fast. But if I instead hack the clock sweep to run twice as far (ignore the first eligible buffer it finds, and go find another one) but all under the cover of a single BufFreelistLock acquisition, there was no meaningful increase in contention. This was all on a 4 socket x 2 core/socket opteron machine which I no longer have access to. Using a more modern 8 core on a single socket, I can't get it to collapse on BufFreelistLock at all, presumably because the cache coherence mechanisms are so much faster. > Shouldn't SBF() work on advisory basis and try to > force a buffer after N failed usage count attempts? I believe Simon tried that a couple commit-fests ago, and no one could show that it made a difference. Cheers, Jeff
On Tue, Nov 20, 2012 at 4:08 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> It strikes me as cavalier to be resetting >> trycounter while sitting under the #1 known contention point for read >> only workloads. > > The only use for the trycounter is to know when to ERROR out with "no > unpinned buffers available", so not resetting that seems entirely > wrong. > > I would contest "the #1 known contention point" claim. We know that > the freelist lock is a point of contention under certain conditions, > but we (or at least I) also know that it is the mere acquisition of > this lock, and not the work done while it is held, that is important. Well, I'm speculating (without evidence) if that is in fact might not be always true. It certainly is hard to demonstrate with synthetic testing though. I guess the first step would be to whip up a diagnostic patch to confirm malfeasance in suspicious cases. merlin
On Wed, Nov 21, 2012 at 9:05 AM, Vlad <marchenko@gmail.com> wrote: > it's session mode and the pool size is 1200 (cause I need to grantee that in > the worst case we have enough slots for all possible clients), Wouldn't the clients prefer to wait 100ms to get a connnection if that means their query finishes in 100ms, rather than waiting 0ms to get a connection but have the query take 1000ms to finish? > however even > at the times preceding high-cpu-sys-stall, the number postmasters are like > 15-20. When stall happens, it starts to raise but that's the result of DB > not answering in timely manner. That is like the "beatings will continue until moral improves" joke. The theory about why pgbouncer would be beneficial is that it prevents hundreds of connections from clogging up the server. By allowing hundreds of connections to be made when the server gets into distress, you are increasing that distress and making it harder to recover from it. That is the theory, anyway. Cheers, Jeff
On Wed, Nov 21, 2012 at 11:05 AM, Vlad <marchenko@gmail.com> wrote: > it's session mode and the pool size is 1200 (cause I need to grantee that in > the worst case we have enough slots for all possible clients), however even > at the times preceding high-cpu-sys-stall, the number postmasters are like > 15-20. When stall happens, it starts to raise but that's the result of DB > not answering in timely manner. That is completely defeating the point. Mode should be transaction mode and somewhere between 1-2x the number of cores for the pool size. Note the # of connected clients does not have to equal the size of the pool -- that's the whole point of pgboucner -- it forces the queries through a fixed number of backend processes. merlin
-- Vlad
On Wed, Nov 21, 2012 at 10:43 AM, Jeff Janes <jeff.janes@gmail.com> wrote:also, what mode.
> On Wed, Nov 21, 2012 at 7:29 AM, Vlad Marchenko <marchenko@gmail.com> wrote:
>
>> update on my problem: despite pgbouncer, the problem still occures on my
>> end.
>
> As Merlin asked, how big is the pool? Maybe you are using a large
> enough pool so as to defeat the purpose of restricting the number of
> connections.
merlin
On Wed, Nov 21, 2012 at 10:43 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, Nov 21, 2012 at 7:29 AM, Vlad Marchenko <marchenko@gmail.com> wrote: > >> update on my problem: despite pgbouncer, the problem still occures on my >> end. > > As Merlin asked, how big is the pool? Maybe you are using a large > enough pool so as to defeat the purpose of restricting the number of > connections. also, what mode. merlin
-- Vlad
On Wed, Nov 21, 2012 at 8:14 AM, Vlad <marchenko@gmail.com> wrote:2 threads is pretty low for 60 clients. What happens if you increase
>
>
> '-M prepared' produces normal results, while '-M simple' results in 40% sys
> cpu. '-M extended' is somewhere in between.
> I'm running it as 60 clients, 2 threads.
-j to either 30 or 60? (What I'm wondering here is if perhaps in this
case it is pgbench which is triggering the scheduling issue)
Also, what happens if you channel pgbench through pgbouncer?
Cheers,
Jeff
On Wed, Nov 21, 2012 at 8:14 AM, Vlad <marchenko@gmail.com> wrote: > > > '-M prepared' produces normal results, while '-M simple' results in 40% sys > cpu. '-M extended' is somewhere in between. > I'm running it as 60 clients, 2 threads. 2 threads is pretty low for 60 clients. What happens if you increase -j to either 30 or 60? (What I'm wondering here is if perhaps in this case it is pgbench which is triggering the scheduling issue) Also, what happens if you channel pgbench through pgbouncer? Cheers, Jeff
what pgbouncer mode, and how large is your pool.
Vlad wrote: > it's session mode and the pool size is 1200 (cause I need to > grantee that in the worst case we have enough slots for all > possible clients), We found that the real-world production performance of a web application servicing millions of we hits per day with thousands of concurrent users improved when we reconfigured our database connection pool to be about 35 instead of 55, on a 16 core box with a 40 drive RAID. http://wiki.postgresql.org/wiki/Number_Of_Database_Connections -Kevin
On 25/11/12 09:30, Kevin Grittner wrote: > Vlad wrote: > >> it's session mode and the pool size is 1200 (cause I need to >> grantee that in the worst case we have enough slots for all >> possible clients), > We found that the real-world production performance of a web > application servicing millions of we hits per day with thousands of > concurrent users improved when we reconfigured our database > connection pool to be about 35 instead of 55, on a 16 core box with a > 40 drive RAID. > > http://wiki.postgresql.org/wiki/Number_Of_Database_Connections > > -Kevin > > Curious, what is your RAID configuration? Cheers, Gavin
Gavin Flower wrote: >> We found that the real-world production performance of a web >> application servicing millions of we hits per day with thousands >> of concurrent users improved when we reconfigured our database >> connection pool to be about 35 instead of 55, on a 16 core box >> with a 40 drive RAID. > Curious, what is your RAID configuration? We have the OS on RAID 1 on one RAID controller, WAL on another RAID 1 array on its own controller, and the PostgreSQL data directory on RAID 5 on a pair of controllers which each has a path to every drive. (Don't ask me how that works, I just trust the hardware guys to get that part right.) The active portion of the database is cached, the read-to-write ratio is about 10 to 1, and the controllers all have BBU cache with write-back. It is actually one pool of about 30 for the read-only web app with so many hits, and a separate pool of 6 database connections for logical replication from about 80 source databases handling an OLTP load of 3000 directly connected users. We were seeing some performance degradation at peak loads, and improved both user response time and overall throughput by reducing the database connections used by the connection pool. Peak load was handled much better that way. I cringe every time I see someone say they need a large number of database connections in order to support that many users. Having one database connection for each user is, in my experience, guaranteed to make things fall apart under load. -Kevin
Gavin Flower wrote:We found that the real-world production performance of a web application servicing millions of we hits per day with thousands of concurrent users improved when we reconfigured our database connection pool to be about 35 instead of 55, on a 16 core box with a 40 drive RAID.Curious, what is your RAID configuration?We have the OS on RAID 1 on one RAID controller, WAL on another RAID 1 array on its own controller, and the PostgreSQL data directory on RAID 5 on a pair of controllers which each has a path to every drive. (Don't ask me how that works, I just trust the hardware guys to get that part right.) The active portion of the database is cached, the read-to-write ratio is about 10 to 1, and the controllers all have BBU cache with write-back. It is actually one pool of about 30 for the read-only web app with so many hits, and a separate pool of 6 database connections for logical replication from about 80 source databases handling an OLTP load of 3000 directly connected users. We were seeing some performance degradation at peak loads, and improved both user response time and overall throughput by reducing the database connections used by the connection pool. Peak load was handled much better that way. I cringe every time I see someone say they need a large number of database connections in order to support that many users. Having one database connection for each user is, in my experience, guaranteed to make things fall apart under load. -Kevin
Thanks for a very comprehensive answer! (almost simply asked what 'type' of RAID did you have, I am glad I rephrased that! I learnt far more than I expected, which is good - more than merely satisfied my curiosity!.
I use RAID-6; but in development, reliability (for me, I've had 5 hard disk crashes in 20 years) is orders of magnitude more important than performance, with rare exceptions!
I can understand the mentality of wanting one connection per user, though I fully know why it is the wrong approach - I would also want to use connection pooling in your situation.
Backing the 1980's I was a Mainframe COBOL analyst/programmer and just starting to use databases. Initially I thought of a transaction in terms of the user wanting to ensure their display and update were in the same transaction. Took me a while to fully appreciate and accept that transactions should be short to protect the database while not blocking other users too much (multi-user access was also a whole new concept), and that keeping a transaction open for user actions was not the 'done thing'! It requires a whole new change of mindset – something a lot of people find difficult, as it conflicts with part of their implicit belief system (just try discussing even the most obvious flaws in Creationism with a Believer!).
Now I'm quite comfortable with the idea of multi-user and am much more database centric while still appreciating the needs of users. Now my weapons of choice are Java & Linux, with Postgres being my preferred database.
Cheers,
Gavin
RAID10 -- vlad On 11/24/2012 3:17 PM, Gavin Flower wrote: > > Curious, what is your RAID configuration? >