Re: write scalability
От | Merlin Moncure |
---|---|
Тема | Re: write scalability |
Дата | |
Msg-id | CAHyXU0x6OibNccfxvFEDivh1NXY0XZ3sqLr5Y_ZONfHeRJDD0w@mail.gmail.com обсуждение исходный текст |
Ответ на | write scalability (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: write scalability
(Alvaro Herrera <alvherre@commandprompt.com>)
|
Список | pgsql-hackers |
On Mon, Jul 25, 2011 at 3:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I've long harbored a suspicion, based on some testing I did on my home > machine, that WALInsertLock is a big performance bottleneck. But I > just did some benchmarking that doesn't entirely support that > contention. This is on Nate Boley's 32-core machine, with the > following settings: > > max_connections = 100 > shared_buffers = 8GB > synchronous_commit = off > checkpoint_segments = 100 > checkpoint_timeout = 15min > checkpoint_completion_target = 0.9 > > I did 5-minute pgbench runs with unlogged tables and with permanent > tables, restarting the database server and reinitializing the tables > between each run. The number at the beginning of each line is the > number of clients, while the p/u indicates which type of tables were > used: > > 01p tps = 628.626815 (including connections establishing) > 01p tps = 635.394288 (including connections establishing) > 01p tps = 634.972789 (including connections establishing) > 08p tps = 3342.787325 (including connections establishing) > 08p tps = 3883.876813 (including connections establishing) > 08p tps = 3941.253567 (including connections establishing) > 32p tps = 5597.472192 (including connections establishing) > 32p tps = 5738.139895 (including connections establishing) > 32p tps = 5794.490934 (including connections establishing) > 80p tps = 4499.685286 (including connections establishing) > 80p tps = 4917.060441 (including connections establishing) > 80p tps = 5050.931933 (including connections establishing) > > 01u tps = 672.469142 (including connections establishing) > 01u tps = 671.256686 (including connections establishing) > 01u tps = 670.421003 (including connections establishing) > 08u tps = 4087.749529 (including connections establishing) > 08u tps = 3797.750851 (including connections establishing) > 08u tps = 4181.393560 (including connections establishing) > 32u tps = 8956.346905 (including connections establishing) > 32u tps = 8898.442517 (including connections establishing) > 32u tps = 8971.591569 (including connections establishing) > 80u tps = 7287.550952 (including connections establishing) > 80u tps = 7266.816989 (including connections establishing) > 80u tps = 7255.968109 (including connections establishing) > > The speed-up from using unlogged tables was not as large as I > expected. Turning off synchronous_commit here removes commit rate as > the bottleneck, and I think perhaps the main benefit of unlogged > tables in that case is the avoidance of I/O, and apparently this > machine has enough I/O bandwidth, and just enough memory in general, > that that's not an issue. > > With either type of tables, the 8 client results are about 6.1 times > the single core results - not great, but not terrible, either. With > 32 clients, there is some improvement: 13.4x vs. 9.1x, but even 13.4x > is a long way from linear. vmstat reveals that CPU usage is > substantially less than 100%. After some investigation, I found that > using unlogged tables wasn't actually getting rid of all the > write-ahead logging - the commit records were still being issued. So > I hacked up RecordTransactionCommit() not to emit transaction commit > records in that case. That doesn't actually completely eliminate the > WAL activity, because it still emits records for zeroing new SLRU > pages for CLOG, but it takes a big chunk out of it. The upshot is > that this improved both raw performance and scalability, but not > dramatically. Unlogged table results, with this change: > > 01h tps = 708.189337 (including connections establishing) > 01h tps = 704.030175 (including connections establishing) > 01h tps = 701.644199 (including connections establishing) > 08h tps = 5196.615955 (including connections establishing) > 08h tps = 5126.162200 (including connections establishing) > 08h tps = 5067.568727 (including connections establishing) > 32h tps = 10661.275636 (including connections establishing) > 32h tps = 10621.085226 (including connections establishing) > 32h tps = 10575.267197 (including connections establishing) > 80h tps = 7557.965666 (including connections establishing) > 80h tps = 7545.697547 (including connections establishing) > 80h tps = 7556.379921 (including connections establishing) > > Now the 32-client numbers have risen to 15.1x the single-client > numbers, but that's still not great. > > What does this mean? Well, XLogInsert does strikes me as an awfully > complex piece of code to be running with a hot LWLock held in > exclusive mode. But even so, I think this certainly means that > WALInsertLock, at least on this workload, is not the whole problem... > in this test, I'm not only eliminating the overhead of inserting the > WAL, but also the overhead of writing it, flushing it, and generating > it. So there is something, other than WAL insertion, which is taking > a big bite out of performance here. > > As to what that something might be, I reran this last test with > LWLOCK_STATS enabled and here are the top things that are blocking: > > lwlock 310: shacq 96846 exacq 108433 blk 16275 > lwlock 3: shacq 64 exacq 2628381 blk 36027 > lwlock 7: shacq 0 exacq 2628615 blk 85220 > lwlock 11: shacq 84913908 exacq 4539551 blk 2119423 > lwlock 4: shacq 28667307 exacq 2628524 blk 3356651 > > During this 5-minute test run, an LWLock acquisition blocked 6180335 > times. As you can see from the above results, ProcArrayLock accounts > for 54% of that blocking, and CLogControlLock accounts for another > 34%. lwlock 7 is WALInsertLock, which manages to account for more > than 1% of the blocking despite the fact that WAL has been largely > eliminated in this test; lwlock 3 is XidGenLock. I dunno what LWLock > 310 is but it doesn't matter. After that there is a long tail of > other stuff with gradually decreasing amounts of blocking. Experience > with the read scalability stuff has taught me also to look at which > LWLocks have the most shared acquisitions, as that can cause spinlock > and cache line contention. The top few culprits are: > > lwlock 504: shacq 5315030 exacq 0 blk 0 > lwlock 45: shacq 5967317 exacq 13284 blk 1722 > lwlock 39: shacq 8219988 exacq 13342 blk 2291 > lwlock 5: shacq 26306020 exacq 0 blk 0 > lwlock 4: shacq 28667307 exacq 2628524 blk 3356651 > lwlock 11: shacq 84913908 exacq 4539551 blk 2119423 > > In all, there were 238777533 shared LWLock acquisitions during this > test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon > to be dealt with, as discussed elsewhere on-list), and then it gets > down into the lock manager locks and a few others. hm, all the CLogControlLock acquisitions in clog.c appear to be exclusive...or did you mean shared in some other sense? TransactionIdGetStatus is taking an exclusive lock which is a red flag and a good optimization target, I think. merlin
В списке pgsql-hackers по дате отправления: