Обсуждение: RFC: bufmgr locking changes
I've attached a (gzip'ed) patch that makes the following changes to the buffer manager: (1) Overhaul locking; whenever the code needs to modify the state of an individual buffer, do synchronization via a per-buffer "meta data lock" rather than the global BufMgrLock. For more info on the motivation for this, see the several recent -hackers threads on the subject. (2) Reduce superfluous dirtying of pages: previously, LockBuffer(buf, LW_EXCLUSIVE) would automatically dirty the buffer. This behavior has been removed, to reduce the number of pages that are dirtied. For more info on the motivation for this, see the previous -hackers thread on the topic. (3) Simplify the code in a bunch of places This basically involved rewriting bufmgr.c (which makes the patch a little illegible; it might be easier to just apply it and read through the new code). That also means there are still a fair number of bugs and unresolved issues. The new locking works as follows: - modifying the shared buffer table (buf_table.c) or making calls into the freelist code (freelist.c) still requires holding the BufMgrLock. There was some talk of trying to add more granular locking to freelist.c itself: if there is still significant contention for the BufMgrLock after these changes have been applied, I'll take a look at doing that - to modify a BufferDesc's meta data (shared refcount, flags, tag, etc.), one must hold the buffer's "meta data lock". Also, I removed the existing "io_in_progress" lock; instead, we now hold the buffer's meta data lock while doing buffer I/O. - if a backend holds the BufMgrLock, it will never try to LWLockAcquire() a per-buffer meta data lock, due to the risk of deadlock (and the loss in concurrency if we got blocked waiting while still holding the BufMgrLock). Instead it does a LWLockConditionalAcquire() and handles an acquisition failure sanely - if a backend holds the meta data lock for a buffer, it CAN attempt to LWLockAcquire() the BufMgrLock. This is safe from deadlocks, due to the previous paragraph. The code is still a work-in-progress (running `pgbench -s 30 -c 20 -t 1000` bails out pretty quickly, for example), but any comments on whether this scheme is in-theory correct would be very welcome. For #2, my understanding of the existing XLOG code is incomplete, so perhaps someone can tell me if I'm on the right track. I've modified a few of the XLogInsert() call sites so that they now: - acquire an exclusive buffer cntx_lock - modify the content of the page/buffer - WriteNoReleaseBuffer() to mark the buffer as dirty; since we still hold the buffer's cntx_lock, a checkpoint process can't write this page to disk yet. This replaces the implicit "mark this page as dirty" behavior of LockBuffer() - do the XLogInsert() - release the cntx_lock - ReleaseBuffer() to decrement the buffer's pincount For example, sequence.c now works like this (I've probably missed a few places) -- is this correct, and/or is there a better way to do it? Notes, and caveats: - Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now completely equivalent to WriteNoReleaseBuffer(), so I just removed the former and replaced all the calls to it with calls to the later. - Moved the code for flushing a dirty buffer to disk to buf_io.c - Moved UnpinBuffer() and PinBuffer() to bufmgr.c, from freelist.c - Removed the following now-unused buffer flag bits: BM_VALID, BM_DELETED, BM_JUST_DIRTIED, BM_IO_IN_PROGRESS, and shrunk the 'flags' field of the BufferDesc down to 8 bits (from 16) - Removed the cntxDirty field from the BufferDesc: now that we don't need to acquire the BufMgrLock to modify the buffer's flags, there's no reason to keep this around - Make 'PrivateRefCount' an array of uint16s, rather than longs. This saves 2 bits * shared_buffers per backend on 32-bit machines and 6 bits * shared_buffers per backend on some 64-bit machines. It means a given backend can only pin a single buffer 65,636 times, but that should be more than enough. Similarly, made LocalRefCount an array of uint16s. I was thinking of adding overflow checking to the refcount increment code to make sure we fail safely if a backend *does* try to exceed this number of pins, but I can't imagine a scenario when it would actually occur, so I haven't bothered. - Remove the BufferLocks array. AFAICS this wasn't actually necessary. - A few loose ends in the code still need to be wrapped up (for example, I need to take another glance at the pin-waiter stuff, and the error handling still needs some more work), but I think most of the functionality is there. Areas of concern are denoted by 'XXX' comments. Any comments? -Neil
Вложения
On Wed, Jan 07, 2004 at 05:37:09PM -0500, Neil Conway wrote: > > - if a backend holds the BufMgrLock, it will never try to > LWLockAcquire() a per-buffer meta data lock, due to the risk of > deadlock (and the loss in concurrency if we got blocked waiting > while still holding the BufMgrLock). Instead it does a > LWLockConditionalAcquire() and handles an acquisition failure > sanely I can only see this work when you always check that you actually got the right buffer after you locked the meta data. There is nothing preventing an other backend from using it to for something else in it between the time you release the BufMgrLock and you lock the buffer's meta data. Note that your PinBuffer() is now always called when you already have the lock meta lock, which is probably a good idea or you're going to make that function more complex that it should be. Just say that it should hold the meta lock instead of the BufMgrLock when it's called. Kurt
Neil Conway <neilc@samurai.com> writes: > - to modify a BufferDesc's meta data (shared refcount, flags, tag, > etc.), one must hold the buffer's "meta data lock". Also, I > removed the existing "io_in_progress" lock; instead, we now hold > the buffer's meta data lock while doing buffer I/O. The latter is a really bad idea IMHO. The io_in_progress lock can be held for eons (in CPU terms) and should not be blocking people who simply want to bump their refcount up and down. In particular, there is no reason for an in-process write to block people who want read-only access to the buffer. It's possible that you could combine the io_in_progress lock with the cntx_lock, but I doubt locking out access to the metadata during i/o is a good idea. > - if a backend holds the BufMgrLock, it will never try to > LWLockAcquire() a per-buffer meta data lock, due to the risk of > deadlock (and the loss in concurrency if we got blocked waiting > while still holding the BufMgrLock). Instead it does a > LWLockConditionalAcquire() and handles an acquisition failure > sanely Hmm, what's "sanely" exactly? It seems to me that figuring out how to not need to lock in this direction is a critical part of the redesign, so you can't just handwave here and expect people to understand. > - Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now > completely equivalent to WriteNoReleaseBuffer(), so I just removed > the former and replaced all the calls to it with calls to the later. The reason I've kept the separation was as a form of documentation as to the reason for each write. Although they currently do the same thing, that might not always be true. I'd prefer not to eliminate the distinction from the source code --- though I'd not object if you want to make SetBufferCommitInfoNeedsSave a macro that invokes the other routine. > - Removed the following now-unused buffer flag bits: BM_VALID, > BM_DELETED, BM_JUST_DIRTIED, BM_IO_IN_PROGRESS, and shrunk the > 'flags' field of the BufferDesc down to 8 bits (from 16) I cannot believe that it's workable to remove BM_JUST_DIRTIED. How are you handling the race conditions that this was for? I'm also unconvinced that removing BM_IO_IN_PROGRESS is a good idea. > - Make 'PrivateRefCount' an array of uint16s, rather than longs. This > saves 2 bits * shared_buffers per backend on 32-bit machines and 6 > bits * shared_buffers per backend on some 64-bit machines. It means > a given backend can only pin a single buffer 65,636 times, but that > should be more than enough. Similarly, made LocalRefCount an array > of uint16s. I think both of these are ill-considered micro-optimization. How do you know that the pin count can't exceed 64K? Consider recursive plpgsql functions for a likely counterexample. > I was thinking of adding overflow checking to the refcount increment > code to make sure we fail safely if a backend *does* try to exceed > this number of pins, but I can't imagine a scenario when it would > actually occur, so I haven't bothered. Leaving the arrays as longs is a much saner approach IMHO. > - Remove the BufferLocks array. AFAICS this wasn't actually necessary. Please put that back. It is there to avoid unnecessary acquisitions of buffer locks during UnlockBuffers (which is executed during any transaction abort). Without it, you will be needing to lock every single buffer during an abort in order to check its flags. regards, tom lane
(Sorry Tom, I was meaning to reply to you once I'd had a chance to revise the bufmgr patch; since that seems a fair waysoff, I figured it would be better to respond now.) Tom Lane <tgl@sss.pgh.pa.us> writes: > Neil Conway <neilc@samurai.com> writes: >> we now hold the buffer's meta data lock while doing buffer I/O. > > The latter is a really bad idea IMHO. The io_in_progress lock can be > held for eons (in CPU terms) and should not be blocking people who > simply want to bump their refcount up and down. My reasoning was that the contention for the per-buffer meta data lock should be pretty low. The io_in_progress lock is held when we're either faulting a page in or flushing a page out. In the first case, we can't actually use the buffer no matter how we do the locking (its content is incomplete), so there's no effective loss in concurrency. In the second case, what kinds of concurrent activity can we allow on the buffer? (We can allow reads, of course, but I don't believe we can allow writes.) However, I'll think some more on this, you (and Jan, who raised this point a while ago via IRC) are probably correct. > It's possible that you could combine the io_in_progress lock with the > cntx_lock Yeah, that's a possibility. >> - Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now >> completely equivalent to WriteNoReleaseBuffer(), so I just removed >> the former and replaced all the calls to it with calls to the later. > > The reason I've kept the separation was as a form of documentation as to > the reason for each write. Although they currently do the same thing, > that might not always be true. I'd prefer not to eliminate the > distinction from the source code --- though I'd not object if you want > to make SetBufferCommitInfoNeedsSave a macro that invokes the other > routine. Ok, fair enough -- I've changed SetBufferCommitInfoNeedsSave() to be a macro for WriteNoReleaseBuffer(). >> - Make 'PrivateRefCount' an array of uint16s, rather than longs. This >> saves 2 bits * shared_buffers per backend on 32-bit machines and 6 >> bits * shared_buffers per backend on some 64-bit machines. It means >> a given backend can only pin a single buffer 65,636 times, but that >> should be more than enough. Similarly, made LocalRefCount an array >> of uint16s. > > I think both of these are ill-considered micro-optimization. How do you > know that the pin count can't exceed 64K? Consider recursive plpgsql > functions for a likely counterexample. Fair enough -- I couldn't conceive of an actual scenario in which a single backend would acquire > 64K pins on a single buffer, but I'll take your word that it's not so far fetched. However, there is still room for improvement, IMHO: on a machine with 64-bit longs, we're plainly allocating 4 bytes more than we need do. Any objection if I change these to arrays of int32? > Please put that back. It is there to avoid unnecessary acquisitions > of buffer locks during UnlockBuffers (which is executed during any > transaction abort). Without it, you will be needing to lock every > single buffer during an abort in order to check its flags. It seems bizarre that we need to iterate through a few thousand array elements just to do some lock cleanup. I'll take a closer look at this... -Neil
Neil Conway <neilc@samurai.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> The latter is a really bad idea IMHO. The io_in_progress lock can be >> held for eons (in CPU terms) and should not be blocking people who >> simply want to bump their refcount up and down. > My reasoning was that the contention for the per-buffer meta data lock > should be pretty low. The io_in_progress lock is held when we're > either faulting a page in or flushing a page out. In the first case, > we can't actually use the buffer no matter how we do the locking (its > content is incomplete), so there's no effective loss in > concurrency. In the second case, what kinds of concurrent activity can > we allow on the buffer? (We can allow reads, of course, but I don't > believe we can allow writes.) True, there's no win in the read-busy case, but I think you underestimate the value of the write-busy case. Multiple concurrent readers are a very important consideration. In Postgres it is possible for a reader to cause a write to occur (because it sets commit hint bits, as per the SetBufferCommitInfoNeedsSave() business), and so you could have a situation like Reader pins page Reader examines some tuples Reader sets a commit bit and dirties page ... Writerstarts write ... Reader examines some more tuples Reader unpins page Writer finishes write If the reader can't unpin until the writer is done, then we will have foreground readers blocked on the background writer process, which is exactly what we do not want. >> I think both of these are ill-considered micro-optimization. How do you >> know that the pin count can't exceed 64K? Consider recursive plpgsql >> functions for a likely counterexample. > Fair enough -- I couldn't conceive of an actual scenario in which > a single backend would acquire > 64K pins on a single buffer, but I'll > take your word that it's not so far fetched. However, there is still > room for improvement, IMHO: on a machine with 64-bit longs, we're > plainly allocating 4 bytes more than we need do. Any objection if I > change these to arrays of int32? That seems like a reasonable compromise. regards, tom lane