Обсуждение: Lockless StrategyGetBuffer() clock sweep
Hi, I've previously posted a patch at http://archives.postgresql.org/message-id/20141010160020.GG6670%40alap3.anarazel.de that reduces contention in StrategyGetBuffer() by making the clock sweep lockless. Robert asked me to post it to a new thread; I originally wrote it to see some other contention in more detail, that's why it ended up in that thread... The performance numbers I've quoted over there are: > Test: > pgbench -M prepared -P 5 -S -c 496 -j 496 -T 5000 > on a scale=1000 database, with 4GB of shared buffers. > > Before: > progress: 40.0 s, 136252.3 tps, lat 3.628 ms stddev 4.547 > progress: 45.0 s, 135049.0 tps, lat 3.660 ms stddev 4.515 > progress: 50.0 s, 135788.9 tps, lat 3.640 ms stddev 4.398 > progress: 55.0 s, 135268.4 tps, lat 3.654 ms stddev 4.469 > progress: 60.0 s, 134991.6 tps, lat 3.661 ms stddev 4.739 > > after: > progress: 40.0 s, 207701.1 tps, lat 2.382 ms stddev 3.018 > progress: 45.0 s, 208022.4 tps, lat 2.377 ms stddev 2.902 > progress: 50.0 s, 209187.1 tps, lat 2.364 ms stddev 2.970 > progress: 55.0 s, 206462.7 tps, lat 2.396 ms stddev 2.871 > progress: 60.0 s, 210263.8 tps, lat 2.351 ms stddev 2.914 Imo the patch doesn't complicate the logic noticeably... I do wonder if we could make the freelist accesses lockless as well - but I think that's harder. So I don't want to mix that with this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On Mon, Oct 27, 2014 at 9:32 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I've previously posted a patch at > http://archives.postgresql.org/message-id/20141010160020.GG6670%40alap3.anarazel.de > that reduces contention in StrategyGetBuffer() by making the clock sweep > lockless. Robert asked me to post it to a new thread; I originally > wrote it to see some other contention in more detail, that's why it > ended up in that thread... Does this LATCHPTR_ACCESS_ONCE crap really do anything? Why do we need that? I'm not convinced that it's actually solving any problem we would otherwise have with this code. But if it is, then how about adding a flag that is 4 bytes wide or less alongside bgwriterLatch, and just checking the flag instead of checking bgwriterLatch itself? Actually, the same approach would work for INT_ACCESS_ONCE. So I propose this approach instead: Add a new atomic flag to StrategyControl, useSlowPath. If StrategyGetBuffer() finds useSlowPath set, call StrategyGetBufferSlow(), which will acquire the spinlock, check whether bgwriterLatch is set and/or the freelist is non-empty and return a buffer ID if able to allocate from the freelist; otherwise -1. If StrategyGetBufferSlow() returns -1, or we decide not to call it in the first place because useSlowPath isn't set, then fall through to your clock-sweep logic. We can set useSlowPath at stratup and whenever we put buffers on the free list. The lack of memory barriers while testing useSlowPath (or, in your version, when testing the ACCESS_ONCE conditions) means that we could fail to set the bgwriter latch or pop from the freelist if another backend has very recently updated those things. But that's not actually a problem; it's fine for any individual request to skip those things, as they are more like hints than correctness requirements. The interaction between this and the bgreclaimer idea is interesting. We can't making popping the freelist lockless without somehow dealing with the resulting A-B-A problem (namely, that between the time we read &head->next and the time we CAS the list-head to that value, the head might have been popped, another item pushed, and the original list head pushed again). So even if bgreclaimer saves some work for individual backends - avoiding the need for them to clock-sweep across many buffers - it may not be worth it if it means taking a spinlock to pop the freelist instead of using an atomics-driven clock sweep. Considering that there may be a million plus buffers to scan through, that's a surprising conclusion, but it seems to be where the data is pointing us. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-10-29 14:18:33 -0400, Robert Haas wrote: > On Mon, Oct 27, 2014 at 9:32 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I've previously posted a patch at > > http://archives.postgresql.org/message-id/20141010160020.GG6670%40alap3.anarazel.de > > that reduces contention in StrategyGetBuffer() by making the clock sweep > > lockless. Robert asked me to post it to a new thread; I originally > > wrote it to see some other contention in more detail, that's why it > > ended up in that thread... > > Does this LATCHPTR_ACCESS_ONCE crap really do anything? Well, t forces the variable to be read from memory, instead of allowing it to be read from a register... I'd much rather have a generic, type independent, ACCESS_ONCE macro, but I'm not aware how to do that in a way that works across compilers. It also prevents the compiler from doing speculative writes. > Why do we need that? I'm not convinced that it's actually solving any > problem we would otherwise have with this code. I agree that in this case we can get rid of it. > But if it is, then how about > adding a flag that is 4 bytes wide or less alongside bgwriterLatch, > and just checking the flag instead of checking bgwriterLatch itself? Yea, that'd be nicer. I didn't do it because it made the patch slightly more invasive... The complexity really is only needed because we're not guaranteed that 64bit reads are atomic. Although we actually can be sure, because there's no platform with nonatomic intptr_t reads - so 64bit platforms actually *do* have atomic 64bit reads/writes. So if we just have a integer 'setBgwriterLatch' somewhere we're good. We don't even need to take a lock to set it. Afaics the worst that can happen is that several processes set the latch... > Actually, the same approach would work for INT_ACCESS_ONCE. So I > propose this approach instead: Add a new atomic flag to > StrategyControl, useSlowPath. If StrategyGetBuffer() finds > useSlowPath set, call StrategyGetBufferSlow(), which will acquire the > spinlock, check whether bgwriterLatch is set and/or the freelist is > non-empty and return a buffer ID if able to allocate from the > freelist; otherwise -1. If StrategyGetBufferSlow() returns -1, or we > decide not to call it in the first place because useSlowPath isn't > set, then fall through to your clock-sweep logic. We can set > useSlowPath at stratup and whenever we put buffers on the free list. I don't like the idea to to conflate the slowpath for the bgwriter latch with the freelist slowpath. And I don't really see what that'd buy us over what's in the patch right now? > The lack of memory barriers while testing useSlowPath (or, in your > version, when testing the ACCESS_ONCE conditions) means that we could > fail to set the bgwriter latch or pop from the freelist if another > backend has very recently updated those things. But that's not > actually a problem; it's fine for any individual request to skip those > things, as they are more like hints than correctness requirements. Right. > The interaction between this and the bgreclaimer idea is interesting. > We can't making popping the freelist lockless without somehow dealing > with the resulting A-B-A problem (namely, that between the time we > read &head->next and the time we CAS the list-head to that value, the > head might have been popped, another item pushed, and the original > list head pushed again). I think if we really feel the need to, we can circumvent the ABA problem here. But I'm not yet convinced that there's the need to do so. I'm unsure that a single process that touches all the buffers at some frequency is actually a good idea on modern NUMA systems... I wonder if we could make a trylock for spinlocks work - then we could look at the freelist if the lock is free and just go to the clock cycle otherwise. My guess is that that'd be quite performant. IIRC it's just the spinlock semaphore fallback that doesn't know how to do trylock... > So even if bgreclaimer saves some work for > individual backends - avoiding the need for them to clock-sweep across > many buffers - it may not be worth it if it means taking a spinlock to > pop the freelist instead of using an atomics-driven clock sweep. > Considering that there may be a million plus buffers to scan through, > that's a surprising conclusion, but it seems to be where the data is > pointing us. I'm not really convinced of this yet either. It might just be that the bgreclaimer implementation isn't good enough. But to me that doesn't really change the fact that there's clear benefit in this patch - even if we can make bgreclaimer beneficial the lockless scan will be good. My feeling is that make bg*writer* more efficient is more likely to be beneficial overall than introducing bgreclaimer. I've seen the clock sweeps really hurt in cases where many many buffers are dirty. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 30, 2014 at 12:39 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-10-29 14:18:33 -0400, Robert Haas wrote:
>
> > The interaction between this and the bgreclaimer idea is interesting.
> > We can't making popping the freelist lockless without somehow dealing
> > with the resulting A-B-A problem (namely, that between the time we
> > read &head->next and the time we CAS the list-head to that value, the
> > head might have been popped, another item pushed, and the original
> > list head pushed again).
>
> I think if we really feel the need to, we can circumvent the ABA problem
> here. But I'm not yet convinced that there's the need to do so. I'm
> unsure that a single process that touches all the buffers at some
> frequency is actually a good idea on modern NUMA systems...
>
> I wonder if we could make a trylock for spinlocks work - then we could
> look at the freelist if the lock is free and just go to the clock cycle
> otherwise. My guess is that that'd be quite performant. IIRC it's just
> the spinlock semaphore fallback that doesn't know how to do trylock...
>
>
> > So even if bgreclaimer saves some work for
> > individual backends - avoiding the need for them to clock-sweep across
> > many buffers - it may not be worth it if it means taking a spinlock to
> > pop the freelist instead of using an atomics-driven clock sweep.
> > Considering that there may be a million plus buffers to scan through,
> > that's a surprising conclusion, but it seems to be where the data is
> > pointing us.
>
> I'm not really convinced of this yet either. It might just be that the
> bgreclaimer implementation isn't good enough. But to me that doesn't
> really change the fact that there's clear benefit in this patch -
> On 2014-10-29 14:18:33 -0400, Robert Haas wrote:
>
> > The interaction between this and the bgreclaimer idea is interesting.
> > We can't making popping the freelist lockless without somehow dealing
> > with the resulting A-B-A problem (namely, that between the time we
> > read &head->next and the time we CAS the list-head to that value, the
> > head might have been popped, another item pushed, and the original
> > list head pushed again).
>
> I think if we really feel the need to, we can circumvent the ABA problem
> here. But I'm not yet convinced that there's the need to do so. I'm
> unsure that a single process that touches all the buffers at some
> frequency is actually a good idea on modern NUMA systems...
>
> I wonder if we could make a trylock for spinlocks work - then we could
> look at the freelist if the lock is free and just go to the clock cycle
> otherwise. My guess is that that'd be quite performant. IIRC it's just
> the spinlock semaphore fallback that doesn't know how to do trylock...
>
>
> > So even if bgreclaimer saves some work for
> > individual backends - avoiding the need for them to clock-sweep across
> > many buffers - it may not be worth it if it means taking a spinlock to
> > pop the freelist instead of using an atomics-driven clock sweep.
> > Considering that there may be a million plus buffers to scan through,
> > that's a surprising conclusion, but it seems to be where the data is
> > pointing us.
>
> I'm not really convinced of this yet either. It might just be that the
> bgreclaimer implementation isn't good enough. But to me that doesn't
> really change the fact that there's clear benefit in this patch -
I have a feeling that this might also have some regression at higher
loads (like scale_factor = 5000, shared_buffers = 8GB,
client_count = 128, 256) for the similar reasons as bgreclaimer patch,
means although both reduces contention around spin lock, however
it moves contention somewhere else. I have yet to take data before
concluding anything (I am just waiting for your other patch (wait free
LW_SHARED) to be committed).
I think once wait free LW_SHARED is in, we can evaluate this patch
(if required may be we can see if there is any interaction between
this and bgreclaimer). However if you want, I think this can be done
even separately from wait free LW_SHARED patch.
> even
> if we can make bgreclaimer beneficial the lockless scan will be good.
>
> My feeling is that make bg*writer* more efficient is more likely to be
> beneficial overall than introducing bgreclaimer.
> if we can make bgreclaimer beneficial the lockless scan will be good.
>
> My feeling is that make bg*writer* more efficient is more likely to be
> beneficial overall than introducing bgreclaimer.
One idea could be that we bgwriter as something similar to auto vacuum
launcher, which means that bgwriter can launch different workers based
on the kind of need.
On 2014-10-30 10:23:56 +0530, Amit Kapila wrote: > I have a feeling that this might also have some regression at higher > loads (like scale_factor = 5000, shared_buffers = 8GB, > client_count = 128, 256) for the similar reasons as bgreclaimer patch, > means although both reduces contention around spin lock, however > it moves contention somewhere else. I have yet to take data before > concluding anything (I am just waiting for your other patch (wait free > LW_SHARED) to be committed). I have a hard time to see how this could be. In the uncontended case the number of cachelines touched and the number of atomic operations is exactly the same. In the contended case the new implementation does far fewer atomic ops - and doesn't do spinning. What's your theory? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> But if it is, then how about >> adding a flag that is 4 bytes wide or less alongside bgwriterLatch, >> and just checking the flag instead of checking bgwriterLatch itself? > > Yea, that'd be nicer. I didn't do it because it made the patch slightly > more invasive... The complexity really is only needed because we're not > guaranteed that 64bit reads are atomic. Although we actually can be > sure, because there's no platform with nonatomic intptr_t reads - so > 64bit platforms actually *do* have atomic 64bit reads/writes. > > So if we just have a integer 'setBgwriterLatch' somewhere we're good. We > don't even need to take a lock to set it. Afaics the worst that can > happen is that several processes set the latch... OK, that design is fine with me. >> Actually, the same approach would work for INT_ACCESS_ONCE. So I >> propose this approach instead: Add a new atomic flag to >> StrategyControl, useSlowPath. If StrategyGetBuffer() finds >> useSlowPath set, call StrategyGetBufferSlow(), which will acquire the >> spinlock, check whether bgwriterLatch is set and/or the freelist is >> non-empty and return a buffer ID if able to allocate from the >> freelist; otherwise -1. If StrategyGetBufferSlow() returns -1, or we >> decide not to call it in the first place because useSlowPath isn't >> set, then fall through to your clock-sweep logic. We can set >> useSlowPath at stratup and whenever we put buffers on the free list. > > I don't like the idea to to conflate the slowpath for the bgwriter latch > with the freelist slowpath. And I don't really see what that'd buy us > over what's in the patch right now? Well, I was thinking that with a single flag check we could skip two paths that, as of today, are both uncommonly taken. Moving all that logic off into a separate function might help the compiler optimize for the fast case, too. > I wonder if we could make a trylock for spinlocks work - then we could > look at the freelist if the lock is free and just go to the clock cycle > otherwise. My guess is that that'd be quite performant. IIRC it's just > the spinlock semaphore fallback that doesn't know how to do trylock... I could go for that. I don't think the semaphore thing is really a problem. If the spinlock semaphore implementation indeed can't support that, then just have it fall back to waiting for the lock and always returning true. Semaphores are so slow that it's not going to make any difference to performance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Oct 30, 2014 at 5:01 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> On 2014-10-30 10:23:56 +0530, Amit Kapila wrote:
> > I have a feeling that this might also have some regression at higher
> > loads (like scale_factor = 5000, shared_buffers = 8GB,
> > client_count = 128, 256) for the similar reasons as bgreclaimer patch,
> > means although both reduces contention around spin lock, however
> > it moves contention somewhere else. I have yet to take data before
> > concluding anything (I am just waiting for your other patch (wait free
> > LW_SHARED) to be committed).
>
> I have a hard time to see how this could be. In the uncontended case the
> number of cachelines touched and the number of atomic operations is
> exactly the same. In the contended case the new implementation does far
> fewer atomic ops - and doesn't do spinning.
>
> What's your theory?
>
> On 2014-10-30 10:23:56 +0530, Amit Kapila wrote:
> > I have a feeling that this might also have some regression at higher
> > loads (like scale_factor = 5000, shared_buffers = 8GB,
> > client_count = 128, 256) for the similar reasons as bgreclaimer patch,
> > means although both reduces contention around spin lock, however
> > it moves contention somewhere else. I have yet to take data before
> > concluding anything (I am just waiting for your other patch (wait free
> > LW_SHARED) to be committed).
>
> I have a hard time to see how this could be. In the uncontended case the
> number of cachelines touched and the number of atomic operations is
> exactly the same. In the contended case the new implementation does far
> fewer atomic ops - and doesn't do spinning.
>
> What's your theory?
I have observed that once we reduce the contention in one path, it doesn't
always lead to performance/scalability gain and rather shifts to other lock
if that exists. This is the reason why we have to work on reducing contention
around both BufFreeList and Buf Mapping tables lock together. I have taken
some performance data and it seems this patch also exhibits similar behaviour
as bgreclaimer patch and I believe resolving contention around dynahash can
improve the situation (Robert's chash patch can be helpful).
Performance Data
------------------------------Configuration and Db Details
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
max_connections = 300
shared_buffers = 8GB
checkpoint_segments=30
checkpoint_timeout =15min
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
checkpoint_segments=30
checkpoint_timeout =15min
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
Test mode - pgbench readonly (-M prepared)
Scale_Factor = 5000
Data below is median of 3 runs, for individual run data check document
attached with this mail.
Scale_Factor = 1000
Patch_ver/Client_Count | 128 | 256 |
HEAD | 265502 | 201689 |
Patch | 283448 | 224888 |
Scale_Factor = 5000
Patch_ver/Client_Count | 128 | 256 |
HEAD | 190435 | 177477 |
Patch | 171485 | 167794 |
The above data indicates that there is performance gain at scale factor
1000, however there is a regression at scale factor 5000.
Вложения
On 2014-10-30 07:55:08 -0400, Robert Haas wrote: > On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> But if it is, then how about > >> adding a flag that is 4 bytes wide or less alongside bgwriterLatch, > >> and just checking the flag instead of checking bgwriterLatch itself? > > > > Yea, that'd be nicer. I didn't do it because it made the patch slightly > > more invasive... The complexity really is only needed because we're not > > guaranteed that 64bit reads are atomic. Although we actually can be > > sure, because there's no platform with nonatomic intptr_t reads - so > > 64bit platforms actually *do* have atomic 64bit reads/writes. > > > > So if we just have a integer 'setBgwriterLatch' somewhere we're good. We > > don't even need to take a lock to set it. Afaics the worst that can > > happen is that several processes set the latch... > > OK, that design is fine with me. There's a slight problem with this, namely restarts of bgwriter. If it crashes the reference to the relevant latch will currently be reset via StrategyNotifyBgWriter(). In reality that's not a problem because sizeof(void*) writes are always atomic, but currently we don't assume that. We'd sometimes write to a old latch, but that's harmless because they're never deallocated. So I can see several solutions right now: 1) Redefine our requirements so that aligned sizeof(void*) writes are always atomic. That doesn't affect any currently supportedplatform and won't ever affect any future platform either. E.g. linux has had that requirement for a long time. 2) Use a explicitly defined latch for the bgworker instead of using the PGPROC->procLatch. That way it never has to be resetand all SetLatch()s will eventually go to the right process. 3) Continue requiring the spinlock when setting the latch. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 8, 2014 at 2:51 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-10-30 07:55:08 -0400, Robert Haas wrote: >> On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> >> But if it is, then how about >> >> adding a flag that is 4 bytes wide or less alongside bgwriterLatch, >> >> and just checking the flag instead of checking bgwriterLatch itself? >> > >> > Yea, that'd be nicer. I didn't do it because it made the patch slightly >> > more invasive... The complexity really is only needed because we're not >> > guaranteed that 64bit reads are atomic. Although we actually can be >> > sure, because there's no platform with nonatomic intptr_t reads - so >> > 64bit platforms actually *do* have atomic 64bit reads/writes. >> > >> > So if we just have a integer 'setBgwriterLatch' somewhere we're good. We >> > don't even need to take a lock to set it. Afaics the worst that can >> > happen is that several processes set the latch... >> >> OK, that design is fine with me. > > There's a slight problem with this, namely restarts of bgwriter. If it > crashes the reference to the relevant latch will currently be reset via > StrategyNotifyBgWriter(). In reality that's not a problem because > sizeof(void*) writes are always atomic, but currently we don't assume > that. We'd sometimes write to a old latch, but that's harmless because > they're never deallocated. > > So I can see several solutions right now: > 1) Redefine our requirements so that aligned sizeof(void*) writes are > always atomic. That doesn't affect any currently supported platform > and won't ever affect any future platform either. E.g. linux has had > that requirement for a long time. > 2) Use a explicitly defined latch for the bgworker instead of using the > PGPROC->procLatch. That way it never has to be reset and all > SetLatch()s will eventually go to the right process. > 3) Continue requiring the spinlock when setting the latch. Maybe you could store the pgprocno instead of the PROC *. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-12-10 21:52:17 -0500, Robert Haas wrote: > Maybe you could store the pgprocno instead of the PROC *. That's a good idea. Here's a patch implementing that and other things. Changes: * The handling of wraparound is slight changed. There's a separate patch for the case where nextVictimBuffer is above NBuffers. That allows a) to avoid the somewhat expensive modulo operation in the common case b) always consistent results for StrategySyncStart() * StrategySyncStart() doesn't have a situation in which it can return inaccurate information anymore. That could actually trigger an assertion bgwriter. * There was a bug because the local victim variable was signed instead of unsigned. * Clock sweep ticks are moved into a separate routine. Comments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On Tue, Dec 23, 2014 at 3:30 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-12-10 21:52:17 -0500, Robert Haas wrote: >> Maybe you could store the pgprocno instead of the PROC *. > > That's a good idea. Here's a patch implementing that and other things. Thanks. Cool. > Changes: > * The handling of wraparound is slight changed. There's a separate patch > for the case where nextVictimBuffer is above NBuffers. That allows a) > to avoid the somewhat expensive modulo operation in the common case b) > always consistent results for StrategySyncStart() > * StrategySyncStart() doesn't have a situation in which it can return > inaccurate information anymore. That could actually trigger an > assertion bgwriter. > * There was a bug because the local victim variable was signed instead > of unsigned. > * Clock sweep ticks are moved into a separate routine. > > Comments? You need to spell-check your commit message. My spell-checker complains about acquiration, aleviated, aleviates, and clocksweep. The first is not a word at all; perhaps you meant acquisition. The next two, I believe, need a double-l rather than a single-l. "acquiration" also appears in the text of the patch, along with "wether" (should be "whether"). And the last one should be two words. I don't think I have anything to say about the substance of the patch. If fetch-and-add is faster than a spinlock cycle, then it is. And it's good to be fast. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-12-23 16:42:41 -0500, Robert Haas wrote: > I don't think I have anything to say about the substance of the patch. > If fetch-and-add is faster than a spinlock cycle, then it is. And > it's good to be fast. I don't think the primary advantage is that it's fast (even though it should be as fast as a single TAS on x86). It's that you can never sleep while holding the spinlock when there's no such spinlock and that everytime you transfer the cacheline from another cpu to you you'll also make progress... Will fix the other stuff. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services