Обсуждение: Inconsistent use of "volatile" when accessing shared memory?
I noticed that we occasionally use "volatile" to access shared memory, but usually not; and I'm not clear on the rules for doing so. For instance, AdvanceXLInsertBuffer() writes to XLogCtl->xlblocks[nextidx] through a volatile pointer; but then immediately writes to XLogCtl- >InitializedUpTo with a non-volatile pointer. There are also places in procarray.c that make use of volatile through UINT32_ACCESS_ONCE(), and of course there are atomics (which use volatile as well as guaranteeing atomicity). In theory, I think we're always supposed to access shared memory through a volatile pointer, right? Otherwise a sufficiently smart (and cruel) compiler could theoretically optimize away the load/store in some surprising cases, or hold a value in a register longer than we expect, and then any memory barriers would be useless. But in practice we don't do that even for sensitive structures like the one referenced by XLogCtl. My intuition up until now was that if we access through a global pointer, then the compiler wouldn't completely optimize away the store/load. I ran through some tests and that assumption seems to hold up, at least in a few simple examples with gcc at -O2, which seem to emit the loads/stores where expected. What is the guidance here? Is the volatile pointer use in AdvanceXLInsertBuffer() required, and if so, why not other places? Regards, Jeff Davis
Hi, On 2023-11-02 23:19:03 -0700, Jeff Davis wrote: > I noticed that we occasionally use "volatile" to access shared memory, > but usually not; and I'm not clear on the rules for doing so. For > instance, AdvanceXLInsertBuffer() writes to XLogCtl->xlblocks[nextidx] > through a volatile pointer; but then immediately writes to XLogCtl- > >InitializedUpTo with a non-volatile pointer. There are also places in > procarray.c that make use of volatile through UINT32_ACCESS_ONCE(), and > of course there are atomics (which use volatile as well as guaranteeing > atomicity). > In theory, I think we're always supposed to access shared memory > through a volatile pointer, right? Otherwise a sufficiently smart (and > cruel) compiler could theoretically optimize away the load/store in > some surprising cases, or hold a value in a register longer than we > expect, and then any memory barriers would be useless. I don't think so. We used to use volatile for most shared memory accesses, but volatile doesn't provide particularly useful semantics - and generates *vastly* slower code in a lot of circumstances. Most of that usage predates spinlocks being proper compiler barriers, which was rectified in: commit 0709b7ee72e Author: Robert Haas <rhaas@postgresql.org> Date: 2014-09-09 17:45:20 -0400 Change the spinlock primitives to function as compiler barriers. or the introduction of compiler/memory barriers in commit 0c8eda62588 Author: Robert Haas <rhaas@postgresql.org> Date: 2011-09-23 17:52:43 -0400 Memory barrier support for PostgreSQL. Most instances of volatile used for shared memory access should be replaced with explicit compiler/memory barriers, as appropriate. Note that use of volatile does *NOT* guarantee anything about memory ordering! > But in practice we don't do that even for sensitive structures like the > one referenced by XLogCtl. My intuition up until now was that if we > access through a global pointer, then the compiler wouldn't completely > optimize away the store/load. That's not guaranteed at all - luckily, as it'd lead to code being more bulky. It's not that the global variable will be optimized away entirely in many situations, but repeated accesses can sometimes be merged and the access can be moved around. > What is the guidance here? Is the volatile pointer use in > AdvanceXLInsertBuffer() required, and if so, why not other places? I don't think it's required. The crucial part is to avoid memory reordering between zeroing the block / initializing fields and changing that field - the relevant part for that is the pg_write_barrier(); *not* the volatile. The volatile does prevent the compiler from deferring the update of xlblocks[idx] to the next loop iteration. Which I guess isn't a bad idea - but it's not required for correctness. When an individual value is read/written from memory, and all that's desired is to prevent the compiler for eliding/moving that operation, it can lead to better code to use volatile *on the individual access* compared to using pg_compiler_barrier(), because it allows the compiler to keep other variables in registers. Greetings, Andres Freund
On Fri, 2023-11-03 at 15:59 -0700, Andres Freund wrote: > I don't think so. We used to use volatile for most shared memory > accesses, but > volatile doesn't provide particularly useful semantics - and > generates > *vastly* slower code in a lot of circumstances. Most of that usage > predates > spinlocks being proper compiler barriers, A compiler barrier doesn't always force the compiler to generate loads and stores, though. For instance (example code I placed at the bottom of xlog.c): typedef struct DummyStruct { XLogRecPtr recptr; } DummyStruct; extern void DummyFunction(void); static DummyStruct Dummy = { 5 }; static DummyStruct *pDummy = &Dummy; void DummyFunction(void) { while(true) { pg_compiler_barrier(); pg_memory_barrier(); if (pDummy->recptr == 0) break; pg_compiler_barrier(); pg_memory_barrier(); } } Generates the following code (clang -O2): 000000000016ed10 <DummyFunction>: 16ed10: f0 83 04 24 00 lock addl $0x0,(%rsp) 16ed15: f0 83 04 24 00 lock addl $0x0,(%rsp) 16ed1a: eb f4 jmp 16ed10 <DummyFunction> 16ed1c: 0f 1f 40 00 nopl 0x0(%rax) Obviously this is an oversimplified example and if I complicate it in any number of ways then it will start generating actual loads and stores, and then the compiler and memory barriers should do their job. > Note that use of volatile does *NOT* guarantee anything about memory > ordering! Right, but it does force loads/stores to be emitted by the compiler; and without loads/stores a memory barrier is useless. I understand that my example is too simple and I'm not claiming that there's a problem. I'd just like to understand the key difference between my example and what we do with XLogCtl. Another way to phrase my question: under what specific circumstances must we use something like UINT32_ACCESS_ONCE()? That seems to be used for local pointers, but it's not clear to me exactly why that matters. Intuitively, access through a local pointer seems much more likely to be optimized and therefore more dangerous, but that doesn't imply that access through global variables is not dangerous. Regards, Jeff Davis
Hi, On 2023-11-03 17:44:44 -0700, Jeff Davis wrote: > On Fri, 2023-11-03 at 15:59 -0700, Andres Freund wrote: > > I don't think so. We used to use volatile for most shared memory > > accesses, but > > volatile doesn't provide particularly useful semantics - and > > generates > > *vastly* slower code in a lot of circumstances. Most of that usage > > predates > > spinlocks being proper compiler barriers, > > A compiler barrier doesn't always force the compiler to generate loads > and stores, though. I don't think volatile does so entirely either, but I don't think it's relevant anyway. > For instance (example code I placed at the bottom of xlog.c): > > typedef struct DummyStruct { > XLogRecPtr recptr; > } DummyStruct; > extern void DummyFunction(void); > static DummyStruct Dummy = { 5 }; > static DummyStruct *pDummy = &Dummy; > void > DummyFunction(void) > { > while(true) > { > pg_compiler_barrier(); > pg_memory_barrier(); > if (pDummy->recptr == 0) > break; > pg_compiler_barrier(); > pg_memory_barrier(); > } > } > > > Generates the following code (clang -O2): > > 000000000016ed10 <DummyFunction>: > 16ed10: f0 83 04 24 00 lock addl $0x0,(%rsp) > 16ed15: f0 83 04 24 00 lock addl $0x0,(%rsp) > 16ed1a: eb f4 jmp 16ed10 <DummyFunction> > 16ed1c: 0f 1f 40 00 nopl 0x0(%rax) > > Obviously this is an oversimplified example and if I complicate it in > any number of ways then it will start generating actual loads and > stores, and then the compiler and memory barriers should do their job. All that is happening here is that clang can prove that nothing ever could change the variable (making this actually undefined behaviour, IIRC). If you add another function like: extern void SetDummy(uint64_t lsn) { pDummy->recptr = lsn; } clang does generate the code you'd expect - there's a possibility that it could be true. See https://godbolt.org/z/EaM77E8jK We don't gain anything from preventing the compiler from making such conclusions afaict. > > Note that use of volatile does *NOT* guarantee anything about memory > > ordering! > > Right, but it does force loads/stores to be emitted by the compiler; > and without loads/stores a memory barrier is useless. There would not have been the point in generating that load... > I understand that my example is too simple and I'm not claiming that > there's a problem. I'd just like to understand the key difference > between my example and what we do with XLogCtl. The key difference is that there's code changing relevant variables :) > Another way to phrase my question: under what specific circumstances > must we use something like UINT32_ACCESS_ONCE()? That seems to be used > for local pointers I guess I don't really know what you mean with global or local pointers? There's no point in ever using something like it if the memory is local to a function. The procarray uses all are for shared memory. Sure, we save a pointer to a subset of shared memory in a local variable, but that doesn't change very much (it does tell the compiler that it doesnt' need to reload ProcGlobal->xid from memory after calling an external function (which could change it), so it's a minor efficiency improvement). The reason for /* Fetch xid just once - see GetNewTransactionId */ pxid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]); is explained in a comment in GetNewTransactionId(): * * Note that readers of ProcGlobal->xids/PGPROC->xid should be careful to * fetch the value for each proc only once, rather than assume they can * read a value multiple times and get the same answer each time. Note we * are assuming that TransactionId and int fetch/store are atomic. without the READ_ONCE(), the compiler could decide to not actually load the memory contents into a register through the loop body, but to just refetch it every time. We could also implement this with a compiler barrier between fetching pxid and using it - but it'd potentially lead to worse code, because instead of just forcing one load to come from memory, it'd also force reloading *other* variables from memory. > but it's not clear to me exactly why that matters. Intuitively, access > through a local pointer seems much more likely to be optimized and therefore > more dangerous, but that doesn't imply that access through global variables > is not dangerous. I really don't think there's a meaningful difference between the two. What is interesting is where the memory points to and whether the compiler can reason about other code potentially having had access to the memory. E.g. with code like this (overly simple example) int foo = 0; int *foop = &0; pg_memory_barrier(); if (*foop != 0) return false; the compiler can easily prove that the variable could not have changed - there is no legal way for foo too have changed, regardless of the memory barrier. However, if the code looked like this: int foo = 0; int *foop = &0; SomethingTheCompilerCantSee(foop); if (*foop != 0) return false; or extern int *extern_foop; int foo = 0; int *foop = &0; extern_foop = foop; if (*foop != 0) return false; the compiler couldn't prove that anymore. Of course the case of a pointer to a stack variable is on the very trivial side. But this holds true for other cases, e.g. the one you noticed above, a static variable (guaranteeing it's not accessed in other translation units) only accessed in one function - there couldn't have been anything modifying it. This even holds true for memory allocated by malloc() - the language guarantees that there shall be no other pointers to that memory, so the compiler knows that just after returning there can't be nobody but the caller accessing the memory. Until the pointer escapes in some form. Greetings, Andres Freund
On Fri, 2023-11-03 at 18:36 -0700, Andres Freund wrote: > All that is happening here is that clang can prove that nothing ever > could > change the variable ... > See https://godbolt.org/z/EaM77E8jK > > We don't gain anything from preventing the compiler from making such > conclusions afaict. Thank you. I modified this a bit to use a global variable and then perform a write outside the loop: https://godbolt.org/z/EfvGx64eE extern void DummyFunction(void); extern DummyStruct Dummy; DummyStruct Dummy = { 5 }; static DummyStruct *pDummy = &Dummy; void DummyFunction(void) { pDummy->recptr = 3; while(true) { if (pDummy->recptr == 0) break; pg_compiler_barrier(); } } And I think that mostly answers my question: if the compiler barrier is there, it loads each time; and if I take the barrier away, it optimizes into an infinite loop. I suppose what's happening without the compiler barrier is that the load is being moved up right below the store, which allows it to be optimized away entirely. The compiler barrier prevents moving the load, and therefore prevents that optimization. I'm still somewhat hazy on exactly what a compiler barrier prevents -- it seems like it depends on the internal details of how the compiler optimizes in stages. But I see that it's doing what we need it to do now. > > The key difference is that there's code changing relevant variables > :) Yeah. > I guess I don't really know what you mean with global or local > pointers? I meant "global pointers to shared memory" (like XLogCtl) vs "local pointers to shared memory" (like other_xids in TransactionIdIsActive()). > We could also implement this with a compiler barrier between fetching > pxid and > using it - but it'd potentially lead to worse code, because instead > of just > forcing one load to come from memory, it'd also force reloading > *other* > variables from memory. I see -- that's a way to be more selective about what gets reloaded since the last compiler barrier, rather than inserting a new compiler barrier which would cause everything to be reloaded. This was helpful, thank you again. I wanted to be more clear on these nuances while reviewing Bharath's patch, because I'm suggesting that he can avoid the WALBufMappingLock to reduce the risk of a regression. In the process, we'll probably get rid of that unnecessary "volatile" in AdvanceXLInsertBuffer(). Regards, Jeff Davis