Обсуждение: Inconsistent use of "volatile" when accessing shared memory?

Поиск
Список
Период
Сортировка

Inconsistent use of "volatile" when accessing shared memory?

От
Jeff Davis
Дата:
I noticed that we occasionally use "volatile" to access shared memory,
but usually not; and I'm not clear on the rules for doing so. For
instance, AdvanceXLInsertBuffer() writes to XLogCtl->xlblocks[nextidx]
through a volatile pointer; but then immediately writes to XLogCtl-
>InitializedUpTo with a non-volatile pointer. There are also places in
procarray.c that make use of volatile through UINT32_ACCESS_ONCE(), and
of course there are atomics (which use volatile as well as guaranteeing
atomicity).

In theory, I think we're always supposed to access shared memory
through a volatile pointer, right? Otherwise a sufficiently smart (and
cruel) compiler could theoretically optimize away the load/store in
some surprising cases, or hold a value in a register longer than we
expect, and then any memory barriers would be useless.

But in practice we don't do that even for sensitive structures like the
one referenced by XLogCtl. My intuition up until now was that if we
access through a global pointer, then the compiler wouldn't completely
optimize away the store/load. I ran through some tests and that
assumption seems to hold up, at least in a few simple examples with gcc
at -O2, which seem to emit the loads/stores where expected.

What is the guidance here? Is the volatile pointer use in
AdvanceXLInsertBuffer() required, and if so, why not other places?

Regards,
    Jeff Davis




Re: Inconsistent use of "volatile" when accessing shared memory?

От
Andres Freund
Дата:
Hi,

On 2023-11-02 23:19:03 -0700, Jeff Davis wrote:
> I noticed that we occasionally use "volatile" to access shared memory,
> but usually not; and I'm not clear on the rules for doing so. For
> instance, AdvanceXLInsertBuffer() writes to XLogCtl->xlblocks[nextidx]
> through a volatile pointer; but then immediately writes to XLogCtl-
> >InitializedUpTo with a non-volatile pointer. There are also places in
> procarray.c that make use of volatile through UINT32_ACCESS_ONCE(), and
> of course there are atomics (which use volatile as well as guaranteeing
> atomicity).

> In theory, I think we're always supposed to access shared memory
> through a volatile pointer, right? Otherwise a sufficiently smart (and
> cruel) compiler could theoretically optimize away the load/store in
> some surprising cases, or hold a value in a register longer than we
> expect, and then any memory barriers would be useless.

I don't think so. We used to use volatile for most shared memory accesses, but
volatile doesn't provide particularly useful semantics - and generates
*vastly* slower code in a lot of circumstances. Most of that usage predates
spinlocks being proper compiler barriers, which was rectified in:

commit 0709b7ee72e
Author: Robert Haas <rhaas@postgresql.org>
Date:   2014-09-09 17:45:20 -0400

    Change the spinlock primitives to function as compiler barriers.

or the introduction of compiler/memory barriers in

commit 0c8eda62588
Author: Robert Haas <rhaas@postgresql.org>
Date:   2011-09-23 17:52:43 -0400

    Memory barrier support for PostgreSQL.


Most instances of volatile used for shared memory access should be replaced
with explicit compiler/memory barriers, as appropriate.

Note that use of volatile does *NOT* guarantee anything about memory ordering!


> But in practice we don't do that even for sensitive structures like the
> one referenced by XLogCtl. My intuition up until now was that if we
> access through a global pointer, then the compiler wouldn't completely
> optimize away the store/load.

That's not guaranteed at all - luckily, as it'd lead to code being more
bulky. It's not that the global variable will be optimized away entirely in
many situations, but repeated accesses can sometimes be merged and the access
can be moved around.


> What is the guidance here? Is the volatile pointer use in
> AdvanceXLInsertBuffer() required, and if so, why not other places?

I don't think it's required. The crucial part is to avoid memory reordering
between zeroing the block / initializing fields and changing that field - the
relevant part for that is the pg_write_barrier(); *not* the volatile.

The volatile does prevent the compiler from deferring the update of
xlblocks[idx] to the next loop iteration. Which I guess isn't a bad idea - but
it's not required for correctness.


When an individual value is read/written from memory, and all that's desired
is to prevent the compiler for eliding/moving that operation, it can lead to
better code to use volatile *on the individual access* compared to using
pg_compiler_barrier(), because it allows the compiler to keep other variables
in registers.

Greetings,

Andres Freund



Re: Inconsistent use of "volatile" when accessing shared memory?

От
Jeff Davis
Дата:
On Fri, 2023-11-03 at 15:59 -0700, Andres Freund wrote:
> I don't think so. We used to use volatile for most shared memory
> accesses, but
> volatile doesn't provide particularly useful semantics - and
> generates
> *vastly* slower code in a lot of circumstances. Most of that usage
> predates
> spinlocks being proper compiler barriers, 

A compiler barrier doesn't always force the compiler to generate loads
and stores, though.

For instance (example code I placed at the bottom of xlog.c):

  typedef struct DummyStruct {
      XLogRecPtr recptr;
  } DummyStruct;
  extern void DummyFunction(void);
  static DummyStruct Dummy = { 5 };
  static DummyStruct *pDummy = &Dummy;
  void
  DummyFunction(void)
  {
      while(true)
      {
          pg_compiler_barrier();
          pg_memory_barrier();
          if (pDummy->recptr == 0)
              break;
          pg_compiler_barrier();
          pg_memory_barrier();
      }
  }


Generates the following code (clang -O2):

  000000000016ed10 <DummyFunction>:
    16ed10:       f0 83 04 24 00          lock addl $0x0,(%rsp)
    16ed15:       f0 83 04 24 00          lock addl $0x0,(%rsp)
    16ed1a:       eb f4                   jmp    16ed10 <DummyFunction>
    16ed1c:       0f 1f 40 00             nopl   0x0(%rax)

Obviously this is an oversimplified example and if I complicate it in
any number of ways then it will start generating actual loads and
stores, and then the compiler and memory barriers should do their job.


> Note that use of volatile does *NOT* guarantee anything about memory
> ordering!

Right, but it does force loads/stores to be emitted by the compiler;
and without loads/stores a memory barrier is useless.

I understand that my example is too simple and I'm not claiming that
there's a problem. I'd just like to understand the key difference
between my example and what we do with XLogCtl.

Another way to phrase my question: under what specific circumstances
must we use something like UINT32_ACCESS_ONCE()? That seems to be used
for local pointers, but it's not clear to me exactly why that matters.
Intuitively, access through a local pointer seems much more likely to
be optimized and therefore more dangerous, but that doesn't imply that
access through global variables is not dangerous.

Regards,
    Jeff Davis





Re: Inconsistent use of "volatile" when accessing shared memory?

От
Andres Freund
Дата:
Hi,

On 2023-11-03 17:44:44 -0700, Jeff Davis wrote:
> On Fri, 2023-11-03 at 15:59 -0700, Andres Freund wrote:
> > I don't think so. We used to use volatile for most shared memory
> > accesses, but
> > volatile doesn't provide particularly useful semantics - and
> > generates
> > *vastly* slower code in a lot of circumstances. Most of that usage
> > predates
> > spinlocks being proper compiler barriers,
>
> A compiler barrier doesn't always force the compiler to generate loads
> and stores, though.

I don't think volatile does so entirely either, but I don't think it's
relevant anyway.


> For instance (example code I placed at the bottom of xlog.c):
>
>   typedef struct DummyStruct {
>       XLogRecPtr recptr;
>   } DummyStruct;
>   extern void DummyFunction(void);
>   static DummyStruct Dummy = { 5 };
>   static DummyStruct *pDummy = &Dummy;
>   void
>   DummyFunction(void)
>   {
>       while(true)
>       {
>           pg_compiler_barrier();
>           pg_memory_barrier();
>           if (pDummy->recptr == 0)
>               break;
>           pg_compiler_barrier();
>           pg_memory_barrier();
>       }
>   }
>
>
> Generates the following code (clang -O2):
>
>   000000000016ed10 <DummyFunction>:
>     16ed10:       f0 83 04 24 00          lock addl $0x0,(%rsp)
>     16ed15:       f0 83 04 24 00          lock addl $0x0,(%rsp)
>     16ed1a:       eb f4                   jmp    16ed10 <DummyFunction>
>     16ed1c:       0f 1f 40 00             nopl   0x0(%rax)
>
> Obviously this is an oversimplified example and if I complicate it in
> any number of ways then it will start generating actual loads and
> stores, and then the compiler and memory barriers should do their job.

All that is happening here is that clang can prove that nothing ever could
change the variable (making this actually undefined behaviour, IIRC). If you
add another function like:

extern void
SetDummy(uint64_t lsn)
{
    pDummy->recptr = lsn;
}

clang does generate the code you'd expect - there's a possibility that it
could be true.

See https://godbolt.org/z/EaM77E8jK

We don't gain anything from preventing the compiler from making such
conclusions afaict.


> > Note that use of volatile does *NOT* guarantee anything about memory
> > ordering!
>
> Right, but it does force loads/stores to be emitted by the compiler;
> and without loads/stores a memory barrier is useless.

There would not have been the point in generating that load...


> I understand that my example is too simple and I'm not claiming that
> there's a problem. I'd just like to understand the key difference
> between my example and what we do with XLogCtl.

The key difference is that there's code changing relevant variables :)


> Another way to phrase my question: under what specific circumstances
> must we use something like UINT32_ACCESS_ONCE()? That seems to be used
> for local pointers

I guess I don't really know what you mean with global or local pointers?
There's no point in ever using something like it if the memory is local to a
function.

The procarray uses all are for shared memory. Sure, we save a pointer to a
subset of shared memory in a local variable, but that doesn't change very much
(it does tell the compiler that it doesnt' need to reload ProcGlobal->xid from
memory after calling an external function (which could change it), so it's a
minor efficiency improvement).

The reason for
        /* Fetch xid just once - see GetNewTransactionId */
        pxid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
is explained in a comment in GetNewTransactionId():
     *
     * Note that readers of ProcGlobal->xids/PGPROC->xid should be careful to
     * fetch the value for each proc only once, rather than assume they can
     * read a value multiple times and get the same answer each time.  Note we
     * are assuming that TransactionId and int fetch/store are atomic.

without the READ_ONCE(), the compiler could decide to not actually load the
memory contents into a register through the loop body, but to just refetch it
every time.

We could also implement this with a compiler barrier between fetching pxid and
using it - but it'd potentially lead to worse code, because instead of just
forcing one load to come from memory, it'd also force reloading *other*
variables from memory.


> but it's not clear to me exactly why that matters.  Intuitively, access
> through a local pointer seems much more likely to be optimized and therefore
> more dangerous, but that doesn't imply that access through global variables
> is not dangerous.

I really don't think there's a meaningful difference between the two. What is
interesting is where the memory points to and whether the compiler can reason
about other code potentially having had access to the memory.

E.g. with code like this (overly simple example)

    int foo = 0;
    int *foop = &0;

    pg_memory_barrier();

    if (*foop != 0)
       return false;

the compiler can easily prove that the variable could not have changed - there
is no legal way for foo too have changed, regardless of the memory barrier.

However, if the code looked like this:

    int foo = 0;
    int *foop = &0;

    SomethingTheCompilerCantSee(foop);

    if (*foop != 0)
       return false;

or

extern int *extern_foop;

    int foo = 0;
    int *foop = &0;

    extern_foop = foop;

    if (*foop != 0)
       return false;


the compiler couldn't prove that anymore.


Of course the case of a pointer to a stack variable is on the very trivial
side. But this holds true for other cases, e.g. the one you noticed above, a
static variable (guaranteeing it's not accessed in other translation units)
only accessed in one function - there couldn't have been anything modifying
it.

This even holds true for memory allocated by malloc() - the language
guarantees that there shall be no other pointers to that memory, so the
compiler knows that just after returning there can't be nobody but the caller
accessing the memory. Until the pointer escapes in some form.


Greetings,

Andres Freund



Re: Inconsistent use of "volatile" when accessing shared memory?

От
Jeff Davis
Дата:
On Fri, 2023-11-03 at 18:36 -0700, Andres Freund wrote:
> All that is happening here is that clang can prove that nothing ever
> could
> change the variable

...

> See https://godbolt.org/z/EaM77E8jK
>
> We don't gain anything from preventing the compiler from making such
> conclusions afaict.

Thank you. I modified this a bit to use a global variable and then
perform a write outside the loop: https://godbolt.org/z/EfvGx64eE

  extern void DummyFunction(void);
  extern DummyStruct Dummy;
  DummyStruct Dummy = { 5 };
  static DummyStruct *pDummy = &Dummy;

  void
  DummyFunction(void)
  {
      pDummy->recptr = 3;
      while(true)
      {
          if (pDummy->recptr == 0)
              break;
          pg_compiler_barrier();
      }
  }

And I think that mostly answers my question: if the compiler barrier is
there, it loads each time; and if I take the barrier away, it optimizes
into an infinite loop.

I suppose what's happening without the compiler barrier is that the
load is being moved up right below the store, which allows it to be
optimized away entirely. The compiler barrier prevents moving the load,
and therefore prevents that optimization.

I'm still somewhat hazy on exactly what a compiler barrier prevents --
it seems like it depends on the internal details of how the compiler
optimizes in stages. But I see that it's doing what we need it to do
now.

>
> The key difference is that there's code changing relevant variables
> :)

Yeah.

> I guess I don't really know what you mean with global or local
> pointers?

I meant "global pointers to shared memory" (like XLogCtl) vs "local
pointers to shared memory" (like other_xids in
TransactionIdIsActive()).

> We could also implement this with a compiler barrier between fetching
> pxid and
> using it - but it'd potentially lead to worse code, because instead
> of just
> forcing one load to come from memory, it'd also force reloading
> *other*
> variables from memory.

I see -- that's a way to be more selective about what gets reloaded
since the last compiler barrier, rather than inserting a new compiler
barrier which would cause everything to be reloaded.

This was helpful, thank you again. I wanted to be more clear on these
nuances while reviewing Bharath's patch, because I'm suggesting that he
can avoid the WALBufMappingLock to reduce the risk of a regression. In
the process, we'll probably get rid of that unnecessary "volatile" in
AdvanceXLInsertBuffer().

Regards,
    Jeff Davis