Обсуждение: epoll_wait returning EFAULT on Linux 3.2.78

Поиск
Список
Период
Сортировка

epoll_wait returning EFAULT on Linux 3.2.78

От
Greg Stark
Дата:
I was just trying out a new (well, new to me...) machine here. It
happens that the version of Linux installed is 3.2.78. Apparently the
current version on this branch is 3.2.80 released a couple months ago
and I don't see any relevant changes in .79 or .80 so I think this is
actually likely to be the behaviour in a "supported" version of Linux.

What I'm seeing is that every call to epoll_wait() raises EFAULT. I
don't see anything wrong with the arguments to epoll_wait so unless
there was some earlier bogus argument to epoll_ctl or something this
looks like some kind of kernel or glibc problem and it looks like
we'll to check for EFAULT and fall back to poll (a configure test
wouldn't help since this is run-time behaviour).

Breakpoint 1, WaitEventSetWaitBlock (set=0x9c0a88, cur_timeout=-1,   occurred_events=0xff88f4f8, nevents=1) at
latch.c:975
975 int returned_events = 0;

(gdb) n
981 rc = epoll_wait(set->epoll_fd, set->epoll_ret_events,

(gdb) p *set
$9 = {nevents = 3, nevents_space = 3, events = 0x9c0aa4, latch = 0xf75a9808, latch_pos = 1, epoll_fd = 3,
epoll_ret_events= 0x9c0ad4}
 

(gdb) p *set->epoll_ret_events
$10 = {events = 0, data = {ptr = 0x0, fd = 0, u32 = 0, u64 = 0}}

(gdb) p nevents
$11 = 1

(gdb) p cur_timeout
$12 = -1

(gdb) n
Breakpoint 3, WaitEventSetWaitBlock (set=0x9c0a88, cur_timeout=-1,   occurred_events=0xff88f4f8, nevents=1) at
latch.c:985
985 if (rc < 0)

(gdb) p rc
$13 = -1

(gdb) p errno
$14 = 14


-- 
greg



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Andres Freund
Дата:
Hi,

On 2016-06-02 15:31:15 +0100, Greg Stark wrote:
> What I'm seeing is that every call to epoll_wait() raises EFAULT. I
> don't see anything wrong with the arguments to epoll_wait so unless
> there was some earlier bogus argument to epoll_ctl or something this
> looks like some kind of kernel or glibc problem and it looks like
> we'll to check for EFAULT and fall back to poll (a configure test
> wouldn't help since this is run-time behaviour).

I'd really like knowing what's going on before doing that. epoll was
definitely supported back then, and just falling back to poll seems too
likely to be just papering over bug or portability issue on our side.

Any chance you could provide a strace of a problematic sequence?


- Andres



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> What I'm seeing is that every call to epoll_wait() raises EFAULT.

The man page for epoll_wait suggests that that implies a bad pointer
value for the events array.  You're showingepoll_ret_events = 0x9c0ad4
which is not obviously bad, but it's also only 4-byte aligned.
I notice that CreateWaitEventSet() is being remarkably cavalier
about alignment requirements; maybe it ought to make some effort
to ensure that epoll_ret_events is maxaligned.  Personally I'd
make all the subsidiary arrays maxaligned, just to be sure.
The existing coding is taking it on faith that the alignment
requirements of each later component of the WaitEventSet are no
worse than those of any earlier component; which even if true today
is an assumption that will bite us on the ass someday.
        regards, tom lane



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Greg Stark
Дата:
On Thu, Jun 2, 2016 at 3:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The man page for epoll_wait suggests that that implies a bad pointer
> value for the events array.  You're showing
>         epoll_ret_events = 0x9c0ad4
> which is not obviously bad, but it's also only 4-byte aligned.
> I notice that CreateWaitEventSet() is being remarkably cavalier
> about alignment requirements; maybe it ought to make some effort
> to ensure that epoll_ret_events is maxaligned.


That makes a certain amount of sense as This is Debian Sparc64 which
as I understand it has the kernel compiled in 64-bit mode but not most
of userland. I bet if I compile Postgres in 64-bit mode it'll turn up
as an alignment issue in userland before it even reaches the kernel.
Or maybe not, maybe it's just that __put_user is stricter about
alignment than really necessary.

-- 
greg



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> On Thu, Jun 2, 2016 at 3:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I notice that CreateWaitEventSet() is being remarkably cavalier
>> about alignment requirements; maybe it ought to make some effort
>> to ensure that epoll_ret_events is maxaligned.

> That makes a certain amount of sense as This is Debian Sparc64 which
> as I understand it has the kernel compiled in 64-bit mode but not most
> of userland. I bet if I compile Postgres in 64-bit mode it'll turn up
> as an alignment issue in userland before it even reaches the kernel.
> Or maybe not, maybe it's just that __put_user is stricter about
> alignment than really necessary.

Hm.  If you're doing a 32-bit build then we'll likely think that
4-byte alignment *is* maxaligned.  But it's not very hard to believe
that the kernel might be insisting on 8-byte alignment anyway.  That
would be a bug, but one that would easily escape detection for such
a seldom-used kernel call.

We might have to do something to force the pointer to be 8-byte-aligned
even in 32-bit builds.  Ick, but better than falling back to poll().

Can you do a test and see if 8-byte aligning the pointer makes the
EFAULT go away?
        regards, tom lane



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Greg Stark
Дата:
So FYI, it does look like Postgres built in 32-bit mode, at least
pointers are 32 bits. But I think maxalign might still be enough due
to doubles being 64 bits.

checking whether long int is 64 bits... no
checking whether long long int is 64 bits... yes
checking snprintf length modifier for long long int... ll
checking whether snprintf supports the %z modifier... yes
checking size of void *... 4
checking size of size_t... 4
checking size of long... 4
checking whether to build with float4 passed by value... yes
checking whether to build with float8 passed by value... no
checking alignment of short... 2
checking alignment of int... 4
checking alignment of long... 4
checking alignment of long long int... 8
checking alignment of double... 8
checking for int8... no
checking for uint8... no
checking for int64... no
checking for uint64... no
checking for __int128... no

I wrote a test program based on latch.c and tested various alignments.
It does indeed get EFAULT for anything other than 8-byte alignment
(the second argument is the number of bytes to offset the events
structure):

$ echo | ./a.out 1 0
epoll_wait(epfd=5, events=0x22018, maxevents=1, timeout=-1)
success
$ echo | ./a.out 1 1
epoll_wait(epfd=5, events=0x22019, maxevents=1, timeout=-1)
epoll_wait: Bad address
$ echo | ./a.out 1 2
epoll_wait(epfd=5, events=0x2201a, maxevents=1, timeout=-1)
epoll_wait: Bad address
$ echo | ./a.out 1 4
epoll_wait(epfd=5, events=0x2201c, maxevents=1, timeout=-1)
epoll_wait: Bad address

The same program gets success for any all offsets on x86.

Вложения

Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> So FYI, it does look like Postgres built in 32-bit mode, at least
> pointers are 32 bits. But I think maxalign might still be enough due
> to doubles being 64 bits.

OK, let's just maxalign all the components of the WaitEventSet struct
and call it good, at least till we find out differently.  Andres,
do you want to do that?
        regards, tom lane



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Andres Freund
Дата:
Hi,

On 2016-06-02 17:50:20 +0100, Greg Stark wrote:
> So FYI, it does look like Postgres built in 32-bit mode, at least
> pointers are 32 bits. But I think maxalign might still be enough due
> to doubles being 64 bits.

That makes sense independent of 32 vs. 64 system. Part of the relevant
struct probably require 8 byte alignment:
          typedef union epoll_data {              void    *ptr;              int      fd;              uint32_t u32;
         uint64_t u64;          } epoll_data_t;
 
          struct epoll_event {              uint32_t     events;    /* Epoll events */              epoll_data_t data;
   /* User data variable */          };
 

So using long long alignment seems the best bet here.

Greg, are you writing & testing a patch? Or should I write something for
you to test?

Regards,

Andres



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Greg Stark
Дата:
On Thu, Jun 2, 2016 at 6:04 PM, Andres Freund <andres@anarazel.de> wrote:
> Greg, are you writing & testing a patch? Or should I write something for
> you to test?

I'm running the regression tests now. They look like they're passing.

I just threw maxalign everywhere but I was going to comment that we
might need to put a double in for the subsequent struct elements to
end up aligned as well. But I guess that's not true due to the u64
element.  I'm not sure I've completely grokked all the bits and pieces
here so I may have added a few too many maxaligns. I put them on all
the size and offset calculations before the multiplications.



--
greg

Вложения

Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> That makes sense independent of 32 vs. 64 system. Part of the relevant
> struct probably require 8 byte alignment:

>            typedef union epoll_data {
>                void    *ptr;
>                int      fd;
>                uint32_t u32;
>                uint64_t u64;
>            } epoll_data_t;

On a 32-bit machine it's entirely possible that that would only require
32-bit alignment.  But based on what we know so far, using our regular
MAXALIGN macro should be good enough.
        regards, tom lane



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Andres Freund
Дата:
On 2016-06-02 13:24:38 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > That makes sense independent of 32 vs. 64 system. Part of the relevant
> > struct probably require 8 byte alignment:
> 
> >            typedef union epoll_data {
> >                void    *ptr;
> >                int      fd;
> >                uint32_t u32;
> >                uint64_t u64;
> >            } epoll_data_t;
> 
> On a 32-bit machine it's entirely possible that that would only require
> 32-bit alignment.

Not on 32bit sparc afaics:
> checking alignment of long long int... 8
> checking alignment of double... 8

> But based on what we know so far, using our regular
> MAXALIGN macro should be good enough.

Yea.

- Andres



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Andres Freund
Дата:
On 2016-06-02 18:15:54 +0100, Greg Stark wrote:
> I just threw maxalign everywhere but I was going to comment that we
> might need to put a double in for the subsequent struct elements to
> end up aligned as well.

Hm. Shouldn't be needed if you MAXALIGN the sz computation, because
that'll mean each struct will start at the proper offset.


> But I guess that's not true due to the u64
> element.  I'm not sure I've completely grokked all the bits and pieces
> here so I may have added a few too many maxaligns. I put them on all
> the size and offset calculations before the multiplications.

They should be *after* the multiplications. If arrays of structs are
internally misaligned there's nothing we can do.


> diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
> index 3fbe0e5..5e69a5a 100644
> --- a/src/backend/storage/ipc/latch.c
> +++ b/src/backend/storage/ipc/latch.c
> @@ -485,35 +485,35 @@ CreateWaitEventSet(MemoryContext context, int nevents)
>      char       *data;
>      Size        sz = 0;
>  
> -    sz += sizeof(WaitEventSet);
> -    sz += sizeof(WaitEvent) * nevents;
> +    sz += MAXALIGN(sizeof(WaitEventSet));
> +    sz += MAXALIGN(sizeof(WaitEvent)) * nevents;
>  
>  #if defined(WAIT_USE_EPOLL)
> -    sz += sizeof(struct epoll_event) * nevents;
> +    sz += MAXALIGN(sizeof(struct epoll_event)) * nevents;

So this needs to include the '* nevents' inside the MAXALIGN.


>  #elif defined(WAIT_USE_POLL)
> -    sz += sizeof(struct pollfd) * nevents;
> +    sz += MAXALIGN(sizeof(struct pollfd)) * nevents;

Same.


>  #elif defined(WAIT_USE_WIN32)
>      /* need space for the pgwin32_signal_event */
> -    sz += sizeof(HANDLE) * (nevents + 1);
> +    sz += MAXALIGN(sizeof(HANDLE)) * (nevents + 1);
>  #endif

same.


>  
>  #if defined(WAIT_USE_EPOLL)
>      set->epoll_ret_events = (struct epoll_event *) data;
> -    data += sizeof(struct epoll_event) * nevents;
> +    data += MAXALIGN(sizeof(struct epoll_event)) * nevents;
>  #elif defined(WAIT_USE_POLL)
>      set->pollfds = (struct pollfd *) data;
> -    data += sizeof(struct pollfd) * nevents;
> +    data += MAXALIGN(sizeof(struct pollfd)) * nevents;
>  #elif defined(WAIT_USE_WIN32)
>      set->handles = (HANDLE) data;
> -    data += sizeof(HANDLE) * nevents;
> +    data += MAXALIGN(sizeof(HANDLE)) * nevents;
>  #endif

same.


Want me to polish that up and push, or do you want to go back and forth
and push yourself?


Greetings,

Andres Freund



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Greg Stark
Дата:
On Thu, Jun 2, 2016 at 6:35 PM, Andres Freund <andres@anarazel.de> wrote:
> Want me to polish that up and push, or do you want to go back and forth
> and push yourself?

I'm happy to check if my bits still work if it's not too much hassle
to go back and forth.


> They should be *after* the multiplications. If arrays of structs are
> internally misaligned there's nothing we can do.

Well there's not *nothing* we can do. I thought I we were going to
have to go back and do manual offset calculations to get that right.
But with the u64_t element that's probably not necessary. It's a bit
scary having that be the only thing protecting it but then again it's
probably exactly why it's there. Given how much of a pain doing manual
offset calculations I'm happy to go along and say it's not worth
bothering.

Should the maxaligns be there for the non-epoll cases? I tossed them
in without paying much attention. I really have no idea if they're
needed on Windows or not for example.


-- 
greg



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Andres Freund
Дата:
On 2016-06-02 18:41:00 +0100, Greg Stark wrote:
> On Thu, Jun 2, 2016 at 6:35 PM, Andres Freund <andres@anarazel.de> wrote:
> > Want me to polish that up and push, or do you want to go back and forth
> > and push yourself?
> 
> I'm happy to check if my bits still work if it's not too much hassle
> to go back and forth.

No problem.


> > They should be *after* the multiplications. If arrays of structs are
> > internally misaligned there's nothing we can do.
> 
> Well there's not *nothing* we can do. I thought I we were going to
> have to go back and do manual offset calculations to get that right.

The kernel accesses the elements as an array. If the array stride (by
virtue of sizeof) were wrong, we couldn't fix that.


> But with the u64_t element that's probably not necessary. It's a bit
> scary having that be the only thing protecting it but then again it's
> probably exactly why it's there.

I guess the problem occurs due to the kernel checking that the
parameter is correctly aligned to contain an epoll_event array. Which
needs its alignment due to that uint64_t.  I rather doubt that the
syscall requires 8 byte alignment because of the 32 vs. 64 bit split;
that problem exists in a *lot* of places...

> Should the maxaligns be there for the non-epoll cases? I tossed them
> in without paying much attention. I really have no idea if they're
> needed on Windows or not for example.

The might not absolutely be needed in some cases today, but I think it's
better to be safe. Who knows what we'll add after the current elements
at some later point.

Greetings,

Andres Freund



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2016-06-02 18:41:00 +0100, Greg Stark wrote:
>> Well there's not *nothing* we can do. I thought I we were going to
>> have to go back and do manual offset calculations to get that right.

> The kernel accesses the elements as an array. If the array stride (by
> virtue of sizeof) were wrong, we couldn't fix that.

Right.  The general rule in C is that sizeof(anything) is always a
multiple of the something's alignment requirement, so that if you
have a correctly aligned initial element of an array then later
elements are also correctly aligned.  The problem in our existing
code is that sizeof(WaitEventSet) might not be a multiple of the
alignment requirement of WaitEvent, and either of those two might
not be a multiple of the alignment requirement of struct epoll_event,
etc.  So we should make the code look like
sz += MAXALIGN(sizeof(WaitEventSet));sz += MAXALIGN(sizeof(WaitEvent) * nevents);

#if defined(WAIT_USE_EPOLL)sz += MAXALIGN(sizeof(struct epoll_event) * nevents);

etc, so that each of the subsidiary arrays starts on a MAXALIGN boundary.
Where the later array elements fall is taken care of given that.
        regards, tom lane



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Greg Stark
Дата:
Ok. I added a comment and also fixed a small typo.

Вложения

Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Andres Freund
Дата:
Hi,

On 2016-06-02 19:15:07 +0100, Greg Stark wrote:
> Ok. I added a comment and also fixed a small typo.

> diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
> index 3fbe0e5..a96fb7b 100644
> --- a/src/backend/storage/ipc/latch.c
> +++ b/src/backend/storage/ipc/latch.c
> @@ -485,35 +485,40 @@ CreateWaitEventSet(MemoryContext context, int nevents)
>      char       *data;
>      Size        sz = 0;
>  
> -    sz += sizeof(WaitEventSet);
> -    sz += sizeof(WaitEvent) * nevents;
> +    /* 
> +     * struct epoll_event contains a uint64_t which on some architectures
> +     * requires 8-byte alignment and just to be safe be prepared for future
> +     * additions of other such elements later in the WaitEventSet structure
> +     */

I'd rather phrase this roughly like:

Use MAXALIGN size/alignment to guarantee that later uses of memory are
aligned correctly. E.g. epoll_event might need 8 byte alignment on some
platforms, but earlier allocations like WaitEventSet and WaitEvent might
not sized to guarantee that when purely using sizeof().

> +    sz += MAXALIGN(sizeof(WaitEventSet));
> +    sz += MAXALIGN(sizeof(WaitEvent) * nevents);
>  
>  #if defined(WAIT_USE_EPOLL)
> -    sz += sizeof(struct epoll_event) * nevents;
> +    sz += MAXALIGN(sizeof(struct epoll_event) * nevents);
>  #elif defined(WAIT_USE_POLL)
> -    sz += sizeof(struct pollfd) * nevents;
> +    sz += MAXALIGN(sizeof(struct pollfd) * nevents);
>  #elif defined(WAIT_USE_WIN32)
>      /* need space for the pgwin32_signal_event */
> -    sz += sizeof(HANDLE) * (nevents + 1);
> +    sz += MAXALIGN(sizeof(HANDLE) * (nevents + 1));
>  #endif
>  
>      data = (char *) MemoryContextAllocZero(context, sz);
>  
>      set = (WaitEventSet *) data;
> -    data += sizeof(WaitEventSet);
> +    data += MAXALIGN(sizeof(WaitEventSet));
>  
>      set->events = (WaitEvent *) data;
> -    data += sizeof(WaitEvent) * nevents;
> +    data += MAXALIGN(sizeof(WaitEvent) * nevents);
>  
>  #if defined(WAIT_USE_EPOLL)
>      set->epoll_ret_events = (struct epoll_event *) data;
> -    data += sizeof(struct epoll_event) * nevents;
> +    data += MAXALIGN(sizeof(struct epoll_event) * nevents);
>  #elif defined(WAIT_USE_POLL)
>      set->pollfds = (struct pollfd *) data;
> -    data += sizeof(struct pollfd) * nevents;
> +    data += MAXALIGN(sizeof(struct pollfd) * nevents);
>  #elif defined(WAIT_USE_WIN32)
>      set->handles = (HANDLE) data;
> -    data += sizeof(HANDLE) * nevents;
> +    data += MAXALIGN(sizeof(HANDLE) * nevents);
>  #endif

You can't easily test WIN32, but I'd suggest running make check with
WAIT_USE_POLL and WAIT_USE_SELECT at the top.

Looks sane otherwise.

Greetings,

Andres Freund



Re: epoll_wait returning EFAULT on Linux 3.2.78

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> Ok. I added a comment and also fixed a small typo.

Looks sane to me, though I might reduce the comment to something like
"MAXALIGN all the subsidiary arrays, to avoid interdependencies of the
alignment requirements of their component types."
        regards, tom lane