Обсуждение: EINTR in ftruncate()

Поиск
Список
Период
Сортировка

EINTR in ftruncate()

От
Alvaro Herrera
Дата:
Nicola Contu reported two years ago to pgsql-general[1] that they were
having sporadic query failures, because EINTR is reported on some system
call.  I have been told that the problem persists, though it is very
infrequent.  I propose the attached patch.  Kyotaro proposed a slightly
different patch which also protects write(), but I think that's not
necessary.

Thomas M. produced some more obscure theories for other things that
could fail, but I think we should patch this problem first, which seems
the most obvious one, and deal with others if and when they are
reported.

[1] https://www.postgresql.org/message-id/CAMTZZh2V%2B0wJVgSqTVvXUAVMduF57Uxubvvw58%3DkbOae%2B53%2BQQ%40mail.gmail.com

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"

Вложения

Re: EINTR in ftruncate()

От
Andres Freund
Дата:
Hi,

On 2022-07-01 17:41:05 +0200, Alvaro Herrera wrote:
> Nicola Contu reported two years ago to pgsql-general[1] that they were
> having sporadic query failures, because EINTR is reported on some system
> call.  I have been told that the problem persists, though it is very
> infrequent.  I propose the attached patch.  Kyotaro proposed a slightly
> different patch which also protects write(), but I think that's not
> necessary.

What is the reason for the || ProcDiePending || QueryCancelPending bit? What
if there's dsm operations intentionally done while QueryCancelPending?

Greetings,

Andres Freund



Re: EINTR in ftruncate()

От
Alvaro Herrera
Дата:
On 2022-Jul-01, Andres Freund wrote:

> On 2022-07-01 17:41:05 +0200, Alvaro Herrera wrote:
> > Nicola Contu reported two years ago to pgsql-general[1] that they were
> > having sporadic query failures, because EINTR is reported on some system
> > call.  I have been told that the problem persists, though it is very
> > infrequent.  I propose the attached patch.  Kyotaro proposed a slightly
> > different patch which also protects write(), but I think that's not
> > necessary.
> 
> What is the reason for the || ProcDiePending || QueryCancelPending bit? What
> if there's dsm operations intentionally done while QueryCancelPending?

That mirrors the test for the other block in that function, which was
added by 63efab4ca139, whose commit message explains:

    Allow DSM allocation to be interrupted.
    
    Chris Travers reported that the startup process can repeatedly try to
    cancel a backend that is in a posix_fallocate()/EINTR loop and cause it
    to loop forever.  Teach the retry loop to give up if an interrupt is
    pending.  Don't actually check for interrupts in that loop though,
    because a non-local exit would skip some clean-up code in the caller.

Thanks for looking!

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: EINTR in ftruncate()

От
Andres Freund
Дата:
Hi,

On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote:
> On 2022-Jul-01, Andres Freund wrote:
> 
> > On 2022-07-01 17:41:05 +0200, Alvaro Herrera wrote:
> > > Nicola Contu reported two years ago to pgsql-general[1] that they were
> > > having sporadic query failures, because EINTR is reported on some system
> > > call.  I have been told that the problem persists, though it is very
> > > infrequent.  I propose the attached patch.  Kyotaro proposed a slightly
> > > different patch which also protects write(), but I think that's not
> > > necessary.
> > 
> > What is the reason for the || ProcDiePending || QueryCancelPending bit? What
> > if there's dsm operations intentionally done while QueryCancelPending?
> 
> That mirrors the test for the other block in that function, which was
> added by 63efab4ca139, whose commit message explains:
> 
>     Allow DSM allocation to be interrupted.
>     
>     Chris Travers reported that the startup process can repeatedly try to
>     cancel a backend that is in a posix_fallocate()/EINTR loop and cause it
>     to loop forever.  Teach the retry loop to give up if an interrupt is
>     pending.  Don't actually check for interrupts in that loop though,
>     because a non-local exit would skip some clean-up code in the caller.

That whole approach seems quite wrong to me. At the absolute very least the
code needs to check if interrupts are being processed in the current context
before just giving up due to ProcDiePending || QueryCancelPending.

I'm very unconvinced this ought to be fixed in dsm_impl_posix_resize(), rather
than the startup process signalling.

There is an argument for allowing more things to be cancelled, but we'd need a
retry loop for the !INTERRUPTS_CAN_BE_PROCESSED() case.

Greetings,

Andres Freund



Re: EINTR in ftruncate()

От
Andres Freund
Дата:
Hi Chris,

On 2022-07-01 13:29:44 -0700, Andres Freund wrote:
> On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote:
> > On 2022-Jul-01, Andres Freund wrote:
> > 
> > > On 2022-07-01 17:41:05 +0200, Alvaro Herrera wrote:
> > > > Nicola Contu reported two years ago to pgsql-general[1] that they were
> > > > having sporadic query failures, because EINTR is reported on some system
> > > > call.  I have been told that the problem persists, though it is very
> > > > infrequent.  I propose the attached patch.  Kyotaro proposed a slightly
> > > > different patch which also protects write(), but I think that's not
> > > > necessary.
> > > 
> > > What is the reason for the || ProcDiePending || QueryCancelPending bit? What
> > > if there's dsm operations intentionally done while QueryCancelPending?
> > 
> > That mirrors the test for the other block in that function, which was
> > added by 63efab4ca139, whose commit message explains:
> > 
> >     Allow DSM allocation to be interrupted.
> >     
> >     Chris Travers reported that the startup process can repeatedly try to
> >     cancel a backend that is in a posix_fallocate()/EINTR loop and cause it
> >     to loop forever.  Teach the retry loop to give up if an interrupt is
> >     pending.  Don't actually check for interrupts in that loop though,
> >     because a non-local exit would skip some clean-up code in the caller.
> 
> That whole approach seems quite wrong to me. At the absolute very least the
> code needs to check if interrupts are being processed in the current context
> before just giving up due to ProcDiePending || QueryCancelPending.
> 
> I'm very unconvinced this ought to be fixed in dsm_impl_posix_resize(), rather
> than the startup process signalling.
> 
> There is an argument for allowing more things to be cancelled, but we'd need a
> retry loop for the !INTERRUPTS_CAN_BE_PROCESSED() case.

Chris, do you have any additional details about the machine that lead to this
change? OS version, whether it might have been swapping, etc?

I wonder if what happened is that posix_fallocate() used glibc's fallback
implementation because the kernel was old enough to not support fallocate()
for tmpfs.  Looks like support for fallocate() for tmpfs was added in 3.5
([1]). So e.g. a rhel 6 wouldn't have had that.

Greetings,

Andres Freund

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e2d12e22c59ce714008aa5266d769f8568d74eac



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Sat, Jul 2, 2022 at 9:06 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-07-01 13:29:44 -0700, Andres Freund wrote:
> > On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote:
> > >     Allow DSM allocation to be interrupted.
> > >
> > >     Chris Travers reported that the startup process can repeatedly try to
> > >     cancel a backend that is in a posix_fallocate()/EINTR loop and cause it
> > >     to loop forever.  Teach the retry loop to give up if an interrupt is
> > >     pending.  Don't actually check for interrupts in that loop though,
> > >     because a non-local exit would skip some clean-up code in the caller.
> >
> > That whole approach seems quite wrong to me. At the absolute very least the
> > code needs to check if interrupts are being processed in the current context
> > before just giving up due to ProcDiePending || QueryCancelPending.
> >
> > I'm very unconvinced this ought to be fixed in dsm_impl_posix_resize(), rather
> > than the startup process signalling.

I agree it's not great.  It was a back-patchable bandaid in need of a
better solution.

> Chris, do you have any additional details about the machine that lead to this
> change? OS version, whether it might have been swapping, etc?
>
> I wonder if what happened is that posix_fallocate() used glibc's fallback
> implementation because the kernel was old enough to not support fallocate()
> for tmpfs.  Looks like support for fallocate() for tmpfs was added in 3.5
> ([1]). So e.g. a rhel 6 wouldn't have had that.

With a quick test program on my Linux 5.10 kernel I see that an
SA_RESTART signal handler definitely causes posix_fallocate() to
return EINTR (can post trivial program).

A drive-by look at the current/modern kernel source supports this:
shmem_fallocate returns -EINTR directly (not -ERESTARTSYS, which seems
to be the Linux-y way to say you want EINTR or restart as
appropriate?), and it also undoes all partial progress too (not too
surprising), which would explain why a perfectly timed machine gun
stream of signals from our recovery conflict system can make an
fallocate retry loop never terminate, for large enough sizes.



Re: EINTR in ftruncate()

От
Andres Freund
Дата:
Hi,

On 2022-07-02 09:52:33 +1200, Thomas Munro wrote:
> On Sat, Jul 2, 2022 at 9:06 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-07-01 13:29:44 -0700, Andres Freund wrote:
> > Chris, do you have any additional details about the machine that lead to this
> > change? OS version, whether it might have been swapping, etc?
> >
> > I wonder if what happened is that posix_fallocate() used glibc's fallback
> > implementation because the kernel was old enough to not support fallocate()
> > for tmpfs.  Looks like support for fallocate() for tmpfs was added in 3.5
> > ([1]). So e.g. a rhel 6 wouldn't have had that.
> 
> With a quick test program on my Linux 5.10 kernel I see that an
> SA_RESTART signal handler definitely causes posix_fallocate() to
> return EINTR (can post trivial program).
> 
> A drive-by look at the current/modern kernel source supports this:
> shmem_fallocate returns -EINTR directly (not -ERESTARTSYS, which seems
> to be the Linux-y way to say you want EINTR or restart as
> appropriate?), and it also undoes all partial progress too (not too
> surprising), which would explain why a perfectly timed machine gun
> stream of signals from our recovery conflict system can make an
> fallocate retry loop never terminate, for large enough sizes.

Yea :(

And even if we fix recovery to not do douse other processes in signals quite
that badly, there are plenty other sources of signals that can arrive at a
steady clip. So I think we need to do something to defuse this another way.

Ideas:

1) do the fallocate in smaller chunks, thereby making it much more likely to
   complete between two signal deliveries
2) block signals while calling posix_fallocate(). That won't work for
   everything (e.g. rapid SIGSTOP/SIGCONT), but that's not something we'd send
   ourselves, so whatever.
3) 1+2
4) ?

Greetings,

Andres Freund



Re: EINTR in ftruncate()

От
Alvaro Herrera
Дата:
On 2022-Jul-01, Andres Freund wrote:

> On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote:
> > On 2022-Jul-01, Andres Freund wrote:

> > > What is the reason for the || ProcDiePending || QueryCancelPending bit? What
> > > if there's dsm operations intentionally done while QueryCancelPending?
> > 
> > That mirrors the test for the other block in that function, which was
> > added by 63efab4ca139, whose commit message explains:

> That whole approach seems quite wrong to me. At the absolute very least the
> code needs to check if interrupts are being processed in the current context
> before just giving up due to ProcDiePending || QueryCancelPending.

For the time being, I can just push the addition of the EINTR retry
without testing ProcDiePending || QueryCancelPending.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"El sudor es la mejor cura para un pensamiento enfermo" (Bardia)

Вложения

Re: EINTR in ftruncate()

От
Andres Freund
Дата:
Hi,

On 2022-07-04 13:07:50 +0200, Alvaro Herrera wrote:
> On 2022-Jul-01, Andres Freund wrote:
> 
> > On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote:
> > > On 2022-Jul-01, Andres Freund wrote:
> 
> > > > What is the reason for the || ProcDiePending || QueryCancelPending bit? What
> > > > if there's dsm operations intentionally done while QueryCancelPending?
> > > 
> > > That mirrors the test for the other block in that function, which was
> > > added by 63efab4ca139, whose commit message explains:
> 
> > That whole approach seems quite wrong to me. At the absolute very least the
> > code needs to check if interrupts are being processed in the current context
> > before just giving up due to ProcDiePending || QueryCancelPending.
> 
> For the time being, I can just push the addition of the EINTR retry
> without testing ProcDiePending || QueryCancelPending.

I think we'd be better off disabling at least some signals during
dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another
variation of these problems. I haven't checked the source of ftruncate, but
what Thomas dug up for fallocate makes it pretty clear that our current
approach of just retrying again and again isn't good enough. It's a bit more
obvious that it's a problem for fallocate, but I don't think it's worth having
different solutions for the two.

Greetings,

Andres Freund



Re: EINTR in ftruncate()

От
Alvaro Herrera
Дата:
On 2022-Jul-05, Andres Freund wrote:

> I think we'd be better off disabling at least some signals during
> dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another
> variation of these problems. I haven't checked the source of ftruncate, but
> what Thomas dug up for fallocate makes it pretty clear that our current
> approach of just retrying again and again isn't good enough. It's a bit more
> obvious that it's a problem for fallocate, but I don't think it's worth having
> different solutions for the two.

So what if we move the retry loop one level up?  As in the attached.
Here, if we get EINTR then we retry both syscalls.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"

Вложения

Re: EINTR in ftruncate()

От
Andres Freund
Дата:
Hi,

On 2022-07-06 21:29:41 +0200, Alvaro Herrera wrote:
> On 2022-Jul-05, Andres Freund wrote:
> 
> > I think we'd be better off disabling at least some signals during
> > dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another
> > variation of these problems. I haven't checked the source of ftruncate, but
> > what Thomas dug up for fallocate makes it pretty clear that our current
> > approach of just retrying again and again isn't good enough. It's a bit more
> > obvious that it's a problem for fallocate, but I don't think it's worth having
> > different solutions for the two.
> 
> So what if we move the retry loop one level up?  As in the attached.
> Here, if we get EINTR then we retry both syscalls.

Doesn't really seem to address the problem to me. posix_fallocate()
takes some time (~1s for 3GB roughly), so if we signal at a higher rate,
we'll just get stuck.

I hacked a bit on a test program from Thomas, and it's pretty clearly
that with a 5ms timer interval you'll pretty much not make
progress. It's much easier to get fallocate() to get interrupted than
ftruncate(), but the latter gets interrupted e.g. when you do a strace
in the "wrong" moment (afaics SIGSTOP/SIGCONT trigger EINTR in
situations that are retried otherwise).

So I think we need: 1) block most signals, 2) a retry loop *without*
interrupt checks.

Greetings,

Andres Freund



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-07-06 21:29:41 +0200, Alvaro Herrera wrote:
> > On 2022-Jul-05, Andres Freund wrote:
> >
> > > I think we'd be better off disabling at least some signals during
> > > dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another
> > > variation of these problems. I haven't checked the source of ftruncate, but
> > > what Thomas dug up for fallocate makes it pretty clear that our current
> > > approach of just retrying again and again isn't good enough. It's a bit more
> > > obvious that it's a problem for fallocate, but I don't think it's worth having
> > > different solutions for the two.
> >
> > So what if we move the retry loop one level up?  As in the attached.
> > Here, if we get EINTR then we retry both syscalls.
>
> Doesn't really seem to address the problem to me. posix_fallocate()
> takes some time (~1s for 3GB roughly), so if we signal at a higher rate,
> we'll just get stuck.
>
> I hacked a bit on a test program from Thomas, and it's pretty clearly
> that with a 5ms timer interval you'll pretty much not make
> progress. It's much easier to get fallocate() to get interrupted than
> ftruncate(), but the latter gets interrupted e.g. when you do a strace
> in the "wrong" moment (afaics SIGSTOP/SIGCONT trigger EINTR in
> situations that are retried otherwise).
>
> So I think we need: 1) block most signals, 2) a retry loop *without*
> interrupt checks.

Yeah.  I was also wondering about wrapping the whole function in
PG_SETMASK(&BlockSig), PG_SETMASK(&UnBlockSig), but also leaving the
while (rc == EINTR) loop there (without the check for *Pending
variables), only because otherwise when you attach a debugger and
continue you'll get a spurious EINTR and it'll interfere with program
execution.  All blockable signals would be blocked *except* SIGQUIT,
which means that fast shutdown/crash will still work.  It seems nice
to leave that way to interrupt it without resorting to SIGKILL.



Re: EINTR in ftruncate()

От
Andres Freund
Дата:
Hi,

On 2022-07-07 08:56:33 +1200, Thomas Munro wrote:
> On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <andres@anarazel.de> wrote:
> > So I think we need: 1) block most signals, 2) a retry loop *without*
> > interrupt checks.
> 
> Yeah.  I was also wondering about wrapping the whole function in
> PG_SETMASK(&BlockSig), PG_SETMASK(&UnBlockSig), but also leaving the
> while (rc == EINTR) loop there (without the check for *Pending
> variables), only because otherwise when you attach a debugger and
> continue you'll get a spurious EINTR and it'll interfere with program
> execution.  All blockable signals would be blocked *except* SIGQUIT,
> which means that fast shutdown/crash will still work.  It seems nice
> to leave that way to interrupt it without resorting to SIGKILL.

Fast shutdown shouldn't use SIGQUIT - did you mean immediate? I think
it's fine to allow immediate shutdowns, but I don't think we should
allow fast shutdowns to interrupt it.

Greetings,

Andres Freund



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Thu, Jul 7, 2022 at 9:03 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-07-07 08:56:33 +1200, Thomas Munro wrote:
> > On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <andres@anarazel.de> wrote:
> > > So I think we need: 1) block most signals, 2) a retry loop *without*
> > > interrupt checks.
> >
> > Yeah.  I was also wondering about wrapping the whole function in
> > PG_SETMASK(&BlockSig), PG_SETMASK(&UnBlockSig), but also leaving the
> > while (rc == EINTR) loop there (without the check for *Pending
> > variables), only because otherwise when you attach a debugger and
> > continue you'll get a spurious EINTR and it'll interfere with program
> > execution.  All blockable signals would be blocked *except* SIGQUIT,
> > which means that fast shutdown/crash will still work.  It seems nice
> > to leave that way to interrupt it without resorting to SIGKILL.
>
> Fast shutdown shouldn't use SIGQUIT - did you mean immediate? I think
> it's fine to allow immediate shutdowns, but I don't think we should
> allow fast shutdowns to interrupt it.

Err, yeah, that one.



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Thu, Jul 7, 2022 at 9:05 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Jul 7, 2022 at 9:03 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-07-07 08:56:33 +1200, Thomas Munro wrote:
> > > On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <andres@anarazel.de> wrote:
> > > > So I think we need: 1) block most signals, 2) a retry loop *without*
> > > > interrupt checks.

Here's a draft patch that tries to explain all this in the commit
message and comments.

Even if we go with this approach now, I think it's plausible that we
might want to reconsider this yet again one day, perhaps allocating
memory with some future asynchronous infrastructure while still
processing interrupts.  It's not very nice to hold up recovery or
ProcSignalBarrier for long operations.

I'm a little unclear about ftruncate() here.  I don't expect it to
report EINTR in other places where we use it (ie to make a local file
on a non-"slow device" smaller), because I expect that to be like
read(), write() etc which we don't wrap in EINTR loops.  Here you've
observed EINTR when messing around with a debugger*.  It seems
inconsistent to put posix_fallocate() in an EINTR retry loop for the
benefit of debuggers, but not ftruncate().  But perhaps that's good
enough, on the theory that posix_fallocate(1GB) is a very large target
and you have a decent chance of hitting it.

Another observation while staring at that ftruncate():  It's entirely
redundant on Linux, because we only ever call dsm_impl_posix_resize()
to make the segment bigger.  Before commit 3c60d0fa (v12) it was
possible to resize a segment to be smaller with dsm_resize(), so you
needed one or t'other depending on the requested size and we just
called both, but dsm_resize() wasn't ever used AFAIK and didn't even
work on all DSM implementations, among other problems, so we ripped it
out.  So... on master at least, we could also change the #ifdef to be
either-or.  While refactoring like that, I think we might as well also
rearrange the code so that the wait event is reported also for other
OSes, just in case it takes a long time.  See 0002 patch.

*It's funny that ftruncate() apparently doesn't automatically restart
for ptrace SIGCONT on Linux according to your report, while poll()
does according to my experiments, even though the latter is one of the
never-restart functions (it doesn't on other OSes I hack on, and you
feel the difference when debugging missing wakeup type bugs...).
Random implementation details...

Вложения

Re: EINTR in ftruncate()

От
Alvaro Herrera
Дата:
On 2022-Jul-07, Thomas Munro wrote:

> On Thu, Jul 7, 2022 at 9:05 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Thu, Jul 7, 2022 at 9:03 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2022-07-07 08:56:33 +1200, Thomas Munro wrote:
> > > > On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <andres@anarazel.de> wrote:
> > > > > So I think we need: 1) block most signals, 2) a retry loop *without*
> > > > > interrupt checks.
> 
> Here's a draft patch that tries to explain all this in the commit
> message and comments.

I gave 0001 a try.  I agree with the approach, and it seems to work as
intended; or at least I couldn't break it under GDB.

I didn't look at 0002, but I wish that the pgstat_report_wait calls were
moved to cover both syscalls in a separate commit, just so we still have
them even if we decide not to do 0002.

Do you intend to get it pushed before the next minors?  I have an
interest in that happening.  Thanks for working on this.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Here's a general engineering tip: if the non-fun part is too complex for you
to figure out, that might indicate the fun part is too ambitious." (John Naylor)
https://postgr.es/m/CAFBsxsG4OWHBbSDM%3DsSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g%40mail.gmail.com



Re: EINTR in ftruncate()

От
Andres Freund
Дата:
Hi,

On 2022-07-07 17:58:10 +1200, Thomas Munro wrote:
> Even if we go with this approach now, I think it's plausible that we
> might want to reconsider this yet again one day, perhaps allocating
> memory with some future asynchronous infrastructure while still
> processing interrupts.  It's not very nice to hold up recovery or
> ProcSignalBarrier for long operations.

I think the next improvement would be to do the fallocate in smaller chunks,
and accept interrupts inbetween.


> I'm a little unclear about ftruncate() here.  I don't expect it to
> report EINTR in other places where we use it (ie to make a local file
> on a non-"slow device" smaller), because I expect that to be like
> read(), write() etc which we don't wrap in EINTR loops.  Here you've
> observed EINTR when messing around with a debugger*.  It seems
> inconsistent to put posix_fallocate() in an EINTR retry loop for the
> benefit of debuggers, but not ftruncate().  But perhaps that's good
> enough, on the theory that posix_fallocate(1GB) is a very large target
> and you have a decent chance of hitting it.

> *It's funny that ftruncate() apparently doesn't automatically restart
> for ptrace SIGCONT on Linux according to your report, while poll()
> does according to my experiments, even though the latter is one of the
> never-restart functions (it doesn't on other OSes I hack on, and you
> feel the difference when debugging missing wakeup type bugs...).
> Random implementation details...

My test was basically while (true); {if (!ftruncate()) bleat(); if
(!fallocate()) bleat();} with a SIGALRM triggering regularly in the
background. The ftruncate failed, rarely, when attaching / detaching strace
-p. Rarely enough that I had already written you an IM saying that I couldn't
make it fail...  So it's hard to be confident this can't otherwise be
hit. With that caveat: I didn't hit it with a "real" file on a "real"
filesystem in a few minutes of trying. But unsurprisingly it triggers when
putting the file on a tmpfs.


> @@ -362,6 +355,14 @@ dsm_impl_posix_resize(int fd, off_t size)
>  {
>      int            rc;
>  
> +    /*
> +     * Block all blockable signals, except SIGQUIT.  posix_fallocate() can run
> +     * for quite a long time, and is an all-or-nothing operation.  If we
> +     * allowed SIGUSR1 to interrupt us repeatedly (for example, due to recovery
> +     * conflicts), the retry loop might never succeed.
> +     */
> +    PG_SETMASK(&BlockSig);
> +
>      /* Truncate (or extend) the file to the requested size. */
>      rc = ftruncate(fd, size);
>

Hm - given that we've observed ftruncate failing with strace, and that
stracing to find problems isn't insane, shouldn't we retry the ftruncate too?
It's kind of obsoleted by your next patch, but not really, because it's not
unconceivable that other OSs behave similarly? And IIUC you're planning to not
backpatch 0002?


> +    pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE);

(not new in this patch, just moved around) - FILL_ZERO_WRITE is imo an odd
description of what's happening... In my understanding this isn't doing any
writing, just allocating. But whatever...


Greetings,

Andres Freund



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Tue, Jul 12, 2022 at 5:45 AM Andres Freund <andres@anarazel.de> wrote:
> Hm - given that we've observed ftruncate failing with strace, and that
> stracing to find problems isn't insane, shouldn't we retry the ftruncate too?
> It's kind of obsoleted by your next patch, but not really, because it's not
> unconceivable that other OSs behave similarly? And IIUC you're planning to not
> backpatch 0002?

Yeah.  Done, and pushed.  0002 not back-patched.

> > +     pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE);
>
> (not new in this patch, just moved around) - FILL_ZERO_WRITE is imo an odd
> description of what's happening... In my understanding this isn't doing any
> writing, just allocating. But whatever...

True.  Fixed in master.



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Fri, Jul 15, 2022 at 12:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Yeah.  Done, and pushed.  0002 not back-patched.

Hmm, there were a couple of hard to understand build farm failures.
My first thought is that the signal mask stuff should only be done if
IsUnderPostmaster, otherwise it clobbers the postmaster's signal mask
when reached from dsm_postmaster_startup().  Looking into that.



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Fri, Jul 15, 2022 at 1:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jul 15, 2022 at 12:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Yeah.  Done, and pushed.  0002 not back-patched.
>
> Hmm, there were a couple of hard to understand build farm failures.
> My first thought is that the signal mask stuff should only be done if
> IsUnderPostmaster, otherwise it clobbers the postmaster's signal mask
> when reached from dsm_postmaster_startup().  Looking into that.

I pushed that change, and I hope that clears up the problems seen on
eg curculio.  It does raise the more general question of why it's safe
to assume the signal mask is UnBlockSig on entry in regular backends.
I expect it to be in released branches, but things get more
complicated as we use DSM in more ways and it's not ideal to bet on
that staying true.  I checked that this throw-away assertion doesn't
fail currently:

        if (IsUnderPostmaster)
+       {
+               sigset_t old;
+               sigprocmask(SIG_SETMASK, NULL, &old);
+               Assert(memcmp(&old, &UnBlockSig, sizeof(UnBlockSig)) == 0);
                PG_SETMASK(&BlockSig);
+       }

... but now I'm wondering if we should be more defensive and possibly
even save/restore the mask.  Originally I discounted that because I
thought I had to go through PG_SETMASK for portability reasons, but on
closer inspection, I don't see any reason not to use sigprocmask
directly in Unix-only code.



Re: EINTR in ftruncate()

От
Alvaro Herrera
Дата:
On 2022-Jul-15, Thomas Munro wrote:

> I checked that this throw-away assertion doesn't fail currently:
> 
>         if (IsUnderPostmaster)
> +       {
> +               sigset_t old;
> +               sigprocmask(SIG_SETMASK, NULL, &old);
> +               Assert(memcmp(&old, &UnBlockSig, sizeof(UnBlockSig)) == 0);
>                 PG_SETMASK(&BlockSig);
> +       }
> 
> ... but now I'm wondering if we should be more defensive and possibly
> even save/restore the mask.

Yeah, that sounds better to me.

> Originally I discounted that because I thought I had to go through
> PG_SETMASK for portability reasons, but on closer inspection, I don't
> see any reason not to use sigprocmask directly in Unix-only code.

ISTM it would be cleaner to patch PG_SETMASK to have a second argument
and to return the original mask if that's not NULL.  This is more
invasive, but there aren't that many callsites of that macro.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: EINTR in ftruncate()

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> ... but now I'm wondering if we should be more defensive and possibly
> even save/restore the mask.

+1, sounds like a more future-proof solution.

> Originally I discounted that because I
> thought I had to go through PG_SETMASK for portability reasons, but on
> closer inspection, I don't see any reason not to use sigprocmask
> directly in Unix-only code.

Seems like the thing to do is to add a suitable operation to the
pqsignal.h API.  We could leave it unimplemented for now on Windows,
and then worry what to do if we ever need it in that context.

            regards, tom lane



Re: EINTR in ftruncate()

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> ISTM it would be cleaner to patch PG_SETMASK to have a second argument
> and to return the original mask if that's not NULL.  This is more
> invasive, but there aren't that many callsites of that macro.

[ shoulda read your message before replying ]

Given that this needs back-patched, I think changing PG_SETMASK
is a bad idea: there might be outside callers.  However, we could
add another macro with the additional argument.  PG_GET_AND_SET_MASK?

            regards, tom lane



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Fri, Jul 15, 2022 at 3:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > ISTM it would be cleaner to patch PG_SETMASK to have a second argument
> > and to return the original mask if that's not NULL.  This is more
> > invasive, but there aren't that many callsites of that macro.
>
> [ shoulda read your message before replying ]
>
> Given that this needs back-patched, I think changing PG_SETMASK
> is a bad idea: there might be outside callers.  However, we could
> add another macro with the additional argument.  PG_GET_AND_SET_MASK?

It's funny though, the reason we had PG_SETMASK in the first place is
not for Windows.  Ancient commit 47937403676 added that for long gone
pre-POSIX systems like NeXTSTEP which only had single-argument
sigsetmask(), not sigprocmask().  In general on Windows we're
emulating POSIX signal interfaces with normal names like sigemptyset()
etc, it just so happens that we chose to emulate that pre-standard
sigsetmask() interface (as you complained about in the commit message
for a65e0864).

So why would I add another wrapper like PG_SETMASK and leave it
unimplemented for now on Windows, when I could just use sigprocmask()
directly and leave it unimplemented for now on Windows?

The only reason I can think of for a wrapper is to provide a place to
check the return code and ereport (panic?).  That seems to be of
limited value (how can it fail ... bad "how" value, or a sandbox
denying some system calls, ...?).  I did make sure to preserve the
errno though; even though we're assuming these calls can't fail by
long standing tradition, I didn't feel like additionally assuming that
successful calls don't clobber errno.

I guess, coded like this, it should also be safe to do it in the
postmaster, but I think maybe we should leave it conditional, rather
than relying on BlockSig being initialised and sane during early
postmaster initialisation.

Вложения

Re: EINTR in ftruncate()

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> So why would I add another wrapper like PG_SETMASK and leave it
> unimplemented for now on Windows, when I could just use sigprocmask()
> directly and leave it unimplemented for now on Windows?

Fair enough, I guess.  No objection to this patch.

(Someday we oughta go ahead and make our Windows signal API look more
like POSIX, as I suggested back in 2015.  I'm still not taking
point on that, though.)

            regards, tom lane



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Fri, Jul 15, 2022 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (Someday we oughta go ahead and make our Windows signal API look more
> like POSIX, as I suggested back in 2015.  I'm still not taking
> point on that, though.)

For the sigprocmask() part, here's a patch that passes CI.  Only the
SIG_SETMASK case is actually exercised by our current code, though.

One weird thing about our PG_SETMASK() macro is that you couldn't have
used its return value portably: on Windows we were returning the old
mask (like sigsetmask(), which has no way to report errors), and on
Unix we were returning 0/-1 (from setprocmask(), ie the error we never
checked).

Вложения

Re: EINTR in ftruncate()

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Fri, Jul 15, 2022 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> (Someday we oughta go ahead and make our Windows signal API look more
>> like POSIX, as I suggested back in 2015.  I'm still not taking
>> point on that, though.)

> For the sigprocmask() part, here's a patch that passes CI.  Only the
> SIG_SETMASK case is actually exercised by our current code, though.

Passes an eyeball check, but I can't actually test it.

            regards, tom lane



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Sat, Jul 16, 2022 at 1:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Fri, Jul 15, 2022 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> (Someday we oughta go ahead and make our Windows signal API look more
> >> like POSIX, as I suggested back in 2015.  I'm still not taking
> >> point on that, though.)
>
> > For the sigprocmask() part, here's a patch that passes CI.  Only the
> > SIG_SETMASK case is actually exercised by our current code, though.
>
> Passes an eyeball check, but I can't actually test it.

Thanks.  Pushed.

I'm not brave enough to try to write a replacement sigaction() yet,
but it does appear that we could rip more ugliness and inconsistencies
that way, eg sa_mask.



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Sat, Jul 16, 2022 at 5:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Jul 16, 2022 at 1:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > On Fri, Jul 15, 2022 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> (Someday we oughta go ahead and make our Windows signal API look more
> > >> like POSIX, as I suggested back in 2015.  I'm still not taking
> > >> point on that, though.)
> >
> > > For the sigprocmask() part, here's a patch that passes CI.  Only the
> > > SIG_SETMASK case is actually exercised by our current code, though.
> >
> > Passes an eyeball check, but I can't actually test it.
>
> Thanks.  Pushed.
>
> I'm not brave enough to try to write a replacement sigaction() yet,
> but it does appear that we could rip more ugliness and inconsistencies
> that way, eg sa_mask.

Here's a draft patch that adds a minimal sigaction() implementation
for Windows.  It doesn't implement stuff we're not using, for sample
sa_sigaction functions, but it has the sa_mask feature so we can
harmonize the stuff that I believe you were talking about.

Вложения

Re: EINTR in ftruncate()

От
Justin Pryzby
Дата:
On Wed, Aug 17, 2022 at 07:51:34AM +1200, Thomas Munro wrote:
> Here's a draft patch that adds a minimal sigaction() implementation
> for Windows.  It doesn't implement stuff we're not using, for sample
> sa_sigaction functions, but it has the sa_mask feature so we can
> harmonize the stuff that I believe you were talking about.

Did you see that this paniced ?

https://cirrus-ci.com/task/4975957546106880

https://api.cirrus-ci.com/v1/artifact/task/4975957546106880/testrun/build/testrun/recovery/027_stream_regress/log/027_stream_regress_standby_1.log

2022-09-30 09:13:03.496 GMT [7312][startup] PANIC:  hash_xlog_split_allocate_page: failed to acquire cleanup lock
2022-09-30 09:13:03.496 GMT [7312][startup] CONTEXT:  WAL redo at 0/7AF6FA8 for Hash/SPLIT_ALLOCATE_PAGE: new_bucket
63,meta_page_masks_updated F, issplitpoint_changed F; blkref #0: rel 1663/16384/23784, blk 45; blkref #1: rel
1663/16384/23784,blk 78; blkref #2: rel 1663/16384/23784, blk 0
 



Re: EINTR in ftruncate()

От
Andres Freund
Дата:
Hi,

On 2022-09-30 13:53:45 -0500, Justin Pryzby wrote:
> On Wed, Aug 17, 2022 at 07:51:34AM +1200, Thomas Munro wrote:
> > Here's a draft patch that adds a minimal sigaction() implementation
> > for Windows.  It doesn't implement stuff we're not using, for sample
> > sa_sigaction functions, but it has the sa_mask feature so we can
> > harmonize the stuff that I believe you were talking about.
> 
> Did you see that this paniced ?
> 
> https://cirrus-ci.com/task/4975957546106880
>
https://api.cirrus-ci.com/v1/artifact/task/4975957546106880/testrun/build/testrun/recovery/027_stream_regress/log/027_stream_regress_standby_1.log
> 
> 2022-09-30 09:13:03.496 GMT [7312][startup] PANIC:  hash_xlog_split_allocate_page: failed to acquire cleanup lock
> 2022-09-30 09:13:03.496 GMT [7312][startup] CONTEXT:  WAL redo at 0/7AF6FA8 for Hash/SPLIT_ALLOCATE_PAGE: new_bucket
63,meta_page_masks_updated F, issplitpoint_changed F; blkref #0: rel 1663/16384/23784, blk 45; blkref #1: rel
1663/16384/23784,blk 78; blkref #2: rel 1663/16384/23784, blk 0
 

This looks like broken code in hash, independent of any recent changes:
https://www.postgresql.org/message-id/20220817193032.z35vdjhpzkgldrd3%40awork3.anarazel.de

Greetings,

Andres Freund



Re: EINTR in ftruncate()

От
Thomas Munro
Дата:
On Wed, Aug 17, 2022 at 7:51 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Jul 16, 2022 at 5:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Sat, Jul 16, 2022 at 1:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Thomas Munro <thomas.munro@gmail.com> writes:
> > > > On Fri, Jul 15, 2022 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > >> (Someday we oughta go ahead and make our Windows signal API look more
> > > >> like POSIX, as I suggested back in 2015.  I'm still not taking
> > > >> point on that, though.)
> > >
> > > > For the sigprocmask() part, here's a patch that passes CI.  Only the
> > > > SIG_SETMASK case is actually exercised by our current code, though.
> > >
> > > Passes an eyeball check, but I can't actually test it.
> >
> > Thanks.  Pushed.
> >
> > I'm not brave enough to try to write a replacement sigaction() yet,
> > but it does appear that we could rip more ugliness and inconsistencies
> > that way, eg sa_mask.
>
> Here's a draft patch that adds a minimal sigaction() implementation
> for Windows.  It doesn't implement stuff we're not using, for sample
> sa_sigaction functions, but it has the sa_mask feature so we can
> harmonize the stuff that I believe you were talking about.

Pushed.

As discussed before, a much better idea would probably be to use
latch-based wakeup instead of putting postmaster logic in signal
handlers, but in the meantime this gets rid of the extra
Windows-specific noise.