Обсуждение: postmaster disappears

Поиск
Список
Период
Сортировка

postmaster disappears

От
Tatsuo Ishii
Дата:
Hi,

I got a few reports from users that postmaster disappears for unknown
reason. Inspecting the postmaster log, I found that postmaster exited at:

if (select(nSockets, &rmask, &wmask, (fd_set *) NULL,       (struct timeval *) NULL) < 0)
{if (errno == EINTR)    continue;fprintf(stderr, "%s: ServerLoop: select failed: %s\n",        progname,
strerror(errno));returnSTATUS_ERROR; <-- here
 
}

In this case errno=ECHILD has been returned that makes postmaster
exiting. This could happen if SIGCHLD raised between select() call and
the next if (errno=...) statement. One of the solution would be
ignoring ECHILD as well as EINTR. Included are patches for this. If
there's no objection, I will commit them to both stable and current
tree.

*** postgresql-6.5.1/src/backend/postmaster/postmaster.c~    Thu Jul  8 02:17:48 1999
--- postgresql-6.5.1/src/backend/postmaster/postmaster.c    Thu Sep  9 10:14:30 1999
***************
*** 709,719 ****         if (select(nSockets, &rmask, &wmask, (fd_set *) NULL,                    (struct timeval *)
NULL)< 0)         {
 
!             if (errno == EINTR)                 continue;
!             fprintf(stderr, "%s: ServerLoop: select failed: %s\n",                     progname, strerror(errno));
!             return STATUS_ERROR;         }          /*
--- 709,729 ----         if (select(nSockets, &rmask, &wmask, (fd_set *) NULL,                    (struct timeval *)
NULL)< 0)         {
 
!             switch(errno) {
!             case EINTR:                 continue;
!                 break;
!             case ECHILD:
!                 fprintf(stderr, "%s: ServerLoop: ignoring ECHILD\n",
!                     progname);
!                 continue;
!                 break;
!             default:
!                 fprintf(stderr, "%s: ServerLoop: select failed: %s\n",                     progname,
strerror(errno));
!                 return STATUS_ERROR;
!                 break;
!             }         }          /*


Re: [HACKERS] postmaster disappears

От
Tom Lane
Дата:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> In this case errno=ECHILD has been returned that makes postmaster
> exiting. This could happen if SIGCHLD raised between select() call and
> the next if (errno=...) statement. One of the solution would be
> ignoring ECHILD as well as EINTR. Included are patches for this.

Hmm.  What you are saying, I guess, is that SIGCHLD is raised,
reaper() executes, and then when control continues in the main loop
the errno left over from reaper()'s last kernel call is what's seen,
instead of the one returned by signal().

Seems to me that the correct fix is to have reaper() save and restore
the outer value of errno, not to hack the main line to ignore the
most probable state left over from reaper().  Otherwise you still choke
if some other value gets returned from whatever call reaper() does last.
Moreover, you're not actually checking what the select() did unless
you do it that way.

Curious that this sort of problem is not seen more often --- I wonder
if most Unixes arrange to save/restore errno around a signal handler
for you?
        regards, tom lane


Re: [HACKERS] postmaster disappears

От
Tatsuo Ishii
Дата:
>Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> In this case errno=ECHILD has been returned that makes postmaster
>> exiting. This could happen if SIGCHLD raised between select() call and
>> the next if (errno=...) statement. One of the solution would be
>> ignoring ECHILD as well as EINTR. Included are patches for this.
>
>Hmm.  What you are saying, I guess, is that SIGCHLD is raised,
>reaper() executes, and then when control continues in the main loop
>the errno left over from reaper()'s last kernel call is what's seen,
>instead of the one returned by signal().

Right.

>Seems to me that the correct fix is to have reaper() save and restore
>the outer value of errno, not to hack the main line to ignore the
>most probable state left over from reaper().  Otherwise you still choke
>if some other value gets returned from whatever call reaper() does
>last.

Not sure. reaper() may be called while reaper() is executing if a new
SIGCHLD is raised. How do you handle this case?

>Moreover, you're not actually checking what the select() did unless
>you do it that way.

Sorry, I don't understand this. Can you explain, please?

>Curious that this sort of problem is not seen more often --- I wonder
>if most Unixes arrange to save/restore errno around a signal handler
>for you?

Maybe because the situation I have pointed out is relatively rare.
---
Tatsuo Ishii



Re: [HACKERS] postmaster disappears

От
Tom Lane
Дата:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> Seems to me that the correct fix is to have reaper() save and restore
>> the outer value of errno, not to hack the main line to ignore the
>> most probable state left over from reaper().  Otherwise you still choke
>> if some other value gets returned from whatever call reaper() does
>> last.

> Not sure. reaper() may be called while reaper() is executing if a new
> SIGCHLD is raised. How do you handle this case?

No, because the signal is disabled when the trap is taken, and then not
re-enabled until reaper() does pqsignal() just before exiting.  We don't
really care if a new signal recursively interrupts reaper() at that
point, but bad things would happen if there were a recursive interrupt
earlier while we were diddling the list of children.  (Cf. comments in
the existing code where SIGCHLD is disabled while we add a child...)

In any case, it's not a problem: if each level of reaper does
reaper(){    int save_errno = errno;
    ...
    errno = save_errno;  /* restore errno of interrupted code */}

then a recursive interrupt might be saving/restoring the errno value
that existed in the next outer interrupt, rather than the value that
is truly at the outer level, but that's what stacks are for ;-)

>> Moreover, you're not actually checking what the select() did unless
>> you do it that way.

> Sorry, I don't understand this. Can you explain, please?

If you don't have the signal routine save/restore errno, then (when this
problem occurs) you are not seeing the errno returned by the select(),
but one left over from reaper()'s activity.  If the select() failed, you
won't know it.

>> Curious that this sort of problem is not seen more often --- I wonder
>> if most Unixes arrange to save/restore errno around a signal handler
>> for you?

> Maybe because the situation I have pointed out is relatively rare.

Well, the window for trouble is awfully tiny in this particular code of
ours, but it might be larger in other programs.  Yet I don't think I've
ever heard a programming recommendation to save/restore errno in signal
handlers...
        regards, tom lane


Re: [HACKERS] postmaster disappears

От
Tatsuo Ishii
Дата:
>> Not sure. reaper() may be called while reaper() is executing if a new
>> SIGCHLD is raised. How do you handle this case?
>
>No, because the signal is disabled when the trap is taken, and then not
>re-enabled until reaper() does pqsignal() just before exiting.  We don't

You are correct. I had wrong impression about signal handling.

>>> Moreover, you're not actually checking what the select() did unless
>>> you do it that way.
>
>> Sorry, I don't understand this. Can you explain, please?
>
>If you don't have the signal routine save/restore errno, then (when this
>problem occurs) you are not seeing the errno returned by the select(),
>but one left over from reaper()'s activity.  If the select() failed, you
>won't know it.

Oh, I see your point.

>>> Curious that this sort of problem is not seen more often --- I wonder
>>> if most Unixes arrange to save/restore errno around a signal handler
>>> for you?
>
>> Maybe because the situation I have pointed out is relatively rare.
>
>Well, the window for trouble is awfully tiny in this particular code of
>ours, but it might be larger in other programs.

Though it seems rare, we certainly have had this kind of reports from
users for a while. Since disappearing postmaster is a really bad
thing, I love to see solutions for this.

>Yet I don't think I've
>ever heard a programming recommendation to save/restore errno in signal
>handlers...

Agreed. I don't like this way.

I asked a Unix guru, and got a suggestion that we do not need to call
wait() (and CleanupProc()) inside the signal handler. Instead we could
have a null signal hander (it just calls pqsignal()) for SIGCHLD.  If
select() returns EINTR then we just call wait() and
CleanupProc(). Moreover this would eliminate sigprocmask() or
sigblock() calls currently done to avoid race conditions before going
into the critical region. Of course we have to call wait() and
CleanupProc() before select() to make sure that we have no waiting
children.

Another way would be blocking SIGCHILD before calling select(). In
this case appropriate time out setting for select() is necessary,
though.
--
Tatsuo Ishii


Re: [HACKERS] postmaster disappears

От
Tom Lane
Дата:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> Yet I don't think I've ever heard a programming recommendation to
>> save/restore errno in signal handlers...

> Agreed. I don't like this way.

Hmm, I don't like your patch and you don't like mine.  Time to redesign
rather than patch ;-)

> I asked a Unix guru, and got a suggestion that we do not need to call
> wait() (and CleanupProc()) inside the signal handler. Instead we could
> have a null signal hander (it just calls pqsignal()) for SIGCHLD.  If
> select() returns EINTR then we just call wait() and
> CleanupProc(). Moreover this would eliminate sigprocmask() or
> sigblock() calls currently done to avoid race conditions before going
> into the critical region. Of course we have to call wait() and
> CleanupProc() before select() to make sure that we have no waiting
> children.

This looks like it could be a really clean solution.  In fact, there'd
be no need to check for EINTR from select(); we could just fall through,
knowing that the reaping will be done as soon as we loop around to the
top of the loop.  The code becomes just
for (;;) {    reap;    select;    handle any input found by select;}

Do we even need a signal handler at all for ECHILD?  I suppose the
select might not get interrupted (at least on some platforms) if there
isn't one.

Actually I guess there still is a race condition: there is a window
between the last wait() of the reap loop and the select() wherein an
ECHILD won't be serviced right away, because we hit the select() before
noticing it.  We could maybe use a timeout on the select to fix that.
Don't really like it though, since the timeout couldn't be very long,
but we don't want the postmaster wasting cycles when there's nothing
to do.  Is there another way around this?
        regards, tom lane


Re: [HACKERS] postmaster disappears

От
Bruce Momjian
Дата:
> Do we even need a signal handler at all for ECHILD?  I suppose the
> select might not get interrupted (at least on some platforms) if there
> isn't one.
> 
> Actually I guess there still is a race condition: there is a window
> between the last wait() of the reap loop and the select() wherein an
> ECHILD won't be serviced right away, because we hit the select() before
> noticing it.  We could maybe use a timeout on the select to fix that.
> Don't really like it though, since the timeout couldn't be very long,
> but we don't want the postmaster wasting cycles when there's nothing
> to do.  Is there another way around this?


Here is code I use for reaping dead child processes.  Under SysV, if you
say you want to ignore child processes, they just disappear, but on BSD,
the children stay as zombies.  This fixes that.

Seems you need to define a singnal handler, and just put select() in a
loop:
while (1)    if (select(...) != -1 || errno != EINTR)        break;

I see you are are loosing your error inside the singnal handler.  Seems
you may have to save/restore errno.

---------------------------------------------------------------------------

/**    From: rstevens@noao.edu (W. Richard Stevens) *    Newsgroups: comp.unix.bsd.misc,comp.unix.bsd.bsdi.misc*
Subject:Re: BSD 4.4:  Preventing zombies SIGCHLD*    Date: 19 Dec 1995 13:24:39 GMT*/
 

void
reapchild(int signo)
{   pid_t   pid;   int     stat;   while ( (pid = waitpid(-1, &stat, WNOHANG)) > 0) {       /* handle "pid" and "stat"
*/  }   if (pid < 0)       ;/* error */
 
/* we are done playing the current sound */cur_sound_id = -1;   /* return value is 0 if no more children */   return;
}

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026