Обсуждение: BUG #7643: Issuing a shutdown request while server startup leads to server hang

Поиск
Список
Период
Сортировка

BUG #7643: Issuing a shutdown request while server startup leads to server hang

От
haribabu.kommi@huawei.com
Дата:
The following bug has been logged on the website:

Bug reference:      7643
Logged by:          Haribabu
Email address:      haribabu.kommi@huawei.com
PostgreSQL version: 9.2.1
Operating system:   Suse Linux 10.2
Description:        =


Problem Reproduction:
1. Add recovery.conf to the database directory.
2. Start the server
3. Issue the shutdown request =

and the shutdown request timing should be such that below server logs should
print.

Log:

./postgres -D data -p 3335
LOG:  database system was shut down in recovery at 2012-11-08 19:42:42 IST
LOG:  entering standby mode
LOG:  received fast shutdown request
LOG:  consistent recovery state reached at 0/17D0700
LOG:  record with zero length at 0/17D0700

Problem reproduced in 9.3 head.

I think the following things are leading to the problem:

1. The postmaster is queued up with the following signals in the order of
SIGINT (shutdown) and SIGUSR1 (Here we create bgwriter and checkpoint) and
the current state of postmaster is PM_STARTUP.

we are creating the bgwriter and checkpoint after processing the shutdown
request, because of this reason no one is there is to send the SIGTERM to
bgwriter and checkpoint.

Re: BUG #7643: Issuing a shutdown request while server startup leads to server hang

От
Tom Lane
Дата:
haribabu.kommi@huawei.com writes:
> Problem Reproduction:
> 1. Add recovery.conf to the database directory.
> 2. Start the server
> 3. Issue the shutdown request
> and the shutdown request timing should be such that below server logs should
> print.

> Log:

> ./postgres -D data -p 3335
> LOG:  database system was shut down in recovery at 2012-11-08 19:42:42 IST
> LOG:  entering standby mode
> LOG:  received fast shutdown request
> LOG:  consistent recovery state reached at 0/17D0700
> LOG:  record with zero length at 0/17D0700

> Problem reproduced in 9.3 head.

After further investigation, I can't reproduce this and I don't believe
your patch fixes it.  What happens when I try this is

* postmaster gets SIGINT, sends SIGTERM to startup process

* startup process exits with exit(1)

* postmaster sees that as a startup crash and exits, per the first
test in reaper()

So the log trace I'm getting looks like

LOG:  received fast shutdown request
LOG:  startup process (PID 9772) exited with exit code 1
LOG:  aborting startup due to startup process failure

Now, transitioning to PM_WAIT_BACKENDS state earlier, as your patch
proposes, might make the log look a bit nicer because the logic in
reaper() wouldn't think the exit was a "crash".  But it's not going to
have anything to do with whether the startup process exits on the signal
or not.  What seems to have happened for you is that the startup process
ignored the SIGTERM signal, but it's not at all obvious why.

We're going to need more details about how to reproduce this.
I speculate it might have something to do with the specific
restore_command you're using.

            regards, tom lane

Re: BUG #7643: Issuing a shutdown request while server startup leads to server hang

От
Hari Babu
Дата:
>haribabu.kommi@huawei.com writes:
> Problem Reproduction:
> 1. Add recovery.conf to the database directory.
> 2. Start the server
> 3. Issue the shutdown request
> and the shutdown request timing should be such that below server logs
should
> print.

> Log:

> ./postgres -D data -p 3335
> LOG:  database system was shut down in recovery at 2012-11-08 19:42:42 IST
> LOG:  entering standby mode
> LOG:  received fast shutdown request
> LOG:  consistent recovery state reached at 0/17D0700
> LOG:  record with zero length at 0/17D0700

> Problem reproduced in 9.3 head.

>After further investigation, I can't reproduce this and I don't believe
>your patch fixes it.  What happens when I try this is

>* postmaster gets SIGINT, sends SIGTERM to startup process

>* startup process exits with exit(1)

>* postmaster sees that as a startup crash and exits, per the first
>test in reaper()

>So the log trace I'm getting looks like

>LOG:  received fast shutdown request
>LOG:  startup process (PID 9772) exited with exit code 1
>LOG:  aborting startup due to startup process failure

>Now, transitioning to PM_WAIT_BACKENDS state earlier, as your patch
>proposes, might make the log look a bit nicer because the logic in
>reaper() wouldn't think the exit was a "crash".  But it's not going to
>have anything to do with whether the startup process exits on the signal
>or not.  What seems to have happened for you is that the startup process
>ignored the SIGTERM signal, but it's not at all obvious why.

>We're going to need more details about how to reproduce this.
>I speculate it might have something to do with the specific
>restore_command you're using.

The problem occurs only when active server is restarting by just adding a
recovery.conf file to the data directory.
No need of specifying any restore command. or the standby server restart
also can lead to this problem.

The startup process sends "PMSIGNAL_RECOVERY_STARTED" to postmaster only
incase of "InArchiveRecovery" flag is enabled.
The SIGINT signal should reach postmaster before the
"PMSIGNAL_RECOVERY_STARTED" sent by the startup process.

with the following code change in the startupXlog function, the issue can
reproduce very easily.

        if (InArchiveRecovery && IsUnderPostmaster)
        {
                PublishStartupProcessInformation();
                SetForwardFsyncRequests();
                kill (PostmasterPid, SIGINT);
                SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
                bgwriterLaunched = true;
        }

Please let me know if I miss anything.

Regards,
Hari babu.

Re: BUG #7643: Issuing a shutdown request while server startup leads to server hang

От
Tom Lane
Дата:
Hari Babu <haribabu.kommi@huawei.com> writes:
>> We're going to need more details about how to reproduce this.

> The problem occurs only when active server is restarting by just adding a
> recovery.conf file to the data directory.

Well, you can't just put an empty file there, but I eventually managed
to reproduce this with the suggested hack in xlog.c.

I think the key problem is that postmaster.c's sigusr1_handler() is
willing to start new children even after shutdown has been initiated.
I don't see any good reason for it to do that, so I think the
appropriate patch is as attached.

Changing that still leaves us with the postmaster thinking that the
eventual exit(1) of the startup process is a "crash".  This is mostly
cosmetic since it still shuts down okay, but we can fix it by reversing
the order of the first two checks in reaper() --- that is, if Shutdown
is set, we should prefer that code path even if we're in PM_STARTUP
state.

I concluded that it probably wasn't a good idea to have the additional
state transition in SIGINT handling.  Generally PM_STARTUP means "we're
running the startup process and nothing else", and that's useful state
info that we shouldn't throw away lightly.


            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b223feefbab0645667449f643c6c8adee3747ef0..6f93d93fa3f7577fb9157f0bea805c427e3605dd 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** pmdie(SIGNAL_ARGS)
*** 2261,2269 ****
              if (pmState == PM_RECOVERY)
              {
                  /*
!                  * Only startup, bgwriter, and checkpointer should be active
!                  * in this state; we just signaled the first two, and we don't
!                  * want to kill checkpointer yet.
                   */
                  pmState = PM_WAIT_BACKENDS;
              }
--- 2261,2269 ----
              if (pmState == PM_RECOVERY)
              {
                  /*
!                  * Only startup, bgwriter, walreceiver, and/or checkpointer
!                  * should be active in this state; we just signaled the first
!                  * three, and we don't want to kill checkpointer yet.
                   */
                  pmState = PM_WAIT_BACKENDS;
              }
*************** reaper(SIGNAL_ARGS)
*** 2355,2360 ****
--- 2355,2372 ----
              StartupPID = 0;

              /*
+              * Startup process exited in response to a shutdown request (or it
+              * completed normally regardless of the shutdown request).
+              */
+             if (Shutdown > NoShutdown &&
+                 (EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
+             {
+                 pmState = PM_WAIT_BACKENDS;
+                 /* PostmasterStateMachine logic does the rest */
+                 continue;
+             }
+
+             /*
               * Unexpected exit of startup process (including FATAL exit)
               * during PM_STARTUP is treated as catastrophic. There are no
               * other processes running yet, so we can just exit.
*************** reaper(SIGNAL_ARGS)
*** 2369,2386 ****
              }

              /*
-              * Startup process exited in response to a shutdown request (or it
-              * completed normally regardless of the shutdown request).
-              */
-             if (Shutdown > NoShutdown &&
-                 (EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
-             {
-                 pmState = PM_WAIT_BACKENDS;
-                 /* PostmasterStateMachine logic does the rest */
-                 continue;
-             }
-
-             /*
               * After PM_STARTUP, any unexpected exit (including FATAL exit) of
               * the startup process is catastrophic, so kill other children,
               * and set RecoveryError so we don't try to reinitialize after
--- 2381,2386 ----
*************** sigusr1_handler(SIGNAL_ARGS)
*** 4283,4289 ****
       * first. We don't want to go back to recovery in that case.
       */
      if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) &&
!         pmState == PM_STARTUP)
      {
          /* WAL redo has started. We're out of reinitialization. */
          FatalError = false;
--- 4283,4289 ----
       * first. We don't want to go back to recovery in that case.
       */
      if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) &&
!         pmState == PM_STARTUP && Shutdown == NoShutdown)
      {
          /* WAL redo has started. We're out of reinitialization. */
          FatalError = false;
*************** sigusr1_handler(SIGNAL_ARGS)
*** 4300,4306 ****
          pmState = PM_RECOVERY;
      }
      if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
!         pmState == PM_RECOVERY)
      {
          /*
           * Likewise, start other special children as needed.
--- 4300,4306 ----
          pmState = PM_RECOVERY;
      }
      if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
!         pmState == PM_RECOVERY && Shutdown == NoShutdown)
      {
          /*
           * Likewise, start other special children as needed.
*************** sigusr1_handler(SIGNAL_ARGS)
*** 4331,4337 ****
          signal_child(SysLoggerPID, SIGUSR1);
      }

!     if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER))
      {
          /*
           * Start one iteration of the autovacuum daemon, even if autovacuuming
--- 4331,4338 ----
          signal_child(SysLoggerPID, SIGUSR1);
      }

!     if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER) &&
!         Shutdown == NoShutdown)
      {
          /*
           * Start one iteration of the autovacuum daemon, even if autovacuuming
*************** sigusr1_handler(SIGNAL_ARGS)
*** 4345,4351 ****
          start_autovac_launcher = true;
      }

!     if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER))
      {
          /* The autovacuum launcher wants us to start a worker process. */
          StartAutovacuumWorker();
--- 4346,4353 ----
          start_autovac_launcher = true;
      }

!     if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER) &&
!         Shutdown == NoShutdown)
      {
          /* The autovacuum launcher wants us to start a worker process. */
          StartAutovacuumWorker();
*************** sigusr1_handler(SIGNAL_ARGS)
*** 4354,4360 ****
      if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER) &&
          WalReceiverPID == 0 &&
          (pmState == PM_STARTUP || pmState == PM_RECOVERY ||
!          pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY))
      {
          /* Startup Process wants us to start the walreceiver process. */
          WalReceiverPID = StartWalReceiver();
--- 4356,4363 ----
      if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER) &&
          WalReceiverPID == 0 &&
          (pmState == PM_STARTUP || pmState == PM_RECOVERY ||
!          pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
!         Shutdown == NoShutdown)
      {
          /* Startup Process wants us to start the walreceiver process. */
          WalReceiverPID = StartWalReceiver();

Re: BUG #7643: Issuing a shutdown request while server startup leads to server hang

От
Hari Babu
Дата:
Tom Lane <tgl@sss.pgh.pa.us > writes:
>I concluded that it probably wasn't a good idea to have the additional
state transition in SIGINT handling.  Generally >PM_STARTUP means "we're
running the startup process and nothing else", and that's useful state info
that we shouldn't >throw away lightly.

I tested the patch, it is working fine.

I have a query regarding the changing of pmState only incase of SIGTERM but
not incase of SIGINT.
Is there any reason to handle like this? or Is it ok to modify the handling
of pmState in SIGTERM same as SIGINT?


Regards,
Hari babu.

Re: BUG #7643: Issuing a shutdown request while server startup leads to server hang

От
Tom Lane
Дата:
Hari Babu <haribabu.kommi@huawei.com> writes:
> I have a query regarding the changing of pmState only incase of SIGTERM but
> not incase of SIGINT.
> Is there any reason to handle like this? or Is it ok to modify the handling
> of pmState in SIGTERM same as SIGINT?

AFAICT the "smart shutdown" code path is okay: it transitions to
PM_WAIT_READONLY state, and then PostmasterStateMachine issues SIGTERM
to the startup process and goes to PM_WAIT_BACKENDS state.  Perhaps this
could be done more clearly but I'm disinclined to mess with it.

            regards, tom lane