Обсуждение: BUG #7643: Issuing a shutdown request while server startup leads to server hang
BUG #7643: Issuing a shutdown request while server startup leads to server hang
От
haribabu.kommi@huawei.com
Дата:
The following bug has been logged on the website: Bug reference: 7643 Logged by: Haribabu Email address: haribabu.kommi@huawei.com PostgreSQL version: 9.2.1 Operating system: Suse Linux 10.2 Description: = Problem Reproduction: 1. Add recovery.conf to the database directory. 2. Start the server 3. Issue the shutdown request = and the shutdown request timing should be such that below server logs should print. Log: ./postgres -D data -p 3335 LOG: database system was shut down in recovery at 2012-11-08 19:42:42 IST LOG: entering standby mode LOG: received fast shutdown request LOG: consistent recovery state reached at 0/17D0700 LOG: record with zero length at 0/17D0700 Problem reproduced in 9.3 head. I think the following things are leading to the problem: 1. The postmaster is queued up with the following signals in the order of SIGINT (shutdown) and SIGUSR1 (Here we create bgwriter and checkpoint) and the current state of postmaster is PM_STARTUP. we are creating the bgwriter and checkpoint after processing the shutdown request, because of this reason no one is there is to send the SIGTERM to bgwriter and checkpoint.
haribabu.kommi@huawei.com writes: > Problem Reproduction: > 1. Add recovery.conf to the database directory. > 2. Start the server > 3. Issue the shutdown request > and the shutdown request timing should be such that below server logs should > print. > Log: > ./postgres -D data -p 3335 > LOG: database system was shut down in recovery at 2012-11-08 19:42:42 IST > LOG: entering standby mode > LOG: received fast shutdown request > LOG: consistent recovery state reached at 0/17D0700 > LOG: record with zero length at 0/17D0700 > Problem reproduced in 9.3 head. After further investigation, I can't reproduce this and I don't believe your patch fixes it. What happens when I try this is * postmaster gets SIGINT, sends SIGTERM to startup process * startup process exits with exit(1) * postmaster sees that as a startup crash and exits, per the first test in reaper() So the log trace I'm getting looks like LOG: received fast shutdown request LOG: startup process (PID 9772) exited with exit code 1 LOG: aborting startup due to startup process failure Now, transitioning to PM_WAIT_BACKENDS state earlier, as your patch proposes, might make the log look a bit nicer because the logic in reaper() wouldn't think the exit was a "crash". But it's not going to have anything to do with whether the startup process exits on the signal or not. What seems to have happened for you is that the startup process ignored the SIGTERM signal, but it's not at all obvious why. We're going to need more details about how to reproduce this. I speculate it might have something to do with the specific restore_command you're using. regards, tom lane
>haribabu.kommi@huawei.com writes: > Problem Reproduction: > 1. Add recovery.conf to the database directory. > 2. Start the server > 3. Issue the shutdown request > and the shutdown request timing should be such that below server logs should > print. > Log: > ./postgres -D data -p 3335 > LOG: database system was shut down in recovery at 2012-11-08 19:42:42 IST > LOG: entering standby mode > LOG: received fast shutdown request > LOG: consistent recovery state reached at 0/17D0700 > LOG: record with zero length at 0/17D0700 > Problem reproduced in 9.3 head. >After further investigation, I can't reproduce this and I don't believe >your patch fixes it. What happens when I try this is >* postmaster gets SIGINT, sends SIGTERM to startup process >* startup process exits with exit(1) >* postmaster sees that as a startup crash and exits, per the first >test in reaper() >So the log trace I'm getting looks like >LOG: received fast shutdown request >LOG: startup process (PID 9772) exited with exit code 1 >LOG: aborting startup due to startup process failure >Now, transitioning to PM_WAIT_BACKENDS state earlier, as your patch >proposes, might make the log look a bit nicer because the logic in >reaper() wouldn't think the exit was a "crash". But it's not going to >have anything to do with whether the startup process exits on the signal >or not. What seems to have happened for you is that the startup process >ignored the SIGTERM signal, but it's not at all obvious why. >We're going to need more details about how to reproduce this. >I speculate it might have something to do with the specific >restore_command you're using. The problem occurs only when active server is restarting by just adding a recovery.conf file to the data directory. No need of specifying any restore command. or the standby server restart also can lead to this problem. The startup process sends "PMSIGNAL_RECOVERY_STARTED" to postmaster only incase of "InArchiveRecovery" flag is enabled. The SIGINT signal should reach postmaster before the "PMSIGNAL_RECOVERY_STARTED" sent by the startup process. with the following code change in the startupXlog function, the issue can reproduce very easily. if (InArchiveRecovery && IsUnderPostmaster) { PublishStartupProcessInformation(); SetForwardFsyncRequests(); kill (PostmasterPid, SIGINT); SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED); bgwriterLaunched = true; } Please let me know if I miss anything. Regards, Hari babu.
Hari Babu <haribabu.kommi@huawei.com> writes: >> We're going to need more details about how to reproduce this. > The problem occurs only when active server is restarting by just adding a > recovery.conf file to the data directory. Well, you can't just put an empty file there, but I eventually managed to reproduce this with the suggested hack in xlog.c. I think the key problem is that postmaster.c's sigusr1_handler() is willing to start new children even after shutdown has been initiated. I don't see any good reason for it to do that, so I think the appropriate patch is as attached. Changing that still leaves us with the postmaster thinking that the eventual exit(1) of the startup process is a "crash". This is mostly cosmetic since it still shuts down okay, but we can fix it by reversing the order of the first two checks in reaper() --- that is, if Shutdown is set, we should prefer that code path even if we're in PM_STARTUP state. I concluded that it probably wasn't a good idea to have the additional state transition in SIGINT handling. Generally PM_STARTUP means "we're running the startup process and nothing else", and that's useful state info that we shouldn't throw away lightly. regards, tom lane diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index b223feefbab0645667449f643c6c8adee3747ef0..6f93d93fa3f7577fb9157f0bea805c427e3605dd 100644 *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** pmdie(SIGNAL_ARGS) *** 2261,2269 **** if (pmState == PM_RECOVERY) { /* ! * Only startup, bgwriter, and checkpointer should be active ! * in this state; we just signaled the first two, and we don't ! * want to kill checkpointer yet. */ pmState = PM_WAIT_BACKENDS; } --- 2261,2269 ---- if (pmState == PM_RECOVERY) { /* ! * Only startup, bgwriter, walreceiver, and/or checkpointer ! * should be active in this state; we just signaled the first ! * three, and we don't want to kill checkpointer yet. */ pmState = PM_WAIT_BACKENDS; } *************** reaper(SIGNAL_ARGS) *** 2355,2360 **** --- 2355,2372 ---- StartupPID = 0; /* + * Startup process exited in response to a shutdown request (or it + * completed normally regardless of the shutdown request). + */ + if (Shutdown > NoShutdown && + (EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus))) + { + pmState = PM_WAIT_BACKENDS; + /* PostmasterStateMachine logic does the rest */ + continue; + } + + /* * Unexpected exit of startup process (including FATAL exit) * during PM_STARTUP is treated as catastrophic. There are no * other processes running yet, so we can just exit. *************** reaper(SIGNAL_ARGS) *** 2369,2386 **** } /* - * Startup process exited in response to a shutdown request (or it - * completed normally regardless of the shutdown request). - */ - if (Shutdown > NoShutdown && - (EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus))) - { - pmState = PM_WAIT_BACKENDS; - /* PostmasterStateMachine logic does the rest */ - continue; - } - - /* * After PM_STARTUP, any unexpected exit (including FATAL exit) of * the startup process is catastrophic, so kill other children, * and set RecoveryError so we don't try to reinitialize after --- 2381,2386 ---- *************** sigusr1_handler(SIGNAL_ARGS) *** 4283,4289 **** * first. We don't want to go back to recovery in that case. */ if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) && ! pmState == PM_STARTUP) { /* WAL redo has started. We're out of reinitialization. */ FatalError = false; --- 4283,4289 ---- * first. We don't want to go back to recovery in that case. */ if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) && ! pmState == PM_STARTUP && Shutdown == NoShutdown) { /* WAL redo has started. We're out of reinitialization. */ FatalError = false; *************** sigusr1_handler(SIGNAL_ARGS) *** 4300,4306 **** pmState = PM_RECOVERY; } if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) && ! pmState == PM_RECOVERY) { /* * Likewise, start other special children as needed. --- 4300,4306 ---- pmState = PM_RECOVERY; } if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) && ! pmState == PM_RECOVERY && Shutdown == NoShutdown) { /* * Likewise, start other special children as needed. *************** sigusr1_handler(SIGNAL_ARGS) *** 4331,4337 **** signal_child(SysLoggerPID, SIGUSR1); } ! if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER)) { /* * Start one iteration of the autovacuum daemon, even if autovacuuming --- 4331,4338 ---- signal_child(SysLoggerPID, SIGUSR1); } ! if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER) && ! Shutdown == NoShutdown) { /* * Start one iteration of the autovacuum daemon, even if autovacuuming *************** sigusr1_handler(SIGNAL_ARGS) *** 4345,4351 **** start_autovac_launcher = true; } ! if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER)) { /* The autovacuum launcher wants us to start a worker process. */ StartAutovacuumWorker(); --- 4346,4353 ---- start_autovac_launcher = true; } ! if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER) && ! Shutdown == NoShutdown) { /* The autovacuum launcher wants us to start a worker process. */ StartAutovacuumWorker(); *************** sigusr1_handler(SIGNAL_ARGS) *** 4354,4360 **** if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER) && WalReceiverPID == 0 && (pmState == PM_STARTUP || pmState == PM_RECOVERY || ! pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY)) { /* Startup Process wants us to start the walreceiver process. */ WalReceiverPID = StartWalReceiver(); --- 4356,4363 ---- if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER) && WalReceiverPID == 0 && (pmState == PM_STARTUP || pmState == PM_RECOVERY || ! pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) && ! Shutdown == NoShutdown) { /* Startup Process wants us to start the walreceiver process. */ WalReceiverPID = StartWalReceiver();
Tom Lane <tgl@sss.pgh.pa.us > writes: >I concluded that it probably wasn't a good idea to have the additional state transition in SIGINT handling. Generally >PM_STARTUP means "we're running the startup process and nothing else", and that's useful state info that we shouldn't >throw away lightly. I tested the patch, it is working fine. I have a query regarding the changing of pmState only incase of SIGTERM but not incase of SIGINT. Is there any reason to handle like this? or Is it ok to modify the handling of pmState in SIGTERM same as SIGINT? Regards, Hari babu.
Hari Babu <haribabu.kommi@huawei.com> writes: > I have a query regarding the changing of pmState only incase of SIGTERM but > not incase of SIGINT. > Is there any reason to handle like this? or Is it ok to modify the handling > of pmState in SIGTERM same as SIGINT? AFAICT the "smart shutdown" code path is okay: it transitions to PM_WAIT_READONLY state, and then PostmasterStateMachine issues SIGTERM to the startup process and goes to PM_WAIT_BACKENDS state. Perhaps this could be done more clearly but I'm disinclined to mess with it. regards, tom lane