Обсуждение: Re: [HACKERS] strange parallel query behavior after OOM crashes
[ Adding Julien, whose patch this was. ] On Thu, Apr 6, 2017 at 5:34 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > While performing StartupDatabase, both master and standby server > behave in similar way till postmaster spawns startup process. > In master, startup process completes its job and dies. As a result, > reaper is called which in turn calls maybe_start_bgworker(). > In standby, after getting a valid snapshot, startup process sends > postmaster a signal to enable connections. Signal handler in > postmaster calls maybe_start_bgworker(). > In maybe_start_bgworker(), if we find a crashed bgworker(crashed_at != > 0) with a NEVER RESTART flag, we call ForgetBackgroundWorker().to > forget the bgworker process. > > I've attached the patch for adding an argument in > ForgetBackgroundWorker() to indicate a crashed situation. Based on > that, we can take the necessary actions. I've not included the Assert > statement in this patch. This doesn't seem right, because this will potentially pass wasCrashed = true even if there has been no crash-and-restart cycle. The place where you're passing "true" only knows that rw->rw_crashed_at != 0 && rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART, but that doesn't prove much of anything. Some *other* background worker could have crashed, rather than the one at issue. Even there's only one worker involved, the test for whether to call HandleChildCrash() involves other factors: if ((rw->rw_worker.bgw_flags & BGWORKER_SHMEM_ACCESS) != 0) { if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus)) { HandleChildCrash(pid, exitstatus, namebuf); return true; } } if (!ReleasePostmasterChildSlot(rw->rw_child_slot) && (rw->rw_worker.bgw_flags & BGWORKER_SHMEM_ACCESS) != 0) { HandleChildCrash(pid, exitstatus, namebuf); return true; } After some study, I think the solution here is as follows: 1. Forget BGW_NEVER_RESTART workers in ResetBackgroundWorkerCrashTimes() rather than leaving them around to be cleaned up after the conclusion of the restart, so that they go away before rather than after shared memory is reset. 2. Don't allow BGWORKER_CLASS_PARALLEL workers to be anything other than BGW_NEVER_RESTART, so that (1) suffices to guarantee that, after recovering from a crash, 0 is the correct starting value for parallel_register_count. (Alternatively, we could iterate through the worker list and compute the correct starting value for parallel_register_count, but that seems silly since we only ever expect BGW_NEVER_RESTART here anyway.) The attached patch seems to fix the problem for me. I'll commit it tomorrow unless somebody sees a problem with it; if that happens, I'll provide some other kind of update tomorrow. [For clarity, for official status update purposes, I am setting a next-update date of tomorrow, 2017-04-11.] -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On Tue, Apr 11, 2017 at 1:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
1. Forget BGW_NEVER_RESTART workers in
ResetBackgroundWorkerCrashTimes() rather than leaving them around to
be cleaned up after the conclusion of the restart, so that they go
away before rather than after shared memory is reset.
Now with this, would it still be required to forget BGW_NEVER_RESTART workers in maybe_start_bgworker():
if (rw->rw_crashed_at != 0)
{
if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART)
{
ForgetBackgroundWorker(&iter);
continue;
}
......
}
Regards,
Neha
Neha
On Mon, Apr 10, 2017 at 2:32 PM, Neha Khatri <nehakhatri5@gmail.com> wrote: > On Tue, Apr 11, 2017 at 1:16 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> 1. Forget BGW_NEVER_RESTART workers in >> ResetBackgroundWorkerCrashTimes() rather than leaving them around to >> be cleaned up after the conclusion of the restart, so that they go >> away before rather than after shared memory is reset. > > Now with this, would it still be required to forget BGW_NEVER_RESTART > workers in maybe_start_bgworker(): > > if (rw->rw_crashed_at != 0) > { > if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART) > { > ForgetBackgroundWorker(&iter); > continue; > } > ...... > } Well, as noted before, not every background worker failure leads to a crash-and-restart; for example, a non-shmem-connected worker or one that exits with status 1 will set rw_crashed_at but will not cause a crash-and-restart cycle. It's not 100% clear to me whether the code you're talking about can be reached in those cases, but I wouldn't bet against it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 11, 2017 at 2:36 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 10, 2017 at 2:32 PM, Neha Khatri <nehakhatri5@gmail.com> wrote: >> On Tue, Apr 11, 2017 at 1:16 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> 1. Forget BGW_NEVER_RESTART workers in >>> ResetBackgroundWorkerCrashTimes() rather than leaving them around to >>> be cleaned up after the conclusion of the restart, so that they go >>> away before rather than after shared memory is reset. >> +1 for the solution. >> Now with this, would it still be required to forget BGW_NEVER_RESTART >> workers in maybe_start_bgworker(): >> >> if (rw->rw_crashed_at != 0) >> { >> if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART) >> { >> ForgetBackgroundWorker(&iter); >> continue; >> } >> ...... >> } > > Well, as noted before, not every background worker failure leads to a > crash-and-restart; for example, a non-shmem-connected worker or one > that exits with status 1 will set rw_crashed_at but will not cause a > crash-and-restart cycle. It's not 100% clear to me whether the code > you're talking about can be reached in those cases, but I wouldn't bet > against it. I think that for above-mentioned background workers, we follow a different path to call ForgetBackgroundWorker(). CleanupBackgroundWorker() - ReportBackgroundWorkerExit()- ForgetBackgroundWorker() But, I'm not sure for any other cases. Should we include the assert statement as well? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com