Обсуждение: Dubious assertion in RegisterDynamicBackgroundWorker
I noticed this recent crash on lorikeet: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2021-05-05%2009%3A19%3A29 The relevant bits of the log seem to be 2021-05-05 05:36:22.011 EDT [60926716.bb24:1] ERROR: could not map dynamic shared memory segment ... 2021-05-05 05:36:22.013 EDT [609266c5.b793:4] LOG: background worker "parallel worker" (PID 47908) exited with exit code1 ... TRAP: FailedAssertion("BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <=MAX_PARALLEL_WORKER_LIMIT", File: "/home/andrew/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c",Line: 1016) *** starting debugger for pid 47743, tid 1264 2021-05-05 05:36:26.629 EDT [609266c5.b793:5] LOG: server process (PID 47743) exited with exit code 127 So we had a parallel worker fail to start, whereupon its leader went down with an assertion failure. I know that the parallel-worker code is held together with chewing gum and baling wire, but that's a bit much. Looking at the indicated code, we find /* * If this is a parallel worker, check whether there are already too many * parallel workers; if so, don't register another one. Our view of * parallel_terminate_count may be slightly stale, but that doesn't really * matter: we would have gotten the same result if we'd arrived here * slightly earlier anyway. There's no help for it, either, since the * postmaster must not take locks; a memory barrier wouldn't guarantee * anything useful. */ if (parallel && (BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count) >= max_parallel_workers) { Assert(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <= MAX_PARALLEL_WORKER_LIMIT); LWLockRelease(BackgroundWorkerLock); return false; } I would like to know on what grounds that Assert isn't insanity? We just finished pointing out that we might see an old parallel_terminate_count, which ISTM says specifically that parallel_register_count minus parallel_terminate_count might be larger than expected. Admittedly, it seems unlikely that the difference could exceed MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where the limit on number of parallel workers is only 8. What I think is more likely, given that these counters are unsigned, is that the difference was actually negative. Which could be a bug, or it could be an expectable race condition, or it could just be some flakiness on lorikeet's part (that machine has had a lot of issues lately). I trawled the buildfarm logs going back 180 days, and found no other instances of this assertion, which seems to be evidence in favor of the "lorikeet got flaky" theory. But it's not proof. In any case, I see zero value in this assertion, so I propose we remove it. If we don't remove it, it needs serious revision, because it seems absolutely obvious to me that it could trigger when there is nothing wrong. A system pushing the limit of number of parallel workers would be at considerable risk. regards, tom lane
On Wed, May 5, 2021 at 3:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Admittedly, it seems unlikely that the difference could exceed > MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where > the limit on number of parallel workers is only 8. What I think is > more likely, given that these counters are unsigned, is that the > difference was actually negative. Which could be a bug, or it could > be an expectable race condition, or it could just be some flakiness > on lorikeet's part (that machine has had a lot of issues lately). I think that assertion was added by me, and I think the thought process was that the value shouldn't go negative and that if it does it's probably a bug which we might want to fix. But since the values are unsigned I could hardly check for < 0, so I did it this way instead. But since there's no memory barrier between the two loads, I guess there's no guarantee that they have the expected relationship, even if there is a memory barrier on the store side. I wonder if it's worth trying to tighten that up so that the assertion is more meaningful, or just give up and rip it out. I'm afraid that if we do have (or develop) bugs in this area, someone will discover that the effective max_parallel_workers value on their system slowly drifts up or down from the configured value, and we'll have no clue where things are going wrong. The assertion was intended to give us a chance of noticing that sort of problem in the buildfarm or on a developer's machine before the code gets out into the real world. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, May 5, 2021 at 3:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Admittedly, it seems unlikely that the difference could exceed >> MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where >> the limit on number of parallel workers is only 8. What I think is >> more likely, given that these counters are unsigned, is that the >> difference was actually negative. Which could be a bug, or it could >> be an expectable race condition, or it could just be some flakiness >> on lorikeet's part (that machine has had a lot of issues lately). > I think that assertion was added by me, and I think the thought > process was that the value shouldn't go negative and that if it does > it's probably a bug which we might want to fix. But since the values > are unsigned I could hardly check for < 0, so I did it this way > instead. > But since there's no memory barrier between the two loads, I guess > there's no guarantee that they have the expected relationship, even if > there is a memory barrier on the store side. I wonder if it's worth > trying to tighten that up so that the assertion is more meaningful, or > just give up and rip it out. I'm afraid that if we do have (or > develop) bugs in this area, someone will discover that the effective > max_parallel_workers value on their system slowly drifts up or down > from the configured value, and we'll have no clue where things are > going wrong. The assertion was intended to give us a chance of > noticing that sort of problem in the buildfarm or on a developer's > machine before the code gets out into the real world. I follow your concern, but I'm not convinced that this assertion is a useful aid; first because the asynchrony involved makes the edge cases rather squishy, and second because allowing 1024 bogus increments before complaining will likely mean that developer test runs will not last long enough to trigger the assertion, and third because if it does fire it's too far removed from the perpetrator to be much help in figuring out what went wrong, or even if anything *is* wrong. I've not tried to trace the code, but I'm now a bit suspicious that there is indeed a design bug here. I gather from the comments that parallel_register_count is incremented by the worker processes, which of course implies that a worker that fails to reattach to shared memory won't do that. But parallel_terminate_count is incremented by the postmaster. If the postmaster will do that even in the case of a worker that failed at startup, then lorikeet's symptoms are neatly explained. I'd be more comfortable with this code if the increments and decrements were handled by the same process. regards, tom lane
On Wed, May 5, 2021 at 5:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I follow your concern, but I'm not convinced that this assertion is > a useful aid; first because the asynchrony involved makes the edge > cases rather squishy, and second because allowing 1024 bogus > increments before complaining will likely mean that developer test > runs will not last long enough to trigger the assertion, and third > because if it does fire it's too far removed from the perpetrator > to be much help in figuring out what went wrong, or even if > anything *is* wrong. Well, it allows 1024 bogus increments in one direction, but a negative value will trip it PDQ. > I've not tried to trace the code, but I'm now a bit suspicious > that there is indeed a design bug here. I gather from the > comments that parallel_register_count is incremented by the > worker processes, which of course implies that a worker that > fails to reattach to shared memory won't do that. But > parallel_terminate_count is incremented by the postmaster. > If the postmaster will do that even in the case of a worker that > failed at startup, then lorikeet's symptoms are neatly explained. parallel_register_count is incremented by RegisterDynamicBackgroundWorker, i.e. when the worker process is requested, not after it starts. To try to do it from the worker after it's launched would, as you suppose, be completely busted. The more general point here is that we should be adjusting these at the same point we allocate and free the worker slots. Wherever that happens is the right place to do this, because the goal of the mechanism is precisely to limit the number of such slots that can be used by parallel query. -- Robert Haas EDB: http://www.enterprisedb.com
I wrote: > I've not tried to trace the code, but I'm now a bit suspicious > that there is indeed a design bug here. I gather from the > comments that parallel_register_count is incremented by the > worker processes, which of course implies that a worker that > fails to reattach to shared memory won't do that. But > parallel_terminate_count is incremented by the postmaster. > If the postmaster will do that even in the case of a worker that > failed at startup, then lorikeet's symptoms are neatly explained. That theory seems to be nonsense. After a bit more study of the code, I see that parallel_register_count is incremented by the *leader* process, when it reserves a BackgroundWorkerSlot for the worker. And parallel_terminate_count is incremented by the postmaster when it releases the slot; so it's darn hard to see how parallel_terminate_count could get ahead of parallel_register_count. I noticed that lorikeet's worker didn't fail at shared memory reattach, as I first thought, anyway. It failed at ERROR: could not map dynamic shared memory segment which means we ought to be able to reproduce the symptoms by faking failure of dsm_attach(), as I did in the quick hack attached. What I get is a lot of "parallel worker failed to initialize" and "lost connection to parallel worker" errors, but no assertion. (I also tried this with an EXEC_BACKEND build, just in case that'd change the behavior, but it didn't.) So it seems like the "lorikeet is flaky" theory is looking pretty plausible. I do see what seems to be a bug-let in ForgetBackgroundWorker. BackgroundWorkerStateChange is careful to do this when freeing a slot: /* * We need a memory barrier here to make sure that the load of * bgw_notify_pid and the update of parallel_terminate_count * complete before the store to in_use. */ notify_pid = slot->worker.bgw_notify_pid; if ((slot->worker.bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) BackgroundWorkerData->parallel_terminate_count++; pg_memory_barrier(); slot->pid = 0; slot->in_use = false; but the mainline case in ForgetBackgroundWorker is a lot less paranoid: Assert(rw->rw_shmem_slot < max_worker_processes); slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot]; if ((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) BackgroundWorkerData->parallel_terminate_count++; slot->in_use = false; One of these functions is mistaken. However, I can't construct a theory whereby that explains lorikeet's symptoms, mainly because Intel chips don't do out-of-order stores so the messing with parallel_terminate_count should be done before in_use is cleared, even without an explicit memory barrier. regards, tom lane diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c index 3550ef13ba..d3b00c2f9e 100644 --- a/src/backend/access/transam/parallel.c +++ b/src/backend/access/transam/parallel.c @@ -1278,6 +1278,11 @@ ParallelWorkerMain(Datum main_arg) "Parallel worker", ALLOCSET_DEFAULT_SIZES); + if (random() < INT_MAX / 64) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("fake failure to map dynamic shared memory segment"))); + /* * Attach to the dynamic shared memory segment for the parallel query, and * find its table of contents.
On Wed, May 05, 2021 at 03:46:43PM -0400, Tom Lane wrote: > I know that the parallel-worker code is held > together with chewing gum and baling wire Having spent time in that code, I didn't notice such weakness. I'm sure it has bugs, but I doubt its bug density is unusual for PostgreSQL.