Обсуждение: background workers, round three
Last week, I attempted to write some code to perform a trivial operation in parallel by launching background workers. Despite my earlier convictions that I'd built enough infrastructure to make this possible, the experiment turned out badly - so, patches! It's been pretty obvious to me from the beginning that any sort of parallel computation would need a way to make sure that if any worker dies, everybody dies. Conversely, if the parent aborts, all the workers should die. My thought was that this could be implemented using PG_TRY()/PG_CATCH() blocks over the existing infrastructure, but this turned out to be naive. If the original backend encounters an error before the child manages to start up, there's no good recovery strategy. The parent can't kill the child because it doesn't exist yet, and the parent can't stop the postmaster from starting the child later. The parent could try waiting until the child starts up and THEN killing it, but that's probably unreliable and surely mind-numbingly lame. The first attached patch, terminate-worker-v1.patch, rectifies this problem by providing a TerminateBackgroundWorker() function. When this function is invoked, the postmaster will become unwilling to restart the worker and will send it a SIGTERM if it's already running. (It's important that the SIGTERM be sent by the postmaster, because if the original backend tries to send it, there's a race condition: the process might not be started at the time the original backend tries to send the signal, but the postmaster might start it before it sees the terminate request.) By itself, this is useful, but painful. The pain comes from the fact that all of the house-keeping is left to the programmer. It is possible but not elegant to use something like PG_ENSURE_ERROR_CLEANUP() to ensure that all background workers are terminated even on an error exit from the affected code. The other half of the problem is harder: how do we ensure not only that the untimely demise of a worker process aborts the original backend's transaction, but that it does so in a relatively timely fashion? If the original backend is executing only a very limited amount of code while the parallel workers remain in existence, it would be possible to add explicit checks for the demise of a worker at periodic intervals through all of that code. But this seems a very limiting approach. In the hope of making things better, the second patch attached herewith, ephemeral-precious-v1.patch, adds two new flags, BGWORKER_EPHEMERAL and BGWORKER_PRECIOUS. Setting the BGWORKER_EPHEMERAL flag causes the background worker to be killed when the registrant's (sub)transaction ends. This eliminates the need to catch errors and explicitly invoke TerminateBackgroundWorker() in the error path. You can simply register an ephemeral worker, write code to do stuff with it, and then terminate it. If an error occurs part-way through, the worker will be terminated as part of the abort path. Setting the BGWORKER_PRECIOUS flag causes the unexpected death of the worker to abort the registrant's current (sub)transaction. This eliminates the need to sprinkle the code with checks for a deceased worker. Instead, you can simply register a precious worker, and then just remember to CHECK_FOR_INTERRUPTS(). There were a couple of awkward cases here. First, all the existing stuff that hooks into ProcessInterrupts() makes provision for handling the ImmediateInterruptOK case. I felt that was superfluous here, so instead simply prohibited leaving a precious background worker running beyond the end of the statement. The point is to enable parallel computation, which will, I think, begin and end within the lifespan of one query. Second, odd things happen if the original backend launches precious workers and then begins a subtransaction. Throwing an error in the subtransaction will not do the right thing; the subtransaction may easily be something like a PL/pgsql exception block and an error will be caught and result in unexpected behavior. So we don't. I just added a comment saying that if you do decide to start a subtransaction while you've got precious workers outstanding, you'd better insert an explicit check for whether they're still alive after unwinding the subtransaction (there's a function to make that easy). We could probably build a mechanism to allow an error to be thrown against a (sub)xact other than the innermost one, implicitly aborting everything below that with extreme prejudice, but it seems like overkill. I can't but imagine that early versions of parallel-anything will include "starting subtransactions" on the list of activities which are prohibited in parallel mode. Using the infrastructure provided by those patches, I was able to write some test code, attached as pingpong-v1.patch. You can make a backend fire up a background worker, and the two will take turns setting each others latches for a number of iterations you specify. This could possibly be adapted into a regression test, if people think it's valuable, but for the moment I'm just including it as a demonstration of the functionality, not intended for commit. Rather gratifyingly, you can set a large iteration count and then interrupt or terminate the foreground process, or terminate the background process, and the other one goes away as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi, Finally I got the chance to put my hands on this code. Really sorry for the late replay. On Fri, Sep 13, 2013 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Last week, I attempted to write some code to perform a trivial > operation in parallel by launching background workers. Despite my > earlier convictions that I'd built enough infrastructure to make this > possible, the experiment turned out badly - so, patches! > > It's been pretty obvious to me from the beginning that any sort of > parallel computation would need a way to make sure that if any worker > dies, everybody dies. Conversely, if the parent aborts, all the > workers should die. My thought was that this could be implemented > using PG_TRY()/PG_CATCH() blocks over the existing infrastructure, but > this turned out to be naive. If the original backend encounters an > error before the child manages to start up, there's no good recovery > strategy. The parent can't kill the child because it doesn't exist > yet, and the parent can't stop the postmaster from starting the child > later. The parent could try waiting until the child starts up and > THEN killing it, but that's probably unreliable and surely > mind-numbingly lame. The first attached patch, > terminate-worker-v1.patch, rectifies this problem by providing a > TerminateBackgroundWorker() function. When this function is invoked, > the postmaster will become unwilling to restart the worker and will > send it a SIGTERM if it's already running. And here are some comments after reading the first patch... The patch looks good, creates no warnings and is not that invasive in the existing code. Few comments about the code: 1) In postmaster.c, what about adding a comment here telling that we can forget about this bgworker as it has already been requested for a termination: + if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART + || rw->rw_terminate) 2) Trying to think about this patch as an independent feature, I think that it would have been nice to demonstrate the capacity of TerminateBackgroundWorker directly with an example in worker_spi to show a direct application of it, with for example the addition of a function like worker_spi_terminate. However this needs: - either an additional API using as input the PID, pid used to fetch a bgworker handle terminating the worker afterwards. This is basically what I did in the patch attached when playing with your patch. You might find it useful... Or not! It introduces as well worker_spi_terminate, a small API scanning the array of bgworker slots . Not sure that this is much compatible with the concept of generation though... - return not only the PID of the worker started dynamically in worker_spi_launch, but also the generation number of the worker to be able to generate a bgworker handle that could be afterwards be used with TerminateBackgroundWorker. Note that this comment is based on my personal experience woth bgworkers, and I think that it is important to provide examples of what is implemented such as users are not off the track, even if playing with bgworker is high-level hacking. TerminateBackgroundWorker can offer a way to users to kill a bgworker more native than pg_terminate_backend for example, especially if they implement a bgworker structure of the type launcher/workers like what autovacuum does. 3) Documentation is needed, following comment 2). > It's important that the > SIGTERM be sent by the postmaster, because if the original backend > tries to send it, there's a race condition: the process might not be > started at the time the original backend tries to send the signal, but > the postmaster might start it before it sees the terminate request.) That's interesting. Nothing more to say about that (perhaps because of my lack of knowledge of the code in this area). > By itself, this is useful, but painful. The pain comes from the fact > that all of the house-keeping is left to the programmer. If this is necessary, so be it. But I think that newcomers to background workers would appreciate at least an exampe in worker_spi (using as a base what I provided in the patch attached) they could refer to when beginning to implement a new feature. This is a thing that people could, and for sure would refer to when implementing a complex thing using this infrastructure. This is all I have about the 1st patch... It is already late here, so I'll have a look at the 2nd/3rd patch hopefully tomorrow or the day after. Regards, -- Michael
Вложения
On Fri, Oct 11, 2013 at 9:27 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > Finally I got the chance to put my hands on this code. Really sorry > for the late replay. Thanks for the review. I'll respond to this in more detail later, but to make a long story short, I'm looking to apply terminate-worker-v1.patch (possibly with modifications after going over your review comments), but I'm not feeling too good any more about ephemeral-precious-v1.patch, because my experience with those facilities has so far proved unsatisfying. I think I'd like to withdraw the latter patch pending further study. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I briefly checked these patches. Let me add some comments. * terminate-worker-v1.patch TerminateBackgroundWorker() turns on slot->terminate flag under LW_SHARED lock. Is it reasonable because all the possible caller is the background worker process itself, isn't it? * ephemeral-precious-v1.patch AtEOXact_BackgroundWorker() is located around other AtEOXact_* routines. Doesn't it makes resource management complicated? In case when main process goes into error handler but worker process is still running in health, it may continue to calculate something and put its results on shared memory segment, even though main process suggest postmaster to kill it. All the ResourceOwnerRelease() callbacks are located prior to AtEOXact_BackgroundWorker(), it is hard to release resources being in use by background worker, because they are healthy running until it receives termination signal, but sent later. In addition, it makes implementation complicated if we need to design background workers to release resources if and when it is terminated. I don't think it is a good coding style, if we need to release resources in different location depending on context. So, I'd like to propose to add a new invocation point of ResourceOwnerRelease() after all AtEOXact_* jobs, with new label something like RESOURCE_RELEASE_FINAL. In addition, AtEOXact_BackgroundWorker() does not synchronize termination of background worker processes being killed. Of course it depends on situation, I think it is good idea to wait for completion of worker processes to be terminated, to ensure resource to be released is backed to the main process if above ResourceOwnerRelease() do the job. Thanks, 2013/10/11 Robert Haas <robertmhaas@gmail.com>: > On Fri, Oct 11, 2013 at 9:27 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Finally I got the chance to put my hands on this code. Really sorry >> for the late replay. > > Thanks for the review. I'll respond to this in more detail later, but > to make a long story short, I'm looking to apply > terminate-worker-v1.patch (possibly with modifications after going > over your review comments), but I'm not feeling too good any more > about ephemeral-precious-v1.patch, because my experience with those > facilities has so far proved unsatisfying. I think I'd like to > withdraw the latter patch pending further study. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- KaiGai Kohei <kaigai@kaigai.gr.jp>
On Fri, Oct 11, 2013 at 9:27 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > Few comments about the code: > 1) In postmaster.c, what about adding a comment here telling that we > can forget about this bgworker as it has already been requested for a > termination: > + if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART > + || rw->rw_terminate) I think that'd be just repeating what the code already says. > 2) Trying to think about this patch as an independent feature, I think > that it would have been nice to demonstrate the capacity of > TerminateBackgroundWorker directly with an example in worker_spi to > show a direct application of it, with for example the addition of a > function like worker_spi_terminate. However this needs: > - either an additional API using as input the PID, pid used to fetch a > bgworker handle terminating the worker afterwards. This is basically > what I did in the patch attached when playing with your patch. You > might find it useful... Or not! It introduces as well I think this is kind of missing the point of this interface. If you've already got the PID, you're can just kill the PID yourself. But suppose you've just registered a background worker in the parent process, and then there's an ERROR. You want the background worker to die, since you don't need it any more, but the postmaster might not have even started it yet, so there's no PID. That's the case in which this is really useful. It's not a good match for what worker_spi needs, because in the worker_spi, you're intending to start a process that you *expect* to outlive the user backend's transaction, and even the user backend's lifetime. I agree we need better testing for this code, but the problem is that this is all pretty low-level plumbing. This patch set was prompted by problems that came up when I tried to build an application using this facility; and I don't know that this is the last patch that will be needed to solve all of those problems. I'm working on another patch set that adds a shared-memory message queue through which messages can be pushed, and I think that will lend itself well to testing of both background workers and dynamic shared memory. I hope to have that ready in time for the next CommitFest. > 3) Documentation is needed, following comment 2). Good point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Oct 12, 2013 at 4:53 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote: > I briefly checked these patches. Let me add some comments. Thanks! > * terminate-worker-v1.patch > TerminateBackgroundWorker() turns on slot->terminate flag under > LW_SHARED lock. Is it reasonable because all the possible caller > is the background worker process itself, isn't it? Hmm. It's probably harmless the way it is, just because if two processes do that at the same time, it's not going to change the outcome. But it might be better to use LW_EXCLUSIVE just to make it easy to reason about the logic. I think I cut-and-pasted without thinking carefully about that. > * ephemeral-precious-v1.patch > AtEOXact_BackgroundWorker() is located around other AtEOXact_* > routines. Doesn't it makes resource management complicated? > In case when main process goes into error handler but worker > process is still running in health, it may continue to calculate > something and put its results on shared memory segment, even > though main process suggest postmaster to kill it. Since I wrote this patch set, I've been thinking a lot more about error recovery. Obviously, one of the big problems as we think about parallel query is that you've now got multiple backends floating around, and if the transaction aborts (in any backend), the other backends don't automatically know that; they need some way to know that they, too, short abort processing. There are a lot of details to get right here, and the time I've spent on it so far convinces me that the problem is anything but easy. Having said that, I'm not too concerned about the particular issue that you raise here. The resources that need to be cleaned up during transaction abort are backend-private resources. If, for example, the user backend detaches a dynamic shared memory segment that is being used for a parallel computation, they're not actually *destroying* the segment; they are just detaching it *from their address space*. The last process to detach it will also destroy it. So the ordering in which the various processes detach it doesn't matter much. One of the things I do this is necessary is a set of on_dsm_detach callbacks that work pretty much the way that on_shmem_exit callbacks work today. Just as we can't detach from the main shared memory segment without releasing locks and buffer pins and lwlocks and our PGXACT, we can't release from a dynamic shared memory segment without performing any similar cleanup that is needed. I'm currently working on a patch for that. > All the ResourceOwnerRelease() callbacks are located prior to > AtEOXact_BackgroundWorker(), it is hard to release resources > being in use by background worker, because they are healthy > running until it receives termination signal, but sent later. > In addition, it makes implementation complicated if we need to > design background workers to release resources if and when it > is terminated. I don't think it is a good coding style, if we need > to release resources in different location depending on context. Which specific resources are you concerned about? > So, I'd like to propose to add a new invocation point of > ResourceOwnerRelease() after all AtEOXact_* jobs, with > new label something like RESOURCE_RELEASE_FINAL. > > In addition, AtEOXact_BackgroundWorker() does not synchronize > termination of background worker processes being killed. > Of course it depends on situation, I think it is good idea to wait > for completion of worker processes to be terminated, to ensure > resource to be released is backed to the main process if above > ResourceOwnerRelease() do the job. Again, which resources are we talking about here? I tend to think it's an essential property of the system that we *shouldn't* have to care about the order in which processes are terminated. First, that will be difficult to control; if an ERROR or FATAL condition has occurred and we need to terminate, then there are real limits to what guarantees we can provide after that point. Second, it's also *expensive*. The point of parallelism is to make things faster; any steps we add that involve waiting for other processes to do things will eat away at the available gains. For a query that'll run for an hour that hardly matters, but for short queries it's important to avoid unnecessary overhead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2013/10/14 Robert Haas <robertmhaas@gmail.com>: >> * ephemeral-precious-v1.patch >> AtEOXact_BackgroundWorker() is located around other AtEOXact_* >> routines. Doesn't it makes resource management complicated? >> In case when main process goes into error handler but worker >> process is still running in health, it may continue to calculate >> something and put its results on shared memory segment, even >> though main process suggest postmaster to kill it. > > Since I wrote this patch set, I've been thinking a lot more about > error recovery. Obviously, one of the big problems as we think about > parallel query is that you've now got multiple backends floating > around, and if the transaction aborts (in any backend), the other > backends don't automatically know that; they need some way to know > that they, too, short abort processing. There are a lot of details to > get right here, and the time I've spent on it so far convinces me that > the problem is anything but easy. > > Having said that, I'm not too concerned about the particular issue > that you raise here. The resources that need to be cleaned up during > transaction abort are backend-private resources. If, for example, the > user backend detaches a dynamic shared memory segment that is being > used for a parallel computation, they're not actually *destroying* the > segment; they are just detaching it *from their address space*. The > last process to detach it will also destroy it. So the ordering in > which the various processes detach it doesn't matter much. > > One of the things I do this is necessary is a set of on_dsm_detach > callbacks that work pretty much the way that on_shmem_exit callbacks > work today. Just as we can't detach from the main shared memory > segment without releasing locks and buffer pins and lwlocks and our > PGXACT, we can't release from a dynamic shared memory segment without > performing any similar cleanup that is needed. I'm currently working > on a patch for that. > Hmm. It probably allows to clean-up smaller fraction of data structure constructed on dynamic shared memory segment, if we map / unmap for each transactions. >> All the ResourceOwnerRelease() callbacks are located prior to >> AtEOXact_BackgroundWorker(), it is hard to release resources >> being in use by background worker, because they are healthy >> running until it receives termination signal, but sent later. >> In addition, it makes implementation complicated if we need to >> design background workers to release resources if and when it >> is terminated. I don't think it is a good coding style, if we need >> to release resources in different location depending on context. > > Which specific resources are you concerned about? > I assumed smaller chunks allocated on static or dynamic shared memory segment to be used for communicate between main process and worker processes because of my motivation. When we move a chunk of data to co-processor using asynchronous DMA transfer, API requires the source buffer is mlock()'ed to avoid unintentional swap out during DMA transfer. On the other hand, cost of mlock() operation is not ignorable, so it may be a reasonable design to lock a shared memory segment on start-up time then continue to use it, without unmapping. So, I wondered how to handle the situation when extension tries to manage a resource with smaller granularity than the one managed by PostgreSQL core. >> So, I'd like to propose to add a new invocation point of >> ResourceOwnerRelease() after all AtEOXact_* jobs, with >> new label something like RESOURCE_RELEASE_FINAL. >> >> In addition, AtEOXact_BackgroundWorker() does not synchronize >> termination of background worker processes being killed. >> Of course it depends on situation, I think it is good idea to wait >> for completion of worker processes to be terminated, to ensure >> resource to be released is backed to the main process if above >> ResourceOwnerRelease() do the job. > > Again, which resources are we talking about here? I tend to think > it's an essential property of the system that we *shouldn't* have to > care about the order in which processes are terminated. First, that > will be difficult to control; if an ERROR or FATAL condition has > occurred and we need to terminate, then there are real limits to what > guarantees we can provide after that point. Second, it's also > *expensive*. The point of parallelism is to make things faster; any > steps we add that involve waiting for other processes to do things > will eat away at the available gains. For a query that'll run for an > hour that hardly matters, but for short queries it's important to > avoid unnecessary overhead. > Indeed, you are right. Error path has to be terminated soon. Probably, ResourceOwnerRelease() callback needs to inform healthy performing worker process the transaction got aborted thus no need to return its calculation result, using some way, if I implement it. Thanks, -- KaiGai Kohei <kaigai@kaigai.gr.jp>
On Tue, Oct 15, 2013 at 4:02 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote: > Hmm. It probably allows to clean-up smaller fraction of data structure > constructed on dynamic shared memory segment, if we map / unmap > for each transactions. I think the primary use of dynamic shared memory will be for segments that get created for a single transaction, used, and then destroyed. Perhaps people will find other uses for them that involve keeping them mapped for longer periods of time, and that is fine. But whether or long or short, it seems clear to me that shared memory data structures will need cleanup on detach, just as backends must clean up the main shared memory segment before they detach. > I assumed smaller chunks allocated on static or dynamic shared > memory segment to be used for communicate between main > process and worker processes because of my motivation. > When we move a chunk of data to co-processor using asynchronous > DMA transfer, API requires the source buffer is mlock()'ed to avoid > unintentional swap out during DMA transfer. On the other hand, > cost of mlock() operation is not ignorable, so it may be a reasonable > design to lock a shared memory segment on start-up time then > continue to use it, without unmapping. > So, I wondered how to handle the situation when extension tries > to manage a resource with smaller granularity than the one > managed by PostgreSQL core. I'm still not sure exactly what your concern is here, but I agree there are some thorny issues around resource management here. I've spent a lot of the last couple months trying to sort through them. As it seems to me, the problem is basically that we're introducing major new types of resources (background workers, dynamic shared memory segments, and perhaps other things) and they all require management - just as our existing types of resources (locks, buffer pins, etc.) require management. But the code to manage existing resources has been around for so long that we just rely on it without thinking about it, so when we suddenly DO need to think about it, it's an unpleasant surprise. As far as shared memory resources specifically, one consideration is that some systems (e.g. some versions of HP-UX) have fairly small limits (< 15) on the number of shared memory segments that can be mapped, and all 32-bit systems are prone to running out of usable address space. So portable code is going to have to use these resources judiciously. On the other hand, I think that people wanting to write code that only needs to run on 64-bit Linux will be able to pretty much go nuts. Unless there are further objections to the terminate patch as written, I'm going to go ahead and commit that, with the additional of documentation (as pointed out by Michael) and a change to the lock mode (as pointed out by KaiGai). The other patches, at least for the time being, are withdrawn. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company