Обсуждение: background workers, round three

Поиск

Список

Период

Сортировка

background workers, round three

От

Robert Haas

Дата:

13 сентября 2013 г., 04:42:40

Last week, I attempted to write some code to perform a trivial
operation in parallel by launching background workers.  Despite my
earlier convictions that I'd built enough infrastructure to make this
possible, the experiment turned out badly - so, patches!

It's been pretty obvious to me from the beginning that any sort of
parallel computation would need a way to make sure that if any worker
dies, everybody dies.  Conversely, if the parent aborts, all the
workers should die.  My thought was that this could be implemented
using PG_TRY()/PG_CATCH() blocks over the existing infrastructure, but
this turned out to be naive.  If the original backend encounters an
error before the child manages to start up, there's no good recovery
strategy.  The parent can't kill the child because it doesn't exist
yet, and the parent can't stop the postmaster from starting the child
later.  The parent could try waiting until the child starts up and
THEN killing it, but that's probably unreliable and surely
mind-numbingly lame.  The first attached patch,
terminate-worker-v1.patch, rectifies this problem by providing a
TerminateBackgroundWorker() function.  When this function is invoked,
the postmaster will become unwilling to restart the worker and will
send it a SIGTERM if it's already running.  (It's important that the
SIGTERM be sent by the postmaster, because if the original backend
tries to send it, there's a race condition: the process might not be
started at the time the original backend tries to send the signal, but
the postmaster might start it before it sees the terminate request.)

By itself, this is useful, but painful.  The pain comes from the fact
that all of the house-keeping is left to the programmer.  It is
possible but not elegant to use something like
PG_ENSURE_ERROR_CLEANUP() to ensure that all background workers are
terminated even on an error exit from the affected code.  The other
half of the problem is harder: how do we ensure not only that the
untimely demise of a worker process aborts the original backend's
transaction, but that it does so in a relatively timely fashion?  If
the original backend is executing only a very limited amount of code
while the parallel workers remain in existence, it would be possible
to add explicit checks for the demise of a worker at periodic
intervals through all of that code.  But this seems a very limiting
approach.  In the hope of making things better, the second patch
attached herewith, ephemeral-precious-v1.patch, adds two new flags,
BGWORKER_EPHEMERAL and BGWORKER_PRECIOUS.

Setting the BGWORKER_EPHEMERAL flag causes the background worker to be
killed when the registrant's (sub)transaction ends.  This eliminates
the need to catch errors and explicitly invoke
TerminateBackgroundWorker() in the error path.  You can simply
register an ephemeral worker, write code to do stuff with it, and then
terminate it.  If an error occurs part-way through, the worker will be
terminated as part of the abort path.

Setting the BGWORKER_PRECIOUS flag causes the unexpected death of the
worker to abort the registrant's current (sub)transaction.  This
eliminates the need to sprinkle the code with checks for a deceased
worker.  Instead, you can simply register a precious worker, and then
just remember to CHECK_FOR_INTERRUPTS().  There were a couple of
awkward cases here.  First, all the existing stuff that hooks into
ProcessInterrupts() makes provision for handling the
ImmediateInterruptOK case.  I felt that was superfluous here, so
instead simply prohibited leaving a precious background worker running
beyond the end of the statement.  The point is to enable parallel
computation, which will, I think, begin and end within the lifespan of
one query.  Second, odd things happen if the original backend launches
precious workers and then begins a subtransaction.  Throwing an error
in the subtransaction will not do the right thing; the subtransaction
may easily be something like a PL/pgsql exception block and an error
will be caught and result in unexpected behavior.  So we don't.  I
just added a comment saying that if you do decide to start a
subtransaction while you've got precious workers outstanding, you'd
better insert an explicit check for whether they're still alive after
unwinding the subtransaction (there's a function to make that easy).
We could probably build a mechanism to allow an error to be thrown
against a (sub)xact other than the innermost one, implicitly aborting
everything below that with extreme prejudice, but it seems like
overkill.  I can't but imagine that early versions of
parallel-anything will include "starting subtransactions" on the list
of activities which are prohibited in parallel mode.

Using the infrastructure provided by those patches, I was able to
write some test code, attached as pingpong-v1.patch.  You can make a
backend fire up a background worker, and the two will take turns
setting each others latches for a number of iterations you specify.
This could possibly be adapted into a regression test, if people think
it's valuable, but for the moment I'm just including it as a
demonstration of the functionality, not intended for commit.  Rather
gratifyingly, you can set a large iteration count and then interrupt
or terminate the foreground process, or terminate the background
process, and the other one goes away as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: background workers, round three

От

Michael Paquier

Дата:

11 октября 2013 г., 16:27:52

Hi,

Finally I got the chance to put my hands on this code. Really sorry
for the late replay.

On Fri, Sep 13, 2013 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Last week, I attempted to write some code to perform a trivial
> operation in parallel by launching background workers.  Despite my
> earlier convictions that I'd built enough infrastructure to make this
> possible, the experiment turned out badly - so, patches!
>
> It's been pretty obvious to me from the beginning that any sort of
> parallel computation would need a way to make sure that if any worker
> dies, everybody dies.  Conversely, if the parent aborts, all the
> workers should die.  My thought was that this could be implemented
> using PG_TRY()/PG_CATCH() blocks over the existing infrastructure, but
> this turned out to be naive.  If the original backend encounters an
> error before the child manages to start up, there's no good recovery
> strategy.  The parent can't kill the child because it doesn't exist
> yet, and the parent can't stop the postmaster from starting the child
> later.  The parent could try waiting until the child starts up and
> THEN killing it, but that's probably unreliable and surely
> mind-numbingly lame.  The first attached patch,
> terminate-worker-v1.patch, rectifies this problem by providing a
> TerminateBackgroundWorker() function.  When this function is invoked,
> the postmaster will become unwilling to restart the worker and will
> send it a SIGTERM if it's already running.
And here are some comments after reading the first patch... The patch
looks good, creates no warnings and is not that invasive in the
existing code.

Few comments about the code:
1) In postmaster.c, what about adding a comment here telling that we
can forget about this bgworker as it has already been requested for a
termination:
+                       if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART
+                               || rw->rw_terminate)
2) Trying to think about this patch as an independent feature, I think
that it would have been nice to demonstrate the capacity of
TerminateBackgroundWorker directly with an example in worker_spi to
show a direct application of it, with for example the addition of a
function like worker_spi_terminate. However this needs:
- either an additional API using as input the PID, pid used to fetch a
bgworker handle terminating the worker afterwards. This is basically
what I did in the patch attached when playing with your patch. You
might find it useful... Or not! It introduces as well
worker_spi_terminate, a small API scanning the array of bgworker slots
. Not sure that this is much compatible with the concept of generation
though...
- return not only the PID of the worker started dynamically in
worker_spi_launch, but also the generation number of the worker to be
able to generate a bgworker handle that could be afterwards be used
with TerminateBackgroundWorker.
Note that this comment is based on my personal experience woth
bgworkers, and I think that it is important to provide examples of
what is implemented such as users are not off the track, even if
playing with bgworker is high-level hacking. TerminateBackgroundWorker
can offer a way to users to kill a bgworker more native than
pg_terminate_backend for example, especially if they implement a
bgworker structure of the type launcher/workers like what autovacuum
does.
3) Documentation is needed, following comment 2).

> It's important that the
> SIGTERM be sent by the postmaster, because if the original backend
> tries to send it, there's a race condition: the process might not be
> started at the time the original backend tries to send the signal, but
> the postmaster might start it before it sees the terminate request.)
That's interesting. Nothing more to say about that (perhaps because of
my lack of knowledge of the code in this area).

> By itself, this is useful, but painful.  The pain comes from the fact
> that all of the house-keeping is left to the programmer.
If this is necessary, so be it. But I think that newcomers to
background workers would appreciate at least an exampe in worker_spi
(using as a base what I provided in the patch attached) they could
refer to when beginning to implement a new feature. This is a thing
that people could, and for sure would refer to when implementing a
complex thing using this infrastructure.

This is all I have about the 1st patch... It is already late here, so
I'll have a look at the 2nd/3rd patch hopefully tomorrow or the day
after.
Regards,
--
Michael

Вложения

20131011_bgworker_terminate_pid.patch

Re: background workers, round three

От

Robert Haas

Дата:

11 октября 2013 г., 17:47:49

On Fri, Oct 11, 2013 at 9:27 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Finally I got the chance to put my hands on this code. Really sorry
> for the late replay.

Thanks for the review.  I'll respond to this in more detail later, but
to make a long story short, I'm looking to apply
terminate-worker-v1.patch (possibly with modifications after going
over your review comments), but I'm not feeling too good any more
about ephemeral-precious-v1.patch, because my experience with those
facilities has so far proved unsatisfying.  I think I'd like to
withdraw the latter patch pending further study.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: background workers, round three

От

Kohei KaiGai

Дата:

12 октября 2013 г., 23:53:15

I briefly checked these patches. Let me add some comments.

* terminate-worker-v1.patch
TerminateBackgroundWorker() turns on slot->terminate flag under
LW_SHARED lock. Is it reasonable because all the possible caller
is the background worker process itself, isn't it?

* ephemeral-precious-v1.patch
AtEOXact_BackgroundWorker() is located around other AtEOXact_*
routines. Doesn't it makes resource management complicated?
In case when main process goes into error handler but worker
process is still running in health, it may continue to calculate
something and put its results on shared memory segment, even
though main process suggest postmaster to kill it.

All the ResourceOwnerRelease() callbacks are located prior to
AtEOXact_BackgroundWorker(), it is hard to release resources
being in use by background worker, because they are healthy
running until it receives termination signal, but sent later.
In addition, it makes implementation complicated if we need to
design background workers to release resources if and when it
is terminated. I don't think it is a good coding style, if we need
to release resources in different location depending on context.

So, I'd like to propose to add a new invocation point of
ResourceOwnerRelease() after all AtEOXact_* jobs, with
new label something like RESOURCE_RELEASE_FINAL.

In addition, AtEOXact_BackgroundWorker() does not synchronize
termination of background worker processes being killed.
Of course it depends on situation, I think it is good idea to wait
for completion of worker processes to be terminated, to ensure
resource to be released is backed to the main process if above
ResourceOwnerRelease() do the job.

Thanks,

2013/10/11 Robert Haas <robertmhaas@gmail.com>:
> On Fri, Oct 11, 2013 at 9:27 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Finally I got the chance to put my hands on this code. Really sorry
>> for the late replay.
>
> Thanks for the review.  I'll respond to this in more detail later, but
> to make a long story short, I'm looking to apply
> terminate-worker-v1.patch (possibly with modifications after going
> over your review comments), but I'm not feeling too good any more
> about ephemeral-precious-v1.patch, because my experience with those
> facilities has so far proved unsatisfying.  I think I'd like to
> withdraw the latter patch pending further study.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
KaiGai Kohei <kaigai@kaigai.gr.jp>

Re: background workers, round three

От

Robert Haas

Дата:

14 октября 2013 г., 15:34:27

On Fri, Oct 11, 2013 at 9:27 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Few comments about the code:
> 1) In postmaster.c, what about adding a comment here telling that we
> can forget about this bgworker as it has already been requested for a
> termination:
> +                       if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART
> +                               || rw->rw_terminate)

I think that'd be just repeating what the code already says.

> 2) Trying to think about this patch as an independent feature, I think
> that it would have been nice to demonstrate the capacity of
> TerminateBackgroundWorker directly with an example in worker_spi to
> show a direct application of it, with for example the addition of a
> function like worker_spi_terminate. However this needs:
> - either an additional API using as input the PID, pid used to fetch a
> bgworker handle terminating the worker afterwards. This is basically
> what I did in the patch attached when playing with your patch. You
> might find it useful... Or not! It introduces as well

I think this is kind of missing the point of this interface.  If
you've already got the PID, you're can just kill the PID yourself.
But suppose you've just registered a background worker in the parent
process, and then there's an ERROR.  You want the background worker to
die, since you don't need it any more, but the postmaster might not
have even started it yet, so there's no PID.  That's the case in which
this is really useful.  It's not a good match for what worker_spi
needs, because in the worker_spi, you're intending to start a process
that you *expect* to outlive the user backend's transaction, and even
the user backend's lifetime.

I agree we need better testing for this code, but the problem is that
this is all pretty low-level plumbing.  This patch set was prompted by
problems that came up when I tried to build an application using this
facility; and I don't know that this is the last patch that will be
needed to solve all of those problems.  I'm working on another patch
set that adds a shared-memory message queue through which messages can
be pushed, and I think that will lend itself well to testing of both
background workers and dynamic shared memory.  I hope to have that
ready in time for the next CommitFest.

> 3) Documentation is needed, following comment 2).

Good point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: background workers, round three

От

Robert Haas

Дата:

14 октября 2013 г., 16:09:52

On Sat, Oct 12, 2013 at 4:53 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
> I briefly checked these patches. Let me add some comments.

Thanks!

> * terminate-worker-v1.patch
> TerminateBackgroundWorker() turns on slot->terminate flag under
> LW_SHARED lock. Is it reasonable because all the possible caller
> is the background worker process itself, isn't it?

Hmm.  It's probably harmless the way it is, just because if two
processes do that at the same time, it's not going to change the
outcome.  But it might be better to use LW_EXCLUSIVE just to make it
easy to reason about the logic.  I think I cut-and-pasted without
thinking carefully about that.

> * ephemeral-precious-v1.patch
> AtEOXact_BackgroundWorker() is located around other AtEOXact_*
> routines. Doesn't it makes resource management complicated?
> In case when main process goes into error handler but worker
> process is still running in health, it may continue to calculate
> something and put its results on shared memory segment, even
> though main process suggest postmaster to kill it.

Since I wrote this patch set, I've been thinking a lot more about
error recovery.  Obviously, one of the big problems as we think about
parallel query is that you've now got multiple backends floating
around, and if the transaction aborts (in any backend), the other
backends don't automatically know that; they need some way to know
that they, too, short abort processing.  There are a lot of details to
get right here, and the time I've spent on it so far convinces me that
the problem is anything but easy.

Having said that, I'm not too concerned about the particular issue
that you raise here.  The resources that need to be cleaned up during
transaction abort are backend-private resources.  If, for example, the
user backend detaches a dynamic shared memory segment that is being
used for a parallel computation, they're not actually *destroying* the
segment; they are just detaching it *from their address space*.  The
last process to detach it will also destroy it.  So the ordering in
which the various processes detach it doesn't matter much.

One of the things I do this is necessary is a set of on_dsm_detach
callbacks that work pretty much the way that on_shmem_exit callbacks
work today.  Just as we can't detach from the main shared memory
segment without releasing locks and buffer pins and lwlocks and our
PGXACT, we can't release from a dynamic shared memory segment without
performing any similar cleanup that is needed.  I'm currently working
on a patch for that.

> All the ResourceOwnerRelease() callbacks are located prior to
> AtEOXact_BackgroundWorker(), it is hard to release resources
> being in use by background worker, because they are healthy
> running until it receives termination signal, but sent later.
> In addition, it makes implementation complicated if we need to
> design background workers to release resources if and when it
> is terminated. I don't think it is a good coding style, if we need
> to release resources in different location depending on context.

Which specific resources are you concerned about?

> So, I'd like to propose to add a new invocation point of
> ResourceOwnerRelease() after all AtEOXact_* jobs, with
> new label something like RESOURCE_RELEASE_FINAL.
>
> In addition, AtEOXact_BackgroundWorker() does not synchronize
> termination of background worker processes being killed.
> Of course it depends on situation, I think it is good idea to wait
> for completion of worker processes to be terminated, to ensure
> resource to be released is backed to the main process if above
> ResourceOwnerRelease() do the job.

Again, which resources are we talking about here?  I tend to think
it's an essential property of the system that we *shouldn't* have to
care about the order in which processes are terminated.  First, that
will be difficult to control; if an ERROR or FATAL condition has
occurred and we need to terminate, then there are real limits to what
guarantees we can provide after that point.  Second, it's also
*expensive*.  The point of parallelism is to make things faster; any
steps we add that involve waiting for other processes to do things
will eat away at the available gains.  For a query that'll run for an
hour that hardly matters, but for short queries it's important to
avoid unnecessary overhead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: background workers, round three

От

Kohei KaiGai

Дата:

15 октября 2013 г., 23:02:55

2013/10/14 Robert Haas <robertmhaas@gmail.com>:
>> * ephemeral-precious-v1.patch
>> AtEOXact_BackgroundWorker() is located around other AtEOXact_*
>> routines. Doesn't it makes resource management complicated?
>> In case when main process goes into error handler but worker
>> process is still running in health, it may continue to calculate
>> something and put its results on shared memory segment, even
>> though main process suggest postmaster to kill it.
>
> Since I wrote this patch set, I've been thinking a lot more about
> error recovery.  Obviously, one of the big problems as we think about
> parallel query is that you've now got multiple backends floating
> around, and if the transaction aborts (in any backend), the other
> backends don't automatically know that; they need some way to know
> that they, too, short abort processing.  There are a lot of details to
> get right here, and the time I've spent on it so far convinces me that
> the problem is anything but easy.
>
> Having said that, I'm not too concerned about the particular issue
> that you raise here.  The resources that need to be cleaned up during
> transaction abort are backend-private resources.  If, for example, the
> user backend detaches a dynamic shared memory segment that is being
> used for a parallel computation, they're not actually *destroying* the
> segment; they are just detaching it *from their address space*.  The
> last process to detach it will also destroy it.  So the ordering in
> which the various processes detach it doesn't matter much.
>
> One of the things I do this is necessary is a set of on_dsm_detach
> callbacks that work pretty much the way that on_shmem_exit callbacks
> work today.  Just as we can't detach from the main shared memory
> segment without releasing locks and buffer pins and lwlocks and our
> PGXACT, we can't release from a dynamic shared memory segment without
> performing any similar cleanup that is needed.  I'm currently working
> on a patch for that.
>
Hmm. It probably allows to clean-up smaller fraction of data structure
constructed on dynamic shared memory segment, if we map / unmap
for each transactions.

>> All the ResourceOwnerRelease() callbacks are located prior to
>> AtEOXact_BackgroundWorker(), it is hard to release resources
>> being in use by background worker, because they are healthy
>> running until it receives termination signal, but sent later.
>> In addition, it makes implementation complicated if we need to
>> design background workers to release resources if and when it
>> is terminated. I don't think it is a good coding style, if we need
>> to release resources in different location depending on context.
>
> Which specific resources are you concerned about?
>
I assumed smaller chunks allocated on static or dynamic shared
memory segment to be used for communicate between main
process and worker processes because of my motivation.
When we move a chunk of data to co-processor using asynchronous
DMA transfer, API requires the source buffer is mlock()'ed to avoid
unintentional swap out during DMA transfer. On the other hand,
cost of mlock() operation is not ignorable, so it may be a reasonable
design to lock a shared memory segment on start-up time then
continue to use it, without unmapping.
So, I wondered how to handle the situation when extension tries
to manage a resource with smaller granularity than the one
managed by PostgreSQL core.

>> So, I'd like to propose to add a new invocation point of
>> ResourceOwnerRelease() after all AtEOXact_* jobs, with
>> new label something like RESOURCE_RELEASE_FINAL.
>>
>> In addition, AtEOXact_BackgroundWorker() does not synchronize
>> termination of background worker processes being killed.
>> Of course it depends on situation, I think it is good idea to wait
>> for completion of worker processes to be terminated, to ensure
>> resource to be released is backed to the main process if above
>> ResourceOwnerRelease() do the job.
>
> Again, which resources are we talking about here?  I tend to think
> it's an essential property of the system that we *shouldn't* have to
> care about the order in which processes are terminated.  First, that
> will be difficult to control; if an ERROR or FATAL condition has
> occurred and we need to terminate, then there are real limits to what
> guarantees we can provide after that point.  Second, it's also
> *expensive*.  The point of parallelism is to make things faster; any
> steps we add that involve waiting for other processes to do things
> will eat away at the available gains.  For a query that'll run for an
> hour that hardly matters, but for short queries it's important to
> avoid unnecessary overhead.
>
Indeed, you are right. Error path has to be terminated soon.
Probably, ResourceOwnerRelease() callback needs to inform healthy
performing worker process the transaction got aborted thus no need
to return its calculation result, using some way, if I implement it.

Thanks,
-- 
KaiGai Kohei <kaigai@kaigai.gr.jp>

Re: background workers, round three

От

Robert Haas

Дата:

16 октября 2013 г., 16:33:28

On Tue, Oct 15, 2013 at 4:02 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
> Hmm. It probably allows to clean-up smaller fraction of data structure
> constructed on dynamic shared memory segment, if we map / unmap
> for each transactions.

I think the primary use of dynamic shared memory will be for segments
that get created for a single transaction, used, and then destroyed.
Perhaps people will find other uses for them that involve keeping them
mapped for longer periods of time, and that is fine.  But whether or
long or short, it seems clear to me that shared memory data structures
will need cleanup on detach, just as backends must clean up the main
shared memory segment before they detach.

> I assumed smaller chunks allocated on static or dynamic shared
> memory segment to be used for communicate between main
> process and worker processes because of my motivation.
> When we move a chunk of data to co-processor using asynchronous
> DMA transfer, API requires the source buffer is mlock()'ed to avoid
> unintentional swap out during DMA transfer. On the other hand,
> cost of mlock() operation is not ignorable, so it may be a reasonable
> design to lock a shared memory segment on start-up time then
> continue to use it, without unmapping.
> So, I wondered how to handle the situation when extension tries
> to manage a resource with smaller granularity than the one
> managed by PostgreSQL core.

I'm still not sure exactly what your concern is here, but I agree
there are some thorny issues around resource management here.  I've
spent a lot of the last couple months trying to sort through them.  As
it seems to me, the problem is basically that we're introducing major
new types of resources (background workers, dynamic shared memory
segments, and perhaps other things) and they all require management -
just as our existing types of resources (locks, buffer pins, etc.)
require management.  But the code to manage existing resources has
been around for so long that we just rely on it without thinking about
it, so when we suddenly DO need to think about it, it's an unpleasant
surprise.

As far as shared memory resources specifically, one consideration is
that some systems (e.g. some versions of HP-UX) have fairly small
limits (< 15) on the number of shared memory segments that can be
mapped, and all 32-bit systems are prone to running out of usable
address space.  So portable code is going to have to use these
resources judiciously.  On the other hand, I think that people wanting
to write code that only needs to run on 64-bit Linux will be able to
pretty much go nuts.

Unless there are further objections to the terminate patch as written,
I'm going to go ahead and commit that, with the additional of
documentation (as pointed out by Michael) and a change to the lock
mode (as pointed out by KaiGai).  The other patches, at least for the
time being, are withdrawn.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: background workers, round three

Вложения

Вложения