Обсуждение: Race condition in WaitForBackgroundWorkerStartup

Поиск
Список
Период
Сортировка

Race condition in WaitForBackgroundWorkerStartup

От
Jeremy Finzel
Дата:
I believe I found a race condition in WaitForBackgroundWorkerStartup in the case where it encounters an ERROR during startup.  I found that depending on the speed of the system, it will unreliably return either status BGWH_STOPPED or BGWH_STARTED.  But I can reliably reproduce getting BGWH_STOPPED by tweaking the worker_spi.c test module.

On my own system running 11.1 (or any other version of pg actually), it returns BGWH_STOPPED and thus a hard error message (ERROR: could not start background process).  But for other colleagues, it returns BGWH_STARTED and thus the client sees the pid that was launched.  One then will see an error in the server logs only as the process exits.

Here is the relevant section of worker_spi.c:395-398:
if (!RegisterDynamicBackgroundWorker(&worker, &handle))
PG_RETURN_NULL();

status = WaitForBackgroundWorkerStartup(handle, &pid);

First, I hacked the SQL in the worker_spi_main module to be invalid.  Then I see one or the other behavior (pid result or ERROR) depending on user.

Then I added an arbitrary sleep before the WaitForBackgroundWorkerStartup call, and reliably, it will always shows an ERROR message.

I'm not sure if this is substantial or not, but it's causing me a problem where I am regression testing an invalid background worker launch and can't trust a reliable output.

This was my original post: https://www.postgresql.org/message-id/CAMa1XUhFap+AibpAHSkjRwN4cd9o8KYghWtG99JNofrEDzsAGw@mail.gmail.com

Now that I figured out the issue, and that it's unrelated to my extension, I thought it warranted to start a separate thread.  I am not sure how to solve this issue best.

Thanks!
Jeremy

Re: Race condition in WaitForBackgroundWorkerStartup

От
Amit Kapila
Дата:
On Mon, Nov 12, 2018 at 11:55 PM Jeremy Finzel <finzelj@gmail.com> wrote:
>
> I believe I found a race condition in WaitForBackgroundWorkerStartup in the case where it encounters an ERROR during
startup. I found that depending on the speed of the system, it will unreliably return either status BGWH_STOPPED or
BGWH_STARTED. But I can reliably reproduce getting BGWH_STOPPED by tweaking the worker_spi.c test module. 
>

Yeah, I think it is possible that you get different values in such
cases because we consider worker status as started after we have
launched the worker.  Now, if we get the error in the worker
initialization, then the user can get any of those values.  I think
this is what is happening in your case where you are saying "ERROR
during startup".
Am I missing something?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Race condition in WaitForBackgroundWorkerStartup

От
Jeremy Finzel
Дата:


On Tue, Nov 13, 2018 at 6:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Nov 12, 2018 at 11:55 PM Jeremy Finzel <finzelj@gmail.com> wrote:
>
> I believe I found a race condition in WaitForBackgroundWorkerStartup in the case where it encounters an ERROR during startup.  I found that depending on the speed of the system, it will unreliably return either status BGWH_STOPPED or BGWH_STARTED.  But I can reliably reproduce getting BGWH_STOPPED by tweaking the worker_spi.c test module.
>

Yeah, I think it is possible that you get different values in such
cases because we consider worker status as started after we have
launched the worker.  Now, if we get the error in the worker
initialization, then the user can get any of those values.  I think
this is what is happening in your case where you are saying "ERROR
during startup".
Am I missing something?

Perhaps.  What I am saying is that some machines show ERROR during startup, and some machines don't get an error at all, return successfully, then immediately error and die in the background, but the client is not shown this.  The behavior isn't predictable.  However, I can get a predictable ERROR to happen always if I put a short pause before WaitForBackgroundWorkerStartup.

I'm unclear on what counts as "worker initialization".  The error is happening in the worker_spi_main function, not in the worker_spi_launch function.  So does an immediate error in worker_spi_main count as part of the worker initialization?

Thanks!
Jeremy

Re: Race condition in WaitForBackgroundWorkerStartup

От
Tomas Vondra
Дата:
On Tue, 2018-11-13 at 07:57 -0600, Jeremy Finzel wrote:
> ...
>
> I'm unclear on what counts as "worker initialization".  The error is
> happening in the worker_spi_main function, not in the
> worker_spi_launch function.  So does an immediate error in
> worker_spi_main count as part of the worker initialization?
> 

I don't think it does (or should). worker_spi_main is pretty much the
worker body, and worker initialization pretty much means just "we've
initialized the worker process (including tracking it in shmem etc.)
and invoked it's main function".

I'd also argue the behavior is expected to depend on the machine to
some extent - depending on speed and load the machine may hit the error
before/after the GetBackgroundWorkerPid call, producing different
results.

So I'd say the behavior seems correct, at least from this POV.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Race condition in WaitForBackgroundWorkerStartup

От
Amit Kapila
Дата:
On Tue, Nov 13, 2018 at 11:48 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, 2018-11-13 at 07:57 -0600, Jeremy Finzel wrote:
> > ...
> >
> > I'm unclear on what counts as "worker initialization".  The error is
> > happening in the worker_spi_main function, not in the
> > worker_spi_launch function.  So does an immediate error in
> > worker_spi_main count as part of the worker initialization?
> >

Yeah, sort of.   Any error you get there can lead to the results you
are seeing.  If you want a more robust mechanism, you need to write
something like what we have done for parallel workers (See
WaitForParallelWorkersToAttach.).  Basically, you need your workers to
reach a specific state before which if there is any error, the main
backend should error out.

>
> I don't think it does (or should). worker_spi_main is pretty much the
> worker body, and worker initialization pretty much means just "we've
> initialized the worker process (including tracking it in shmem etc.)
> and invoked it's main function".
>
> I'd also argue the behavior is expected to depend on the machine to
> some extent - depending on speed and load the machine may hit the error
> before/after the GetBackgroundWorkerPid call, producing different
> results.
>
> So I'd say the behavior seems correct, at least from this POV.
>

+1.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com