Обсуждение: ExecGather() + nworkers

Поиск

Список

Период

Сортировка

ExecGather() + nworkers

От

Peter Geoghegan

Дата:

10 января 2016 г., 08:29:48

The Gather node executor function ExecGather() does this:
   /*    * Register backend workers. We might not get as many as we    * requested, or indeed any at all.    */   pcxt
=node->pei->pcxt;   LaunchParallelWorkers(pcxt);
 
   /* Set up tuple queue readers to read the results. */   if (pcxt->nworkers > 0)   {       ...   }
   /* No workers?  Then never mind. */   if (!got_any_worker)       ExecShutdownGatherWorkers(node);

I'm not sure why the test for nworkers following the
LaunchParallelWorkers() call doesn't look like this, though:
   /* Set up tuple queue readers to read the results. */   if (pcxt->nworkers_launched > 0)   {       ...   }

ISTM, having followed the chain of calls, that the "if" statement I
highlight here is currently tautological (excluding the possibility of
one or two special cases in the CreateParallelContext() call performed
by ExecInitParallelPlan(). e.g., the IsolationIsSerializable() case
*can* result in caller's nworkers being overridden to be 0).

I guess it isn't surprising that the code evolved to look like this,
since the commit message of b0b0d84b ponders: "I suspect we'll want to
revise the Gather node to make use of this new capability [relaunching
workers], but even if not it may be useful elsewhere and requires very
little additional code". The nworkers_launched tracking came only in
this later commit.

From my point of view, as a student of this code working on parallel
index builds (i.e new code which will also be a client of parallel.c,
a module which right now only has nodeGather.c as a client to learn
from), this is confusing. It's confusing just because the simpler
approach isn't taken when it could have been. It isn't actually wrong
that we figure out if any workers were launched in this relatively
laborious way:
   /* Set up tuple queue readers to read the results. */   if (pcxt->nworkers > 0)   {       ...
       for (i = 0; i < pcxt->nworkers; ++i)       {           if (pcxt->worker[i].bgwhandle == NULL)
continue;
           ...
           nworkers_launched = true;       ...

But going to this additional trouble (detecting no workers launched on
the basis of !nworkers_launched) suggests that simply testing
nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
do that, and in so doing also totally remove the "for" loop shown
here?

In the case of parallel sequential scan, it looks like one worker can
be helpful, because then the gather node (leader process) can run the
plan itself to some degree, and so there are effectively 2 processes
scanning at a minimum (unless 0 workers could be allocated to begin
with). How useful is it to have a parallel scan when this happens,
though?

I guess it isn't obvious to me how to reliably back out of not being
able to launch at least 2 workers in the case of my parallel index
build patch, because I suspect 2 workers (plus the leader process) are
the minimum number that will make index builds faster. Right now, it
looks like I'll have to check nworkers_launched in the leader (which
will be the only process to access the ParallelContext, since it's in
its local memory). Then, having established that there are at least
the minimum useful number of worker processes launched for sorting,
the leader can "permit" worker processes to "really" start based on
changing some state in the TOC/segment in common use. Otherwise, the
leader must call the whole thing off and do a conventional, serial
index build, even though technically the main worker process function
has started execution in worker processes.

I think what might be better is a general solution to my problem,
which I imagine will crop up again and again as new clients are added.
I would like an API that lets callers of LaunchParallelWorkers() only
actually launch *any* worker on the basis of having been able to
launch some minimum sensible number (typically 2). Otherwise, indicate
failure, allowing callers to call the whole thing off in a general
way, without the postmaster having actually launched anything, and
without custom "call it all off" code for parallel index builds. This
would probably involve introducing a distinction between a
BackgroundWorkerSlot being "reserved" rather than "in_use", lest the
postmaster accidentally launch 1 worker process before we established
definitively that launching any is really a good idea.

Does that make sense?

Thanks
-- 
Peter Geoghegan

Re: ExecGather() + nworkers

От

Robert Haas

Дата:

10 января 2016 г., 20:13:42

On Sun, Jan 10, 2016 at 12:29 AM, Peter Geoghegan <pg@heroku.com> wrote:
> The Gather node executor function ExecGather() does this:
> [ code ]
> I'm not sure why the test for nworkers following the
> LaunchParallelWorkers() call doesn't look like this, though:
>
>     /* Set up tuple queue readers to read the results. */
>     if (pcxt->nworkers_launched > 0)
>     {
>         ...
>     }

Hmm, yeah, I guess it could do that.

> But going to this additional trouble (detecting no workers launched on
> the basis of !nworkers_launched) suggests that simply testing
> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
> do that, and in so doing also totally remove the "for" loop shown
> here?

I don't see how the for loop goes away.

> In the case of parallel sequential scan, it looks like one worker can
> be helpful, because then the gather node (leader process) can run the
> plan itself to some degree, and so there are effectively 2 processes
> scanning at a minimum (unless 0 workers could be allocated to begin
> with). How useful is it to have a parallel scan when this happens,
> though?

Empirically, that's really quite useful.  When you have 3 or 4
workers, the leader really doesn't make a significant contribution to
the work, but what I've seen in my testing is that 1 worker often runs
almost twice as fast as 0 workers.

> I guess it isn't obvious to me how to reliably back out of not being
> able to launch at least 2 workers in the case of my parallel index
> build patch, because I suspect 2 workers (plus the leader process) are
> the minimum number that will make index builds faster. Right now, it
> looks like I'll have to check nworkers_launched in the leader (which
> will be the only process to access the ParallelContext, since it's in
> its local memory). Then, having established that there are at least
> the minimum useful number of worker processes launched for sorting,
> the leader can "permit" worker processes to "really" start based on
> changing some state in the TOC/segment in common use. Otherwise, the
> leader must call the whole thing off and do a conventional, serial
> index build, even though technically the main worker process function
> has started execution in worker processes.

I don't really understand why this should be so.  I thought the idea
of parallel sort is (roughly) that each worker should read data until
it fills work_mem, sort that data, and write a tape.  Repeat until no
data remains.  Then, merge the tapes.  I don't see any reason at all
why this shouldn't work just fine with a leader and 1 worker.

> I think what might be better is a general solution to my problem,
> which I imagine will crop up again and again as new clients are added.
> I would like an API that lets callers of LaunchParallelWorkers() only
> actually launch *any* worker on the basis of having been able to
> launch some minimum sensible number (typically 2). Otherwise, indicate
> failure, allowing callers to call the whole thing off in a general
> way, without the postmaster having actually launched anything, and
> without custom "call it all off" code for parallel index builds. This
> would probably involve introducing a distinction between a
> BackgroundWorkerSlot being "reserved" rather than "in_use", lest the
> postmaster accidentally launch 1 worker process before we established
> definitively that launching any is really a good idea.

I think that's probably over-engineered.  I mean, it wouldn't be that
hard to have the workers just exit if you decide you don't want them,
and I don't really want to make the signaling here more complicated
than it really needs to be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ExecGather() + nworkers

От

Peter Geoghegan

Дата:

11 января 2016 г., 00:44:51

On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm not sure why the test for nworkers following the
>> LaunchParallelWorkers() call doesn't look like this, though:
>>
>>     /* Set up tuple queue readers to read the results. */
>>     if (pcxt->nworkers_launched > 0)
>>     {
>>         ...
>>     }
>
> Hmm, yeah, I guess it could do that.

That would make it clearer as an example.

>> But going to this additional trouble (detecting no workers launched on
>> the basis of !nworkers_launched) suggests that simply testing
>> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
>> do that, and in so doing also totally remove the "for" loop shown
>> here?
>
> I don't see how the for loop goes away.

I meant that some code in the "for" loop goes away. Not all of it.
Just the more obscure code. As I said, I'm mostly pointing this out
out of concern for making it clearer as example code.

>> In the case of parallel sequential scan, it looks like one worker can
>> be helpful, because then the gather node (leader process) can run the
>> plan itself to some degree, and so there are effectively 2 processes
>> scanning at a minimum (unless 0 workers could be allocated to begin
>> with). How useful is it to have a parallel scan when this happens,
>> though?
>
> Empirically, that's really quite useful.  When you have 3 or 4
> workers, the leader really doesn't make a significant contribution to
> the work, but what I've seen in my testing is that 1 worker often runs
> almost twice as fast as 0 workers.

I suppose that makes sense, given that parallel sequential scan works
best when most tuples are eliminated in workers; there ought not to be
many tuples filling the single worker's queue anyway.

> I don't really understand why this should be so.  I thought the idea
> of parallel sort is (roughly) that each worker should read data until
> it fills work_mem, sort that data, and write a tape.  Repeat until no
> data remains.  Then, merge the tapes.  I don't see any reason at all
> why this shouldn't work just fine with a leader and 1 worker.

It will work fine with a leader and 1 worker -- the code will be
correct, and without any special cases. But it will be a suboptimal
use of resources. From the caller's point of view, there is no reason
to think it will be faster, and some reason to think it will be
slower. A particular concern for parallel sort is that the sort might
not use enough memory to need to be an external sort, but you
effectively force it to be one by making it a parallel sort (that is
not ideal in the long term, but it's a good compromise for 9.6's
parallel index build stuff). You're also consuming a
BackgroundWorkerSlot for the duration of the sort, in an environment
where evidently those are in short supply.

Now, you might wonder why it is that the leader cannot also sort runs,
just as a worker would. It's possible, but it isn't exactly
straightforward. You have to have special cases in several places,
even though it probably is going to be uncommon to only have one
BackgroundWorkerSlot available in practice. It's simpler to just
opt-out, and seems better given that max_parallel_degree is a way of
resource limiting based on available cores (it's certainly not about
the availability of shared memory for the BackgroundWorkerSlot array).

More importantly, I have other, entirely general concerns. Other major
RDBMSs have settings that are very similar to max_parallel_degree,
with a setting of 1 effectively disabling all parallelism. Both Oracle
and SQL Server have settings that they both call the "maximum degree
or parallelism". I think it's a bit odd that with Postgres,
max_parallel_degree = 1 can still use parallelism at all. I have to
wonder: are we conflating controlling the resources used by parallel
operations with how shared memory is doled out?

I could actually "give back" my parallel worker slots early if I
really wanted to (that would be messy, but the then-acquiesced workers
do nothing for the duration of the merge beyond conceptually owning
the shared tape temp files). I don't think releasing the slots early
makes sense, because I tend to think that hanging on to the workers
helps the DBA in managing the server's resources. The still-serial
merge phase is likely to become a big bottleneck with parallel sort.

With parallel sequential scan, a max_parallel_degree of 8 could result
in 16 processes scanning in parallel. That's a concern, and not least
because it happens only sometimes, when things are timed just right.
The fact that only half of those processes are "genuine" workers seems
to me like a distinction without a difference.

> I think that's probably over-engineered.  I mean, it wouldn't be that
> hard to have the workers just exit if you decide you don't want them,
> and I don't really want to make the signaling here more complicated
> than it really needs to be.

I worry about the additional overhead of constantly starting and
stopping a single worker in some cases (not so much with parallel
index build, but other uses of parallel sort beyond 9.6). Furthermore,
the coordination between worker and leader processes to make this
happen seems messy -- you actually have the postmaster launch
processes, but they must immediately get permission to do anything.

It wouldn't be that hard to offer a general way of doing this, so why not?
-- 
Peter Geoghegan

Re: ExecGather() + nworkers

От

Peter Geoghegan

Дата:

11 января 2016 г., 00:57:49

On Sun, Jan 10, 2016 at 1:44 PM, Peter Geoghegan <pg@heroku.com> wrote:
> With parallel sequential scan, a max_parallel_degree of 8 could result
> in 16 processes scanning in parallel.

I meant a max_worker_processes setting, which of course is different.
Nevertheless, I find it surprising that max_parallel_degree = 1 does
not prevent parallel operations, and that max_parallel_degree is
defined in terms of the availability of worker processes (in the
strict sense of worker processes that are launched by
LaunchParallelWorkers(), and not a broader and more practical
definition).

-- 
Peter Geoghegan

Re: ExecGather() + nworkers

От

Robert Haas

Дата:

11 января 2016 г., 04:45:41

On Sun, Jan 10, 2016 at 4:44 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I don't really understand why this should be so.  I thought the idea
>> of parallel sort is (roughly) that each worker should read data until
>> it fills work_mem, sort that data, and write a tape.  Repeat until no
>> data remains.  Then, merge the tapes.  I don't see any reason at all
>> why this shouldn't work just fine with a leader and 1 worker.
>
> It will work fine with a leader and 1 worker -- the code will be
> correct, and without any special cases. But it will be a suboptimal
> use of resources. From the caller's point of view, there is no reason
> to think it will be faster, and some reason to think it will be
> slower. A particular concern for parallel sort is that the sort might
> not use enough memory to need to be an external sort, but you
> effectively force it to be one by making it a parallel sort (that is
> not ideal in the long term, but it's a good compromise for 9.6's
> parallel index build stuff). You're also consuming a
> BackgroundWorkerSlot for the duration of the sort, in an environment
> where evidently those are in short supply.

Well, in general, the parallel sort code doesn't really get to pick
whether or not a BackgroundWorkerSlot gets used or not.  Whoever
created the parallel context decides how many workers to request, and
then the context got as many of those as it could.  It then did
arbitrary computation, which at some point in the middle involves one
or more parallel sorts.  You can't just have one of those workers up
and exit in the middle.  Now, in the specific case of parallel index
build, you probably can do that, if you want to.  But to be honest,
I'd be inclined not to include that in the first version.  If you get
fewer workers than you asked for, just use the number you got.  Let's
see how that actually works out before we decide that we need a lot
more mechanism here.  You may find that it's surprisingly effective to
do it this way.

> Now, you might wonder why it is that the leader cannot also sort runs,
> just as a worker would. It's possible, but it isn't exactly
> straightforward. You have to have special cases in several places,
> even though it probably is going to be uncommon to only have one
> BackgroundWorkerSlot available in practice. It's simpler to just
> opt-out, and seems better given that max_parallel_degree is a way of
> resource limiting based on available cores (it's certainly not about
> the availability of shared memory for the BackgroundWorkerSlot array).

I am surprised that this is not straightforward.  I don't see why it
shouldn't be, and it worries me that you think it isn't.

> More importantly, I have other, entirely general concerns. Other major
> RDBMSs have settings that are very similar to max_parallel_degree,
> with a setting of 1 effectively disabling all parallelism. Both Oracle
> and SQL Server have settings that they both call the "maximum degree
> or parallelism". I think it's a bit odd that with Postgres,
> max_parallel_degree = 1 can still use parallelism at all. I have to
> wonder: are we conflating controlling the resources used by parallel
> operations with how shared memory is doled out?

We could redefined things so that max_parallel_degree = N means use N
- 1 workers, with a minimum value of 1 rather than 0, if there's a
consensus that that's better.  Personally, I prefer it the way we've
got it: it's real darned clear in my mind that max_parallel_degree=0
means "not parallel".  But I won't cry into my beer if a consensus
emerges that the other way would be better.

> I could actually "give back" my parallel worker slots early if I
> really wanted to (that would be messy, but the then-acquiesced workers
> do nothing for the duration of the merge beyond conceptually owning
> the shared tape temp files). I don't think releasing the slots early
> makes sense, because I tend to think that hanging on to the workers
> helps the DBA in managing the server's resources. The still-serial
> merge phase is likely to become a big bottleneck with parallel sort.

Like I say, the sort code better not know anything about this
directly, or it's going to break when embedded in a query.

> With parallel sequential scan, a max_parallel_degree of 8 could result
> in 16 processes scanning in parallel. That's a concern, and not least
> because it happens only sometimes, when things are timed just right.
> The fact that only half of those processes are "genuine" workers seems
> to me like a distinction without a difference.

This seems dead wrong.  A max_parallel_degree of 8 means you have a
leader and 8 workers.  Where are the other 7 processes coming from?
What you should have is 8 processes each of which is participating in
both the parallel seq scan and the parallel sort, not 8 processes
scanning and 8 separate processes sorting.

>> I think that's probably over-engineered.  I mean, it wouldn't be that
>> hard to have the workers just exit if you decide you don't want them,
>> and I don't really want to make the signaling here more complicated
>> than it really needs to be.
>
> I worry about the additional overhead of constantly starting and
> stopping a single worker in some cases (not so much with parallel
> index build, but other uses of parallel sort beyond 9.6). Furthermore,
> the coordination between worker and leader processes to make this
> happen seems messy -- you actually have the postmaster launch
> processes, but they must immediately get permission to do anything.
>
> It wouldn't be that hard to offer a general way of doing this, so why not?

Well, if these things become actual problems, fine, we can fix them.
But let's not decide to add the API before we're agreed that we need
it to solve an actual problem that we both agree we have.  We are not
there yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ExecGather() + nworkers

От

Amit Kapila

Дата:

11 января 2016 г., 06:46:11

On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
>
> On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> I'm not sure why the test for nworkers following the
> >> LaunchParallelWorkers() call doesn't look like this, though:
> >>
> >> /* Set up tuple queue readers to read the results. */
> >> if (pcxt->nworkers_launched > 0)
> >> {
> >> ...
> >> }
> >
> > Hmm, yeah, I guess it could do that.
>
> That would make it clearer as an example.
>
> >> But going to this additional trouble (detecting no workers launched on
> >> the basis of !nworkers_launched) suggests that simply testing
> >> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
> >> do that, and in so doing also totally remove the "for" loop shown
> >> here?
> >
> > I don't see how the for loop goes away.
>
> I meant that some code in the "for" loop goes away. Not all of it.
> Just the more obscure code. As I said, I'm mostly pointing this out
> out of concern for making it clearer as example code.
>

Right, I can write a patch to do it in a way you are suggesting if you

are not planning to do it.

>
> >> In the case of parallel sequential scan, it looks like one worker can
> >> be helpful, because then the gather node (leader process) can run the
> >> plan itself to some degree, and so there are effectively 2 processes
> >> scanning at a minimum (unless 0 workers could be allocated to begin
> >> with). How useful is it to have a parallel scan when this happens,
> >> though?
> >
> > Empirically, that's really quite useful. When you have 3 or 4
> > workers, the leader really doesn't make a significant contribution to
> > the work, but what I've seen in my testing is that 1 worker often runs
> > almost twice as fast as 0 workers.
>
> I suppose that makes sense, given that parallel sequential scan works
> best when most tuples are eliminated in workers; there ought not to be
> many tuples filling the single worker's queue anyway.
>
> > I don't really understand why this should be so. I thought the idea
> > of parallel sort is (roughly) that each worker should read data until
> > it fills work_mem, sort that data, and write a tape. Repeat until no
> > data remains. Then, merge the tapes. I don't see any reason at all
> > why this shouldn't work just fine with a leader and 1 worker.
>
> It will work fine with a leader and 1 worker -- the code will be
> correct, and without any special cases. But it will be a suboptimal
> use of resources. From the caller's point of view, there is no reason
> to think it will be faster, and some reason to think it will be
> slower. A particular concern for parallel sort is that the sort might
> not use enough memory to need to be an external sort, but you
> effectively force it to be one by making it a parallel sort (that is
> not ideal in the long term, but it's a good compromise for 9.6's
> parallel index build stuff). You're also consuming a
> BackgroundWorkerSlot for the duration of the sort, in an environment
> where evidently those are in short supply.
>
> Now, you might wonder why it is that the leader cannot also sort runs,
> just as a worker would. It's possible, but it isn't exactly
> straightforward. You have to have special cases in several places,
> even though it probably is going to be uncommon to only have one
> BackgroundWorkerSlot available in practice. It's simpler to just
> opt-out, and seems better given that max_parallel_degree is a way of
> resource limiting based on available cores (it's certainly not about
> the availability of shared memory for the BackgroundWorkerSlot array).
>

If I understand correctly, you are worried about the case where if the

leader is not able to launch the minimum required number of workers,

the parallel index builds will be slower as compare serial index builds.

I think it is genuine to worry about such cases, but it won't be

difficult to just make parallel execution behave as serial execution

(basically, you need to get all the work done by leader). Now, one

could worry, that there will be some overhead of setting-up and

destroy of workers in this case, but I think that could be treated as

a limitation for the initial version of implementation and if such a

case is more common in general usage, then we could have some

mechanism to reserve the workers and start parallelism only when

the leader is able to reserve required number of workers.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: ExecGather() + nworkers

От

Pavel Stehule

Дата:

11 января 2016 г., 09:02:43

> More importantly, I have other, entirely general concerns. Other major
> RDBMSs have settings that are very similar to max_parallel_degree,
> with a setting of 1 effectively disabling all parallelism. Both Oracle
> and SQL Server have settings that they both call the "maximum degree
> or parallelism". I think it's a bit odd that with Postgres,
> max_parallel_degree = 1 can still use parallelism at all. I have to
> wonder: are we conflating controlling the resources used by parallel
> operations with how shared memory is doled out?

We could redefined things so that max_parallel_degree = N means use N
- 1 workers, with a minimum value of 1 rather than 0, if there's a
consensus that that's better. Personally, I prefer it the way we've
got it: it's real darned clear in my mind that max_parallel_degree=0
means "not parallel". But I won't cry into my beer if a consensus
emerges that the other way would be better.

when max_parallel_degree will be renamed to max_query_workers or some similar, then the new metric has sense. And can be more intuitive.

Regards

Pavel

> I could actually "give back" my parallel worker slots early if I
> really wanted to (that would be messy, but the then-acquiesced workers
> do nothing for the duration of the merge beyond conceptually owning
> the shared tape temp files). I don't think releasing the slots early
> makes sense, because I tend to think that hanging on to the workers
> helps the DBA in managing the server's resources. The still-serial
> merge phase is likely to become a big bottleneck with parallel sort.

Like I say, the sort code better not know anything about this
directly, or it's going to break when embedded in a query.

> With parallel sequential scan, a max_parallel_degree of 8 could result
> in 16 processes scanning in parallel. That's a concern, and not least
> because it happens only sometimes, when things are timed just right.
> The fact that only half of those processes are "genuine" workers seems
> to me like a distinction without a difference.

This seems dead wrong. A max_parallel_degree of 8 means you have a
leader and 8 workers. Where are the other 7 processes coming from?
What you should have is 8 processes each of which is participating in
both the parallel seq scan and the parallel sort, not 8 processes
scanning and 8 separate processes sorting.

>> I think that's probably over-engineered. I mean, it wouldn't be that
>> hard to have the workers just exit if you decide you don't want them,
>> and I don't really want to make the signaling here more complicated
>> than it really needs to be.
>
> I worry about the additional overhead of constantly starting and
> stopping a single worker in some cases (not so much with parallel
> index build, but other uses of parallel sort beyond 9.6). Furthermore,
> the coordination between worker and leader processes to make this
> happen seems messy -- you actually have the postmaster launch
> processes, but they must immediately get permission to do anything.
>
> It wouldn't be that hard to offer a general way of doing this, so why not?

Well, if these things become actual problems, fine, we can fix them.
But let's not decide to add the API before we're agreed that we need
it to solve an actual problem that we both agree we have. We are not
there yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ExecGather() + nworkers

От

Peter Geoghegan

Дата:

12 января 2016 г., 04:20:46

On Sun, Jan 10, 2016 at 5:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, in general, the parallel sort code doesn't really get to pick
> whether or not a BackgroundWorkerSlot gets used or not.  Whoever
> created the parallel context decides how many workers to request, and
> then the context got as many of those as it could.  It then did
> arbitrary computation, which at some point in the middle involves one
> or more parallel sorts.  You can't just have one of those workers up
> and exit in the middle.  Now, in the specific case of parallel index
> build, you probably can do that, if you want to.  But to be honest,
> I'd be inclined not to include that in the first version.  If you get
> fewer workers than you asked for, just use the number you got.  Let's
> see how that actually works out before we decide that we need a lot
> more mechanism here.  You may find that it's surprisingly effective to
> do it this way.

I am inclined to just accept for the time being (until I post the
patch) that one worker and one leader may sometimes be all that we
get. I will put off dealing with the problem until I show code, in
other words.

>> Now, you might wonder why it is that the leader cannot also sort runs,
>> just as a worker would. It's possible, but it isn't exactly
>> straightforward.

> I am surprised that this is not straightforward.  I don't see why it
> shouldn't be, and it worries me that you think it isn't.

It isn't entirely straightforward because it requires special case
handling. For example, I must teach the leader not to try and wait on
itself to finish sorting runs. It might otherwise attempt that ahead
of its final on-the-fly merge.

>> More importantly, I have other, entirely general concerns. Other major
>> RDBMSs have settings that are very similar to max_parallel_degree,
>> with a setting of 1 effectively disabling all parallelism. Both Oracle
>> and SQL Server have settings that they both call the "maximum degree
>> or parallelism". I think it's a bit odd that with Postgres,
>> max_parallel_degree = 1 can still use parallelism at all. I have to
>> wonder: are we conflating controlling the resources used by parallel
>> operations with how shared memory is doled out?
>
> We could redefined things so that max_parallel_degree = N means use N
> - 1 workers, with a minimum value of 1 rather than 0, if there's a
> consensus that that's better.  Personally, I prefer it the way we've
> got it: it's real darned clear in my mind that max_parallel_degree=0
> means "not parallel".  But I won't cry into my beer if a consensus
> emerges that the other way would be better.

The fact that we don't do that isn't quite the issue, though. It may
or may not make sense to count the leader as an additional worker when
the leader has very little work to do. In good cases for parallel
sequential scan, the leader has very little leader-specific work to
do, because most of the time is spent on workers (including the
leader) determining that tuples must not be returned to the leader.
When that is less true, maybe the leader could reasonably count as a
fixed cost for a parallel operation. Hard to say.

I'm sorry that that's not really actionable, but I'm still working
this stuff out.

>> I could actually "give back" my parallel worker slots early if I
>> really wanted to (that would be messy, but the then-acquiesced workers
>> do nothing for the duration of the merge beyond conceptually owning
>> the shared tape temp files). I don't think releasing the slots early
>> makes sense, because I tend to think that hanging on to the workers
>> helps the DBA in managing the server's resources. The still-serial
>> merge phase is likely to become a big bottleneck with parallel sort.
>
> Like I say, the sort code better not know anything about this
> directly, or it's going to break when embedded in a query.

tuplesort.c knows very little. nbtsort.c manages workers, and their
worker tuplesort states, as well as the leader and its tuplesort
state. So tuplesort.c knows a little bit about how the leader
tuplesort state may need to reconstruct worker state in order to do
its final on-the-fly merge. It knows nothing else, though, and
provides generic hooks for assigning worker numbers to worker
processes, or logical run numbers (this keeps trace_sort output
straight, plus a few other things).

Parallel workers are all managed in nbtsort.c, which seems
appropriate. Note that I have introduced a way in which a single
tuplesort state doesn't perfectly encapsulate a single sort operation,
though.

> This seems dead wrong.  A max_parallel_degree of 8 means you have a
> leader and 8 workers.  Where are the other 7 processes coming from?
> What you should have is 8 processes each of which is participating in
> both the parallel seq scan and the parallel sort, not 8 processes
> scanning and 8 separate processes sorting.

I simply conflated max_parallel_degree and max_worker_processes for a
moment. The point is that max_worker_processes is defined in terms of
one definition of a worker process that could in theory be quite
narrow.

A related issue is that I have no way to force Postgres to use however
many worker processes I feel like. Currently, I don't have a cost
model for parallel index builds, because I just use
max_worker_processes; this will probably change soon; I think having
something better than just using max_worker_processes is *essential*
for parallel index builds. There is already such a model for parallel
seq scan in the optimizer, of course.

The inability to force a certain number of workers is less of a
concern with parallel sequential scan, but would be nice to test it
that way. For parallel index builds, it seems reasonable to suppose
that advanced DBAs will want to avoid using the cost model sometimes,
for example because there are no statistics available following a bulk
load. The possible range of sensible nworkers could be drastically
different to max_parallel_degree.

Other systems directly support a way of doing this. Have you thought
about adding, say, a "parallel_degree" GUC, defaulting to 0 or -1
("use cost model"), but otherwise used directly as an nworkers input
for CreateParallelContext()?

>>> I think that's probably over-engineered.  I mean, it wouldn't be that
>>> hard to have the workers just exit if you decide you don't want them,
>>> and I don't really want to make the signaling here more complicated
>>> than it really needs to be.
>>
>> I worry about the additional overhead of constantly starting and
>> stopping a single worker in some cases (not so much with parallel
>> index build, but other uses of parallel sort beyond 9.6). Furthermore,
>> the coordination between worker and leader processes to make this
>> happen seems messy -- you actually have the postmaster launch
>> processes, but they must immediately get permission to do anything.
>>
>> It wouldn't be that hard to offer a general way of doing this, so why not?
>
> Well, if these things become actual problems, fine, we can fix them.
> But let's not decide to add the API before we're agreed that we need
> it to solve an actual problem that we both agree we have.  We are not
> there yet.

That's fair. Before too long, I'll post a patch that doesn't attempt
to deal with there only being one worker process, where that isn't
expected to be any faster than a serial sort. We can iterate on that.

-- 
Peter Geoghegan

Re: ExecGather() + nworkers

От

Amit Kapila

Дата:

13 января 2016 г., 11:19:13

On Mon, Jan 11, 2016 at 9:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
> >
> > On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > >> I'm not sure why the test for nworkers following the
> > >> LaunchParallelWorkers() call doesn't look like this, though:
> > >>
> > >> /* Set up tuple queue readers to read the results. */
> > >> if (pcxt->nworkers_launched > 0)
> > >> {
> > >> ...
> > >> }
> > >
> > > Hmm, yeah, I guess it could do that.
> >
> > That would make it clearer as an example.
> >
> > >> But going to this additional trouble (detecting no workers launched on
> > >> the basis of !nworkers_launched) suggests that simply testing
> > >> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
> > >> do that, and in so doing also totally remove the "for" loop shown
> > >> here?
> > >
> > > I don't see how the for loop goes away.
> >
> > I meant that some code in the "for" loop goes away. Not all of it.
> > Just the more obscure code. As I said, I'm mostly pointing this out
> > out of concern for making it clearer as example code.
> >
>
> Right, I can write a patch to do it in a way you are suggesting if you
> are not planning to do it.
>

Changed the code such that nworkers_launched gets used wherever

appropriate instead of nworkers. This includes places other than

pointed out above.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

optimize_parallelism_code_for_launched_workers_usage_v1.patch

Re: ExecGather() + nworkers

От

Haribabu Kommi

Дата:

04 марта 2016 г., 09:27:56

On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 11, 2016 at 9:16 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>> On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> >
>> > On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com>
>> > wrote:
>> > >> I'm not sure why the test for nworkers following the
>> > >> LaunchParallelWorkers() call doesn't look like this, though:
>> > >>
>> > >>     /* Set up tuple queue readers to read the results. */
>> > >>     if (pcxt->nworkers_launched > 0)
>> > >>     {
>> > >>         ...
>> > >>     }
>> > >
>> > > Hmm, yeah, I guess it could do that.
>> >
>> > That would make it clearer as an example.
>> >
>> > >> But going to this additional trouble (detecting no workers launched
>> > >> on
>> > >> the basis of !nworkers_launched) suggests that simply testing
>> > >> nworkers_launched would be wrong, which AFAICT it isn't. Can't we
>> > >> just
>> > >> do that, and in so doing also totally remove the "for" loop shown
>> > >> here?
>> > >
>> > > I don't see how the for loop goes away.
>> >
>> > I meant that some code in the "for" loop goes away. Not all of it.
>> > Just the more obscure code. As I said, I'm mostly pointing this out
>> > out of concern for making it clearer as example code.
>> >
>>
>> Right, I can write a patch to do it in a way you are suggesting if you
>> are not planning to do it.
>>
>
> Changed the code such that nworkers_launched gets used wherever
> appropriate instead of nworkers.  This includes places other than
> pointed out above.

The changes of the patch are simple optimizations that are trivial.
I didn't find any problem regarding the changes. I think the same
optimization is required in "ExecParallelFinish" function also.

Regards,
Hari Babu
Fujitsu Australia

Re: ExecGather() + nworkers

От

Amit Kapila

Дата:

04 марта 2016 г., 14:33:15

On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>
> Changed the code such that nworkers_launched gets used wherever
> appropriate instead of nworkers. This includes places other than
> pointed out above.

The changes of the patch are simple optimizations that are trivial.
I didn't find any problem regarding the changes. I think the same
optimization is required in "ExecParallelFinish" function also.

There is already one change as below for ExecParallelFinish() in patch.

@@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)

WaitForParallelWorkersToFinish(pei->pcxt);

/* Next, accumulate buffer usage. */

- for (i = 0; i < pei->pcxt->nworkers; ++i)

+ for (i = 0; i < pei->pcxt->nworkers_launched; ++i)

InstrAccumParallelQuery(&pei->buffer_usage[i]);

Can you be slightly more specific, where exactly you are expecting more changes?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: ExecGather() + nworkers

От

Haribabu Kommi

Дата:

04 марта 2016 г., 14:52:11

On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >>
>> >
>> > Changed the code such that nworkers_launched gets used wherever
>> > appropriate instead of nworkers.  This includes places other than
>> > pointed out above.
>>
>> The changes of the patch are simple optimizations that are trivial.
>> I didn't find any problem regarding the changes. I think the same
>> optimization is required in "ExecParallelFinish" function also.
>>
>
> There is already one change as below for ExecParallelFinish() in patch.
>
> @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
>
>   WaitForParallelWorkersToFinish(pei->pcxt);
>
>
>
>   /* Next, accumulate buffer usage. */
>
> - for (i = 0; i < pei->pcxt->nworkers; ++i)
>
> + for (i = 0; i < pei->pcxt->nworkers_launched; ++i)
>
>   InstrAccumParallelQuery(&pei->buffer_usage[i]);
>
>
> Can you be slightly more specific, where exactly you are expecting more
> changes?

I missed it during the comparison with existing code and patch.
Everything is fine with the patch. I marked the patch as ready for committer.

Regards,
Hari Babu
Fujitsu Australia

Re: ExecGather() + nworkers

От

Amit Kapila

Дата:

04 марта 2016 г., 14:56:02

On Fri, Mar 4, 2016 at 5:21 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >>
>> >
>> > Changed the code such that nworkers_launched gets used wherever
>> > appropriate instead of nworkers. This includes places other than
>> > pointed out above.
>>
>> The changes of the patch are simple optimizations that are trivial.
>> I didn't find any problem regarding the changes. I think the same
>> optimization is required in "ExecParallelFinish" function also.
>>
>
> There is already one change as below for ExecParallelFinish() in patch.
>
> @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
>
> WaitForParallelWorkersToFinish(pei->pcxt);
>
>
>
> /* Next, accumulate buffer usage. */
>
> - for (i = 0; i < pei->pcxt->nworkers; ++i)
>
> + for (i = 0; i < pei->pcxt->nworkers_launched; ++i)
>
> InstrAccumParallelQuery(&pei->buffer_usage[i]);
>
>
> Can you be slightly more specific, where exactly you are expecting more
> changes?

I missed it during the comparison with existing code and patch.
Everything is fine with the patch. I marked the patch as ready for committer.

Thanks!

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: ExecGather() + nworkers

От

Robert Haas

Дата:

04 марта 2016 г., 21:11:28

On Fri, Mar 4, 2016 at 6:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 4, 2016 at 5:21 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi
>> > <kommi.haribabu@gmail.com>
>> > wrote:
>> >>
>> >> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com>
>> >> wrote:
>> >> >>
>> >> >
>> >> > Changed the code such that nworkers_launched gets used wherever
>> >> > appropriate instead of nworkers.  This includes places other than
>> >> > pointed out above.
>> >>
>> >> The changes of the patch are simple optimizations that are trivial.
>> >> I didn't find any problem regarding the changes. I think the same
>> >> optimization is required in "ExecParallelFinish" function also.
>> >>
>> >
>> > There is already one change as below for ExecParallelFinish() in patch.
>> >
>> > @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
>> >
>> >   WaitForParallelWorkersToFinish(pei->pcxt);
>> >
>> >
>> >
>> >   /* Next, accumulate buffer usage. */
>> >
>> > - for (i = 0; i < pei->pcxt->nworkers; ++i)
>> >
>> > + for (i = 0; i < pei->pcxt->nworkers_launched; ++i)
>> >
>> >   InstrAccumParallelQuery(&pei->buffer_usage[i]);
>> >
>> >
>> > Can you be slightly more specific, where exactly you are expecting more
>> > changes?
>>
>> I missed it during the comparison with existing code and patch.
>> Everything is fine with the patch. I marked the patch as ready for
>> committer.
>>
>
> Thanks!

OK, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ExecGather() + nworkers

От

Amit Kapila

Дата:

05 марта 2016 г., 06:37:21

On Fri, Mar 4, 2016 at 11:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 4, 2016 at 6:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 4, 2016 at 5:21 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi
>> > <kommi.haribabu@gmail.com>
>> > wrote:
>> >>
>> >> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com>
>> >> wrote:
>> >> >>
>> >> >
>> >> > Changed the code such that nworkers_launched gets used wherever
>> >> > appropriate instead of nworkers. This includes places other than
>> >> > pointed out above.
>> >>
>> >> The changes of the patch are simple optimizations that are trivial.
>> >> I didn't find any problem regarding the changes. I think the same
>> >> optimization is required in "ExecParallelFinish" function also.
>> >>
>> >
>> > There is already one change as below for ExecParallelFinish() in patch.
>> >
>> > @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
>> >
>> > WaitForParallelWorkersToFinish(pei->pcxt);
>> >
>> >
>> >
>> > /* Next, accumulate buffer usage. */
>> >
>> > - for (i = 0; i < pei->pcxt->nworkers; ++i)
>> >
>> > + for (i = 0; i < pei->pcxt->nworkers_launched; ++i)
>> >
>> > InstrAccumParallelQuery(&pei->buffer_usage[i]);
>> >
>> >
>> > Can you be slightly more specific, where exactly you are expecting more
>> > changes?
>>
>> I missed it during the comparison with existing code and patch.
>> Everything is fine with the patch. I marked the patch as ready for
>> committer.
>>
>
> Thanks!

OK, committed.

Thanks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: ExecGather() + nworkers

От

Amit Kapila

Дата:

07 марта 2016 г., 15:04:43

On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
>
> On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Now, you might wonder why it is that the leader cannot also sort runs,
> just as a worker would. It's possible, but it isn't exactly
> straightforward. You have to have special cases in several places,
> even though it probably is going to be uncommon to only have one
> BackgroundWorkerSlot available in practice. It's simpler to just
> opt-out, and seems better given that max_parallel_degree is a way of
> resource limiting based on available cores (it's certainly not about
> the availability of shared memory for the BackgroundWorkerSlot array).
>
> More importantly, I have other, entirely general concerns. Other major
> RDBMSs have settings that are very similar to max_parallel_degree,
> with a setting of 1 effectively disabling all parallelism. Both Oracle
> and SQL Server have settings that they both call the "maximum degree
> or parallelism". I think it's a bit odd that with Postgres,
> max_parallel_degree = 1 can still use parallelism at all. I have to
> wonder: are we conflating controlling the resources used by parallel
> operations with how shared memory is doled out?
>

Your point is genuine, but OTOH let us say if max_parallel_degree = 1 means parallelism is disabled then when somebody sets max_parallel_degree = 2, then it looks somewhat odd to me that, it will mean that 1 worker process can be used for parallel query. Also, I think the parallel query will be able to get parallel workers till max_worker_processes + 1 which seems less intuitive than the current.

On your point of other databases, I have also checked and it seems like some of other databases like sybase [1] also provide a similar parameter

and value 1 means serial execution, so we might also want to consider it similarly, but to me the current setting sounds quite intuitive, however if more people also feel the same as you, then we should change it.

[1] -http://infocenter.sybase.com/archive/index.jsp?topic=/com.sybase.help.ase_15.0.sag1/html/sag1/sag1234.htm

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: ExecGather() + nworkers

От

Peter Geoghegan

Дата:

07 марта 2016 г., 22:13:23

On Mon, Mar 7, 2016 at 4:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Your point is genuine, but OTOH let us say if max_parallel_degree = 1 means
> parallelism is disabled then when somebody sets max_parallel_degree = 2,
> then it looks somewhat odd to me that, it will mean that 1 worker process
> can be used for parallel query.

I'm not sure that that has to be true.

What is the argument for only using one worker process, say in the
case of parallel seq scan? I understand that parallel seq scan can
consume tuples itself, which seems like a good principle, but how far
does it go, and how useful is it in the general case? I'm not
suggesting that it isn't, but I'm not sure.

How common is it for the leader process to do anything other than
coordinate and consume from worker processes?

-- 
Peter Geoghegan

Re: ExecGather() + nworkers

От

Robert Haas

Дата:

07 марта 2016 г., 22:43:58

On Mon, Mar 7, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Mar 7, 2016 at 4:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Your point is genuine, but OTOH let us say if max_parallel_degree = 1 means
>> parallelism is disabled then when somebody sets max_parallel_degree = 2,
>> then it looks somewhat odd to me that, it will mean that 1 worker process
>> can be used for parallel query.
>
> I'm not sure that that has to be true.
>
> What is the argument for only using one worker process, say in the
> case of parallel seq scan? I understand that parallel seq scan can
> consume tuples itself, which seems like a good principle, but how far
> does it go, and how useful is it in the general case? I'm not
> suggesting that it isn't, but I'm not sure.
>
> How common is it for the leader process to do anything other than
> coordinate and consume from worker processes?

1 worker is often a very big speedup vs. 0 workers, and the work can
easily be evenly distributed between the worker and the leader.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: ExecGather() + nworkers

Вложения