Обсуждение: [HACKERS] Small improvement to parallel query docs

Поиск
Список
Период
Сортировка

[HACKERS] Small improvement to parallel query docs

От
David Rowley
Дата:
Tomas Vondra pointed out to me that there's an error in parallel.sgml
which confuses the inner and outer sides of the join.

I've attached a patch which fixes this, although I think I'm still
missing the point to text's explanation of why Merge Join is not
included due to it having to sort the inner side in each worker. Hash
join must build a hash table for each worker, so why is that OK by
sorting is not?

Anyway, I've attached a patch which fixes the outer/inner confusion
and cleans up a couple of other grammar mistakes and an omissions
regarding DISTINCT and ORDER BY not being supported in parallel
aggregate. I ended up rewriting a section too which was explaining
parallel aggregate, which I personally believe is a bit more clear to
read, but someone else may think otherwise.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Small improvement to parallel query docs

От
Robert Haas
Дата:
On Sun, Feb 12, 2017 at 7:16 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Tomas Vondra pointed out to me that there's an error in parallel.sgml
> which confuses the inner and outer sides of the join.
>
> I've attached a patch which fixes this, although I think I'm still
> missing the point to text's explanation of why Merge Join is not
> included due to it having to sort the inner side in each worker. Hash
> join must build a hash table for each worker, so why is that OK by
> sorting is not?
>
> Anyway, I've attached a patch which fixes the outer/inner confusion
> and cleans up a couple of other grammar mistakes and an omissions
> regarding DISTINCT and ORDER BY not being supported in parallel
> aggregate. I ended up rewriting a section too which was explaining
> parallel aggregate, which I personally believe is a bit more clear to
> read, but someone else may think otherwise.

-    loops or hash joins.  The outer side of the join may be any kind of
+    loops or hash joins.  The inner side of the join may be any kind of

Oops.  That's clearly a mistake.

-    table. Each worker will execute the outer side of the plan in full, which
-    is why merge joins are not supported here. The outer side of a merge join
-    will often involve sorting the entire inner table; even if it involves an
-    index, it is unlikely to be productive to have multiple processes each
-    conduct a full index scan of the inner table.
+    relation. Each worker will execute the inner side of the join in full,
+    which is why merge joins are not supported here. The inner side of a merge
+    join will often involve sorting the entire inner relation; even if it
+    involves an index, it is unlikely to be productive to have multiple
+    processes each conduct a full index scan of the inner side of the join.

Why s/table/relation/?  I don't think that's useful, especially
because the first part of that very same paragraph would still say
"table".

Anyway, events have shown the bit about merge joins was wrong thinking
on my part.  See Dilip's emails about parallel merge join.  Maybe we
should just change to say that it isn't supported without giving a
reason.  I hope to commit Dilip's patch to add support for exactly
that thing soon.  Then we can remove the language altogether and say
that it is supported.

-    <literal>COUNT(*)</>, each worker could compute a total, but those totals
-    would need to combined in order to produce a final answer.  If the query
-    involved a <literal>GROUP BY</> clause, a separate total would need to
-    be computed for each group.  Even though aggregation can't be done entirely
-    in parallel, queries involving aggregation are often excellent candidates
-    for parallel query, because they typically read many rows but return only
-    a few rows to the client.  Queries that return many rows to the client
-    are often limited by the speed at which the client can read the data,
-    in which case parallel query cannot help very much.
+    <literal>COUNT(*)</>, each worker must compute subtotals which later must
+    be combined to produce an overall total in order to produce the final
+    answer.  If the query involves a <literal>GROUP BY</> clause,
+    separate subtotals must be computed for each group seen by each parallel
+    worker. Each of these subtotals must then be combined into an overall
+    total for each group once the parallel aggregate portion of the plan is
+    complete.  This means that queries which produce a low number of groups
+    relative to the number of input rows are often far more attractive to the
+    query planner, whereas queries which don't collect many rows into each
+    group are less attractive, due to the overhead of having to combine the
+    subtotals into totals, of which cannot run in parallel.

I don't think "of which cannot run in parallel" is good grammar.  I'm
somewhat unsure whether the rest is an improvement or not.  Other
opinions?

-    a <literal>PartialAggregate</> node.  Second, the partial results are
+    a <literal>Partial Aggregate</> node.  Second, the partial results are

Oops, that's clearly a mistake.

-    <literal>FinalizeAggregate</> node.
+    <literal>Finalize Aggregate</> node.

That one, too.

-    Parallel aggregation is not supported for ordered set aggregates or when
-    the query involves <literal>GROUPING SETS</>.  It can only be used when
-    all joins involved in the query are also part of the parallel portion
-    of the plan.
+    Parallel aggregation is not supported if any aggregate function call
+    contains <literal>DISTINCT</> or <literal>ORDER BY</> clause and is also
+    not supported for ordered set aggregates or when  the query involves
+    <literal>GROUPING SETS</>.  It can only be used when all joins involved in
+    the query are also part of the parallel portion of the plan.

That chunk seems like an improvement to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Small improvement to parallel query docs

От
David Rowley
Дата:
On 14 February 2017 at 09:21, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Feb 12, 2017 at 7:16 PM, David Rowley
> -    table. Each worker will execute the outer side of the plan in full, which
> -    is why merge joins are not supported here. The outer side of a merge join
> -    will often involve sorting the entire inner table; even if it involves an
> -    index, it is unlikely to be productive to have multiple processes each
> -    conduct a full index scan of the inner table.
> +    relation. Each worker will execute the inner side of the join in full,
> +    which is why merge joins are not supported here. The inner side of a merge
> +    join will often involve sorting the entire inner relation; even if it
> +    involves an index, it is unlikely to be productive to have multiple
> +    processes each conduct a full index scan of the inner side of the join.
>
> Why s/table/relation/?  I don't think that's useful, especially
> because the first part of that very same paragraph would still say
> "table".

Perhaps it's not correct to do that. I did mean relation is the true
relational theory sense, rather than where relkind = 'r'. I didn't
really like the way it assumed the inner side was a table. Perhaps its
better to just say "involve sorting the entire inner side of the join"



-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Small improvement to parallel query docs

От
Robert Haas
Дата:
On Mon, Feb 13, 2017 at 3:29 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 14 February 2017 at 09:21, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Feb 12, 2017 at 7:16 PM, David Rowley
>> -    table. Each worker will execute the outer side of the plan in full, which
>> -    is why merge joins are not supported here. The outer side of a merge join
>> -    will often involve sorting the entire inner table; even if it involves an
>> -    index, it is unlikely to be productive to have multiple processes each
>> -    conduct a full index scan of the inner table.
>> +    relation. Each worker will execute the inner side of the join in full,
>> +    which is why merge joins are not supported here. The inner side of a merge
>> +    join will often involve sorting the entire inner relation; even if it
>> +    involves an index, it is unlikely to be productive to have multiple
>> +    processes each conduct a full index scan of the inner side of the join.
>>
>> Why s/table/relation/?  I don't think that's useful, especially
>> because the first part of that very same paragraph would still say
>> "table".
>
> Perhaps it's not correct to do that. I did mean relation is the true
> relational theory sense, rather than where relkind = 'r'. I didn't
> really like the way it assumed the inner side was a table. Perhaps its
> better to just say "involve sorting the entire inner side of the join"

Yeah, that seems fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Small improvement to parallel query docs

От
Brad DeJong
Дата:
Robert Haas wrote:

> +    <literal>COUNT(*)</>, each worker must compute subtotals which later must
> +    be combined to produce an overall total in order to produce the final
> +    answer.  If the query involves a <literal>GROUP BY</> clause,
> +    separate subtotals must be computed for each group seen by each parallel
> +    worker. Each of these subtotals must then be combined into an overall
> +    total for each group once the parallel aggregate portion of the plan is
> +    complete.  This means that queries which produce a low number of groups
> +    relative to the number of input rows are often far more attractive to the
> +    query planner, whereas queries which don't collect many rows into each
> +    group are less attractive, due to the overhead of having to combine the
> +    subtotals into totals, of which cannot run in parallel.

> I don't think "of which cannot run in parallel" is good grammar.  I'm somewhat unsure whether the rest is an
improvementor not.  Other opinions?
 

Does this read any more clearly?

+    <literal>COUNT(*)</>, each worker must compute subtotals which are later
+    combined in order to produce an overall total for the final answer.  If
+    the query involves a <literal>GROUP BY</> clause, separate subtotals
+    must be computed for each group seen by each parallel worker.  After the
+    parallel aggregate portion of the plan is complete, there is a serial step
+    where the group subtotals from all of the parallel workers are combined
+    into an overall total for each group.  Because of the overhead of combining
+    the subtotals into totals, plans which produce few groups relative to the
+    number of input rows are often more attractive to the query planner
+    than plans which produce many groups relative to the number of input rows.


I got rid of the ", of which cannot run in parallel" entirely. It was
already stated that the subtotals->totals step runs "once the parallel
aggregate portion of the plan is complete." which implies that it is serial.
I made that explicit with "there is a serial step". Also, the purpose of the 
", of which cannot run in parallel" sentence is to communicate why the
planner prefers one plan over another and, if I'm reading this correctly,
the subtotals->totals step is serial for both plans so that is not a reason
to prefer one over the other.

I think that the planner prefers plans rather than queries, so I changed that as well.

Re: [HACKERS] Small improvement to parallel query docs

От
David Rowley
Дата:
On 14 February 2017 at 10:10, Brad DeJong <Brad.Dejong@infor.com> wrote:
> Robert Haas wrote:
>
>> +    <literal>COUNT(*)</>, each worker must compute subtotals which later must
>> +    be combined to produce an overall total in order to produce the final
>> +    answer.  If the query involves a <literal>GROUP BY</> clause,
>> +    separate subtotals must be computed for each group seen by each parallel
>> +    worker. Each of these subtotals must then be combined into an overall
>> +    total for each group once the parallel aggregate portion of the plan is
>> +    complete.  This means that queries which produce a low number of groups
>> +    relative to the number of input rows are often far more attractive to the
>> +    query planner, whereas queries which don't collect many rows into each
>> +    group are less attractive, due to the overhead of having to combine the
>> +    subtotals into totals, of which cannot run in parallel.
>
>> I don't think "of which cannot run in parallel" is good grammar.  I'm somewhat unsure whether the rest is an
improvementor not.  Other opinions?
 
>
> Does this read any more clearly?
>
> +    <literal>COUNT(*)</>, each worker must compute subtotals which are later
> +    combined in order to produce an overall total for the final answer.  If
> +    the query involves a <literal>GROUP BY</> clause, separate subtotals
> +    must be computed for each group seen by each parallel worker.  After the
> +    parallel aggregate portion of the plan is complete, there is a serial step
> +    where the group subtotals from all of the parallel workers are combined
> +    into an overall total for each group.  Because of the overhead of combining
> +    the subtotals into totals, plans which produce few groups relative to the
> +    number of input rows are often more attractive to the query planner
> +    than plans which produce many groups relative to the number of input rows.

Actually looking over this again I think it's getting into too much
detail which is already described in the next paragraph (of which I
think is very clear). I propose we just remove the whole paragraph,
and mention about the planning and estimated number of groups stuff in
another new paragraph.

I've attached a patch to this effect, which also just removes the text
about why we don't support Merge Join. I felt something needed written
in its place, so I mentioned that identical hash tables are created in
each worker. This is perhaps not required, but the paragraph seemed a
bit empty without it.  I also noticed a mistake "based on a column
taken from the inner table", this "inner" I assume should be "outer"
since it surely must be talking of a parameterised index scan?, in
which case the parameter is from the outer side, not the inner.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Small improvement to parallel query docs

От
Brad DeJong
Дата:
David Rowley wrote:
> I propose we just remove the whole paragraph, and mention about
> the planning and estimated number of groups stuff in another new paragraph.
> 
> I've attached a patch to this effect ...

s/In a worse case scenario/In the worst case scenario,/

Other than that, the phrasing in the new patch reads very smoothly.


Re: [HACKERS] Small improvement to parallel query docs

От
David Rowley
Дата:
On 14 February 2017 at 10:56, Brad DeJong <Brad.Dejong@infor.com> wrote:
> David Rowley wrote:
>> I propose we just remove the whole paragraph, and mention about
>> the planning and estimated number of groups stuff in another new paragraph.
>>
>> I've attached a patch to this effect ...
>
> s/In a worse case scenario/In the worst case scenario,/
>
> Other than that, the phrasing in the new patch reads very smoothly.

Thanks. Updated patch attached.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Small improvement to parallel query docs

От
Amit Kapila
Дата:
On Tue, Feb 14, 2017 at 3:33 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 14 February 2017 at 10:56, Brad DeJong <Brad.Dejong@infor.com> wrote:
>> David Rowley wrote:
>>> I propose we just remove the whole paragraph, and mention about
>>> the planning and estimated number of groups stuff in another new paragraph.
>>>
>>> I've attached a patch to this effect ...
>>
>> s/In a worse case scenario/In the worst case scenario,/
>>
>> Other than that, the phrasing in the new patch reads very smoothly.
>
> Thanks. Updated patch attached.
>
>

+    Aggregate</> stage. For such cases there is clearly going to be no
+    performance benefit to using parallel aggregation.

A comma is required after "For such cases"

> The query planner takes
> +    this into account during the planning process and will choose how to
> +    perform the aggregation accordingly.

This part of the sentence sounds unclear.   It doesn't clearly
indicate that planner won't choose a parallel plan in such cases.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Small improvement to parallel query docs

От
David Rowley
Дата:
On 14 February 2017 at 21:25, Amit Kapila <amit.kapila16@gmail.com> wrote:
> +    Aggregate</> stage. For such cases there is clearly going to be no
> +    performance benefit to using parallel aggregation.
>
> A comma is required after "For such cases"

Added

>> The query planner takes
>> +    this into account during the planning process and will choose how to
>> +    perform the aggregation accordingly.
>
> This part of the sentence sounds unclear.   It doesn't clearly
> indicate that planner won't choose a parallel plan in such cases.

I thought that was obvious enough giving that I'd just mentioned that
there's clearly no benefit, however I've changed things to make that a
bit more explicit.

Thanks for reviewing this.

Updated patch attached.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Small improvement to parallel query docs

От
Robert Haas
Дата:
On Tue, Feb 14, 2017 at 5:17 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 14 February 2017 at 21:25, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> +    Aggregate</> stage. For such cases there is clearly going to be no
>> +    performance benefit to using parallel aggregation.
>>
>> A comma is required after "For such cases"
>
> Added
>
>>> The query planner takes
>>> +    this into account during the planning process and will choose how to
>>> +    perform the aggregation accordingly.
>>
>> This part of the sentence sounds unclear.   It doesn't clearly
>> indicate that planner won't choose a parallel plan in such cases.
>
> I thought that was obvious enough giving that I'd just mentioned that
> there's clearly no benefit, however I've changed things to make that a
> bit more explicit.
>
> Thanks for reviewing this.
>
> Updated patch attached.

Committed and back-patched to 9.6.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Small improvement to parallel query docs

От
David Rowley
Дата:
On 15 February 2017 at 03:41, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 14, 2017 at 5:17 AM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> Updated patch attached.
>
> Committed and back-patched to 9.6.

Great. Thanks Robert.


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services