Обсуждение: [HACKERS] [PATCH] Push limit to sort through a subquery

Поиск
Список
Период
Сортировка

[HACKERS] [PATCH] Push limit to sort through a subquery

От
Douglas Doole
Дата:
We've hit a case where pass_down_bound() isn't pushing the row count limit from limit into sort. The issue is that we're getting a subquery scan node between the limit and the sort. The subquery is only doing column projection and has no quals or SRFs so it should be safe to push the limit into the sort.

The query that hit the problem can be simplified to:

   SELECT * FROM (SELECT A,B FROM T ORDER BY C) LIMIT 5

(Yeah, the query's a little screwy in that the ORDER BY should really be outside the subselect, but it came from a query generator, so that's a different conversation.)

Proposed patch is attached.

- Doug
Salesforce
Вложения

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Ashutosh Bapat
Дата:
The function pass_down_bound() is a recursive function. For
SubqueryScanState we have to do something similar to ResultState i.e.
call pass_down_bound() recursively on subqueryScanState->subplan.

Please add this to the next commitfest.

On Wed, Apr 19, 2017 at 6:09 AM, Douglas Doole <dougdoole@gmail.com> wrote:
> We've hit a case where pass_down_bound() isn't pushing the row count limit
> from limit into sort. The issue is that we're getting a subquery scan node
> between the limit and the sort. The subquery is only doing column projection
> and has no quals or SRFs so it should be safe to push the limit into the
> sort.
>
> The query that hit the problem can be simplified to:
>
>    SELECT * FROM (SELECT A,B FROM T ORDER BY C) LIMIT 5
>
> (Yeah, the query's a little screwy in that the ORDER BY should really be
> outside the subselect, but it came from a query generator, so that's a
> different conversation.)
>
> Proposed patch is attached.
>
> - Doug
> Salesforce
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Douglas Doole
Дата:
Thanks for the feedback Ashutosh.

I disagree that we need to call pass_down_bound() recursively. The whole point of the recursive call would be to tunnel through a plan node. Since SubqueryScanState only has one child (unlike MergeAppendState) this can be more efficiently done inline rather than via a recursive call.

In the case of ResultState, this is classic tail-end recursion and the C compiler should optimize it out. As we add more recursive calls though, it becomes harder for the compiler to do the right thing. (I'll admit that I'm not up on what state-of-the-art in recursion removal though, so maybe my concern is moot. That said, I prefer not to depend on compiler optimizations if there's a clean alternative way to express the code.)

I do agree that it would make the code cleaner to handle ResultState and SubqueryScanState in a similar fashion. My preference, however, would be to remove the recursive call from ResultState handling. That would give us code like:

    if child == SubqueryScanState
        child = SubqueryScanState->child
    else if child == ResultState
        child = ResultState->child

    if child == SortState
        limit pushdown
    else if child == MergeAppendState
        recursive call on all children

If it is possible to get more than one SubqueryScanState and/or ResultState between the limit and sort, then the first block of code could be placed in a while loop.

If you'd prefer that I rework the patch as I discussed, just let me know and I'll resubmit it.

- Doug
Salesforce

On Wed, Apr 19, 2017 at 1:57 AM Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
The function pass_down_bound() is a recursive function. For
SubqueryScanState we have to do something similar to ResultState i.e.
call pass_down_bound() recursively on subqueryScanState->subplan.

Please add this to the next commitfest.

On Wed, Apr 19, 2017 at 6:09 AM, Douglas Doole <dougdoole@gmail.com> wrote:
> We've hit a case where pass_down_bound() isn't pushing the row count limit
> from limit into sort. The issue is that we're getting a subquery scan node
> between the limit and the sort. The subquery is only doing column projection
> and has no quals or SRFs so it should be safe to push the limit into the
> sort.
>
> The query that hit the problem can be simplified to:
>
>    SELECT * FROM (SELECT A,B FROM T ORDER BY C) LIMIT 5
>
> (Yeah, the query's a little screwy in that the ORDER BY should really be
> outside the subselect, but it came from a query generator, so that's a
> different conversation.)
>
> Proposed patch is attached.
>
> - Doug
> Salesforce
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Ashutosh Bapat
Дата:
On Wed, Apr 19, 2017 at 10:51 PM, Douglas Doole <dougdoole@gmail.com> wrote:
> Thanks for the feedback Ashutosh.
>
> I disagree that we need to call pass_down_bound() recursively. The whole
> point of the recursive call would be to tunnel through a plan node. Since
> SubqueryScanState only has one child (unlike MergeAppendState) this can be
> more efficiently done inline rather than via a recursive call.
>
> In the case of ResultState, this is classic tail-end recursion and the C
> compiler should optimize it out. As we add more recursive calls though, it
> becomes harder for the compiler to do the right thing. (I'll admit that I'm
> not up on what state-of-the-art in recursion removal though, so maybe my
> concern is moot. That said, I prefer not to depend on compiler optimizations
> if there's a clean alternative way to express the code.)
>
> I do agree that it would make the code cleaner to handle ResultState and
> SubqueryScanState in a similar fashion. My preference, however, would be to
> remove the recursive call from ResultState handling. That would give us code
> like:
>
>     if child == SubqueryScanState
>         child = SubqueryScanState->child
>     else if child == ResultState
>         child = ResultState->child
>
>     if child == SortState
>         limit pushdown
>     else if child == MergeAppendState
>         recursive call on all children
>
> If it is possible to get more than one SubqueryScanState and/or ResultState
> between the limit and sort, then the first block of code could be placed in
> a while loop.
>
> If you'd prefer that I rework the patch as I discussed, just let me know and
> I'll resubmit it.

Thinking more about it, pass_down_bound() optimization ultimately
targets sort nodes. It doesn't for example target ForeignScan nodes
which can benefit from such optimization. But I think any generic
LIMIT optimization will need to be handled in the planner as against
the executor.

Said that, I don't think community at large would accept serializing
pass_down_bound() as you are proposing. You may to wait for others'
opinions before working on it. If you add this to the commitfest app,
more people might look at it when the next commitfest opens. Also, it
might help if you can provide a query/ies with numbers where this
optimization shows improvement.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Douglas Doole
Дата:
If you add this to the commitfest app, more people might look at it when the next commitfest opens.


Also, it might help if you can provide a query/ies with numbers where this optimization shows improvement.

I can't provide the real queries where we encountered the problem because they are internal. However I showed a simplified version of the queries in my first post.

On our queries, the change made quite a difference - execution time dropped from 31.4 seconds to 7.2 seconds. Explain analyze also shows that memory use dropped significantly and we didn't have to spill the sort to disk

From:

-> Sort (cost=989.95..1013.27 rows=9326 width=30) (node_startup_time/loop=31328.891, node_total_time/loop: 31329.756 rows=2001 loops=1) Buffers: temp read=772 written=11201 lsm_bufmgr hits=3392 Sort Key: *** Sort Method: external merge Sort Space Used: 89592 Sort Space Type: Disk

To:

-> Sort (cost=989.95..1013.27 rows=9326 width=30) (node_startup_time/loop=7123.275, node_total_time/loop: 7123.504 rows=2001 loops=1) Buffers: lsm_bufmgr hits=3387 Sort Key: *** Sort Method: top-N heapsort Sort Space Used: 3256 Sort Space Type: Memory

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Konstantin Knizhnik
Дата:


On 29.04.2017 00:13, Douglas Doole wrote:
If you add this to the commitfest app, more people might look at it when the next commitfest opens.


Also, it might help if you can provide a query/ies with numbers where this optimization shows improvement.

I can't provide the real queries where we encountered the problem because they are internal. However I showed a simplified version of the queries in my first post.

On our queries, the change made quite a difference - execution time dropped from 31.4 seconds to 7.2 seconds. Explain analyze also shows that memory use dropped significantly and we didn't have to spill the sort to disk

From:

-> Sort (cost=989.95..1013.27 rows=9326 width=30) (node_startup_time/loop=31328.891, node_total_time/loop: 31329.756 rows=2001 loops=1) Buffers: temp read=772 written=11201 lsm_bufmgr hits=3392 Sort Key: *** Sort Method: external merge Sort Space Used: 89592 Sort Space Type: Disk

To:

-> Sort (cost=989.95..1013.27 rows=9326 width=30) (node_startup_time/loop=7123.275, node_total_time/loop: 7123.504 rows=2001 loops=1) Buffers: lsm_bufmgr hits=3387 Sort Key: *** Sort Method: top-N heapsort Sort Space Used: 3256 Sort Space Type: Memory

Attached please find yet another small patch which pushes down LIMIT to ForeignScan.
I should notice that currently Postgres optimizer is using "Merge Append" and fetches from remote nodes only required number of tuples.
So even without LIMIT push down, postgres_fdw will not pull the whole table from remote host.
postgres_fdw is using cursor for fetching data from remote. Default fetch size is 100, so even without limit remote query will fetch no more  than 100 rows at remote site.

Assume the following example:

postgres=# create extension postgres_fdw;
postgres=# create server shard1  FOREIGN DATA WRAPPER postgres_fdw options(dbname 'postgres', host 'localhost', port '5432');
postgres=# create server shard2  FOREIGN DATA WRAPPER postgres_fdw options(dbname 'postgres', host 'localhost', port '5432');
postgres=# CREATE USER MAPPING for $user SERVER shard1 options (user '$user');
postgres=# CREATE USER MAPPING for $user SERVER shard1 options (user '$user');
postgres=# CREATE TABLE t(u integer primary key, v integer);
postgres=# CREATE TABLE t1(u integer primary key, v integer);
postgres=# CREATE TABLE t2(u integer primary key, v integer);
postgres=# insert into t1 values (generate_series(1,100000), random()*100000);
postgres=# insert into t2 values (generate_series(1,100000), random()*100000);
postgres=# CREATE FOREIGN TABLE t_fdw1() inherits (t) server shard1 options(table_name 't1');
postgres=# CREATE FOREIGN TABLE t_fdw2() inherits (t) server shard2 options(table_name 't2');


postgres=# explain analyze select * from t order by u limit 1;
                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=200.15..200.20 rows=1 width=8) (actual time=2.010..2.010 rows=1 loops=1)
   ->  Merge Append  (cost=200.15..449.39 rows=5121 width=8) (actual time=2.009..2.009 rows=1 loops=1)
         Sort Key: t.u
         ->  Index Scan using t_pkey on t  (cost=0.12..8.14 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=1)
         ->  Foreign Scan on t_fdw2  (cost=100.00..193.92 rows=2560 width=8) (actual time=1.074..1.074 rows=1 loops=1)
         ->  Foreign Scan on t_fdw1  (cost=100.00..193.92 rows=2560 width=8) (actual time=0.928..0.928 rows=1 loops=1)
 Planning time: 0.769 ms
 Execution time: 6.837 ms
(8 rows)

As you can see foreign scan fetches only one row from each remote node.

But still pushing down limit can have positive effect on performance, especially if SORT can be replaced with TOP-N.
I got the following results (time in seconds):

Query
original
limit push down
select * from t order by u limit 1
2.276
1.777
select * from t order by v limit 1
10042

There is index for "u", so fetching records with smallest "u" values can be done without sorting, so times are similar.
But in case of sorting by "v", pushing down limit allows to use TOP-1 instead of global sort and it reduces query execution time more than 2 times.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Вложения

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Douglas Doole
Дата:
I completely agree. The further a limit can be pushed down, the better.

The patch looks good to me.

On Thu, Aug 17, 2017 at 8:27 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:


On 29.04.2017 00:13, Douglas Doole wrote:
If you add this to the commitfest app, more people might look at it when the next commitfest opens.


Also, it might help if you can provide a query/ies with numbers where this optimization shows improvement.

I can't provide the real queries where we encountered the problem because they are internal. However I showed a simplified version of the queries in my first post.

On our queries, the change made quite a difference - execution time dropped from 31.4 seconds to 7.2 seconds. Explain analyze also shows that memory use dropped significantly and we didn't have to spill the sort to disk

From:

-> Sort (cost=989.95..1013.27 rows=9326 width=30) (node_startup_time/loop=31328.891, node_total_time/loop: 31329.756 rows=2001 loops=1) Buffers: temp read=772 written=11201 lsm_bufmgr hits=3392 Sort Key: *** Sort Method: external merge Sort Space Used: 89592 Sort Space Type: Disk

To:

-> Sort (cost=989.95..1013.27 rows=9326 width=30) (node_startup_time/loop=7123.275, node_total_time/loop: 7123.504 rows=2001 loops=1) Buffers: lsm_bufmgr hits=3387 Sort Key: *** Sort Method: top-N heapsort Sort Space Used: 3256 Sort Space Type: Memory

Attached please find yet another small patch which pushes down LIMIT to ForeignScan.
I should notice that currently Postgres optimizer is using "Merge Append" and fetches from remote nodes only required number of tuples.
So even without LIMIT push down, postgres_fdw will not pull the whole table from remote host.
postgres_fdw is using cursor for fetching data from remote. Default fetch size is 100, so even without limit remote query will fetch no more  than 100 rows at remote site.

Assume the following example:

postgres=# create extension postgres_fdw;
postgres=# create server shard1  FOREIGN DATA WRAPPER postgres_fdw options(dbname 'postgres', host 'localhost', port '5432');
postgres=# create server shard2  FOREIGN DATA WRAPPER postgres_fdw options(dbname 'postgres', host 'localhost', port '5432');
postgres=# CREATE USER MAPPING for $user SERVER shard1 options (user '$user');
postgres=# CREATE USER MAPPING for $user SERVER shard1 options (user '$user');
postgres=# CREATE TABLE t(u integer primary key, v integer);
postgres=# CREATE TABLE t1(u integer primary key, v integer);
postgres=# CREATE TABLE t2(u integer primary key, v integer);
postgres=# insert into t1 values (generate_series(1,100000), random()*100000);
postgres=# insert into t2 values (generate_series(1,100000), random()*100000);
postgres=# CREATE FOREIGN TABLE t_fdw1() inherits (t) server shard1 options(table_name 't1');
postgres=# CREATE FOREIGN TABLE t_fdw2() inherits (t) server shard2 options(table_name 't2');


postgres=# explain analyze select * from t order by u limit 1;
                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=200.15..200.20 rows=1 width=8) (actual time=2.010..2.010 rows=1 loops=1)
   ->  Merge Append  (cost=200.15..449.39 rows=5121 width=8) (actual time=2.009..2.009 rows=1 loops=1)
         Sort Key: t.u
         ->  Index Scan using t_pkey on t  (cost=0.12..8.14 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=1)
         ->  Foreign Scan on t_fdw2  (cost=100.00..193.92 rows=2560 width=8) (actual time=1.074..1.074 rows=1 loops=1)
         ->  Foreign Scan on t_fdw1  (cost=100.00..193.92 rows=2560 width=8) (actual time=0.928..0.928 rows=1 loops=1)
 Planning time: 0.769 ms
 Execution time: 6.837 ms
(8 rows)

As you can see foreign scan fetches only one row from each remote node.

But still pushing down limit can have positive effect on performance, especially if SORT can be replaced with TOP-N.
I got the following results (time in seconds):

Query
original
limit push down
select * from t order by u limit 1
2.276
1.777
select * from t order by v limit 1
10042

There is index for "u", so fetching records with smallest "u" values can be done without sorting, so times are similar.
But in case of sorting by "v", pushing down limit allows to use TOP-1 instead of global sort and it reduces query execution time more than 2 times.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Thu, Aug 17, 2017 at 11:36 AM, Douglas Doole <dougdoole@gmail.com> wrote:
I completely agree. The further a limit can be pushed down, the better.

The patch looks good to me.

It seems like a somewhat ad-hoc approach; it supposes that we can take any query produced by deparseSelectStmtForRel() and stick a LIMIT clause onto the very end and all will be well.  Maybe that's not a problematic assumption, not sure.  The grammar happens to allow both FOR UPDATE LIMIT n and LIMIT n FOR UPDATE even though only the latter syntax is documented.

Regarding the other patch on this thread, you mentioned upthread that "If it is possible to get more than one SubqueryScanState and/or ResultState between the limit and sort, then the first block of code could be placed in a while loop."  I think that's not possible for a ResultState, but I think it *is* possible for a SubqueryScanState.
 
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Douglas Doole
Дата:
Thanks for the feedback on my original patch Robert. Here's an updated patch that will tunnel through multiple SubqueryScanStates.

- Doug
Salesforce

On Thu, Aug 17, 2017 at 6:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 17, 2017 at 11:36 AM, Douglas Doole <dougdoole@gmail.com> wrote:
I completely agree. The further a limit can be pushed down, the better.

The patch looks good to me.

It seems like a somewhat ad-hoc approach; it supposes that we can take any query produced by deparseSelectStmtForRel() and stick a LIMIT clause onto the very end and all will be well.  Maybe that's not a problematic assumption, not sure.  The grammar happens to allow both FOR UPDATE LIMIT n and LIMIT n FOR UPDATE even though only the latter syntax is documented.

Regarding the other patch on this thread, you mentioned upthread that "If it is possible to get more than one SubqueryScanState and/or ResultState between the limit and sort, then the first block of code could be placed in a while loop."  I think that's not possible for a ResultState, but I think it *is* possible for a SubqueryScanState.
 
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Fri, Aug 18, 2017 at 11:42 AM, Douglas Doole <dougdoole@gmail.com> wrote:
> Thanks for the feedback on my original patch Robert. Here's an updated patch
> that will tunnel through multiple SubqueryScanStates.

Seems reasonable.  I have some assorted nitpicks.

1. The header comment for pass_down_bound() could mention "one or more
levels of subqueries" rather than "a subquery".

2. The first of the comments in the function body appears to have a
whitespace issue that needs to be fixed manually or, better yet,
addressed by pgindent.

3. The formatting of the comment in the regression tests appears to be
unlike any other comment in that same file.

4. I am pretty doubtful that "Memory: 25kB" is going to be stable
enough for us to want that output memorialized in the regression
tests. It seems like it might vary on different platforms - e.g.
32-bit vs. 64-bit - and it also seems like minor changes to how we do
sorting could perturb it and, perhaps, make it unstable even if today
it isn't.  So I think it would be good to give this a bit more thought
and see if you can come up with a way to test this without running
afoul of that problem.  Maybe adapt from this:

do $$declare x text; begin execute 'explain select 1' into x; if x !~
'^Result' then raise notice '%', x; else raise notice 'looks ok'; end
if; end;$$;

BTW, regarding the other patch on this thread, it struck me that maybe
it would be better to just reduce/limit the fetch count for the cursor
instead of trying to inject LIMIT n into the query itself.  That's not
as good for query optimization purposes but it's a lot more
future-proof.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Douglas Doole
Дата:
1. The header comment for pass_down_bound() could mention "one or more
levels of subqueries" rather than "a subquery".

Fixed 

2. The first of the comments in the function body appears to have a
whitespace issue that needs to be fixed manually or, better yet,
addressed by pgindent.

Fixed
 
3. The formatting of the comment in the regression tests appears to be
unlike any other comment in that same file.

A side effect of inheriting it from our branches ;-) Reworked.
 
4. I am pretty doubtful that "Memory: 25kB" is going to be stable
enough for us to want that output memorialized in the regression ...

Fair enough. I wanted to be a bit more sophisticated in my check than looking for a single value so I worked out something that distills the explain down to the key elements.
Вложения

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Fri, Aug 18, 2017 at 4:08 PM, Douglas Doole <dougdoole@gmail.com> wrote:
>> 4. I am pretty doubtful that "Memory: 25kB" is going to be stable
>> enough for us to want that output memorialized in the regression ...
>
> Fair enough. I wanted to be a bit more sophisticated in my check than
> looking for a single value so I worked out something that distills the
> explain down to the key elements.

Works for me.  While I'm sure this won't eclipse previous achievements
in this area, it still seems worthwhile.  So committed with one minor
change to shorten a long-ish line.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Michael Paquier
Дата:
On Tue, Aug 22, 2017 at 3:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Works for me.  While I'm sure this won't eclipse previous achievements
> in this area, it still seems worthwhile.

This one is intentional per what happens in the US today? ;)
-- 
Michael



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Konstantin Knizhnik
Дата:


On 18.08.2017 04:33, Robert Haas wrote:

It seems like a somewhat ad-hoc approach; it supposes that we can take any query produced by deparseSelectStmtForRel() and stick a LIMIT clause onto the very end and all will be well.  Maybe that's not a problematic assumption, not sure.  The grammar happens to allow both FOR UPDATE LIMIT n and LIMIT n FOR UPDATE even though only the latter syntax is documented.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


I am not absolutely sure that it is possible to append any query which can be constructed by postgres_fdw for foreign scan with "LIMIT n" clause.
But I also do not know example when it is not possible. As you have mentioned, "FOR UPDATE LIMIT n" is currently recognized by Postgres.

Can you suggest how to implement limit push down to FDW in better way?
Move deparseSelectStmtForRel() from postgresGetForeignPlan to postgresIterateForeignScan ?
It seems to be problematic because many information required by deparseSelectStmtForRel is not available in postgresIterateForeignScan.
In principle, it is possible to somehow propagate it here. But from my point of view it is not right approach...

IMHO there is some contradiction in Postgres optimizer that static information about limit is not taken in account at the planning stage and is actually used only during query execution,
when pass_down_bound() function is called to propagate knowledge about limit down through plan nodes. Certainly I understand that it gives more flexibility: we can use information from
previous steps of query execution which was not available at planning stage.

But pushing down limit at planning stage requires too much changes.  And the proposed patch is very small and non-invasive. And in principle, it can be used not only postgres_fdw, but also in other FDW implementations to push down information about LIMIT.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Konstantin Knizhnik
Дата:


On 22.08.2017 17:27, Konstantin Knizhnik wrote:


On 18.08.2017 04:33, Robert Haas wrote:

It seems like a somewhat ad-hoc approach; it supposes that we can take any query produced by deparseSelectStmtForRel() and stick a LIMIT clause onto the very end and all will be well.  Maybe that's not a problematic assumption, not sure.  The grammar happens to allow both FOR UPDATE LIMIT n and LIMIT n FOR UPDATE even though only the latter syntax is documented.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


I am not absolutely sure that it is possible to append any query which can be constructed by postgres_fdw for foreign scan with "LIMIT n" clause.
But I also do not know example when it is not possible. As you have mentioned, "FOR UPDATE LIMIT n" is currently recognized by Postgres.


I have inspected deparseSelectStmtForRel function and now I am sure that appending LIMIT to the SQL statement generated by this function will not cause any problems.
It can produce only the following subset of SELECT:

select <target-list> FROM <table-list> [GROUP BY ... [ HAVING ... ]] [ OREDER BY ... ] [ FOR UPDATE ... ];


The only suspicious clause is FOR UPDATE, but I have checked that "FOR UPDATE ... LIMIT n" is  really accepted by Postgres parser.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Ashutosh Bapat
Дата:
On Wed, Aug 23, 2017 at 12:56 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
>
>
> On 22.08.2017 17:27, Konstantin Knizhnik wrote:
>
>
>
> On 18.08.2017 04:33, Robert Haas wrote:
>
>
> It seems like a somewhat ad-hoc approach; it supposes that we can take any
> query produced by deparseSelectStmtForRel() and stick a LIMIT clause onto
> the very end and all will be well.  Maybe that's not a problematic
> assumption, not sure.  The grammar happens to allow both FOR UPDATE LIMIT n
> and LIMIT n FOR UPDATE even though only the latter syntax is documented.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>
> I am not absolutely sure that it is possible to append any query which can
> be constructed by postgres_fdw for foreign scan with "LIMIT n" clause.
> But I also do not know example when it is not possible. As you have
> mentioned, "FOR UPDATE LIMIT n" is currently recognized by Postgres.
>
>
> I have inspected deparseSelectStmtForRel function and now I am sure that
> appending LIMIT to the SQL statement generated by this function will not
> cause any problems.
> It can produce only the following subset of SELECT:
>
> select <target-list> FROM <table-list> [GROUP BY ... [ HAVING ... ]] [
> OREDER BY ... ] [ FOR UPDATE ... ];
>
>
> The only suspicious clause is FOR UPDATE, but I have checked that "FOR
> UPDATE ... LIMIT n" is  really accepted by Postgres parser.
>

There are two ways we can do this.
1. Implement limit handling in postgresGetForeignUpperPaths() when it
gets UPPERREL_FINAL (or we might want to break this rel into final and
limit rel). We then add paths for limit processing to the final rel
and add corresponding handling in deparseSelectStmtForRel(). If
foreign path happens to be cheaper, LIMIT gets pushed down to the
foreign server. But this approach does not take care of LIMITs passed
down by pass_down_bound(). But it will take care of deparsing the
query correctly and handle OFFSET clause as well.

2. Konstantin's approach takes care of LIMITs passed down by
pass_down_bound(), but "always" pushes down the LIMIT and assumes that
a LIMIT clause can be appended to the query already generated. Both of
these seem sane choices. But then it doesn't push down OFFSET clause
even when it's possible to push it down.

If we could defer deparsing the query to execution time we might be
able to achieve both the targets. It has one more advantage: we could
pass parameters embedded in the query, rather than binding them.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Mon, Aug 21, 2017 at 2:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Works for me.  While I'm sure this won't eclipse previous achievements
> in this area, it still seems worthwhile.  So committed with one minor
> change to shorten a long-ish line.

Buildfarm members with force_parallel_mode=regress are failing now.  I
haven't had a chance to investigate exactly what's going on here, but
I think there are probably several issues:

1. It's definitely the case that the details about a sort operation
aren't propagated from a worker back to the leader.  This has elicited
complaint previously.

2. pass_down_bound() probably gets the Gather node at the top of the
tree and stops, since it doesn't know how to pass down a bound through
a Gather.  We could probably teach it to pass down the bound through
Gather or Gather Merge on the same theory as Append/MergeAppend.

3. However, even if it did that, it would only affect the leader, not
the workers, because each worker will of course have a separate copy
of each executor state node.  We could fix that by having the Gather
or Gather Merge node also store the bound and propagate that to each
worker via DSM, and then have each worker call pass_down_bound itself.
(This would require a bit of refactoring of the API for
pass_down_bound, but it looks doable.)

In the short run, I'm not sure we have a better alternative than
removing this test - unless somebody has a better idea? - but it would
be good to work on all of the above problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Douglas Doole
Дата:
Buildfarm members with force_parallel_mode=regress are failing now.  I
haven't had a chance to investigate exactly what's going on here, but
I think there are probably several issues:

Not an auspicious beginning for my first patch :-(
 
3. However, even if it did that, it would only affect the leader, not
the workers, because each worker will of course have a separate copy
of each executor state node.  We could fix that by having the Gather
or Gather Merge node also store the bound and propagate that to each
worker via DSM, and then have each worker call pass_down_bound itself.
(This would require a bit of refactoring of the API for
pass_down_bound, but it looks doable.)

From previous experience, pushing the limit to the workers has the potential to be a big win .

In the short run, I'm not sure we have a better alternative than
removing this test - unless somebody has a better idea? - but it would
be good to work on all of the above problems.

I haven't played with parallel mode at all yet. Is it possible to disable force_parallel_mode for the new test? If not, then I'd agree that removing the test is probably the only reasonable short term fix.

- Doug
Salesforce

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Wed, Aug 23, 2017 at 6:55 PM, Douglas Doole <dougdoole@gmail.com> wrote:
> From previous experience, pushing the limit to the workers has the potential
> to be a big win .

Here's a not-really-tested draft patch for that.

> I haven't played with parallel mode at all yet. Is it possible to disable
> force_parallel_mode for the new test?

Yeah, I suppose that's another alternative.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Douglas Doole
Дата:
Here's a not-really-tested draft patch for that.

I don't know the parallel code so I'm not going to comment on the overall patch, but a thought on ExecSetTupleBound().

That function is getting a number of cases where we're doing recursion for a single child (result, gather, gather-merge). Recursion for a single child isn't particularly efficient. I know that if there's a single case of recursion like this, compilers will frequently turn it into a loop, but I don't know if compilers can optimize a branched case like this.

Would we be better off moving those cases into the while loop I added to avoid the recursion? So we'd end up with something like:

while ()
{
    if subquery
    else if result
    else if gather
    else if gather merge
}

if sort
else if merge append

And a nit from my original fix now that I've looked at the pg10 code more. The two casts I added (to SubqueryScanState and Node) should probably be changed to castNode() calls.

- Doug
Salesforce

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Aug 23, 2017 at 6:55 PM, Douglas Doole <dougdoole@gmail.com> wrote:
>> From previous experience, pushing the limit to the workers has the potential
>> to be a big win .

> Here's a not-really-tested draft patch for that.

Both this patch and the already-committed one contain useless (and not
very cheap) expression_returns_set tests.  Since commit 69f4b9c85,
SRFs could only appear in the tlist of ProjectSet nodes.

No opinion on whether this patch is right otherwise.
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Douglas Doole <dougdoole@gmail.com> writes:
> Would we be better off moving those cases into the while loop I added to
> avoid the recursion?

No.  You can't easily avoid recursion for the merge-append case, since
that has to descend to multiple children.  TBH I dislike the fact that
you did the subquery case randomly differently from the existing cases;
it should just have been added as an additional recursive case.  Since
this is done only once at query startup, worrying about hypothetical
micro-performance issues seems rather misguided.

Perhaps, since this function is recursive, it ought to have a
check_stack_depth call, just to be safe.  I'm dubious though that it could
ever eat more stack than the preceding node-initialization calls, so maybe
we needn't bother.

In the long run I wonder whether we should convert "push down bound" into
a generic node operation like the others managed by execAmi.c.  It seems
like you could push a limit through any node type that's guaranteed not to
reduce the number of rows returned, and so we're missing cases.  Maybe
they're not ones that are important in practice, but I'm unsure.

> The two casts I added (to SubqueryScanState and Node) should probably be
> changed to castNode() calls.

The SubqueryScanState cast seems fine given that it immediately follows an
IsA() check.  You can't use castNode() for a cast to Node, because that's
a generic not a specific node type.
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> Buildfarm members with force_parallel_mode=regress are failing now.  I
> haven't had a chance to investigate exactly what's going on here, but
> I think there are probably several issues:

> 1. It's definitely the case that the details about a sort operation
> aren't propagated from a worker back to the leader.  This has elicited
> complaint previously.

Check; this must be the explanation for the buildfarm failures.  That'd
be worth fixing but it seems like material for an independent patch.

> 2. pass_down_bound() probably gets the Gather node at the top of the
> tree and stops, since it doesn't know how to pass down a bound through
> a Gather.  We could probably teach it to pass down the bound through
> Gather or Gather Merge on the same theory as Append/MergeAppend.

While that might be worth doing, it's not the issue here, because
the forced Gather is on top of the Limit node.

> In the short run, I'm not sure we have a better alternative than
> removing this test - unless somebody has a better idea? - but it would
> be good to work on all of the above problems.

To get the buildfarm back to green, I agree with the suggestion of
setting force_parallel_mode=off for this test, at least for now.

I don't greatly like the way that the regression test case filters
the output; it hides too much IMO.  I'd be inclined to try to return
the EXPLAIN output with as much detail preserved as possible.  Maybe
we could use regex substitution on lines of the output to get rid of
stuff that won't be invariant.

Unless you're already on it, I can go do that.
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Douglas Doole
Дата:
I don't greatly like the way that the regression test case filters
the output; it hides too much IMO.  I'd be inclined to try to return
the EXPLAIN output with as much detail preserved as possible.  Maybe
we could use regex substitution on lines of the output to get rid of
stuff that won't be invariant.

My first thought was to do a regex over the explain output to mask the troublesome value, but I couldn't figure out how to make it work and didn't find an example (admittedly didn't spent a huge amount of time hunting though). I'd love to see how to put it together.

Doug
- Salesforce

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
I wrote:
> To get the buildfarm back to green, I agree with the suggestion of
> setting force_parallel_mode=off for this test, at least for now.

Actually, there's an easier way: we can just make the test table be
TEMP.  That is a good idea anyway to prevent possible interference
from auto-analyze, which conceivably could come along at just the
wrong time and cause a plan change.
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Douglas Doole
Дата:
No.  You can't easily avoid recursion for the merge-append case, since
that has to descend to multiple children.

Agreed. That's why it's not in the while loop in my sketch of the suggested rework.
 
  TBH I dislike the fact that
you did the subquery case randomly differently from the existing cases;
it should just have been added as an additional recursive case.  Since
this is done only once at query startup, worrying about hypothetical
micro-performance issues seems rather misguided.

Habit mostly - why write code with potential performance problems when the alternative is just as easy to read?

I disagree with saying "it's just done at query startup so don't worry about it too much". Back in my days with DB2 we were working on the TPCC benchmark (when it was still relevant) and we found that the startup cost was a huge factor when doing rapid fire, cached, simple queries. In many cases the startup cost was >50% of the overall query execution time. Our testing here in Salesforce is showing a similar pattern in some of our workloads and PostgreSQL.

I'm not trying to argue that this particular issue of recurse/don't recurse is going to make or break anything. It's just where my head's at when I look at the code.

- Doug
Salesforce

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Douglas Doole <dougdoole@gmail.com> writes:
> My first thought was to do a regex over the explain output to mask the
> troublesome value, but I couldn't figure out how to make it work and didn't
> find an example (admittedly didn't spent a huge amount of time hunting
> though). I'd love to see how to put it together.

I did it like this:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1177ab1dabf72bafee8f19d904cee3a299f25892
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Douglas Doole <dougdoole@gmail.com> writes:
>> TBH I dislike the fact that
>> you did the subquery case randomly differently from the existing cases;
>> it should just have been added as an additional recursive case.  Since
>> this is done only once at query startup, worrying about hypothetical
>> micro-performance issues seems rather misguided.

> Habit mostly - why write code with potential performance problems when the
> alternative is just as easy to read?

I disagree that having adjacent code doing the same thing in two different
ways is "easy to read"; what it is is confusing.  More globally, since
we're dealing with community code that will be read and hacked on by many
people, I think we need to prioritize simplicity, readability, and
extensibility over micro-efficiency.  I'm willing to compromise those
goals for efficiency where a clear need has been demonstrated, but no such
need has been shown here, nor do I think it's likely that one can be
shown.

I'd have been okay with changing the entire function if it could still be
doing all cases the same way.  But the exception for merge-append (and
probably soon other cases) means you still end up with a readability
problem.
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Thu, Aug 24, 2017 at 2:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'd have been okay with changing the entire function if it could still be
> doing all cases the same way.  But the exception for merge-append (and
> probably soon other cases) means you still end up with a readability
> problem.

I don't really agree with that.  I think it's reasonable to handle the
cases where we are just looking through nodes differently than the
others.  Just because we can't conveniently avoid recursing in every
case doesn't mean, to me, that we should recurse even when it's easily
avoidable.  I think what we might think about doing is have a node
that loops over Subquery Scan and Result and drills down, and then
handle the rest of the cases either as at present or, perhaps, with a
switch.

On another note, here's a second patch applying on top of my earlier
patch to push down Limit through Gather and Gather Merge.  This one
propagates information about sorts performed in parallel workers back
to the leader and adjusts the test case to use a non-temp table again.
This second patch has the nice property of being a fairly good test
that the first patch is actually doing something.

I'm inclined to commit both of these after a little more testing and
self-review, but let me know if anyone else wants to review first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Aug 24, 2017 at 2:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'd have been okay with changing the entire function if it could still be
>> doing all cases the same way.  But the exception for merge-append (and
>> probably soon other cases) means you still end up with a readability
>> problem.

> I don't really agree with that.  I think it's reasonable to handle the
> cases where we are just looking through nodes differently than the
> others.  Just because we can't conveniently avoid recursing in every
> case doesn't mean, to me, that we should recurse even when it's easily
> avoidable.

Basically, this argument is that we should contort the code in order
to avoid tail recursion, rather than assuming the compiler will turn
that recursion into iteration, even in cases where there is really
zero evidence that we should be worrying about performance at all
rather than correctness and readability.  I do not agree (and just
finished pushing a patch to change it).

> On another note, here's a second patch applying on top of my earlier
> patch to push down Limit through Gather and Gather Merge.

Can you demonstrate any case where the planner would put Gather between
Sort and Limit?  The GatherMerge case is probably interesting, but I'm
having doubts about Gather.
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I'm inclined to commit both of these after a little more testing and
> self-review, but let me know if anyone else wants to review first.

Looking through the first one, I wondered whether we needed to
check for a qual expression on Gather or GatherMerge.  It seems like
it would be stupid to put a filter on that node rather than its
children, but I see this in both nodeGather.c and nodeGatherMerge.c:
/* * initialize child expressions */gatherstate->ps.qual =    ExecInitQual(node->plan.qual, (PlanState *)
gatherstate);

It doesn't look like the qual is actually used anywhere in either node
type.  Am I right in thinking this is dead code?
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Amit Kapila
Дата:
On Fri, Aug 25, 2017 at 7:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I'm inclined to commit both of these after a little more testing and
>> self-review, but let me know if anyone else wants to review first.
>
> Looking through the first one, I wondered whether we needed to
> check for a qual expression on Gather or GatherMerge.  It seems like
> it would be stupid to put a filter on that node rather than its
> children, but I see this in both nodeGather.c and nodeGatherMerge.c:
>
>         /*
>          * initialize child expressions
>          */
>         gatherstate->ps.qual =
>                 ExecInitQual(node->plan.qual, (PlanState *) gatherstate);
>
> It doesn't look like the qual is actually used anywhere in either node
> type.  Am I right in thinking this is dead code?
>

I also think so.  I think this was required in some initial versions
of gather node patch where we were thinking of having a single node
(instead of what we have now that Gather node and beneath there will
be partial scan node) to perform parallel scans.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I'm inclined to commit both of these after a little more testing and
> self-review, but let me know if anyone else wants to review first.

Attached is an updated version of the first patch, which I rebased
over 3f4c7917b, improved a bunch of the comments for, and fixed a
couple of obvious typos in.  However, when I went to test it,
it blew up real good:

regression=# set parallel_setup_cost=0;
SET
regression=# set parallel_tuple_cost=0;
SET
regression=# set min_parallel_table_scan_size=0;
SET
regression=# set max_parallel_workers_per_gather=4;
SET
regression=# explain analyze select * from tenk1 order by fivethous limit 4;
ERROR:  retrieved too many tuples in a bounded sort
CONTEXT:  parallel worker

The cause is obvious: GatherMerge doesn't know about the contract that
it's not supposed to pull more than tuples_needed rows from an input after
promising not to do so.  I am not convinced that it's worth adding the
logic that would be needed to make that happen, so my inclination is to
abandon this patch.  But here it is if you want to push further.

            regards, tom lane

diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ce47f1d..1287025 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -47,17 +47,26 @@
  * greater than any 32-bit integer here so that values < 2^32 can be used
  * by individual parallel nodes to store their own state.
  */
-#define PARALLEL_KEY_PLANNEDSTMT        UINT64CONST(0xE000000000000001)
-#define PARALLEL_KEY_PARAMS                UINT64CONST(0xE000000000000002)
-#define PARALLEL_KEY_BUFFER_USAGE        UINT64CONST(0xE000000000000003)
-#define PARALLEL_KEY_TUPLE_QUEUE        UINT64CONST(0xE000000000000004)
-#define PARALLEL_KEY_INSTRUMENTATION    UINT64CONST(0xE000000000000005)
-#define PARALLEL_KEY_DSA                UINT64CONST(0xE000000000000006)
-#define PARALLEL_KEY_QUERY_TEXT        UINT64CONST(0xE000000000000007)
+#define PARALLEL_KEY_EXECUTOR_FIXED        UINT64CONST(0xE000000000000001)
+#define PARALLEL_KEY_PLANNEDSTMT        UINT64CONST(0xE000000000000002)
+#define PARALLEL_KEY_PARAMS                UINT64CONST(0xE000000000000003)
+#define PARALLEL_KEY_BUFFER_USAGE        UINT64CONST(0xE000000000000004)
+#define PARALLEL_KEY_TUPLE_QUEUE        UINT64CONST(0xE000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION    UINT64CONST(0xE000000000000006)
+#define PARALLEL_KEY_DSA                UINT64CONST(0xE000000000000007)
+#define PARALLEL_KEY_QUERY_TEXT        UINT64CONST(0xE000000000000008)

 #define PARALLEL_TUPLE_QUEUE_SIZE        65536

 /*
+ * Fixed-size random stuff that we need to pass to parallel workers.
+ */
+typedef struct FixedParallelExecutorState
+{
+    int64        tuples_needed;    /* tuple bound, see ExecSetTupleBound */
+} FixedParallelExecutorState;
+
+/*
  * DSM structure for accumulating per-PlanState instrumentation.
  *
  * instrument_options: Same meaning here as in instrument.c.
@@ -381,12 +390,14 @@ ExecParallelReinitialize(ParallelExecutorInfo *pei)
  * execution and return results to the main backend.
  */
 ParallelExecutorInfo *
-ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
+ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers,
+                     int64 tuples_needed)
 {
     ParallelExecutorInfo *pei;
     ParallelContext *pcxt;
     ExecParallelEstimateContext e;
     ExecParallelInitializeDSMContext d;
+    FixedParallelExecutorState *fpes;
     char       *pstmt_data;
     char       *pstmt_space;
     char       *param_space;
@@ -418,6 +429,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
      * for the various things we need to store.
      */

+    /* Estimate space for fixed-size state. */
+    shm_toc_estimate_chunk(&pcxt->estimator,
+                           sizeof(FixedParallelExecutorState));
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
     /* Estimate space for query text. */
     query_len = strlen(estate->es_sourceText);
     shm_toc_estimate_chunk(&pcxt->estimator, query_len);
@@ -487,6 +503,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
      * asked for has been allocated or initialized yet, though, so do that.
      */

+    /* Store fixed-size state. */
+    fpes = shm_toc_allocate(pcxt->toc, sizeof(FixedParallelExecutorState));
+    fpes->tuples_needed = tuples_needed;
+    shm_toc_insert(pcxt->toc, PARALLEL_KEY_EXECUTOR_FIXED, fpes);
+
     /* Store query string */
     query_string = shm_toc_allocate(pcxt->toc, query_len);
     memcpy(query_string, estate->es_sourceText, query_len);
@@ -833,6 +854,7 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 void
 ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 {
+    FixedParallelExecutorState *fpes;
     BufferUsage *buffer_usage;
     DestReceiver *receiver;
     QueryDesc  *queryDesc;
@@ -841,6 +863,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
     void       *area_space;
     dsa_area   *area;

+    /* Get fixed-size state. */
+    fpes = shm_toc_lookup(toc, PARALLEL_KEY_EXECUTOR_FIXED, false);
+
     /* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
     receiver = ExecParallelGetReceiver(seg, toc);
     instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
@@ -868,6 +893,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
     queryDesc->planstate->state->es_query_dsa = area;
     ExecParallelInitializeWorker(queryDesc->planstate, toc);

+    /* Pass down any tuple bound */
+    ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
+
     /* Run the plan */
     ExecutorRun(queryDesc, ForwardScanDirection, 0L, true);

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 36d2914..06a89ad 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -757,3 +757,119 @@ ExecShutdownNode(PlanState *node)

     return false;
 }
+
+/*
+ * ExecSetTupleBound
+ *
+ * Set a tuple bound for a planstate node.  This lets child plan nodes
+ * optimize based on the knowledge that the maximum number of tuples
+ * that their parent will demand is limited.
+ *
+ * Passing a negative tuples_needed value indicates "unknown limit", which
+ * should be the default assumption when this is not called at all for a
+ * particular node.  Also note the requirement that if this is called
+ * repeatedly on a plan tree, the exact same set of nodes must be updated
+ * with the new limit each time.
+ */
+void
+ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
+{
+    /*
+     * Since this function recurses, in principle we should check stack depth
+     * here.  In practice, it's probably pointless since the earlier node
+     * initialization tree traversal would surely have consumed more stack.
+     */
+
+    if (IsA(child_node, SortState))
+    {
+        /*
+         * If it is a Sort node, notify it that it can use bounded sort.
+         *
+         * Note: it is the responsibility of nodeSort.c to react properly to
+         * changes of these parameters.  If we ever redesign this, it'd be a
+         * good idea to integrate this signaling with the parameter-change
+         * mechanism.
+         */
+        SortState  *sortState = (SortState *) child_node;
+
+        if (tuples_needed < 0)
+        {
+            /* make sure flag gets reset if needed upon rescan */
+            sortState->bounded = false;
+        }
+        else
+        {
+            sortState->bounded = true;
+            sortState->bound = tuples_needed;
+        }
+    }
+    else if (IsA(child_node, MergeAppendState))
+    {
+        /*
+         * If it is a MergeAppend, we can apply the bound to any nodes that
+         * are children of the MergeAppend, since the MergeAppend surely need
+         * read no more than that many tuples from any one input.
+         */
+        MergeAppendState *maState = (MergeAppendState *) child_node;
+        int            i;
+
+        for (i = 0; i < maState->ms_nplans; i++)
+            ExecSetTupleBound(tuples_needed, maState->mergeplans[i]);
+    }
+    else if (IsA(child_node, ResultState))
+    {
+        /*
+         * We should also look through a Result, since the planner might stick
+         * one atop MergeAppend for projection purposes.
+         *
+         * If Result supported qual checking, we'd have to punt on seeing a
+         * qual.  Note that having a resconstantqual is not a showstopper: if
+         * that fails we're not getting any rows at all.
+         */
+        if (outerPlanState(child_node))
+            ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
+    }
+    else if (IsA(child_node, SubqueryScanState))
+    {
+        /*
+         * We can also look through SubqueryScan, but only if it has no qual
+         * (otherwise it might discard rows).
+         */
+        SubqueryScanState *subqueryState = (SubqueryScanState *) child_node;
+
+        if (subqueryState->ss.ps.qual == NULL)
+            ExecSetTupleBound(tuples_needed, subqueryState->subplan);
+    }
+    else if (IsA(child_node, GatherState))
+    {
+        GatherState *gstate = (GatherState *) child_node;
+
+        /*
+         * We might have a Gather node, which can propagate the bound to its
+         * workers.  As with MergeAppend, no one worker could possibly need to
+         * return more tuples than the Gather itself needs to.
+         *
+         * Note: As with Sort, the Gather node is responsible for reacting
+         * properly to changes to this parameter.
+         */
+        gstate->tuples_needed = tuples_needed;
+
+        /* We must also pass down bound to our own copy of the child plan */
+        ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
+    }
+    else if (IsA(child_node, GatherMergeState))
+    {
+        GatherMergeState *gstate = (GatherMergeState *) child_node;
+
+        /* Same idea here as for Gather */
+        gstate->tuples_needed = tuples_needed;
+        ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
+    }
+
+    /*
+     * In principle we could look through any plan node type that is certain
+     * not to discard or combine input rows.  In practice, there are not many
+     * node types that the planner might put between Sort and Limit, so trying
+     * to be very general is not (currently) worth the trouble.
+     */
+}
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index e8d94ee..a0f5a60 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -72,6 +72,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
     gatherstate->ps.state = estate;
     gatherstate->ps.ExecProcNode = ExecGather;
     gatherstate->need_to_scan_locally = !node->single_copy;
+    gatherstate->tuples_needed = -1;

     /*
      * Miscellaneous initialization
@@ -156,7 +157,8 @@ ExecGather(PlanState *pstate)
             if (!node->pei)
                 node->pei = ExecInitParallelPlan(node->ps.lefttree,
                                                  estate,
-                                                 gather->num_workers);
+                                                 gather->num_workers,
+                                                 node->tuples_needed);

             /*
              * Register backend workers. We might not get as many as we
diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 64c6239..2526c58 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -77,6 +77,7 @@ ExecInitGatherMerge(GatherMerge *node, EState *estate, int eflags)
     gm_state->ps.plan = (Plan *) node;
     gm_state->ps.state = estate;
     gm_state->ps.ExecProcNode = ExecGatherMerge;
+    gm_state->tuples_needed = -1;

     /*
      * Miscellaneous initialization
@@ -190,7 +191,8 @@ ExecGatherMerge(PlanState *pstate)
             if (!node->pei)
                 node->pei = ExecInitParallelPlan(node->ps.lefttree,
                                                  estate,
-                                                 gm->num_workers);
+                                                 gm->num_workers,
+                                                 node->tuples_needed);

             /* Try to launch workers. */
             pcxt = node->pei->pcxt;
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index ceb6854..d407bd8 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -27,7 +27,7 @@
 #include "nodes/nodeFuncs.h"

 static void recompute_limits(LimitState *node);
-static void pass_down_bound(LimitState *node, PlanState *child_node);
+static int64 compute_tuples_needed(LimitState *node);


 /* ----------------------------------------------------------------
@@ -297,92 +297,27 @@ recompute_limits(LimitState *node)
     /* Set state-machine state */
     node->lstate = LIMIT_RESCAN;

-    /* Notify child node about limit, if useful */
-    pass_down_bound(node, outerPlanState(node));
+    /*
+     * Notify child node about limit, if useful.  Note: think not to
+     * "optimize" by skipping the ExecSetTupleBound call if
+     * compute_tuples_needed returns <0.  We must update child nodes anyway,
+     * in case this is a rescan and the previous time we got a different
+     * result.
+     */
+    ExecSetTupleBound(compute_tuples_needed(node), outerPlanState(node));
 }

 /*
- * If we have a COUNT, and our input is a Sort node, notify it that it can
- * use bounded sort.  We can also pass down the bound through plan nodes
- * that cannot remove or combine input rows; for example, if our input is a
- * MergeAppend, we can apply the same bound to any Sorts that are direct
- * children of the MergeAppend, since the MergeAppend surely need not read
- * more than that many tuples from any one input.
- *
- * This is a bit of a kluge, but we don't have any more-abstract way of
- * communicating between the two nodes; and it doesn't seem worth trying
- * to invent one without some more examples of special communication needs.
- *
- * Note: it is the responsibility of nodeSort.c to react properly to
- * changes of these parameters.  If we ever do redesign this, it'd be a
- * good idea to integrate this signaling with the parameter-change mechanism.
+ * Compute the number of tuples needed to satisfy a Limit node.
+ * Return a negative value if there is not a determinable limit.
  */
-static void
-pass_down_bound(LimitState *node, PlanState *child_node)
+static int64
+compute_tuples_needed(LimitState *node)
 {
-    /*
-     * Since this function recurses, in principle we should check stack depth
-     * here.  In practice, it's probably pointless since the earlier node
-     * initialization tree traversal would surely have consumed more stack.
-     */
-
-    if (IsA(child_node, SortState))
-    {
-        SortState  *sortState = (SortState *) child_node;
-        int64        tuples_needed = node->count + node->offset;
-
-        /* negative test checks for overflow in sum */
-        if (node->noCount || tuples_needed < 0)
-        {
-            /* make sure flag gets reset if needed upon rescan */
-            sortState->bounded = false;
-        }
-        else
-        {
-            sortState->bounded = true;
-            sortState->bound = tuples_needed;
-        }
-    }
-    else if (IsA(child_node, MergeAppendState))
-    {
-        /* Pass down the bound through MergeAppend */
-        MergeAppendState *maState = (MergeAppendState *) child_node;
-        int            i;
-
-        for (i = 0; i < maState->ms_nplans; i++)
-            pass_down_bound(node, maState->mergeplans[i]);
-    }
-    else if (IsA(child_node, ResultState))
-    {
-        /*
-         * We also have to be prepared to look through a Result, since the
-         * planner might stick one atop MergeAppend for projection purposes.
-         *
-         * If Result supported qual checking, we'd have to punt on seeing a
-         * qual.  Note that having a resconstantqual is not a showstopper: if
-         * that fails we're not getting any rows at all.
-         */
-        if (outerPlanState(child_node))
-            pass_down_bound(node, outerPlanState(child_node));
-    }
-    else if (IsA(child_node, SubqueryScanState))
-    {
-        /*
-         * We can also look through SubqueryScan, but only if it has no qual
-         * (otherwise it might discard rows).
-         */
-        SubqueryScanState *subqueryState = (SubqueryScanState *) child_node;
-
-        if (subqueryState->ss.ps.qual == NULL)
-            pass_down_bound(node, subqueryState->subplan);
-    }
-
-    /*
-     * In principle we could look through any plan node type that is certain
-     * not to discard or combine input rows.  In practice, there are not many
-     * node types that the planner might put between Sort and Limit, so trying
-     * to be very general is not worth the trouble.
-     */
+    if (node->noCount)
+        return -1;
+    /* Note: if this overflows, we'll return a negative value, which is OK */
+    return node->count + node->offset;
 }

 /* ----------------------------------------------------------------
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index bd0a87f..79b8867 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -33,7 +33,7 @@ typedef struct ParallelExecutorInfo
 } ParallelExecutorInfo;

 extern ParallelExecutorInfo *ExecInitParallelPlan(PlanState *planstate,
-                     EState *estate, int nworkers);
+                     EState *estate, int nworkers, int64 tuples_needed);
 extern void ExecParallelFinish(ParallelExecutorInfo *pei);
 extern void ExecParallelCleanup(ParallelExecutorInfo *pei);
 extern void ExecParallelReinitialize(ParallelExecutorInfo *pei);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index eacbea3..f48a603 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -232,6 +232,7 @@ extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
+extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);


 /* ----------------------------------------------------------------
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3272c4b..15a8426 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1919,6 +1919,7 @@ typedef struct GatherState
     struct TupleQueueReader **reader;
     TupleTableSlot *funnel_slot;
     bool        need_to_scan_locally;
+    int64        tuples_needed;    /* tuple bound, see ExecSetTupleBound */
 } GatherState;

 /* ----------------
@@ -1944,6 +1945,7 @@ typedef struct GatherMergeState
     struct binaryheap *gm_heap; /* binary heap of slot indices */
     bool        gm_initialized; /* gather merge initilized ? */
     bool        need_to_scan_locally;
+    int64        tuples_needed;    /* tuple bound, see ExecSetTupleBound */
     int            gm_nkeys;
     SortSupport gm_sortkeys;    /* array of length ms_nkeys */
     struct GMReaderTupleBuffer *gm_tuple_buffers;    /* tuple buffer per reader */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Fri, Aug 25, 2017 at 7:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> it would be stupid to put a filter on that node rather than its
>> children, but I see this in both nodeGather.c and nodeGatherMerge.c:
>>
>> gatherstate->ps.qual =
>> ExecInitQual(node->plan.qual, (PlanState *) gatherstate);
>> 
>> It doesn't look like the qual is actually used anywhere in either node
>> type.  Am I right in thinking this is dead code?

> I also think so.  I think this was required in some initial versions
> of gather node patch where we were thinking of having a single node
> (instead of what we have now that Gather node and beneath there will
> be partial scan node) to perform parallel scans.

Thanks for confirming.  I'll replace that with something like
Assert(!node->plan.qual).  I have some other code-review-ish
fixes to do in nodeGatherMerge, too.
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Fri, Aug 25, 2017 at 9:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Basically, this argument is that we should contort the code in order
> to avoid tail recursion, rather than assuming the compiler will turn
> that recursion into iteration, even in cases where there is really
> zero evidence that we should be worrying about performance at all
> rather than correctness and readability.  I do not agree (and just
> finished pushing a patch to change it).

I'm not prepared to argue further about it.  This strikes me as one of
those things that is really a matter of judgement; I don't really care
much about how the code ends up but I think your portrayal of what was
there is unduly negative.

>> On another note, here's a second patch applying on top of my earlier
>> patch to push down Limit through Gather and Gather Merge.
>
> Can you demonstrate any case where the planner would put Gather between
> Sort and Limit?  The GatherMerge case is probably interesting, but I'm
> having doubts about Gather.

That's an interesting point.  I was thinking that we needed both of
these patches in order to make the proposed regression test pass,
because in the relevant case we'll have Gather -> Limit -> Sort, but
actually we only need the second one, because as you point out here
what matters is what's between the Limit and the Sort.  I was thinking
that we'd need to be able to see through the Gather at the top of the
plan, but we don't.

However, there's another opportunity for optimization here, which is
the Limit -> Gather -> Anything case.  A worker which has itself
generated enough tuples to fill the limit can certainly stop working,
but right now it doesn't know that.  The patch I posted before didn't
think about that case, but it seems to require only trivial
modifications.   I was going to post an updated patch trying to
address that, but I see that you've already posted something, so I'll
go have a look at that instead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> However, there's another opportunity for optimization here, which is
> the Limit -> Gather -> Anything case.  A worker which has itself
> generated enough tuples to fill the limit can certainly stop working,
> but right now it doesn't know that.  The patch I posted before didn't
> think about that case, but it seems to require only trivial
> modifications.

Ah, good point.

> I was going to post an updated patch trying to
> address that, but I see that you've already posted something, so I'll
> go have a look at that instead.

The problem I exhibited with the updated patch could probably be resolved
if the top level logic in a parallel worker knows about tuples_needed
and doesn't try to pull more than that from its subplan.  So what you're
describing isn't just an additional optimization, it's necessary to make
this work at all.
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Fri, Aug 25, 2017 at 11:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The problem I exhibited with the updated patch could probably be resolved
> if the top level logic in a parallel worker knows about tuples_needed
> and doesn't try to pull more than that from its subplan.  So what you're
> describing isn't just an additional optimization, it's necessary to make
> this work at all.

I had the same thought.  Done in the attached revision of your
version.  I couldn't consistently reproduce the failure you saw -- it
failed for me only about 1 time in 10 -- but I don't think it's
failing any more with this version.

I think that the comment at the bottom of ExecSetTupleBound should
probably be rethought a bit in light of these changes.  Teaching the
parallel worker about tuples_needed may not be JUST an additional
optimization, but it IS an additional optimization; IOW, what the
planner can put between Sort and Limit will no longer be the only
relevant consideration.

The details aside, Konstantin Knizhnik's proposal to teach this code
about ForeignScans is yet another independent way for this to win, and
I think that will also be quite a large win.  In the future, I believe
that we might want to use a mechanism like this in even more cases.
For example, a Semi Join or Anti Join only needs 1 row from the inner
side per execution.  There's probably no reason to put a Sort under
the inner side of a Semi Join or Anti Join, but it's certainly not
impossible to have a Foreign Scan there.  It's likely to be an
expensive plan, but it's a heck of a lot more expensive if we fetch
100 rows when we only need 1.

I'm not sure the existing mechanism will stretch to handle such cases.
It's possible that we should work out some of these things at plan
time rather than execution time, and it's also possible that a
separate call to ExecSetTupleBound is too much work.  I've mulled over
the idea of whether we should get rid of this mechanism altogether in
favor of passing the tuple bound as an additional argument to
ExecProcNode, because it seems silly to call node-specific code once
to set a (normally small, possibly 1) bound and then call it again to
get a (or the) tuple.  But I'm not sure passing it to ExecProcNode is
really the right idea; it's adding some overhead for something that is
a bit of a special case, and I'm not sure it will work out cleanly
otherwise, either.  I'd be interested in your thoughts if you have
any.  My intuition is that there's substantially more to be gained in
this area, but exactly how best to gain it is not very clear to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I had the same thought.  Done in the attached revision of your
> version.  I couldn't consistently reproduce the failure you saw -- it
> failed for me only about 1 time in 10 -- but I don't think it's
> failing any more with this version.

Hm.  It seemed pretty reliable for me, but we already know that the
parallel machinery is quite timing-sensitive.  I think we'd better
incorporate a regression test case into the committed patch so we
can see what the buildfarm thinks.  I'll see about adding one.

> I think that the comment at the bottom of ExecSetTupleBound should
> probably be rethought a bit in light of these changes.

Agreed; I'll revisit all the comments, actually.

> The details aside, Konstantin Knizhnik's proposal to teach this code
> about ForeignScans is yet another independent way for this to win, and
> I think that will also be quite a large win.

+1

> It's possible that we should work out some of these things at plan
> time rather than execution time, and it's also possible that a
> separate call to ExecSetTupleBound is too much work.  I've mulled over
> the idea of whether we should get rid of this mechanism altogether in
> favor of passing the tuple bound as an additional argument to
> ExecProcNode, because it seems silly to call node-specific code once
> to set a (normally small, possibly 1) bound and then call it again to
> get a (or the) tuple.

No, I think that's fundamentally the wrong idea.  The tuple bound
inherently is a piece of information that has to apply across a whole
scan.  If you pass it to ExecProcNode then you'll have to get into
API contracts about whether it's allowed to change across calls,
which is a mess, even aside from efficiency concerns (and for
ExecProcNode, I *am* concerned about efficiency).

You could imagine adding it as a parameter to ExecReScan, instead, but
then there are a bunch of issues with how that would interact with the
existing optimizations for skipped/delayed/merged rescan calls (cf.
the existing open issue with parallel rescans).  That is a can of worms
I'd just as soon not open.

I'm content to leave it as a separate call.  I still think that worrying
about extra C function calls for this purpose is, at best, premature
optimization.

I'll do another pass over the patch and get back to you.  I've not
looked at the instrumentation patch yet --- is there a reason to
worry about it before we finish this one?
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Fri, Aug 25, 2017 at 12:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm.  It seemed pretty reliable for me, but we already know that the
> parallel machinery is quite timing-sensitive.  I think we'd better
> incorporate a regression test case into the committed patch so we
> can see what the buildfarm thinks.  I'll see about adding one.

That would be great.  I didn't have a great idea how to write a test
for this -- maybe one of my systematic failures as a developer.  :-(

>> I think that the comment at the bottom of ExecSetTupleBound should
>> probably be rethought a bit in light of these changes.
>
> Agreed; I'll revisit all the comments, actually.

That also sounds great.

> No, I think that's fundamentally the wrong idea.  The tuple bound
> inherently is a piece of information that has to apply across a whole
> scan.  If you pass it to ExecProcNode then you'll have to get into
> API contracts about whether it's allowed to change across calls,
> which is a mess, even aside from efficiency concerns (and for
> ExecProcNode, I *am* concerned about efficiency).

Fair.

> You could imagine adding it as a parameter to ExecReScan, instead, but
> then there are a bunch of issues with how that would interact with the
> existing optimizations for skipped/delayed/merged rescan calls (cf.
> the existing open issue with parallel rescans).  That is a can of worms
> I'd just as soon not open.
>
> I'm content to leave it as a separate call.  I still think that worrying
> about extra C function calls for this purpose is, at best, premature
> optimization.

OK, you may well be right.  Outside of Limit, I suppose we're not
likely to set any limit other than 1, and if we set a limit of 1 it'll
probably be a limit that survives rescans, because it'll be based on
the parent node not caring about anything more than the first tuple.

BTW, I think another reason why we're likely to eventually want to get
very aggressive about passing down bounds is batch execution.  That's
somewhere on Andres's list of things that could speed up the executor,
although I believe at present there are quite a few other bottlenecks
that are more urgent.  But certainly if we start doing that, then even
things like SeqScan are going to want to know about applicable limit
clauses.

> I'll do another pass over the patch and get back to you.  I've not
> looked at the instrumentation patch yet --- is there a reason to
> worry about it before we finish this one?

I think they're independent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
v4, with a test case and some more comment-polishing.  I think this
is committable.

            regards, tom lane

diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ce47f1d..ad9eba6 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -47,17 +47,26 @@
  * greater than any 32-bit integer here so that values < 2^32 can be used
  * by individual parallel nodes to store their own state.
  */
-#define PARALLEL_KEY_PLANNEDSTMT        UINT64CONST(0xE000000000000001)
-#define PARALLEL_KEY_PARAMS                UINT64CONST(0xE000000000000002)
-#define PARALLEL_KEY_BUFFER_USAGE        UINT64CONST(0xE000000000000003)
-#define PARALLEL_KEY_TUPLE_QUEUE        UINT64CONST(0xE000000000000004)
-#define PARALLEL_KEY_INSTRUMENTATION    UINT64CONST(0xE000000000000005)
-#define PARALLEL_KEY_DSA                UINT64CONST(0xE000000000000006)
-#define PARALLEL_KEY_QUERY_TEXT        UINT64CONST(0xE000000000000007)
+#define PARALLEL_KEY_EXECUTOR_FIXED        UINT64CONST(0xE000000000000001)
+#define PARALLEL_KEY_PLANNEDSTMT        UINT64CONST(0xE000000000000002)
+#define PARALLEL_KEY_PARAMS                UINT64CONST(0xE000000000000003)
+#define PARALLEL_KEY_BUFFER_USAGE        UINT64CONST(0xE000000000000004)
+#define PARALLEL_KEY_TUPLE_QUEUE        UINT64CONST(0xE000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION    UINT64CONST(0xE000000000000006)
+#define PARALLEL_KEY_DSA                UINT64CONST(0xE000000000000007)
+#define PARALLEL_KEY_QUERY_TEXT        UINT64CONST(0xE000000000000008)

 #define PARALLEL_TUPLE_QUEUE_SIZE        65536

 /*
+ * Fixed-size random stuff that we need to pass to parallel workers.
+ */
+typedef struct FixedParallelExecutorState
+{
+    int64        tuples_needed;    /* tuple bound, see ExecSetTupleBound */
+} FixedParallelExecutorState;
+
+/*
  * DSM structure for accumulating per-PlanState instrumentation.
  *
  * instrument_options: Same meaning here as in instrument.c.
@@ -381,12 +390,14 @@ ExecParallelReinitialize(ParallelExecutorInfo *pei)
  * execution and return results to the main backend.
  */
 ParallelExecutorInfo *
-ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
+ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers,
+                     int64 tuples_needed)
 {
     ParallelExecutorInfo *pei;
     ParallelContext *pcxt;
     ExecParallelEstimateContext e;
     ExecParallelInitializeDSMContext d;
+    FixedParallelExecutorState *fpes;
     char       *pstmt_data;
     char       *pstmt_space;
     char       *param_space;
@@ -418,6 +429,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
      * for the various things we need to store.
      */

+    /* Estimate space for fixed-size state. */
+    shm_toc_estimate_chunk(&pcxt->estimator,
+                           sizeof(FixedParallelExecutorState));
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
     /* Estimate space for query text. */
     query_len = strlen(estate->es_sourceText);
     shm_toc_estimate_chunk(&pcxt->estimator, query_len);
@@ -487,6 +503,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
      * asked for has been allocated or initialized yet, though, so do that.
      */

+    /* Store fixed-size state. */
+    fpes = shm_toc_allocate(pcxt->toc, sizeof(FixedParallelExecutorState));
+    fpes->tuples_needed = tuples_needed;
+    shm_toc_insert(pcxt->toc, PARALLEL_KEY_EXECUTOR_FIXED, fpes);
+
     /* Store query string */
     query_string = shm_toc_allocate(pcxt->toc, query_len);
     memcpy(query_string, estate->es_sourceText, query_len);
@@ -833,6 +854,7 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 void
 ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 {
+    FixedParallelExecutorState *fpes;
     BufferUsage *buffer_usage;
     DestReceiver *receiver;
     QueryDesc  *queryDesc;
@@ -841,6 +863,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
     void       *area_space;
     dsa_area   *area;

+    /* Get fixed-size state. */
+    fpes = shm_toc_lookup(toc, PARALLEL_KEY_EXECUTOR_FIXED, false);
+
     /* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
     receiver = ExecParallelGetReceiver(seg, toc);
     instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
@@ -868,8 +893,17 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
     queryDesc->planstate->state->es_query_dsa = area;
     ExecParallelInitializeWorker(queryDesc->planstate, toc);

-    /* Run the plan */
-    ExecutorRun(queryDesc, ForwardScanDirection, 0L, true);
+    /* Pass down any tuple bound */
+    ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
+
+    /*
+     * Run the plan.  If we specified a tuple bound, be careful not to demand
+     * more tuples than that.
+     */
+    ExecutorRun(queryDesc,
+                ForwardScanDirection,
+                fpes->tuples_needed < 0 ? (int64) 0 : fpes->tuples_needed,
+                true);

     /* Shut down the executor */
     ExecutorFinish(queryDesc);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 36d2914..c1aa506 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -757,3 +757,124 @@ ExecShutdownNode(PlanState *node)

     return false;
 }
+
+/*
+ * ExecSetTupleBound
+ *
+ * Set a tuple bound for a planstate node.  This lets child plan nodes
+ * optimize based on the knowledge that the maximum number of tuples that
+ * their parent will demand is limited.  The tuple bound for a node may
+ * only be changed between scans (i.e., after node initialization or just
+ * before an ExecReScan call).
+ *
+ * Any negative tuples_needed value means "no limit", which should be the
+ * default assumption when this is not called at all for a particular node.
+ *
+ * Note: if this is called repeatedly on a plan tree, the exact same set
+ * of nodes must be updated with the new limit each time; be careful that
+ * only unchanging conditions are tested here.
+ */
+void
+ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
+{
+    /*
+     * Since this function recurses, in principle we should check stack depth
+     * here.  In practice, it's probably pointless since the earlier node
+     * initialization tree traversal would surely have consumed more stack.
+     */
+
+    if (IsA(child_node, SortState))
+    {
+        /*
+         * If it is a Sort node, notify it that it can use bounded sort.
+         *
+         * Note: it is the responsibility of nodeSort.c to react properly to
+         * changes of these parameters.  If we ever redesign this, it'd be a
+         * good idea to integrate this signaling with the parameter-change
+         * mechanism.
+         */
+        SortState  *sortState = (SortState *) child_node;
+
+        if (tuples_needed < 0)
+        {
+            /* make sure flag gets reset if needed upon rescan */
+            sortState->bounded = false;
+        }
+        else
+        {
+            sortState->bounded = true;
+            sortState->bound = tuples_needed;
+        }
+    }
+    else if (IsA(child_node, MergeAppendState))
+    {
+        /*
+         * If it is a MergeAppend, we can apply the bound to any nodes that
+         * are children of the MergeAppend, since the MergeAppend surely need
+         * read no more than that many tuples from any one input.
+         */
+        MergeAppendState *maState = (MergeAppendState *) child_node;
+        int            i;
+
+        for (i = 0; i < maState->ms_nplans; i++)
+            ExecSetTupleBound(tuples_needed, maState->mergeplans[i]);
+    }
+    else if (IsA(child_node, ResultState))
+    {
+        /*
+         * Similarly, for a projecting Result, we can apply the bound to its
+         * child node.
+         *
+         * If Result supported qual checking, we'd have to punt on seeing a
+         * qual.  Note that having a resconstantqual is not a showstopper: if
+         * that condition succeeds it affects nothing, while if it fails, no
+         * rows will be demanded from the Result child anyway.
+         */
+        if (outerPlanState(child_node))
+            ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
+    }
+    else if (IsA(child_node, SubqueryScanState))
+    {
+        /*
+         * We can also descend through SubqueryScan, but only if it has no
+         * qual (otherwise it might discard rows).
+         */
+        SubqueryScanState *subqueryState = (SubqueryScanState *) child_node;
+
+        if (subqueryState->ss.ps.qual == NULL)
+            ExecSetTupleBound(tuples_needed, subqueryState->subplan);
+    }
+    else if (IsA(child_node, GatherState))
+    {
+        /*
+         * A Gather node can propagate the bound to its workers.  As with
+         * MergeAppend, no one worker could possibly need to return more
+         * tuples than the Gather itself needs to.
+         *
+         * Note: As with Sort, the Gather node is responsible for reacting
+         * properly to changes to this parameter.
+         */
+        GatherState *gstate = (GatherState *) child_node;
+
+        gstate->tuples_needed = tuples_needed;
+
+        /* Also pass down the bound to our own copy of the child plan */
+        ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
+    }
+    else if (IsA(child_node, GatherMergeState))
+    {
+        /* Same comments as for Gather */
+        GatherMergeState *gstate = (GatherMergeState *) child_node;
+
+        gstate->tuples_needed = tuples_needed;
+
+        ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
+    }
+
+    /*
+     * In principle we could descend through any plan node type that is
+     * certain not to discard or combine input rows; but on seeing a node that
+     * can do that, we can't propagate the bound any further.  For the moment
+     * it's unclear that any other cases are worth checking here.
+     */
+}
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index e8d94ee..a0f5a60 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -72,6 +72,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
     gatherstate->ps.state = estate;
     gatherstate->ps.ExecProcNode = ExecGather;
     gatherstate->need_to_scan_locally = !node->single_copy;
+    gatherstate->tuples_needed = -1;

     /*
      * Miscellaneous initialization
@@ -156,7 +157,8 @@ ExecGather(PlanState *pstate)
             if (!node->pei)
                 node->pei = ExecInitParallelPlan(node->ps.lefttree,
                                                  estate,
-                                                 gather->num_workers);
+                                                 gather->num_workers,
+                                                 node->tuples_needed);

             /*
              * Register backend workers. We might not get as many as we
diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 64c6239..2526c58 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -77,6 +77,7 @@ ExecInitGatherMerge(GatherMerge *node, EState *estate, int eflags)
     gm_state->ps.plan = (Plan *) node;
     gm_state->ps.state = estate;
     gm_state->ps.ExecProcNode = ExecGatherMerge;
+    gm_state->tuples_needed = -1;

     /*
      * Miscellaneous initialization
@@ -190,7 +191,8 @@ ExecGatherMerge(PlanState *pstate)
             if (!node->pei)
                 node->pei = ExecInitParallelPlan(node->ps.lefttree,
                                                  estate,
-                                                 gm->num_workers);
+                                                 gm->num_workers,
+                                                 node->tuples_needed);

             /* Try to launch workers. */
             pcxt = node->pei->pcxt;
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index ceb6854..883f46c 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -27,7 +27,7 @@
 #include "nodes/nodeFuncs.h"

 static void recompute_limits(LimitState *node);
-static void pass_down_bound(LimitState *node, PlanState *child_node);
+static int64 compute_tuples_needed(LimitState *node);


 /* ----------------------------------------------------------------
@@ -297,92 +297,26 @@ recompute_limits(LimitState *node)
     /* Set state-machine state */
     node->lstate = LIMIT_RESCAN;

-    /* Notify child node about limit, if useful */
-    pass_down_bound(node, outerPlanState(node));
+    /*
+     * Notify child node about limit.  Note: think not to "optimize" by
+     * skipping ExecSetTupleBound if compute_tuples_needed returns < 0.  We
+     * must update the child node anyway, in case this is a rescan and the
+     * previous time we got a different result.
+     */
+    ExecSetTupleBound(compute_tuples_needed(node), outerPlanState(node));
 }

 /*
- * If we have a COUNT, and our input is a Sort node, notify it that it can
- * use bounded sort.  We can also pass down the bound through plan nodes
- * that cannot remove or combine input rows; for example, if our input is a
- * MergeAppend, we can apply the same bound to any Sorts that are direct
- * children of the MergeAppend, since the MergeAppend surely need not read
- * more than that many tuples from any one input.
- *
- * This is a bit of a kluge, but we don't have any more-abstract way of
- * communicating between the two nodes; and it doesn't seem worth trying
- * to invent one without some more examples of special communication needs.
- *
- * Note: it is the responsibility of nodeSort.c to react properly to
- * changes of these parameters.  If we ever do redesign this, it'd be a
- * good idea to integrate this signaling with the parameter-change mechanism.
+ * Compute the maximum number of tuples needed to satisfy this Limit node.
+ * Return a negative value if there is not a determinable limit.
  */
-static void
-pass_down_bound(LimitState *node, PlanState *child_node)
+static int64
+compute_tuples_needed(LimitState *node)
 {
-    /*
-     * Since this function recurses, in principle we should check stack depth
-     * here.  In practice, it's probably pointless since the earlier node
-     * initialization tree traversal would surely have consumed more stack.
-     */
-
-    if (IsA(child_node, SortState))
-    {
-        SortState  *sortState = (SortState *) child_node;
-        int64        tuples_needed = node->count + node->offset;
-
-        /* negative test checks for overflow in sum */
-        if (node->noCount || tuples_needed < 0)
-        {
-            /* make sure flag gets reset if needed upon rescan */
-            sortState->bounded = false;
-        }
-        else
-        {
-            sortState->bounded = true;
-            sortState->bound = tuples_needed;
-        }
-    }
-    else if (IsA(child_node, MergeAppendState))
-    {
-        /* Pass down the bound through MergeAppend */
-        MergeAppendState *maState = (MergeAppendState *) child_node;
-        int            i;
-
-        for (i = 0; i < maState->ms_nplans; i++)
-            pass_down_bound(node, maState->mergeplans[i]);
-    }
-    else if (IsA(child_node, ResultState))
-    {
-        /*
-         * We also have to be prepared to look through a Result, since the
-         * planner might stick one atop MergeAppend for projection purposes.
-         *
-         * If Result supported qual checking, we'd have to punt on seeing a
-         * qual.  Note that having a resconstantqual is not a showstopper: if
-         * that fails we're not getting any rows at all.
-         */
-        if (outerPlanState(child_node))
-            pass_down_bound(node, outerPlanState(child_node));
-    }
-    else if (IsA(child_node, SubqueryScanState))
-    {
-        /*
-         * We can also look through SubqueryScan, but only if it has no qual
-         * (otherwise it might discard rows).
-         */
-        SubqueryScanState *subqueryState = (SubqueryScanState *) child_node;
-
-        if (subqueryState->ss.ps.qual == NULL)
-            pass_down_bound(node, subqueryState->subplan);
-    }
-
-    /*
-     * In principle we could look through any plan node type that is certain
-     * not to discard or combine input rows.  In practice, there are not many
-     * node types that the planner might put between Sort and Limit, so trying
-     * to be very general is not worth the trouble.
-     */
+    if (node->noCount)
+        return -1;
+    /* Note: if this overflows, we'll return a negative value, which is OK */
+    return node->count + node->offset;
 }

 /* ----------------------------------------------------------------
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index bd0a87f..79b8867 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -33,7 +33,7 @@ typedef struct ParallelExecutorInfo
 } ParallelExecutorInfo;

 extern ParallelExecutorInfo *ExecInitParallelPlan(PlanState *planstate,
-                     EState *estate, int nworkers);
+                     EState *estate, int nworkers, int64 tuples_needed);
 extern void ExecParallelFinish(ParallelExecutorInfo *pei);
 extern void ExecParallelCleanup(ParallelExecutorInfo *pei);
 extern void ExecParallelReinitialize(ParallelExecutorInfo *pei);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index eacbea3..f48a603 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -232,6 +232,7 @@ extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
+extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);


 /* ----------------------------------------------------------------
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3272c4b..15a8426 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1919,6 +1919,7 @@ typedef struct GatherState
     struct TupleQueueReader **reader;
     TupleTableSlot *funnel_slot;
     bool        need_to_scan_locally;
+    int64        tuples_needed;    /* tuple bound, see ExecSetTupleBound */
 } GatherState;

 /* ----------------
@@ -1944,6 +1945,7 @@ typedef struct GatherMergeState
     struct binaryheap *gm_heap; /* binary heap of slot indices */
     bool        gm_initialized; /* gather merge initilized ? */
     bool        need_to_scan_locally;
+    int64        tuples_needed;    /* tuple bound, see ExecSetTupleBound */
     int            gm_nkeys;
     SortSupport gm_sortkeys;    /* array of length ms_nkeys */
     struct GMReaderTupleBuffer *gm_tuple_buffers;    /* tuple buffer per reader */
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 084f0f0..ccad18e 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -300,6 +300,29 @@ select count(*) from tenk1 group by twenty;
    500
 (20 rows)

+reset enable_hashagg;
+-- gather merge test with a LIMIT
+explain (costs off)
+  select fivethous from tenk1 order by fivethous limit 4;
+                  QUERY PLAN
+----------------------------------------------
+ Limit
+   ->  Gather Merge
+         Workers Planned: 4
+         ->  Sort
+               Sort Key: fivethous
+               ->  Parallel Seq Scan on tenk1
+(6 rows)
+
+select fivethous from tenk1 order by fivethous limit 4;
+ fivethous
+-----------
+         0
+         0
+         1
+         1
+(4 rows)
+
 -- gather merge test with 0 worker
 set max_parallel_workers = 0;
 explain (costs off)
@@ -325,7 +348,6 @@ select string4 from tenk1 order by string4 limit 5;
 (5 rows)

 reset max_parallel_workers;
-reset enable_hashagg;
 SAVEPOINT settings;
 SET LOCAL force_parallel_mode = 1;
 explain (costs off)
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 58c3f59..c0debdd 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -118,13 +118,20 @@ explain (costs off)

 select count(*) from tenk1 group by twenty;

+reset enable_hashagg;
+
+-- gather merge test with a LIMIT
+explain (costs off)
+  select fivethous from tenk1 order by fivethous limit 4;
+
+select fivethous from tenk1 order by fivethous limit 4;
+
 -- gather merge test with 0 worker
 set max_parallel_workers = 0;
 explain (costs off)
    select string4 from tenk1 order by string4 limit 5;
 select string4 from tenk1 order by string4 limit 5;
 reset max_parallel_workers;
-reset enable_hashagg;

 SAVEPOINT settings;
 SET LOCAL force_parallel_mode = 1;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Fri, Aug 25, 2017 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> v4, with a test case and some more comment-polishing.  I think this
> is committable.

Awesome.  What's the business with enable_hashagg in the regression test stuff?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> Awesome.  What's the business with enable_hashagg in the regression test stuff?

Just relocating that "reset" so that it only affects the test case it's
needed for.  (I think the 0-worker test was inserted in the wrong place.)
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Fri, Aug 25, 2017 at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Awesome.  What's the business with enable_hashagg in the regression test stuff?
>
> Just relocating that "reset" so that it only affects the test case it's
> needed for.  (I think the 0-worker test was inserted in the wrong place.)

Oh, hmm.  Well, no objection to that, I guess.  Thanks for working on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On another note, here's a second patch applying on top of my earlier
> patch to push down Limit through Gather and Gather Merge.

I looked through this a little, and feel uncomfortable with the division
of typedefs between execnodes.h and tuplesort.h.  I'm inclined to push
struct SortInstrumentation, and maybe also SharedSortInfo, into
tuplesort.h.  Then tuplesort_get_stats could be given the signature
void tuplesort_get_stats(Tuplesortstate *state,                         SortInstrumentation *stats);

which seems less messy.  Thoughts?

I'm also pretty suspicious of the filter code you added to subselect.sql.
What happens when there's more than one worker?

(BTW, would it make sense to number the workers from 1 not 0 in the
EXPLAIN printout?)
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Fri, Aug 25, 2017 at 2:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On another note, here's a second patch applying on top of my earlier
>> patch to push down Limit through Gather and Gather Merge.
>
> I looked through this a little, and feel uncomfortable with the division
> of typedefs between execnodes.h and tuplesort.h.  I'm inclined to push
> struct SortInstrumentation, and maybe also SharedSortInfo, into
> tuplesort.h.  Then tuplesort_get_stats could be given the signature
>
>         void tuplesort_get_stats(Tuplesortstate *state,
>                                  SortInstrumentation *stats);
>
> which seems less messy.  Thoughts?

I think moving SharedSortInfo into tuplesort.h would be a gross
abstraction violation, but moving SortInstrumentation into tuplesort.h
seems like a modest improvement.

> I'm also pretty suspicious of the filter code you added to subselect.sql.
> What happens when there's more than one worker?

Uh, then somebody's changed the definition of force_parallel_mode in a
really strange way.

> (BTW, would it make sense to number the workers from 1 not 0 in the
> EXPLAIN printout?)

The main problem I see with that is that it might confuse somebody who
is debugging.  If you've got a backend and print out
ParallelWorkerNumber, you might expect that to match what you see in
the EXPLAIN output.  It also seems likely to me that other parts of
EXPLAIN or other parts of the system generally will eventually expose
parallel worker numbers in mildly user-visible ways (e.g. maybe it'll
show up in a DEBUG message someplace eventually), and if we insist on
sticking a + 1 into EXPLAIN's output in the existing places we'll have
to do it everywhere else, and somebody's eventually going to forget.
So I'm in favor of leaving it alone; I don't think that 0-based
indexing is such an obscure convention that it will flummox users.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 25, 2017 at 2:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I looked through this a little, and feel uncomfortable with the division
>> of typedefs between execnodes.h and tuplesort.h.  I'm inclined to push
>> struct SortInstrumentation, and maybe also SharedSortInfo, into
>> tuplesort.h.

> I think moving SharedSortInfo into tuplesort.h would be a gross
> abstraction violation, but moving SortInstrumentation into tuplesort.h
> seems like a modest improvement.

Hmm, I'm not sure why SortInstrumentation belongs naturally to
tuplesort.h but putting an array of them there would be a "gross
abstraction violation".  Perhaps it would help to rename
struct SharedSortInfo to SortInstrumentationArray, and change its
field names to be less specific to the parallel-worker use case?

>> (BTW, would it make sense to number the workers from 1 not 0 in the
>> EXPLAIN printout?)

> ... So I'm in favor of leaving it alone; I don't think that 0-based
> indexing is such an obscure convention that it will flummox users.

OK, I'm not particularly set on that.
        regards, tom lane



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Fri, Aug 25, 2017 at 2:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think moving SharedSortInfo into tuplesort.h would be a gross
>> abstraction violation, but moving SortInstrumentation into tuplesort.h
>> seems like a modest improvement.
>
> Hmm, I'm not sure why SortInstrumentation belongs naturally to
> tuplesort.h but putting an array of them there would be a "gross
> abstraction violation".  Perhaps it would help to rename
> struct SharedSortInfo to SortInstrumentationArray, and change its
> field names to be less specific to the parallel-worker use case?

What other use case could there be?  I think an array of
SortInstrumentation objects intended to be stored in DSM is fairly
clearly a bit of executor-specific machinery and thus properly
declared along with the node that contains it.  Only nodeSort.c and
explain.c need to know anything about it; tuplesort.c is totally
ignorant of it and I see no reason why we'd ever want to change that.
Indeed, I don't quite see how we could do so without some kind of
abstraction violation.

SortInstrumentation is a diffferent kettle of fish.  I chose to make
that an executor-specific bit as well, but an advantage of your
proposed approach is that future additions to (or changes in) what
gets returned by tuplesort_get_stats() wouldn't require as much
tweaking elsewhere.  With your proposed change, nodeSort.c wouldn't
really care about the contents of that structure (as long as there are
no embedded pointers), so we could add new members or rename things
and only tuplesort.c and explain.c would need updating.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 25, 2017 at 2:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm, I'm not sure why SortInstrumentation belongs naturally to
>> tuplesort.h but putting an array of them there would be a "gross
>> abstraction violation".  Perhaps it would help to rename
>> struct SharedSortInfo to SortInstrumentationArray, and change its
>> field names to be less specific to the parallel-worker use case?

> What other use case could there be?  I think an array of
> SortInstrumentation objects intended to be stored in DSM is fairly
> clearly a bit of executor-specific machinery and thus properly
> declared along with the node that contains it.

I'm not really convinced, but it's not worth arguing further.

Here's a reviewed version of the second patch.  I fixed one bug
(wrong explain group nesting) and made some additional cosmetic
improvements beyond the struct relocation.

            regards, tom lane

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 953e74d..385b325 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2292,15 +2292,21 @@ show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 static void
 show_sort_info(SortState *sortstate, ExplainState *es)
 {
-    if (es->analyze && sortstate->sort_Done &&
-        sortstate->tuplesortstate != NULL)
+    if (!es->analyze)
+        return;
+
+    if (sortstate->sort_Done && sortstate->tuplesortstate != NULL)
     {
         Tuplesortstate *state = (Tuplesortstate *) sortstate->tuplesortstate;
+        TupleSortInstrumentation stats;
         const char *sortMethod;
         const char *spaceType;
         long        spaceUsed;

-        tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+        tuplesort_get_stats(state, &stats);
+        sortMethod = tuplesort_method_name(stats.sortMethod);
+        spaceType = tuplesort_space_type_name(stats.spaceType);
+        spaceUsed = stats.spaceUsed;

         if (es->format == EXPLAIN_FORMAT_TEXT)
         {
@@ -2315,6 +2321,51 @@ show_sort_info(SortState *sortstate, ExplainState *es)
             ExplainPropertyText("Sort Space Type", spaceType, es);
         }
     }
+
+    if (sortstate->shared_info != NULL)
+    {
+        int            n;
+        bool        opened_group = false;
+
+        for (n = 0; n < sortstate->shared_info->num_workers; n++)
+        {
+            TupleSortInstrumentation *sinstrument;
+            const char *sortMethod;
+            const char *spaceType;
+            long        spaceUsed;
+
+            sinstrument = &sortstate->shared_info->sinstrument[n];
+            if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+                continue;        /* ignore any unfilled slots */
+            sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+            spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+            spaceUsed = sinstrument->spaceUsed;
+
+            if (es->format == EXPLAIN_FORMAT_TEXT)
+            {
+                appendStringInfoSpaces(es->str, es->indent * 2);
+                appendStringInfo(es->str,
+                                 "Worker %d:  Sort Method: %s  %s: %ldkB\n",
+                                 n, sortMethod, spaceType, spaceUsed);
+            }
+            else
+            {
+                if (!opened_group)
+                {
+                    ExplainOpenGroup("Workers", "Workers", false, es);
+                    opened_group = true;
+                }
+                ExplainOpenGroup("Worker", NULL, true, es);
+                ExplainPropertyInteger("Worker Number", n, es);
+                ExplainPropertyText("Sort Method", sortMethod, es);
+                ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+                ExplainPropertyText("Sort Space Type", spaceType, es);
+                ExplainCloseGroup("Worker", NULL, true, es);
+            }
+        }
+        if (opened_group)
+            ExplainCloseGroup("Workers", "Workers", false, es);
+    }
 }

 /*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ad9eba6..01316ff 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -28,9 +28,10 @@
 #include "executor/nodeBitmapHeapscan.h"
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
-#include "executor/nodeSeqscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeSort.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -202,10 +203,10 @@ ExecSerializePlan(Plan *plan, EState *estate)
 }

 /*
- * Ordinary plan nodes won't do anything here, but parallel-aware plan nodes
- * may need some state which is shared across all parallel workers.  Before
- * we size the DSM, give them a chance to call shm_toc_estimate_chunk or
- * shm_toc_estimate_keys on &pcxt->estimator.
+ * Parallel-aware plan nodes (and occasionally others) may need some state
+ * which is shared across all parallel workers.  Before we size the DSM, give
+ * them a chance to call shm_toc_estimate_chunk or shm_toc_estimate_keys on
+ * &pcxt->estimator.
  *
  * While we're at it, count the number of PlanState nodes in the tree, so
  * we know how many SharedPlanStateInstrumentation structures we need.
@@ -219,38 +220,43 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
     /* Count this node. */
     e->nnodes++;

-    /* Call estimators for parallel-aware nodes. */
-    if (planstate->plan->parallel_aware)
+    switch (nodeTag(planstate))
     {
-        switch (nodeTag(planstate))
-        {
-            case T_SeqScanState:
+        case T_SeqScanState:
+            if (planstate->plan->parallel_aware)
                 ExecSeqScanEstimate((SeqScanState *) planstate,
                                     e->pcxt);
-                break;
-            case T_IndexScanState:
+            break;
+        case T_IndexScanState:
+            if (planstate->plan->parallel_aware)
                 ExecIndexScanEstimate((IndexScanState *) planstate,
                                       e->pcxt);
-                break;
-            case T_IndexOnlyScanState:
+            break;
+        case T_IndexOnlyScanState:
+            if (planstate->plan->parallel_aware)
                 ExecIndexOnlyScanEstimate((IndexOnlyScanState *) planstate,
                                           e->pcxt);
-                break;
-            case T_ForeignScanState:
+            break;
+        case T_ForeignScanState:
+            if (planstate->plan->parallel_aware)
                 ExecForeignScanEstimate((ForeignScanState *) planstate,
                                         e->pcxt);
-                break;
-            case T_CustomScanState:
+            break;
+        case T_CustomScanState:
+            if (planstate->plan->parallel_aware)
                 ExecCustomScanEstimate((CustomScanState *) planstate,
                                        e->pcxt);
-                break;
-            case T_BitmapHeapScanState:
+            break;
+        case T_BitmapHeapScanState:
+            if (planstate->plan->parallel_aware)
                 ExecBitmapHeapEstimate((BitmapHeapScanState *) planstate,
                                        e->pcxt);
-                break;
-            default:
-                break;
-        }
+            break;
+        case T_SortState:
+            /* even when not parallel-aware */
+            ExecSortEstimate((SortState *) planstate, e->pcxt);
+        default:
+            break;
     }

     return planstate_tree_walker(planstate, ExecParallelEstimate, e);
@@ -276,46 +282,51 @@ ExecParallelInitializeDSM(PlanState *planstate,
     d->nnodes++;

     /*
-     * Call initializers for parallel-aware plan nodes.
+     * Call initializers for DSM-using plan nodes.
      *
-     * Ordinary plan nodes won't do anything here, but parallel-aware plan
-     * nodes may need to initialize shared state in the DSM before parallel
-     * workers are available.  They can allocate the space they previously
+     * Most plan nodes won't do anything here, but plan nodes that allocated
+     * DSM may need to initialize shared state in the DSM before parallel
+     * workers are launched.  They can allocate the space they previously
      * estimated using shm_toc_allocate, and add the keys they previously
      * estimated using shm_toc_insert, in each case targeting pcxt->toc.
      */
-    if (planstate->plan->parallel_aware)
+    switch (nodeTag(planstate))
     {
-        switch (nodeTag(planstate))
-        {
-            case T_SeqScanState:
+        case T_SeqScanState:
+            if (planstate->plan->parallel_aware)
                 ExecSeqScanInitializeDSM((SeqScanState *) planstate,
                                          d->pcxt);
-                break;
-            case T_IndexScanState:
+            break;
+        case T_IndexScanState:
+            if (planstate->plan->parallel_aware)
                 ExecIndexScanInitializeDSM((IndexScanState *) planstate,
                                            d->pcxt);
-                break;
-            case T_IndexOnlyScanState:
+            break;
+        case T_IndexOnlyScanState:
+            if (planstate->plan->parallel_aware)
                 ExecIndexOnlyScanInitializeDSM((IndexOnlyScanState *) planstate,
                                                d->pcxt);
-                break;
-            case T_ForeignScanState:
+            break;
+        case T_ForeignScanState:
+            if (planstate->plan->parallel_aware)
                 ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
                                              d->pcxt);
-                break;
-            case T_CustomScanState:
+            break;
+        case T_CustomScanState:
+            if (planstate->plan->parallel_aware)
                 ExecCustomScanInitializeDSM((CustomScanState *) planstate,
                                             d->pcxt);
-                break;
-            case T_BitmapHeapScanState:
+            break;
+        case T_BitmapHeapScanState:
+            if (planstate->plan->parallel_aware)
                 ExecBitmapHeapInitializeDSM((BitmapHeapScanState *) planstate,
                                             d->pcxt);
-                break;
-
-            default:
-                break;
-        }
+            break;
+        case T_SortState:
+            /* even when not parallel-aware */
+            ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
+        default:
+            break;
     }

     return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
@@ -642,6 +653,13 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
     planstate->worker_instrument->num_workers = instrumentation->num_workers;
     memcpy(&planstate->worker_instrument->instrument, instrument, ibytes);

+    /*
+     * Perform any node-type-specific work that needs to be done.  Currently,
+     * only Sort nodes need to do anything here.
+     */
+    if (IsA(planstate, SortState))
+        ExecSortRetrieveInstrumentation((SortState *) planstate);
+
     return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
                                  instrumentation);
 }
@@ -801,35 +819,40 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
     if (planstate == NULL)
         return false;

-    /* Call initializers for parallel-aware plan nodes. */
-    if (planstate->plan->parallel_aware)
+    switch (nodeTag(planstate))
     {
-        switch (nodeTag(planstate))
-        {
-            case T_SeqScanState:
+        case T_SeqScanState:
+            if (planstate->plan->parallel_aware)
                 ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
-                break;
-            case T_IndexScanState:
+            break;
+        case T_IndexScanState:
+            if (planstate->plan->parallel_aware)
                 ExecIndexScanInitializeWorker((IndexScanState *) planstate, toc);
-                break;
-            case T_IndexOnlyScanState:
+            break;
+        case T_IndexOnlyScanState:
+            if (planstate->plan->parallel_aware)
                 ExecIndexOnlyScanInitializeWorker((IndexOnlyScanState *) planstate, toc);
-                break;
-            case T_ForeignScanState:
+            break;
+        case T_ForeignScanState:
+            if (planstate->plan->parallel_aware)
                 ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
                                                 toc);
-                break;
-            case T_CustomScanState:
+            break;
+        case T_CustomScanState:
+            if (planstate->plan->parallel_aware)
                 ExecCustomScanInitializeWorker((CustomScanState *) planstate,
                                                toc);
-                break;
-            case T_BitmapHeapScanState:
+            break;
+        case T_BitmapHeapScanState:
+            if (planstate->plan->parallel_aware)
                 ExecBitmapHeapInitializeWorker(
                                                (BitmapHeapScanState *) planstate, toc);
-                break;
-            default:
-                break;
-        }
+            break;
+        case T_SortState:
+            /* even when not parallel-aware */
+            ExecSortInitializeWorker((SortState *) planstate, toc);
+        default:
+            break;
     }

     return planstate_tree_walker(planstate, ExecParallelInitializeWorker, toc);
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index aae4150..060752c 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -15,6 +15,7 @@

 #include "postgres.h"

+#include "access/parallel.h"
 #include "executor/execdebug.h"
 #include "executor/nodeSort.h"
 #include "miscadmin.h"
@@ -127,6 +128,15 @@ ExecSort(PlanState *pstate)
         node->sort_Done = true;
         node->bounded_Done = node->bounded;
         node->bound_Done = node->bound;
+        if (node->shared_info && node->am_worker)
+        {
+            TupleSortInstrumentation *si;
+
+            Assert(IsParallelWorker());
+            Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+            si = &node->shared_info->sinstrument[ParallelWorkerNumber];
+            tuplesort_get_stats(tuplesortstate, si);
+        }
         SO1_printf("ExecSort: %s\n", "sorting done");
     }

@@ -334,3 +344,90 @@ ExecReScanSort(SortState *node)
     else
         tuplesort_rescan((Tuplesortstate *) node->tuplesortstate);
 }
+
+/* ----------------------------------------------------------------
+ *                        Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *        ExecSortEstimate
+ *
+ *        Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecSortEstimate(SortState *node, ParallelContext *pcxt)
+{
+    Size        size;
+
+    /* don't need this if not instrumenting or no workers */
+    if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+        return;
+
+    size = mul_size(pcxt->nworkers, sizeof(TupleSortInstrumentation));
+    size = add_size(size, offsetof(SharedSortInfo, sinstrument));
+    shm_toc_estimate_chunk(&pcxt->estimator, size);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *        ExecSortInitializeDSM
+ *
+ *        Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt)
+{
+    Size        size;
+
+    /* don't need this if not instrumenting or no workers */
+    if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+        return;
+
+    size = offsetof(SharedSortInfo, sinstrument)
+        + pcxt->nworkers * sizeof(TupleSortInstrumentation);
+    node->shared_info = shm_toc_allocate(pcxt->toc, size);
+    /* ensure any unfilled slots will contain zeroes */
+    memset(node->shared_info, 0, size);
+    node->shared_info->num_workers = pcxt->nworkers;
+    shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+                   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *        ExecSortInitializeWorker
+ *
+ *        Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecSortInitializeWorker(SortState *node, shm_toc *toc)
+{
+    node->shared_info =
+        shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id, true);
+    node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *        ExecSortRetrieveInstrumentation
+ *
+ *        Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecSortRetrieveInstrumentation(SortState *node)
+{
+    Size        size;
+    SharedSortInfo *si;
+
+    if (node->shared_info == NULL)
+        return;
+
+    size = offsetof(SharedSortInfo, sinstrument)
+        + node->shared_info->num_workers * sizeof(TupleSortInstrumentation);
+    si = palloc(size);
+    memcpy(si, node->shared_info, size);
+    node->shared_info = si;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 59cd28e..ff0857e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3227,13 +3227,10 @@ tuplesort_restorepos(Tuplesortstate *state)
  *
  * This can be called after tuplesort_performsort() finishes to obtain
  * printable summary information about how the sort was performed.
- * spaceUsed is measured in kilobytes.
  */
 void
 tuplesort_get_stats(Tuplesortstate *state,
-                    const char **sortMethod,
-                    const char **spaceType,
-                    long *spaceUsed)
+                    TupleSortInstrumentation *stats)
 {
     /*
      * Note: it might seem we should provide both memory and disk usage for a
@@ -3246,35 +3243,68 @@ tuplesort_get_stats(Tuplesortstate *state,
      */
     if (state->tapeset)
     {
-        *spaceType = "Disk";
-        *spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
+        stats->spaceType = SORT_SPACE_TYPE_DISK;
+        stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
     }
     else
     {
-        *spaceType = "Memory";
-        *spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
+        stats->spaceType = SORT_SPACE_TYPE_MEMORY;
+        stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
     }

     switch (state->status)
     {
         case TSS_SORTEDINMEM:
             if (state->boundUsed)
-                *sortMethod = "top-N heapsort";
+                stats->sortMethod = SORT_TYPE_TOP_N_HEAPSORT;
             else
-                *sortMethod = "quicksort";
+                stats->sortMethod = SORT_TYPE_QUICKSORT;
             break;
         case TSS_SORTEDONTAPE:
-            *sortMethod = "external sort";
+            stats->sortMethod = SORT_TYPE_EXTERNAL_SORT;
             break;
         case TSS_FINALMERGE:
-            *sortMethod = "external merge";
+            stats->sortMethod = SORT_TYPE_EXTERNAL_MERGE;
             break;
         default:
-            *sortMethod = "still in progress";
+            stats->sortMethod = SORT_TYPE_STILL_IN_PROGRESS;
             break;
     }
 }

+/*
+ * Convert TuplesortMethod to a string.
+ */
+const char *
+tuplesort_method_name(TuplesortMethod m)
+{
+    switch (m)
+    {
+        case SORT_TYPE_STILL_IN_PROGRESS:
+            return "still in progress";
+        case SORT_TYPE_TOP_N_HEAPSORT:
+            return "top-N heapsort";
+        case SORT_TYPE_QUICKSORT:
+            return "quicksort";
+        case SORT_TYPE_EXTERNAL_SORT:
+            return "external sort";
+        case SORT_TYPE_EXTERNAL_MERGE:
+            return "external merge";
+    }
+
+    return "unknown";
+}
+
+/*
+ * Convert TuplesortSpaceType to a string.
+ */
+const char *
+tuplesort_space_type_name(TuplesortSpaceType t)
+{
+    Assert(t == SORT_SPACE_TYPE_DISK || t == SORT_SPACE_TYPE_MEMORY);
+    return t == SORT_SPACE_TYPE_DISK ? "Disk" : "Memory";
+}
+

 /*
  * Heap manipulation routines, per Knuth's Algorithm 5.2.3H.
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index ed0e9db..77ac065 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -14,6 +14,7 @@
 #ifndef NODESORT_H
 #define NODESORT_H

+#include "access/parallel.h"
 #include "nodes/execnodes.h"

 extern SortState *ExecInitSort(Sort *node, EState *estate, int eflags);
@@ -22,4 +23,10 @@ extern void ExecSortMarkPos(SortState *node);
 extern void ExecSortRestrPos(SortState *node);
 extern void ExecReScanSort(SortState *node);

+/* parallel instrumentation support */
+extern void ExecSortEstimate(SortState *node, ParallelContext *pcxt);
+extern void ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt);
+extern void ExecSortInitializeWorker(SortState *node, shm_toc *toc);
+extern void ExecSortRetrieveInstrumentation(SortState *node);
+
 #endif                            /* NODESORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 15a8426..b66110e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1731,6 +1731,16 @@ typedef struct MaterialState
 } MaterialState;

 /* ----------------
+ *     Shared memory container for per-worker sort information
+ * ----------------
+ */
+typedef struct SharedSortInfo
+{
+    int            num_workers;
+    TupleSortInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
+} SharedSortInfo;
+
+/* ----------------
  *     SortState information
  * ----------------
  */
@@ -1744,6 +1754,8 @@ typedef struct SortState
     bool        bounded_Done;    /* value of bounded we did the sort with */
     int64        bound_Done;        /* value of bound we did the sort with */
     void       *tuplesortstate; /* private state of tuplesort.c */
+    bool        am_worker;        /* are we a worker? */
+    SharedSortInfo *shared_info;    /* one entry per worker */
 } SortState;

 /* ---------------------
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 28c168a..484fdac 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -32,6 +32,34 @@
 typedef struct Tuplesortstate Tuplesortstate;

 /*
+ * Data structures for reporting sort statistics.  Note that
+ * TupleSortInstrumentation can't contain any pointers because we
+ * sometimes put it in shared memory.
+ */
+typedef enum
+{
+    SORT_TYPE_STILL_IN_PROGRESS = 0,
+    SORT_TYPE_TOP_N_HEAPSORT,
+    SORT_TYPE_QUICKSORT,
+    SORT_TYPE_EXTERNAL_SORT,
+    SORT_TYPE_EXTERNAL_MERGE
+} TuplesortMethod;
+
+typedef enum
+{
+    SORT_SPACE_TYPE_DISK,
+    SORT_SPACE_TYPE_MEMORY
+} TuplesortSpaceType;
+
+typedef struct TupleSortInstrumentation
+{
+    TuplesortMethod sortMethod; /* sort algorithm used */
+    TuplesortSpaceType spaceType;    /* type of space spaceUsed represents */
+    long        spaceUsed;        /* space consumption, in kB */
+} TupleSortInstrumentation;
+
+
+/*
  * We provide multiple interfaces to what is essentially the same code,
  * since different callers have different data to be sorted and want to
  * specify the sort key information differently.  There are two APIs for
@@ -107,9 +135,9 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 extern void tuplesort_end(Tuplesortstate *state);

 extern void tuplesort_get_stats(Tuplesortstate *state,
-                    const char **sortMethod,
-                    const char **spaceType,
-                    long *spaceUsed);
+                    TupleSortInstrumentation *stats);
+extern const char *tuplesort_method_name(TuplesortMethod m);
+extern const char *tuplesort_space_type_name(TuplesortSpaceType t);

 extern int    tuplesort_merge_order(int64 allowedMem);

diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index f009c25..f3ebad5 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1047,7 +1047,7 @@ drop function tattle(x int, y int);
 -- ANALYZE shows that a top-N sort was used.  We must suppress or filter away
 -- all the non-invariant parts of the EXPLAIN ANALYZE output.
 --
-create temp table sq_limit (pk int primary key, c1 int, c2 int);
+create table sq_limit (pk int primary key, c1 int, c2 int);
 insert into sq_limit values
     (1, 1, 1),
     (2, 2, 2),
@@ -1066,6 +1066,8 @@ begin
         select * from (select pk,c2 from sq_limit order by c1,pk) as x limit 3
     loop
         ln := regexp_replace(ln, 'Memory: \S*',  'Memory: xxx');
+        -- this case might occur if force_parallel_mode is on:
+        ln := regexp_replace(ln, 'Worker 0:  Sort Method',  'Sort Method');
         return next ln;
     end loop;
 end;
@@ -1090,3 +1092,4 @@ select * from (select pk,c2 from sq_limit order by c1,pk) as x limit 3;
 (3 rows)

 drop function explain_sq_limit();
+drop table sq_limit;
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index 9a14832..5ac8bad 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -547,7 +547,7 @@ drop function tattle(x int, y int);
 -- ANALYZE shows that a top-N sort was used.  We must suppress or filter away
 -- all the non-invariant parts of the EXPLAIN ANALYZE output.
 --
-create temp table sq_limit (pk int primary key, c1 int, c2 int);
+create table sq_limit (pk int primary key, c1 int, c2 int);
 insert into sq_limit values
     (1, 1, 1),
     (2, 2, 2),
@@ -567,6 +567,8 @@ begin
         select * from (select pk,c2 from sq_limit order by c1,pk) as x limit 3
     loop
         ln := regexp_replace(ln, 'Memory: \S*',  'Memory: xxx');
+        -- this case might occur if force_parallel_mode is on:
+        ln := regexp_replace(ln, 'Worker 0:  Sort Method',  'Sort Method');
         return next ln;
     end loop;
 end;
@@ -577,3 +579,5 @@ select * from explain_sq_limit();
 select * from (select pk,c2 from sq_limit order by c1,pk) as x limit 3;

 drop function explain_sq_limit();
+
+drop table sq_limit;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Robert Haas
Дата:
On Fri, Aug 25, 2017 at 5:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Aug 25, 2017 at 2:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Hmm, I'm not sure why SortInstrumentation belongs naturally to
>>> tuplesort.h but putting an array of them there would be a "gross
>>> abstraction violation".  Perhaps it would help to rename
>>> struct SharedSortInfo to SortInstrumentationArray, and change its
>>> field names to be less specific to the parallel-worker use case?
>
>> What other use case could there be?  I think an array of
>> SortInstrumentation objects intended to be stored in DSM is fairly
>> clearly a bit of executor-specific machinery and thus properly
>> declared along with the node that contains it.
>
> I'm not really convinced, but it's not worth arguing further.
>
> Here's a reviewed version of the second patch.  I fixed one bug
> (wrong explain group nesting) and made some additional cosmetic
> improvements beyond the struct relocation.

The capitalization of TupleSortInstrumentation is inconsistent with
TuplesortSpaceType, TuplesortMethod, and Tuplesortstate; I recommend
we lower-case the S.

-     * Ordinary plan nodes won't do anything here, but parallel-aware plan
-     * nodes may need to initialize shared state in the DSM before parallel
-     * workers are available.  They can allocate the space they previously
+     * Most plan nodes won't do anything here, but plan nodes that allocated
+     * DSM may need to initialize shared state in the DSM before parallel
+     * workers are launched.  They can allocate the space they previously

On reading this again, it seems to me that the first instance of
"allocated" in this comment block needs to be changed to "estimated",
because otherwise saying they can allocated it later on is
unintelligible.

Other than that, LGTM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 25, 2017 at 5:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Here's a reviewed version of the second patch.  I fixed one bug
>> (wrong explain group nesting) and made some additional cosmetic
>> improvements beyond the struct relocation.

> The capitalization of TupleSortInstrumentation is inconsistent with
> TuplesortSpaceType, TuplesortMethod, and Tuplesortstate; I recommend
> we lower-case the S.

Sure, do it that way.
        regards, tom lane