Обсуждение: Planner producing 100% duplicate subplans when unneeded
This is all on Postgres 9.0.0: I'm working on the definition for a view that, as part of its output, includes three columns that each contain a sum of values in a table that are in one of a few different states -- essentially: for each parent, give me the sum of children in the foo state, a second sum of the children in the bar state, and a third sum of the children in the baz state. Thinking I'd be clever and avoid multiple almost-identical subqueries to for each SUM, I decided to do it in a single subquery that returns an array of all 3 sums and then split it out into its component parts higher up (see the example below if you don't understand what I mean). Unfortunately, the query planner doesn't quite handle this case: For each reference to the subquery value, it duplicates the subquery and thus its plan: CREATE TABLE parent ( id INT PRIMARY KEY ); INSERT INTO parent(id) SELECT GENERATE_SERIES(1, 1000); CREATE TABLE child ( parent_id INT, v1 INT, v2 INT, PRIMARY KEY(parent_id, v1) ); INSERT INTO child(parent_id, v1, v2) SELECT p.id, v1.id, RANDOM() * 500 FROM GENERATE_SERIES(1, 1000) AS p(id) CROSS JOIN GENERATE_SERIES(1, 20) AS v1(id) ; EXPLAIN SELECT *, child_data[1] AS foo, child_data[2] AS bar, child_data[3] AS baz FROM ( SELECT p.id, ( SELECT ARRAY[ SUM(c.v2), SUM(CASE WHEN c.v1 > 15 THEN c.v2 ELSE 0 END), SUM(CASE WHEN c.v1 > 5 THEN c.v2 ELSE 0 END) ] FROM child AS c WHERE c.parent_id=p.id ) AS child_data FROM parent AS p ) AS t; produces: Seq Scan on wings_sky.parent p (cost=0.00..161113.12 rows=1000 width=4) Output: p.id, (SubPlan 1), ((SubPlan 2))[1], ((SubPlan 3))[2], ((SubPlan 4))[3] SubPlan 1 -> Aggregate (cost=40.26..40.27 rows=1 width=8) Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0 END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)] -> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10 rows=20 width=8) Output: c.parent_id, c.v1, c.v2 Index Cond: (c.parent_id = $0) SubPlan 2 -> Aggregate (cost=40.26..40.27 rows=1 width=8) Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0 END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)] -> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10 rows=20 width=8) Output: c.parent_id, c.v1, c.v2 Index Cond: (c.parent_id = $0) SubPlan 3 -> Aggregate (cost=40.26..40.27 rows=1 width=8) Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0 END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)] -> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10 rows=20 width=8) Output: c.parent_id, c.v1, c.v2 Index Cond: (c.parent_id = $0) SubPlan 4 -> Aggregate (cost=40.26..40.27 rows=1 width=8) Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0 END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)] -> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10 rows=20 width=8) Output: c.parent_id, c.v1, c.v2 Index Cond: (c.parent_id = $0) There's another method of writing this that is more efficient in this PARTICULAR case: EXPLAIN VERBOSE SELECT parent.id, t.foo, t.bar, t.baz FROM parent INNER JOIN ( SELECT child.parent_id, SUM(child.v2) AS foo, SUM(CASE WHEN child.v1 > 15 THEN child.v2 ELSE 0 END) AS bar, SUM(CASE WHEN child.v1 > 5 THEN child.v2 ELSE 0 END) AS baz FROM child GROUP BY child.parent_id ) AS t ON t.parent_id=parent.id Hash Join (cost=547.12..556.62 rows=200 width=28) Output: parent.id, (sum(child.v2)), (sum(CASE WHEN (child.v1 > 15) THEN child.v2 ELSE 0 END)), (sum(CASE WHEN (child.v1 > 5) THEN child.v2 ELSE 0 END)) Hash Cond: (child.parent_id = parent.id) -> HashAggregate (cost=483.12..487.62 rows=200 width=12) Output: child.parent_id, sum(child.v2), sum(CASE WHEN (child.v1 > 15) THEN child.v2 ELSE 0 END), sum(CASE WHEN (child.v1 > 5) THEN child.v2 ELSE 0 END) -> Seq Scan on wings_sky.child (cost=0.00..291.06 rows=19206 width=12) Output: child.parent_id, child.v1, child.v2 -> Hash (cost=34.00..34.00 rows=2400 width=4) Output: parent.id -> Seq Scan on wings_sky.parent (cost=0.00..34.00 rows=2400 width=4) Output: parent.id However, the second plan performs poorly in cases in the case where parent_id can be more than one possible value (i.e. there is no WHERE parent_id=1 clause) -- it takes nearly as long in that instance as it does with no WHERE clause altogether. Both plans run the same speed with one parent_id. The first plan starts losing speed gradually as the number of parents increase; the second plan is either all-or-nothing. In the first case, it seems inefficient to duplicate the subplan for each reference -- I'd think the (corrected) plan should look something like this: Seq Scan on wings_sky.parent p (cost=0.00..161113.12 rows=1000 width=4) Output: p.id, (SubPlan 1), ((SubPlan 1))[1], ((SubPlan 1))[2], ((SubPlan 1))[3] SubPlan 1 -> Aggregate (cost=40.26..40.27 rows=1 width=8) Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0 END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)] -> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10 rows=20 width=8) Output: c.parent_id, c.v1, c.v2 Index Cond: (c.parent_id = $0) Is there any chance this might be looked at in a future release? -- Daniel Grace AGE, LLC System Administrator and Software Developer dgrace@wingsnw.com // www.wingsnw.com
On Mon, Sep 27, 2010 at 5:09 PM, Daniel Grace <dgrace@wingsnw.com> wrote: > Is there any chance this might be looked at in a future release? This is another interesting example of a case where an inlining-type optimization (which is effectively what's happening here, I think) turns out to be a negative. We had one a while back that involved actual function inlining, which is not quite what's happening here, but it's close. It doesn't seem too hard to figure out whether or not inlining is a win (non-trivial subexpressions should probably never get duplicated), but nobody's gotten around to writing the logic to make it work yet. One useful technique is to stick "OFFSET 0" into the subquery; that prevents it from being inlined and gives you something more like the plan you were hoping for. Whether or not this will get looked at in a future release is a tough question to answer. It's possible that someone (most likely, Tom) will get motivated to fix this out of sheer annoyance with the current behavior, or will notice a way to improve it incidentally while making some other change. But of course the only way to make sure it gets fixed is to do it yourself (or pay someone to do it). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
As a theoretical question (I'm not savvy on Postgres's code but might be intrigued enough to beat on it anyways), is it feasible to do an additional pass on the query plan that essentially goes: - Are these two subplans identical? - Are they at the same part of the tree? and if both of these conditions are true, discard one subplan and rewrite all references to point to the other one? Assuming it IS possible, are there any particular situations where it wouldn't work? On Mon, Oct 4, 2010 at 11:47 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Sep 27, 2010 at 5:09 PM, Daniel Grace <dgrace@wingsnw.com> wrote: >> Is there any chance this might be looked at in a future release? > > This is another interesting example of a case where an inlining-type > optimization (which is effectively what's happening here, I think) > turns out to be a negative. =A0We had one a while back that involved > actual function inlining, which is not quite what's happening here, > but it's close. =A0It doesn't seem too hard to figure out whether or not > inlining is a win (non-trivial subexpressions should probably never > get duplicated), but nobody's gotten around to writing the logic to > make it work yet. =A0One useful technique is to stick "OFFSET 0" into > the subquery; that prevents it from being inlined and gives you > something more like the plan you were hoping for. > > Whether or not this will get looked at in a future release is a tough > question to answer. =A0It's possible that someone (most likely, Tom) > will get motivated to fix this out of sheer annoyance with the current > behavior, or will notice a way to improve it incidentally while making > some other change. =A0But of course the only way to make sure it gets > fixed is to do it yourself (or pay someone to do it). > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise Postgres Company > --=20 Daniel Grace AGE, LLC System Administrator and Software Developer dgrace@wingsnw.com // (425)327-0079 // www.wingsnw.com
On Mon, Oct 4, 2010 at 4:15 PM, Daniel Grace <dgrace@wingsnw.com> wrote: > As a theoretical question (I'm not savvy on Postgres's code but might > be intrigued enough to beat on it anyways), is it feasible to do an > additional pass on the query plan that essentially goes: > > - Are these two subplans identical? > - Are they at the same part of the tree? > > and if both of these conditions are true, discard one subplan and > rewrite all references to point to the other one? > > Assuming it IS possible, are there any particular situations where it > wouldn't work? Well, volatile functions would be a problem, but I don't think that's really the way to go anyway. I think that deciding whether or not to collapse subqueries into the main query (which is what's happening here) seems like it could be done by counting the number of times any given subexpression is referenced, which seems like it would be a lot cheaper than checking things against each other pairwise to see if they match up. Slowing down the planner to detect cases like this wouldn't be very appealing: SELECT (SELECT SUM(a) FROM bob, SELECT SUM(b) FROM bob, SELECT SUM(c) FROM bob); ...because very few people are going to write that query that way anyway, and unless it's almost free to notice the duplication, it doesn't make sense to spend planning time on it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Sep 27, 2010 at 5:09 PM, Daniel Grace <dgrace@wingsnw.com> wrote: >> Is there any chance this might be looked at in a future release? > This is another interesting example of a case where an inlining-type > optimization (which is effectively what's happening here, I think) > turns out to be a negative. We had one a while back that involved > actual function inlining, which is not quite what's happening here, > but it's close. It doesn't seem too hard to figure out whether or not > inlining is a win (non-trivial subexpressions should probably never > get duplicated), but nobody's gotten around to writing the logic to > make it work yet. Actually this case has some history behind it. The reason that the sub-sub-query gets duplicated is that we flatten the sub-query, resulting in duplicating its output expressions wherever they're referenced. Now before you say that that's stupid, please notice that this case got disabled in 7.3 as a bug workaround, and that almost immediately produced howls of pain: http://archives.postgresql.org/pgsql-general/2002-12/msg00410.php http://archives.postgresql.org/pgsql-performance/2002-12/msg00214.php so I undid it as soon as I could: http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=de97072e3c88e104a55b0d5c67477f1b0097c003 While just shutting off the pull-up unconditionally would merely require reintroducing a contains_subplans() restriction in is_simple_subquery(), I'm unwilling to do that because of the earlier complaints. And "figuring out whether it's a win" is a lot harder than you opine above: at the point where this decision has to be taken, we have no cost information whatsoever, and not even any clear idea whether the subquery outputs are referenced at all let alone multiply referenced. My thought about the current shortest path to a solution would be to wrap subquery-containing output expressions in PlaceHolderVars. Per this comment elsewhere in is_simple_subquery(), /* * Don't pull up a subquery that has any volatile functions in its * targetlist. Otherwise we might introduce multiple evaluations of these * functions, if they get copied to multiple places in the upper query, * leading to surprising results. (Note: the PlaceHolderVar mechanism * doesn't quite guarantee single evaluation; else we could pull up anyway * and just wrap such items in PlaceHolderVars ...) */ that doesn't entirely work at the moment, but it might work if we were to rejigger the PHV mechanism a bit. This would be attractive because the added overhead of a PHV would be nearly nil, so we could afford to do this without making any predictions about relative costs. As a workaround until somebody gets around to looking at that, I'd suggest that Daniel plaster his sub-query with the good old all-purpose optimization fence, "OFFSET 0". That will prevent flattening and thus prevent the sub-sub-query from getting duplicated. regards, tom lane