Обсуждение: plan_rows confusion with parallel queries
Hi, while eye-balling some explain plans for parallel queries, I got a bit confused by the row count estimates. I wonder whether I'm alone. Consider for example a simple seq scan query, which in non-parallel explain looks like this: QUERY PLAN --------------------------------------------------------------------- Seq Scan on tables t (cost=0.00..16347.60 rows=317160width=356) (actual rows=317160 loops=1) Planning time: 0.173 ms Execution time: 47.707 ms (3 rows) but a parallel plan looks like this: QUERY PLAN --------------------------------------------------------------------- Gather (cost=0.00..14199.10 rows=317160 width=356) (actual rows=317160 loops=1) Workers Planned: 3 Workers Launched: 3 -> Parallel Seq Scan on tablest (cost=... rows=102310 width=356) (actual rows=79290 loops=4) Planning time:0.209 ms Execution time: 150.812 ms (6 rows) Now, for actual rows we can simply do 79290 * 4 = 317160, and we get the correct number of rows produced by the plan (i.e. matching the non-parallel query). But for the estimate, it doesn't work like that: 102310 * 4 = 409240 which is ~30% above the actual estimate. It's clear why this is happening - when computing plan_rows, we don't count the leader as a full worker, but use this: leader_contribution = 1.0 - (0.3 * path->parallel_workers); so with 3 workers, the leader is only worth ~0.1 of a worker: 102310 * 3.1 = 317161 It's fairly easy to spot this here, because the Gather node is right above the Parallel Seq Scan, and the values in the Gather accurate. But in many plans the Gather will not be immediately above the node (e.g. there may be parallel aggregate in between). Of course, the fact that we use planned number of workers when computing plan_rows but actual number of workers for actually produced rows makes this even more confusing. BTW is it really a good idea to use nloops to track the number of workers executing a given node? How will that work if once we get parallel nested loops and index scans? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > while eye-balling some explain plans for parallel queries, I got a bit > confused by the row count estimates. I wonder whether I'm alone. I got confused by that a minute ago, so no you're not alone. The problem is even worse in join cases. For example: Gather (cost=34332.00..53265.35 rows=100 width=8) Workers Planned: 2 -> Hash Join (cost=33332.00..52255.35 rows=100width=8) Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2)) -> Append (cost=0.00..8614.96 rows=417996width=8) -> Parallel Seq Scan on pp (cost=0.00..8591.67 rows=416667 widt h=8) -> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 width=8 ) -> Hash (cost=14425.00..14425.00 rows=1000000 width=8) -> Seq Scan on cc (cost=0.00..14425.00 rows=1000000width=8) There are actually 1000000 rows in pp, and none in pp1. I'm not bothered particularly by the nonzero estimate for pp1, because I know where that came from, but I'm not very happy that nowhere here does it look like it's estimating a million-plus rows going into the join. regards, tom lane
On 11/02/2016 09:00 PM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> while eye-balling some explain plans for parallel queries, I got a bit >> confused by the row count estimates. I wonder whether I'm alone. > > I got confused by that a minute ago, so no you're not alone. The problem > is even worse in join cases. For example: > > Gather (cost=34332.00..53265.35 rows=100 width=8) > Workers Planned: 2 > -> Hash Join (cost=33332.00..52255.35 rows=100 width=8) > Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2)) > -> Append (cost=0.00..8614.96 rows=417996 width=8) > -> Parallel Seq Scan on pp (cost=0.00..8591.67 rows=416667 widt > h=8) > -> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 width=8 > ) > -> Hash (cost=14425.00..14425.00 rows=1000000 width=8) > -> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 width=8) > > There are actually 1000000 rows in pp, and none in pp1. I'm not bothered > particularly by the nonzero estimate for pp1, because I know where that > came from, but I'm not very happy that nowhere here does it look like > it's estimating a million-plus rows going into the join. > Yeah. I wonder how tools visualizing explain plans are going to compute time spent in a given node (i.e. excluding the time spent in child nodes), or expected cost of that node. So far we could do something like self_time = total_time - child_node_time * nloops and example in this plan it's pretty clear we spend ~130ms in Aggregate: QUERY PLAN ---------------------------------------------------------------------------- Aggregate (cost=17140.50..17140.51 rows=1 width=8) (actual time=306.675..306.675 rows=1 loops=1) -> Seq Scan on tables (cost=0.00..16347.60 rows=317160width=0) (actual time=0.188..170.993 rows=317160 loops=1) Planning time: 0.201 ms Executiontime: 306.860 ms (4 rows) But in parallel plans it can easily happen that child_node_time * nloops > total_time Consider for example this parallel plan: QUERY PLAN ---------------------------------------------------------------------------- Finalize Aggregate (cost=15455.19..15455.20rows=1 width=8) (actual time=107.636..107.636 rows=1 loops=1) -> Gather (cost=15454.87..15455.18rows=3 width=8) (actual time=107.579..107.629 rows=4 loops=1) Workers Planned:3 Workers Launched: 3 -> Partial Aggregate (cost=14454.87..14454.88 rows=1 ...) (actual time=103.895..103.895 rows=1 loops=4) -> Parallel Seq Scan on tables (cost=0.00..14199.10 rows=102310 width=0) (actual time=0.059..59.217 rows=79290 loops=4) Planning time:0.052 ms Execution time: 109.250 ms (8 rows) Reading explains for parallel plans will always be complicated, but perhaps overloading the nloops like this makes it more complicated? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/02/2016 11:56 PM, Tomas Vondra wrote: > On 11/02/2016 09:00 PM, Tom Lane wrote: >> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >>> while eye-balling some explain plans for parallel queries, I got a bit >>> confused by the row count estimates. I wonder whether I'm alone. >> >> I got confused by that a minute ago, so no you're not alone. The problem >> is even worse in join cases. For example: >> >> Gather (cost=34332.00..53265.35 rows=100 width=8) >> Workers Planned: 2 >> -> Hash Join (cost=33332.00..52255.35 rows=100 width=8) >> Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2)) >> -> Append (cost=0.00..8614.96 rows=417996 width=8) >> -> Parallel Seq Scan on pp (cost=0.00..8591.67 >> rows=416667 widt >> h=8) >> -> Parallel Seq Scan on pp1 (cost=0.00..23.29 >> rows=1329 width=8 >> ) >> -> Hash (cost=14425.00..14425.00 rows=1000000 width=8) >> -> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 >> width=8) >> >> There are actually 1000000 rows in pp, and none in pp1. I'm not bothered >> particularly by the nonzero estimate for pp1, because I know where that >> came from, but I'm not very happy that nowhere here does it look like >> it's estimating a million-plus rows going into the join. >> Although - it is estimating 1M rows, but only "per worker" estimates are shown, and because there are 2 workers planned it says 1M/2.4 which is the 416k. I agree it's a bit unclear, but at least it's consistent with how we treat loops (i.e. that the numbers are per loop). But there's more fun with joins - consider for example this simple join: QUERY PLAN ------------------------------------------------------------------------------ Gather (cost=19515.96..43404.82 rows=96957width=12) (actual time=295.167..746.312 rows=99999 loops=1) Workers Planned: 2 Workers Launched: 2 -> Hash Join (cost=18515.96..32709.12 rows=96957 width=12) (actual time=249.281..670.309 rows=33333loops=3) Hash Cond: (t2.a = t1.a) -> Parallel Seq Scan on t2 (cost=0.00..8591.67 rows=416667width=8) (actual time=0.100..184.315 rows=333333 loops=3) -> Hash (cost=16925.00..16925.00rows=96957 width=8) (actual time=246.760..246.760 rows=99999 loops=3) Buckets: 131072 Batches: 2 Memory Usage: 2976kB -> Seq Scan on t1 (cost=0.00..16925.00rows=96957 width=8) (actual time=0.065..178.385 rows=99999 loops=3) Filter: (b < 100000) Rows Removed by Filter: 900001 Planning time: 0.763 ms Execution time: 793.653ms (13 rows) Suddenly we don't show per-worker estimates for the hash join - both the Hash Join and the Gather have exactly the same cardinality estimate. Now, let's try forcing Nested Loops and see what happens: QUERY PLAN ----------------------------------------------------------------------------- Gather (cost=1000.42..50559.65 rows=96957width=12) (actual time=0.610..203.694 rows=99999 loops=1) Workers Planned: 2 Workers Launched: 2 -> Nested Loop (cost=0.42..39863.95 rows=96957 width=12) (actual time=0.222..182.755 rows=33333 loops=3) -> Parallel Seq Scan on t1 (cost=0.00..9633.33 rows=40399 width=8) (actual time=0.030..40.358 rows=33333 loops=3) Filter: (b < 100000) Rows Removed by Filter:300000 -> Index Scan using t2_a_idx on t2 (cost=0.42..0.74 rows=1 width=8) (actualtime=0.002..0.002 rows=1 loops=99999) Index Cond: (a = t1.a) Planning time: 0.732 ms Execution time:250.707 ms (11 rows) So, different join method but same result - 2 workers, loops=3. But let's try with small tables (100k rows instead of 1M rows): QUERY PLAN ---------------------------------------------------------------------------- Gather (cost=0.29..36357.94 rows=100118 width=12)(actual time=13.219..589.723 rows=100000 loops=1) Workers Planned: 1 Workers Launched: 1 Single Copy: true -> Nested Loop (cost=0.29..36357.94 rows=100118 width=12) (actual time=0.288..442.821 rows=100000 loops=1) -> Seq Scan on t1 (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.148..49.308 rows=100000loops=1) -> Index Scan using t2_a_idx on t2 (cost=0.29..0.34 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=100000) Index Cond: (a = t1.a) Planning time: 0.483ms Execution time: 648.941 ms (10 rows) Suddenly, we get nworkers=1 with loops=1 (and not nworkers+1 as before). FWIW I've only seen this with force_parallel_mode=on, and the row counts are correct, so perhaps that's OK. single_copy seems a bit underdocumented, though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 11/02/2016 11:56 PM, Tomas Vondra wrote: >> On 11/02/2016 09:00 PM, Tom Lane wrote: >>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >>>> while eye-balling some explain plans for parallel queries, I got a bit >>>> confused by the row count estimates. I wonder whether I'm alone. >>> I got confused by that a minute ago, so no you're not alone. The problem >>> is even worse in join cases. For example: >>> Gather (cost=34332.00..53265.35 rows=100 width=8) >>> Workers Planned: 2 >>> -> Hash Join (cost=33332.00..52255.35 rows=100 width=8) >>> Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2)) >>> -> Append (cost=0.00..8614.96 rows=417996 width=8) >>> -> Parallel Seq Scan on pp (cost=0.00..8591.67 rows=416667 width=8) >>> -> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 width=8) >>> -> Hash (cost=14425.00..14425.00 rows=1000000 width=8) >>> -> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 width=8) >>> There are actually 1000000 rows in pp, and none in pp1. I'm not bothered >>> particularly by the nonzero estimate for pp1, because I know where that >>> came from, but I'm not very happy that nowhere here does it look like >>> it's estimating a million-plus rows going into the join. > Although - it is estimating 1M rows, but only "per worker" estimates are > shown, and because there are 2 workers planned it says 1M/2.4 which is > the 416k. I agree it's a bit unclear, but at least it's consistent with > how we treat loops (i.e. that the numbers are per loop). Well, it's not *that* consistent. If we were estimating all the numbers underneath the Gather as being per-worker numbers, that would make some amount of sense. But neither the other seqscan, nor the hash on it, nor the hashjoin's output count are scaled that way. It's very hard to call the above display anything but flat-out broken. > But there's more fun with joins - consider for example this simple join: > ... > Suddenly we don't show per-worker estimates for the hash join - both the > Hash Join and the Gather have exactly the same cardinality estimate. Yeah. That doesn't seem to be quite the same problem as in my example, but it's about as confused. Maybe we need to bite the bullet and add a "number of workers" field to the estimated and actual counts. Not sure how much that helps for the partial-count-for-the-leader issue, though. regards, tom lane
On Wed, Nov 2, 2016 at 2:42 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > BTW is it really a good idea to use nloops to track the number of workers > executing a given node? How will that work if once we get parallel nested > loops and index scans? We already have parallel nested loops with inner index scans. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 2, 2016 at 4:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> while eye-balling some explain plans for parallel queries, I got a bit >> confused by the row count estimates. I wonder whether I'm alone. > > I got confused by that a minute ago, so no you're not alone. The problem > is even worse in join cases. For example: > > Gather (cost=34332.00..53265.35 rows=100 width=8) > Workers Planned: 2 > -> Hash Join (cost=33332.00..52255.35 rows=100 width=8) > Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2)) > -> Append (cost=0.00..8614.96 rows=417996 width=8) > -> Parallel Seq Scan on pp (cost=0.00..8591.67 rows=416667 widt > h=8) > -> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 width=8 > ) > -> Hash (cost=14425.00..14425.00 rows=1000000 width=8) > -> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 width=8) > > There are actually 1000000 rows in pp, and none in pp1. I'm not bothered > particularly by the nonzero estimate for pp1, because I know where that > came from, but I'm not very happy that nowhere here does it look like > it's estimating a million-plus rows going into the join. I welcome suggestions for improvement, but you will note that if the row count didn't reflect some kind of guess about the number of rows that each individual worker will see, the costing would be hopelessly broken. The cost needs to reflect a guess about the time the query will finish, not the total amount of effort expended. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 2, 2016 at 10:44 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Although - it is estimating 1M rows, but only "per worker" estimates are > shown, and because there are 2 workers planned it says 1M/2.4 which is the > 416k. I agree it's a bit unclear, but at least it's consistent with how we > treat loops (i.e. that the numbers are per loop). Right. Which I think was a horrible decision. I think that it would be best to change EXPLAIN so that the row counts and costs are never divided by nloops. That would be a backward-incompatible change, but I think it would be worth it. What you typically want to understand is the total effort expended in a particular plan node, and the current system makes that incredibly difficult to understand, especially because we then round off the row count estimates to the nearest integer, so that you can't even reverse the division if you want to (which you always do). > But there's more fun with joins - consider for example this simple join: > > QUERY PLAN > ------------------------------------------------------------------------------ > Gather (cost=19515.96..43404.82 rows=96957 width=12) > (actual time=295.167..746.312 rows=99999 loops=1) > Workers Planned: 2 > Workers Launched: 2 > -> Hash Join (cost=18515.96..32709.12 rows=96957 width=12) > (actual time=249.281..670.309 rows=33333 loops=3) > Hash Cond: (t2.a = t1.a) > -> Parallel Seq Scan on t2 > (cost=0.00..8591.67 rows=416667 width=8) > (actual time=0.100..184.315 rows=333333 loops=3) > -> Hash (cost=16925.00..16925.00 rows=96957 width=8) > (actual time=246.760..246.760 rows=99999 loops=3) > Buckets: 131072 Batches: 2 Memory Usage: 2976kB > -> Seq Scan on t1 > (cost=0.00..16925.00 rows=96957 width=8) > (actual time=0.065..178.385 rows=99999 loops=3) > Filter: (b < 100000) > Rows Removed by Filter: 900001 > Planning time: 0.763 ms > Execution time: 793.653 ms > (13 rows) > > Suddenly we don't show per-worker estimates for the hash join - both the > Hash Join and the Gather have exactly the same cardinality estimate. I'm not sure why that's happening, but I haven't made any changes to the costing for a node like hash join. It doesn't treat the parallel sequential scan that is coming as its first input any differently than it would if that were a non-parallel plan. It's just costing the join normally, based on an input row count that is lower than what it would be if it were going to see every row from t2 rather than only some of them. > So, different join method but same result - 2 workers, loops=3. But let's > try with small tables (100k rows instead of 1M rows): > > QUERY PLAN > ---------------------------------------------------------------------------- > Gather (cost=0.29..36357.94 rows=100118 width=12) (actual > time=13.219..589.723 rows=100000 loops=1) > Workers Planned: 1 > Workers Launched: 1 > Single Copy: true > -> Nested Loop (cost=0.29..36357.94 rows=100118 width=12) > (actual time=0.288..442.821 rows=100000 loops=1) > -> Seq Scan on t1 (cost=0.00..1444.18 rows=100118 width=8) > (actual time=0.148..49.308 rows=100000 loops=1) > -> Index Scan using t2_a_idx on t2 > (cost=0.29..0.34 rows=1 width=8) > (actual time=0.002..0.002 rows=1 loops=100000) > Index Cond: (a = t1.a) > Planning time: 0.483 ms > Execution time: 648.941 ms > (10 rows) > > Suddenly, we get nworkers=1 with loops=1 (and not nworkers+1 as before). > FWIW I've only seen this with force_parallel_mode=on, and the row counts are > correct, so perhaps that's OK. single_copy seems a bit underdocumented, > though. This is certainly entirely as expected. Single-copy means that there's one process running the non-parallel plan beneath it, and that's it. So the Gather is just a pass-through node here, like a Materialize or Sort: the number of input rows and the number of output rows have to be the same. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 2, 2016 at 10:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> I got confused by that a minute ago, so no you're not alone. The problem >>>> is even worse in join cases. For example: >>>> Gather (cost=34332.00..53265.35 rows=100 width=8) >>>> Workers Planned: 2 >>>> -> Hash Join (cost=33332.00..52255.35 rows=100 width=8) >>>> Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2)) >>>> -> Append (cost=0.00..8614.96 rows=417996 width=8) >>>> -> Parallel Seq Scan on pp (cost=0.00..8591.67 rows=416667 width=8) >>>> -> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 width=8) >>>> -> Hash (cost=14425.00..14425.00 rows=1000000 width=8) >>>> -> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 width=8) >> Although - it is estimating 1M rows, but only "per worker" estimates are >> shown, and because there are 2 workers planned it says 1M/2.4 which is >> the 416k. I agree it's a bit unclear, but at least it's consistent with >> how we treat loops (i.e. that the numbers are per loop). > > Well, it's not *that* consistent. If we were estimating all the numbers > underneath the Gather as being per-worker numbers, that would make some > amount of sense. But neither the other seqscan, nor the hash on it, nor > the hashjoin's output count are scaled that way. It's very hard to call > the above display anything but flat-out broken. While investigating why Rushabh Lathia's Gather Merge patch sometimes fails to pick a Gather Merge plan even when it really ought to do so, I ran smack into this problem. I discovered that this is more than a cosmetic issue. The costing itself is actually badly broken. In the single-table case, when you have just ... Gather -> Parallel Seq Scan ...the Parallel Seq Scan node reflects a per-worker row estimate, and the Gather node reflects a total row estimate. But in the join case, as shown above, the Gather thinks that the total number of rows which it will produce is equal to the number that will be produced by one single worker, which is crap, and the cost of doing the join in parallel is based on the per-worker rather than the total number, which is crappier. The difference in cost between the Gather and the underlying join in the above example is exactly 1010, namely 1000 for parallel_setup_cost and 100 tuples at 0.1 per tuple, even though 100 is the number of tuples per-worker, not the total number. That's really not good. I probably should have realized this when I looked at this thread the first time, but I somehow got it into my head that this was just a complaint about the imperfections of the display (which is indeed imperfect) and failed to realize that the same report was also pointing to an actual costing bug. I apologize for that. The reason why this is happening is that final_cost_nestloop(), final_cost_hashjoin(), and final_cost_mergejoin() don't care a whit about whether the path they are generating is partial. They apply the row estimate for the joinrel itself to every such path generated for the join, except for parameterized paths which are a special case. I think this generally has the effect of discouraging parallel joins, because the inflated row count also inflates the join cost. I think the right thing to do is probably to scale the row count estimate for the joinrel's partial paths by the leader_contribution value computed in cost_seqscan. Despite my general hatred of back-patching things that cause plan changes, I'm inclined to think the fix for this should be back-patched to 9.6, because this is really a brown-paper-bag bug. If the consensus is otherwise I will of course defer to that consensus. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 11, 2017 at 1:24 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Well, it's not *that* consistent. If we were estimating all the numbers >> underneath the Gather as being per-worker numbers, that would make some >> amount of sense. But neither the other seqscan, nor the hash on it, nor >> the hashjoin's output count are scaled that way. It's very hard to call >> the above display anything but flat-out broken. > > While investigating why Rushabh Lathia's Gather Merge patch sometimes > fails to pick a Gather Merge plan even when it really ought to do so, > I ran smack into this problem. I discovered that this is more than a > cosmetic issue. The costing itself is actually badly broken. > > The reason why this is happening is that final_cost_nestloop(), > final_cost_hashjoin(), and final_cost_mergejoin() don't care a whit > about whether the path they are generating is partial. They apply the > row estimate for the joinrel itself to every such path generated for > the join, except for parameterized paths which are a special case. I > think this generally has the effect of discouraging parallel joins, > because the inflated row count also inflates the join cost. I think > the right thing to do is probably to scale the row count estimate for > the joinrel's partial paths by the leader_contribution value computed > in cost_seqscan. > > Despite my general hatred of back-patching things that cause plan > changes, I'm inclined to think the fix for this should be back-patched > to 9.6, because this is really a brown-paper-bag bug. If the > consensus is otherwise I will of course defer to that consensus. And here is a patch which seems to fix the problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On Wed, Jan 11, 2017 at 4:05 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> While investigating why Rushabh Lathia's Gather Merge patch sometimes >> fails to pick a Gather Merge plan even when it really ought to do so, >> I ran smack into this problem. I discovered that this is more than a >> cosmetic issue. The costing itself is actually badly broken. >> >> The reason why this is happening is that final_cost_nestloop(), >> final_cost_hashjoin(), and final_cost_mergejoin() don't care a whit >> about whether the path they are generating is partial. They apply the >> row estimate for the joinrel itself to every such path generated for >> the join, except for parameterized paths which are a special case. I >> think this generally has the effect of discouraging parallel joins, >> because the inflated row count also inflates the join cost. I think >> the right thing to do is probably to scale the row count estimate for >> the joinrel's partial paths by the leader_contribution value computed >> in cost_seqscan. >> >> Despite my general hatred of back-patching things that cause plan >> changes, I'm inclined to think the fix for this should be back-patched >> to 9.6, because this is really a brown-paper-bag bug. If the >> consensus is otherwise I will of course defer to that consensus. > > And here is a patch which seems to fix the problem. Since nobody seems to have any comment here, I've committed and back-patched this to 9.6. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company