Обсуждение: Getting different number of results when using hashjoin on/off
I've a problem that might be a bug in the core system (hashjoins) or with ltree using gist-index, but I fail miserable toproduce a useful testcase (using 8.1, worked in 8.0): A query produces wrong (=0) results, when a different plan is enforced, I get a merge-join plan that looks similar, but producesthe correct result (=16 rows). I can post a queryplan, but cannot post the data itself since it's confidental (though I might be able to randomize somedata and construct a self contained case, but this would take quite some time). The working case is: set enable_hashjoin to off;Seq Scan on foo1 cost=0.00..423583.57 rows=10810 width=4) (actual time=675.422..706.815 rows=16loops=1) Filter: (subplan) SubPlan -> Merge Join (cost=19.49..19.55 rows=1 width=0) (actual time=0.028..0.028rows=0 loops=21619) Merge Cond: ("outer".str_id = "inner".id) -> Sort (cost=6.49..6.50rows=5 width=4) (actual time=0.023..0.023 rows=0 loops=21619) Sort Key: bz.str_id -> Bitmap Heap Scan on foo2 bz (cost=2.02..6.43 rows=5 width=4) (actual time=0.012..0.012 rows=0 loops=21619) Recheck Cond: (bid = $0) -> Bitmap Index Scan on foo2_bid_key1 (cost=0.00..2.02rows=5 width=0) (actual time=0.009..0.009 rows=0 loops=21619) Index Cond: (bid= $0) -> Sort (cost=13.00..13.01 rows=6 width=4) (actual time=0.002..0.003 rows=1 loops=136) Sort Key: str.id -> Bitmap Heap Scan on structure str (cost=2.02..12.92 rows=6 width=4) (actual time=0.095..0.097rows=1 loops=1) Recheck Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) -> Bitmap Index Scan on str_uk4 (cost=0.00..2.02 rows=6 width=0) (actual time=0.086..0.086 rows=1 loops=1) Index Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery)Total runtime: 707.019 ms 16 rows... The failing case is: set enable_hashjoin to on;Seq Scan on foo1 cost=0.00..421679.00 rows=10810 width=4) (actual time=154.663..154.663 rows=0loops=1) Filter: (subplan) SubPlan -> Hash Join (cost=8.47..19.46 rows=1 width=0) (actual time=0.004..0.004rows=0 loops=21619) Hash Cond: ("outer".id = "inner".str_id) -> Bitmap Heap Scan on structurestr (cost=2.02..12.92 rows=6 width=4) (actual time=0.100..30.095 rows=1 loops=1) Recheck Cond: (path~ '142.2330445.2330598.2330676.*'::lquery) -> Bitmap Index Scan on str_uk4 (cost=0.00..2.02 rows=6width=0) (actual time=0.090..0.090 rows=1 loops=1) Index Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) -> Hash (cost=6.43..6.43 rows=5 width=4) (actual time=0.032..0.032 rows=0loops=1) -> Bitmap Heap Scan on foo2 bz (cost=2.02..6.43 rows=5 width=4) (actual time=0.025..0.025rows=0 loops=1) Recheck Cond: (bid = $0) -> Bitmap Index Scanon foo2_bid_key1 (cost=0.00..2.02 rows=5 width=0) (actual time=0.021..0.021 rows=0 loops=1) Index Cond: (bid = $0)Total runtime: 154.862 ms No rows.... The query itself is quite simple: select foo1.id from foo1 where foo1.datloesch is null and exists (select 1 from foo2 bz, structure str where bz.bid=foo1.id and str.id = bz.str_id and str.path ~ '*.2330676.*' ); The path field is an "ltree" column, with an GIST index on it. Any ideas what I could try to track this down? Best regards,Mario Weilguni
> The path field is an "ltree" column, with an GIST index on it. Something to do with bitmap indexscans on lossy indexes? Chris
Am Montag, 28. November 2005 14:12 schrieb Christopher Kings-Lynne: > > The path field is an "ltree" column, with an GIST index on it. > > Something to do with bitmap indexscans on lossy indexes? > > Chris I doubt that, "set enable_bitmapscan to off" produces the wrong result as well. Best regards Mario
Mario Weilguni <mweilguni@sime.com> writes: > The failing case is: > ... > SubPlan > -> Hash Join (cost=8.47..19.46 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=21619) > Hash Cond: ("outer".id = "inner".str_id) > -> Bitmap Heap Scan on structure str (cost=2.02..12.92 rows=6 width=4) (actual time=0.100..30.095 rows=1 loops=1) > Recheck Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) > -> Bitmap Index Scan on str_uk4 (cost=0.00..2.02 rows=6 width=0) (actual time=0.090..0.090 rows=1 loops=1) > Index Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) > -> Hash (cost=6.43..6.43 rows=5 width=4) (actual time=0.032..0.032 rows=0 loops=1) > -> Bitmap Heap Scan on foo2 bz (cost=2.02..6.43 rows=5 width=4) (actual time=0.025..0.025 rows=0 loops=1) > Recheck Cond: (bid = $0) > -> Bitmap Index Scan on foo2_bid_key1 (cost=0.00..2.02 rows=5 width=0) (actual time=0.021..0.021rows=0 loops=1) > Index Cond: (bid = $0) Hmm, I wonder why the hash join's input nodes are showing "loops=1" ... the hash depends on the subplan parameter $0 so it needs to be re-evaluated each time through. It looks like that's not happening. Do you have the corresponding results from 8.0 --- if so, what do the loop counts look like? regards, tom lane
"Mario Weilguni" <mario.weilguni@icomedias.com> writes: > Yes. This is from a 8.0.3 (with slightly older and different data, > resulting in only 9 rows, but the rest is the same): Yeah, that looks more reasonable. I tried to reproduce this, without any luck: regression=# explain analyze select count(*) from tenk1 a where exists (select 1 from tenk1 b, tenk1 c where b.unique1=c.unique2and b.hundred in (4,5) and c.hundred=a.hundred); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=3879742.37..3879742.38 rows=1 width=0) (actual time=46579.077..46579.082 rows=1 loops=1) -> Seq Scan on tenk1 a (cost=0.00..3879729.87 rows=5000 width=0) (actual time=5.129..46528.208 rows=8500 loops=1) Filter: (subplan) SubPlan -> Hash Join (cost=229.20..546.66 rows=2 width=0) (actual time=4.569..4.569 rows=1 loops=10000) Hash Cond: ("outer".unique1 = "inner".unique2) -> Bitmap Heap Scan on tenk1 b (cost=4.69..321.15rows=196 width=4) (actual time=0.947..1.698 rows=90 loops=10000) Recheck Cond: ((hundred= 4) OR (hundred = 5)) -> BitmapOr (cost=4.69..4.69 rows=197 width=0) (actual time=0.544..0.544rows=0 loops=10000) -> Bitmap Index Scan on tenk1_hundred (cost=0.00..2.34rows=98 width=0) (actual time=0.271..0.271 rows=100 loops=10000) Index Cond:(hundred = 4) -> Bitmap Index Scan on tenk1_hundred (cost=0.00..2.34 rows=98 width=0) (actualtime=0.262..0.262 rows=100 loops=10000) Index Cond: (hundred = 5) -> Hash (cost=224.26..224.26 rows=100 width=4) (actual time=2.370..2.370 rows=100 loops=10000) -> Bitmap Heap Scan on tenk1 c (cost=2.35..224.26 rows=100 width=4) (actual time=0.492..1.616 rows=100 loops=10000) Recheck Cond: (hundred = $0) -> Bitmap Index Scan on tenk1_hundred (cost=0.00..2.35rows=100 width=0) (actual time=0.278..0.278 rows=100 loops=10000) IndexCond: (hundred = $0)Total runtime: 46584.654 ms (19 rows) (I'm not bothering with setting up an ltree index, since the question of what index is being used shouldn't affect hashjoin's decision to rescan or not.) That's using 8.1 branch CVS tip, but there aren't any related bug fixes since 8.1 release. We did have several bug fixes in the hash join code during the 8.1 beta cycle though ... is it possible you are really running an 8.1 beta and not 8.1.0? regards, tom lane
Thanks Tom for you quick answer! No, I'm using 8.1.0, and tried it on different machines, always the same results. SELECT version(); PostgreSQL 8.1.0 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.4 20040623 (Gentoo Hardened Linux 3.3.4-r1, ssp-3.3.2-2,pie-8.7.6) Best regards,Mario Weilguni icomedias - Digitale Kommunikation ------------------------------------------------------------------------ Mario Weilguni, Forschung und Entwicklung mario.weilguni@icomedias.com, http://www.icomedias.com/ icomedias Österreich Systemhaus GmbH: 8020 Graz, Entenplatz 1 Tel: +43 (316) 721.671-272, Fax: -103 icomedias Deutschland Systemhaus GmbH: 10969 Berlin, Alexandrinenstraße 2-3 Tel: +49 (30) 695.399-272, Fax: -103 -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, November 28, 2005 5:20 PM To: Mario Weilguni Cc: Mario Weilguni; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Getting different number of results when using hashjoin on/off "Mario Weilguni" <mario.weilguni@icomedias.com> writes: > Yes. This is from a 8.0.3 (with slightly older and different data, > resulting in only 9 rows, but the rest is the same): Yeah, that looks more reasonable. I tried to reproduce this, without any luck: regression=# explain analyze select count(*) from tenk1 a where exists (select 1 from tenk1 b, tenk1 c where b.unique1=c.unique2and b.hundred in (4,5) and c.hundred=a.hundred); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=3879742.37..3879742.38 rows=1 width=0) (actual time=46579.077..46579.082 rows=1 loops=1) -> Seq Scan on tenk1 a (cost=0.00..3879729.87 rows=5000 width=0) (actual time=5.129..46528.208 rows=8500 loops=1) Filter: (subplan) SubPlan -> Hash Join (cost=229.20..546.66 rows=2 width=0) (actual time=4.569..4.569 rows=1 loops=10000) Hash Cond: ("outer".unique1 = "inner".unique2) -> Bitmap Heap Scan on tenk1 b (cost=4.69..321.15rows=196 width=4) (actual time=0.947..1.698 rows=90 loops=10000) Recheck Cond: ((hundred= 4) OR (hundred = 5)) -> BitmapOr (cost=4.69..4.69 rows=197 width=0) (actual time=0.544..0.544rows=0 loops=10000) -> Bitmap Index Scan on tenk1_hundred (cost=0.00..2.34rows=98 width=0) (actual time=0.271..0.271 rows=100 loops=10000) Index Cond:(hundred = 4) -> Bitmap Index Scan on tenk1_hundred (cost=0.00..2.34 rows=98 width=0) (actualtime=0.262..0.262 rows=100 loops=10000) Index Cond: (hundred = 5) -> Hash (cost=224.26..224.26 rows=100 width=4) (actual time=2.370..2.370 rows=100 loops=10000) -> Bitmap Heap Scan on tenk1 c (cost=2.35..224.26 rows=100 width=4) (actual time=0.492..1.616 rows=100 loops=10000) Recheck Cond: (hundred = $0) -> Bitmap Index Scan on tenk1_hundred (cost=0.00..2.35rows=100 width=0) (actual time=0.278..0.278 rows=100 loops=10000) IndexCond: (hundred = $0)Total runtime: 46584.654 ms (19 rows) (I'm not bothering with setting up an ltree index, since the question of what index is being used shouldn't affect hashjoin's decision to rescan or not.) That's using 8.1 branch CVS tip, but there aren't any related bug fixes since 8.1 release. We did have several bug fixes in the hash join code during the 8.1 beta cycle though ... is it possible you are really running an 8.1 beta and not 8.1.0? regards, tom lane
"Mario Weilguni" <mario.weilguni@icomedias.com> writes: > No, I'm using 8.1.0, and tried it on different machines, always the same results. I see it, I think: the recent changes to avoid work when one or the other side of the hash join is empty would exit the hash join leaving a state that confused ExecReScanHashJoin() into thinking it didn't have to do anything. Try the attached patch. regards, tom lane Index: src/backend/executor/nodeHashjoin.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v retrieving revision 1.75.2.1 diff -c -r1.75.2.1 nodeHashjoin.c *** src/backend/executor/nodeHashjoin.c 22 Nov 2005 18:23:09 -0000 1.75.2.1 --- src/backend/executor/nodeHashjoin.c 28 Nov 2005 17:04:43 -0000 *************** *** 152,163 **** * outer join, we can quit without scanning the outer relation. */ if (hashtable->totalTuples== 0 && node->js.jointype != JOIN_LEFT) - { - ExecHashTableDestroy(hashtable); - node->hj_HashTable = NULL; - node->hj_FirstOuterTupleSlot = NULL; return NULL; - } /* * need to remember whether nbatch has increased since we began --- 152,158 ---- *************** *** 487,493 **** { ExecHashTableDestroy(node->hj_HashTable); node->hj_HashTable = NULL; - node->hj_FirstOuterTupleSlot = NULL; } /* --- 482,487 ---- *************** *** 805,841 **** ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) { /* - * If we haven't yet built the hash table then we can just return; nothing - * done yet, so nothing to undo. - */ - if (node->hj_HashTable == NULL) - return; - - /* * In a multi-batch join, we currently have to do rescans the hard way, * primarily because batch tempfiles may have already been released. But * if it's a single-batch join, and there is no parameter change for the * inner subnode, then we can just re-use the existing hash table without * rebuilding it. */ ! if (node->hj_HashTable->nbatch == 1 && ! ((PlanState *) node)->righttree->chgParam == NULL) ! { ! /* okay to reuse the hash table; needn't rescan inner, either */ ! } ! else { ! /* must destroy and rebuild hash table */ ! ExecHashTableDestroy(node->hj_HashTable); ! node->hj_HashTable = NULL; ! node->hj_FirstOuterTupleSlot = NULL; ! /* ! * if chgParam of subnode is not null then plan will be re-scanned by ! * first ExecProcNode. ! */ ! if (((PlanState *) node)->righttree->chgParam == NULL) ! ExecReScan(((PlanState *) node)->righttree, exprCtxt); } /* Always reset intra-tuple state */ --- 799,830 ---- ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) { /* * In a multi-batch join, wecurrently have to do rescans the hard way, * primarily because batch temp files may have already been released. But * if it's a single-batch join, and there is no parameter change for the * inner subnode, then we can just re-usethe existing hash table without * rebuilding it. */ ! if (node->hj_HashTable != NULL) { ! if (node->hj_HashTable->nbatch == 1 && ! ((PlanState *) node)->righttree->chgParam == NULL) ! { ! /* okay to reuse the hash table; needn't rescan inner, either */ ! } ! else ! { ! /* must destroy and rebuild hash table */ ! ExecHashTableDestroy(node->hj_HashTable); ! node->hj_HashTable = NULL; ! /* ! * if chgParam of subnode is not null then plan will be re-scanned ! * by first ExecProcNode. ! */ ! if (((PlanState *) node)->righttree->chgParam == NULL) ! ExecReScan(((PlanState *) node)->righttree, exprCtxt); ! } } /* Always reset intra-tuple state */ *************** *** 847,852 **** --- 836,842 ---- node->js.ps.ps_TupFromTlist = false; node->hj_NeedNewOuter = true; node->hj_MatchedOuter = false; + node->hj_FirstOuterTupleSlot = NULL; /* * if chgParam of subnode is not null then plan will be re-scannedby
Hello Tom, Thanks for the quick response, I've tried the patch, but it did not work as expected. When I set enable_hashjoin to off, everything works as expected, but with hashjoin on I do not even get results anymore, CPU is going up to 100% and after 3 minutes I cancelled the query (it normale would take ~100-500 milliseconds). I will check the patch on a different machine again and inform you of the results. Best regards,Mario Weilguni -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, November 28, 2005 6:09 PM To: Mario Weilguni Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Getting different number of results when using hashjoin on/off "Mario Weilguni" <mario.weilguni@icomedias.com> writes: > No, I'm using 8.1.0, and tried it on different machines, always the same results. I see it, I think: the recent changes to avoid work when one or the other side of the hash join is empty would exit the hash join leaving a state that confused ExecReScanHashJoin() into thinking it didn't have to do anything. Try the attached patch. regards, tom lane Index: src/backend/executor/nodeHashjoin.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v retrieving revision 1.75.2.1 diff -c -r1.75.2.1 nodeHashjoin.c *** src/backend/executor/nodeHashjoin.c 22 Nov 2005 18:23:09 -0000 1.75.2.1 --- src/backend/executor/nodeHashjoin.c 28 Nov 2005 17:04:43 -0000 *************** *** 152,163 **** * outer join, we can quit without scanning the outer relation. */ if (hashtable->totalTuples == 0 && node->js.jointype != JOIN_LEFT) - { - ExecHashTableDestroy(hashtable); - node->hj_HashTable = NULL; - node->hj_FirstOuterTupleSlot = NULL; return NULL; - } /* * need to remember whether nbatch has increased since we began --- 152,158 ---- *************** *** 487,493 **** { ExecHashTableDestroy(node->hj_HashTable); node->hj_HashTable = NULL; - node->hj_FirstOuterTupleSlot = NULL; } /* --- 482,487 ---- *************** *** 805,841 **** ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) { /* - * If we haven't yet built the hash table then we can just return; nothing - * done yet, so nothing to undo. - */ - if (node->hj_HashTable == NULL) - return; - - /* * In a multi-batch join, we currently have to do rescans the hard way, * primarily because batch temp files may have already been released. But * if it's a single-batch join, and there is no parameter change for the * inner subnode, then we can just re-use the existing hash table without * rebuilding it. */ ! if (node->hj_HashTable->nbatch == 1 && ! ((PlanState *) node)->righttree->chgParam == NULL) ! { ! /* okay to reuse the hash table; needn't rescan inner, either */ ! } ! else { ! /* must destroy and rebuild hash table */ ! ExecHashTableDestroy(node->hj_HashTable); ! node->hj_HashTable = NULL; ! node->hj_FirstOuterTupleSlot = NULL; ! /* ! * if chgParam of subnode is not null then plan will be re-scanned by ! * first ExecProcNode. ! */ ! if (((PlanState *) node)->righttree->chgParam == NULL) ! ExecReScan(((PlanState *) node)->righttree, exprCtxt); } /* Always reset intra-tuple state */ --- 799,830 ---- ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) { /* * In a multi-batch join, wecurrently have to do rescans the hard way, * primarily because batch temp files may have already been released. But * if it's a single-batch join, and there is no parameter change for the * inner subnode, then we can just re-use the existing hash table without * rebuilding it. */ ! if (node->hj_HashTable != NULL) { ! if (node->hj_HashTable->nbatch == 1 && ! ((PlanState *) node)->righttree->chgParam == NULL) ! { ! /* okay to reuse the hash table; needn't rescan inner, either */ ! } ! else ! { ! /* must destroy and rebuild hash table */ ! ExecHashTableDestroy(node->hj_HashTable); ! node->hj_HashTable = NULL; ! /* ! * if chgParam of subnode is not null then plan will be re-scanned ! * by first ExecProcNode. ! */ ! if (((PlanState *) node)->righttree->chgParam == NULL) ! ExecReScan(((PlanState *) node)->righttree, exprCtxt); ! } } /* Always reset intra-tuple state */ *************** *** 847,852 **** --- 836,842 ---- node->js.ps.ps_TupFromTlist = false; node->hj_NeedNewOuter = true; node->hj_MatchedOuter = false; + node->hj_FirstOuterTupleSlot = NULL; /* * if chgParam of subnode is not null then plan will be re-scanned by
Yes. This is from a 8.0.3 (with slightly older and different data, resulting in only 9 rows, but the rest is the same): Index Scan using ben_uk3 on foo1 ben (cost=0.00..73867.23 rows=863 width=27) (actual time=38.591..501.839 rows=9 loops=1) Filter: (subplan) SubPlan -> Hash Join (cost=14.25..42.53 rows=1width=0) (actual time=0.284..0.284 rows=0 loops=1725) Hash Cond: ("outer".id = "inner".str_id) -> Index Scan using str_uk4on structure str (cost=0.00..27.91 rows=13 width=4) (actual time=0.765..4.043 rows=1 loops=112) Index Cond: (path ~ '*.2330676.*'::lquery) -> Hash (cost=14.23..14.23 rows=10 width=4)(actual time=0.012..0.012 rows=0 loops=1725) -> Index Scan using foo2_ben_id_key1 on foo2 bz (cost=0.00..14.23 rows=10 width=4) (actual time=0.008..0.009 rows=1 loops=1725) Index Cond: (ben_id = $0)Total runtime: 501.980 ms Best regards P.s. sorry for the stupid quoting, I've to use Outlook.... Mario Weilguni <mweilguni@sime.com> writes: > The failing case is: > ... > SubPlan > -> Hash Join (cost=8.47..19.46 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=21619) > Hash Cond: ("outer".id = "inner".str_id) > -> Bitmap Heap Scan on structure str (cost=2.02..12.92 rows=6 width=4) (actual time=0.100..30.095 rows=1 loops=1) > Recheck Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) > -> Bitmap Index Scan on str_uk4 (cost=0.00..2.02 rows=6 width=0) (actual time=0.090..0.090 rows=1 loops=1) > Index Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery) > -> Hash (cost=6.43..6.43 rows=5 width=4) (actual time=0.032..0.032 rows=0 loops=1) > -> Bitmap Heap Scan on foo2 bz (cost=2.02..6.43 rows=5 width=4) (actual time=0.025..0.025 rows=0 loops=1) > Recheck Cond: (bid = $0) > -> Bitmap Index Scan on foo2_bid_key1 (cost=0.00..2.02 rows=5 width=0) (actual time=0.021..0.021 rows=0 loops=1) > Index Cond: (bid = $0) Hmm, I wonder why the hash join's input nodes are showing "loops=1" ... the hash depends on the subplan parameter $0 so it needs to be re-evaluated each time through. It looks like that's not happening. >Do you have the corresponding results from 8.0 --- if so, what do >the loop counts look like? ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.orgso that your message can get through to the mailing list cleanly
"Mario Weilguni" <mario.weilguni@icomedias.com> writes: > Thanks for the quick response, I've tried the patch, but it did not work > as expected. When I set enable_hashjoin to off, everything works as > expected, but with hashjoin on I do not even get results anymore, CPU is > going up to 100% and after 3 minutes I cancelled the query (it normale > would take ~100-500 milliseconds). Try letting it run longer. I think your expectation is tuned for the broken implementation (which runs the subqueries only once instead of 26k times...) The test case I developed for this failure in the regression database is select count(*) from tenk1 a where exists (select 1 from tenk1 b, tenk1 c where b.unique1=c.unique2 and b.hundred in (4,5) andc.hundred=a.hundred+99); 8.0 prefers a nestloop for the subquery, and that plan runs in about 600 ms on my machine. If forced to a hash join, it takes about 2450 ms. 8.1 prefers the hash join to start with, but takes 11300 ms to run it :-( (after the patch that is). The reason for the differential is that 8.1 guesses wrong about which subplan to cycle first: most of the time, the inner plan is empty and so there's no need to pull any rows from the outer plan, but 8.1 pulls the first row from the outer plan anyway, and doing that 10000 times is what's eating the extra runtime. It looks from your previous message that similar things are happening with your data distribution, allowing 8.0 to run faster for you than 8.1 does. Not sure if there's much we can do about this. The presence of the upper-query parameter in the subplan makes it difficult to derive any stats at all, let alone guess how often the subplan will be completely empty, so I'm not sure the planner can help. For a query like this, where the hash join is being done repeatedly, it might be useful for the executor itself to track how often each subplan has been seen to be empty. In particular, the executor knows that the outer subplan is parameterless and therefore should deliver the same results each time (modulo volatile functions of course), so after the first cycle it could know that there's no point in trying the early fetch on that side. Dunno if this will be of wide enough use to be worth implementing though --- in simple cases the join won't be rescanned and so the executor can't help. Anyone have any other ideas? regards, tom lane
I wrote: > For a query like this, where the hash join is being done repeatedly, > it might be useful for the executor itself to track how often each > subplan has been seen to be empty. I implemented a simple form of this, and it made 8.1 faster than 8.0 on the test case I was using. Give it a try ... regards, tom lane Index: src/backend/executor/nodeHashjoin.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v retrieving revision 1.75.2.2 diff -c -r1.75.2.2 nodeHashjoin.c *** src/backend/executor/nodeHashjoin.c 28 Nov 2005 17:14:47 -0000 1.75.2.2 --- src/backend/executor/nodeHashjoin.c 28 Nov 2005 23:41:28 -0000 *************** *** 120,135 **** * since we aren't going to be able to skip the join on the strength * of an empty innerrelation anyway.) * * The only way to make the check is to try to fetch a tuple from the * outer plan node. If we succeed, we have to stash it away for later * consumption by ExecHashJoinOuterGetTuple. */ ! if (outerNode->plan->startup_cost < hashNode->ps.plan->total_cost || ! node->js.jointype == JOIN_LEFT) { node->hj_FirstOuterTupleSlot = ExecProcNode(outerNode); if (TupIsNull(node->hj_FirstOuterTupleSlot)) return NULL; } else node->hj_FirstOuterTupleSlot = NULL; --- 120,147 ---- * since we aren't going to be able to skip the join on the strength * of an empty innerrelation anyway.) * + * If we are rescanning the join, we make use of information gained + * on the previous scan: don't bother to try the prefetch if the + * previous scan found the outer relation nonempty. This is not + * 100% reliable since with new parameters the outer relation might + * yield different results, but it's a good heuristic. + * * The only way to make the check is to try to fetch a tuple from the * outer plan node. Ifwe succeed, we have to stash it away for later * consumption by ExecHashJoinOuterGetTuple. */ ! if (node->js.jointype == JOIN_LEFT || ! (outerNode->plan->startup_cost < hashNode->ps.plan->total_cost && ! !node->hj_OuterNotEmpty)) { node->hj_FirstOuterTupleSlot = ExecProcNode(outerNode); if (TupIsNull(node->hj_FirstOuterTupleSlot)) + { + node->hj_OuterNotEmpty = false; return NULL; + } + else + node->hj_OuterNotEmpty = true; } else node->hj_FirstOuterTupleSlot = NULL; *************** *** 159,164 **** --- 171,183 ---- * scanning the outer relation */ hashtable->nbatch_outstart = hashtable->nbatch; + + /* + * Reset OuterNotEmpty for scan. (It's OK if we fetched a tuple + * above, because ExecHashJoinOuterGetTuple will immediately + * set it again.) + */ + node->hj_OuterNotEmpty = false; } /* *************** *** 454,459 **** --- 473,479 ---- hjstate->js.ps.ps_TupFromTlist = false; hjstate->hj_NeedNewOuter = true; hjstate->hj_MatchedOuter= false; + hjstate->hj_OuterNotEmpty = false; return hjstate; } *************** *** 546,551 **** --- 566,574 ---- *hashvalue = ExecHashGetHashValue(hashtable, econtext, hjstate->hj_OuterHashKeys); + /* remember outer relation is not empty for possible rescan */ + hjstate->hj_OuterNotEmpty = true; + return slot; } *************** *** 810,816 **** if (node->hj_HashTable->nbatch == 1 && ((PlanState *) node)->righttree->chgParam ==NULL) { ! /* okay to reuse the hash table; needn't rescan inner, either */ } else { --- 833,851 ---- if (node->hj_HashTable->nbatch == 1 && ((PlanState *) node)->righttree->chgParam ==NULL) { ! /* ! * okay to reuse the hash table; needn't rescan inner, either. ! * ! * What we do need to do is reset our state about the emptiness ! * of the outer relation, so that the new scan of the outer will ! * update it correctly if it turns out to be empty this time. ! * (There's no harm in clearing it now because ExecHashJoin won't ! * need the info. In the other cases, where the hash table ! * doesn't exist or we are destroying it, we leave this state ! * alone because ExecHashJoin will need it the first time ! * through.) ! */ ! node->hj_OuterNotEmpty = false; } else { Index: src/include/nodes/execnodes.h =================================================================== RCS file: /cvsroot/pgsql/src/include/nodes/execnodes.h,v retrieving revision 1.139.2.2 diff -c -r1.139.2.2 execnodes.h *** src/include/nodes/execnodes.h 22 Nov 2005 18:23:28 -0000 1.139.2.2 --- src/include/nodes/execnodes.h 28 Nov 2005 23:41:28 -0000 *************** *** 1101,1106 **** --- 1101,1107 ---- * hj_FirstOuterTupleSlot first tuple retrieved from outer plan * hj_NeedNewOuter true if need new outer tuple on next call * hj_MatchedOuter true if found a join match for currentouter + * hj_OuterNotEmpty true if outer relation known not empty * ---------------- */ *************** *** 1125,1130 **** --- 1126,1132 ---- TupleTableSlot *hj_FirstOuterTupleSlot; bool hj_NeedNewOuter; bool hj_MatchedOuter; + bool hj_OuterNotEmpty; } HashJoinState;
Greg Stark <gsstark@mit.edu> writes: > I suspect this is obvious but since you asked, there isn't any way to keep > around the hash table and just reuse it repeatedly instead of having to rescan > the data over and over is there? We already do that when possible --- which it's not in the particular case at hand, because there's an outer-query parameter used in the hashed subplan. It occurs to me that the planner ought to favor putting parameterized subplans on the outside of a hash join instead of the inside, so as to make reuse more likely. Not sure how to factor that into the cost model though. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > In particular, the executor knows that the outer subplan is parameterless > and therefore should deliver the same results each time (modulo volatile > functions of course), so after the first cycle it could know that there's no > point in trying the early fetch on that side. > Anyone have any other ideas? I suspect this is obvious but since you asked, there isn't any way to keep around the hash table and just reuse it repeatedly instead of having to rescan the data over and over is there? -- greg