Обсуждение: Re: [PERFORM] Hash Anti Join performance degradation
2011/5/31 Robert Haas <robertmhaas@gmail.com>: > On Thu, May 26, 2011 at 8:33 AM, panam <panam@gmx.net> wrote: >> Any third party confirmation? > > Yeah, it definitely looks like there is some kind of bug here. Or if > not a bug, then a very surprising feature. EXPLAIN ANALYZE outputs > from your proposed test attached. Here's a unified diff of the two > outputs: > > > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------------------------------- > - Seq Scan on box b (cost=0.00..3669095.76 rows=128 width=8) (actual > time=0.147..431517.693 rows=128 loops=1) > + Seq Scan on box b (cost=0.00..3669095.76 rows=128 width=8) (actual > time=0.047..6938.165 rows=128 loops=1) > SubPlan 1 > - -> Hash Anti Join (cost=14742.77..28664.79 rows=19239 width=8) > (actual time=2960.176..3370.425 rows=1 loops=128) > + -> Hash Anti Join (cost=14742.77..28664.79 rows=19239 width=8) > (actual time=48.385..53.361 rows=1 loops=128) > Hash Cond: (m1.box_id = m2.box_id) > Join Filter: (m1.id < m2.id) > - -> Bitmap Heap Scan on message m1 (cost=544.16..13696.88 > rows=28858 width=16) (actual time=2.320..6.204 rows=18487 loops=128) > + -> Bitmap Heap Scan on message m1 (cost=544.16..13696.88 > rows=28858 width=16) (actual time=1.928..5.502 rows=17875 loops=128) > Recheck Cond: (box_id = b.id) > - -> Bitmap Index Scan on "message_box_Idx" > (cost=0.00..536.94 rows=28858 width=0) (actual time=2.251..2.251 > rows=18487 loops=128) > + -> Bitmap Index Scan on "message_box_Idx" > (cost=0.00..536.94 rows=28858 width=0) (actual time=1.797..1.797 > rows=18487 loops=128) > Index Cond: (box_id = b.id) > - -> Hash (cost=13696.88..13696.88 rows=28858 width=16) > (actual time=12.632..12.632 rows=19720 loops=120) > - Buckets: 4096 Batches: 4 (originally 2) Memory Usage: 1787kB > - -> Bitmap Heap Scan on message m2 > (cost=544.16..13696.88 rows=28858 width=16) (actual time=1.668..6.619 > rows=19720 loops=120) > + -> Hash (cost=13696.88..13696.88 rows=28858 width=16) > (actual time=11.603..11.603 rows=20248 loops=113) > + Buckets: 4096 Batches: 4 (originally 2) Memory Usage: 1423kB > + -> Bitmap Heap Scan on message m2 > (cost=544.16..13696.88 rows=28858 width=16) (actual time=1.838..6.886 > rows=20248 loops=113) > Recheck Cond: (box_id = b.id) > - -> Bitmap Index Scan on "message_box_Idx" > (cost=0.00..536.94 rows=28858 width=0) (actual time=1.602..1.602 > rows=19720 loops=120) > + -> Bitmap Index Scan on "message_box_Idx" > (cost=0.00..536.94 rows=28858 width=0) (actual time=1.743..1.743 > rows=20903 loops=113) > Index Cond: (box_id = b.id) > - Total runtime: 431520.186 ms > + Total runtime: 6940.369 ms > > That's pretty odd. Yes, while here I noticed that the query was long to be killed. I added a CHECK_FOR_INTERRUPT() in the for(;;) loop in nodeHashjoin.c. It fixes the delay when trying to kill but I don't know about performance impact this can have in this place of the code. -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Вложения
On Tue, May 31, 2011 at 8:43 PM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: > Yes, while here I noticed that the query was long to be killed. > I added a CHECK_FOR_INTERRUPT() in the for(;;) loop in nodeHashjoin.c. > It fixes the delay when trying to kill but I don't know about > performance impact this can have in this place of the code. Well, seems easy enough to find out: just test the query with and without your patch (and without casserts). If there's no measurable difference on this query, there probably won't be one anywhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/6/1 Robert Haas <robertmhaas@gmail.com>: > On Tue, May 31, 2011 at 8:43 PM, Cédric Villemain > <cedric.villemain.debian@gmail.com> wrote: >> Yes, while here I noticed that the query was long to be killed. >> I added a CHECK_FOR_INTERRUPT() in the for(;;) loop in nodeHashjoin.c. >> It fixes the delay when trying to kill but I don't know about >> performance impact this can have in this place of the code. > > Well, seems easy enough to find out: just test the query with and > without your patch (and without casserts). If there's no measurable > difference on this query, there probably won't be one anywhere Oh damned, I am currently with an eeepc, I'll need 2 days to bench that :-D I'll see tomorow. . > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Tue, May 31, 2011 at 9:11 PM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: > Oh damned, I am currently with an eeepc, I'll need 2 days to bench that :-D > I'll see tomorow. LOL. With respect to the root of the issue (why does the anti-join take so long?), my first thought was that perhaps the OP was very unlucky and had a lot of values that hashed to the same bucket. But that doesn't appear to be the case. There are 120 different box_id values in the message table, and running hashint8(box_id) % 16384 (which I think is the right calculation: 4096 buckets * 4 batches) yields 120 different values. I think that part of the problem here is that the planner has no particularly efficient way of executing a non-equijoin. Each probe finds the appropriate hash bucket and must then probe the entire chain, all of which pass the hash condition and some of which fail the join qual. So if there are n_1 instances of value v_1, n_2 instances of value v_2, etc. then the total effort is proportional to n_1^2 + n_2^2 + ... But that doesn't really explain the problem because removing the last few rows only changes that sum by a small amount. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > With respect to the root of the issue (why does the anti-join take so > long?), my first thought was that perhaps the OP was very unlucky and > had a lot of values that hashed to the same bucket. But that doesn't > appear to be the case. Well, yes it is. Notice what the subquery is doing: for each row in "box", it's pulling all matching "box_id"s from message and running a self-join across those rows. The hash join condition is a complete no-op. And some of the box_ids have hundreds of thousands of rows. I'd just write it off as being a particularly stupid way to find the max(), except I'm not sure why deleting just a few thousand rows improves things so much. It looks like it ought to be an O(N^2) situation, so the improvement should be noticeable but not amazing. And if you force it to not use a hashjoin, suddenly things are better. Nestloop should also be O(N^2) in this situation, but seemingly it avoids whatever weird corner case we are hitting here. As Cedric says, the lack of any CHECK_FOR_INTERRUPTS in this loop is also problematic. I'm not sure that right there is an ideal place to put it, but we need one somewhere in the loop. regards, tom lane
On Tue, May 31, 2011 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> With respect to the root of the issue (why does the anti-join take so >> long?), my first thought was that perhaps the OP was very unlucky and >> had a lot of values that hashed to the same bucket. But that doesn't >> appear to be the case. > > Well, yes it is. Notice what the subquery is doing: for each row in > "box", it's pulling all matching "box_id"s from message and running a > self-join across those rows. The hash join condition is a complete > no-op. And some of the box_ids have hundreds of thousands of rows. > > I'd just write it off as being a particularly stupid way to find the > max(), except I'm not sure why deleting just a few thousand rows > improves things so much. It looks like it ought to be an O(N^2) > situation, so the improvement should be noticeable but not amazing. Yeah, this is what I was getting at, though perhaps I didn't say it well. If the last 78K rows were particularly pathological in some way, that might explain something, but as far as one can see they are not a whole heck of a lot different from the rest of the data. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tom Lane-2 wrote: > > It looks like it ought to be an O(N^2) > situation, so the improvement should be noticeable but not amazing. > Hm, the performance was reasonable again when doing a cluster... So I believe this should be more a technical than an algorithmical/complexity issue. Maybe it is the way the hashtable is built and that order makes a difference in that case? In short: Why is clustered data not affected? Regards, panam -- View this message in context: http://postgresql.1045698.n5.nabble.com/Re-PERFORM-Hash-Anti-Join-performance-degradation-tp4443803p4445123.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 31, 2011 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'd just write it off as being a particularly stupid way to find the >> max(), except I'm not sure why deleting just a few thousand rows >> improves things so much. �It looks like it ought to be an O(N^2) >> situation, so the improvement should be noticeable but not amazing. > Yeah, this is what I was getting at, though perhaps I didn't say it > well. If the last 78K rows were particularly pathological in some > way, that might explain something, but as far as one can see they are > not a whole heck of a lot different from the rest of the data. Well, I wasted a lot more time on this than I should have, but the answer is: it's a pathological data ordering. The given data file loads the message rows approximately in "id" order, and in fact once you lop off the ones with id > 2550000, it's sufficiently in order that the highest id is also physically last in the table, at least for all of the box_ids that have large numbers of entries. Now, the tested query plan loads the hash table like this: -> Hash (cost=13685.86..13685.86 rows=28511 width=16) (actual time=176.286..176.286 rows=211210 loops=1) Buckets:4096 Batches: 1 Memory Usage: 9901kB -> Bitmap Heap Scan on message m2 (cost=537.47..13685.86 rows=28511width=16) (actual time=23.204..124.624 rows=211210 loops=1) Recheck Cond: (box_id = $1) -> Bitmap Index Scan on "message_box_Idx" (cost=0.00..530.34 rows=28511 width=0) (actual time=21.974..21.974 rows=211210loops=1) Index Cond: (box_id = $1) Because of the way that a bitmap heap scan works, the rows are guaranteed to be loaded into the hash table in physical order, which means (in the fast case) that the row with the largest "id" value gets loaded last. And because ExecHashTableInsert pushes each row onto the front of its hash chain, that row ends up at the front of the hash chain. Net result: for all the outer rows that aren't the one with maximum id, we get a joinqual match at the very first entry in the hash chain. Since it's an antijoin, we then reject that outer row and go on to the next. The join thus ends up costing us only O(N) time. However, with the additional rows in place, there are a significant number of outer rows that don't get a match at the first hash chain entry, and we're spending more like O(N^2) time. I instrumented things for the specific case of box_id = 69440, which is the most common box_id, and got these results: 2389 got match of join quals at probe 208077 1 got match of join quals at probe 1 175 got match of join quals atprobe 208077 273 got match of join quals at probe 1 21 got match of join quals at probe 208077 1 got match ofjoin quals at probe 1 24 got match of join quals at probe 208077 6 got match of join quals at probe 1 157 gotmatch of join quals at probe 208077 1 got match of join quals at probe 1 67 got match of join quals at probe 208077 18 got match of join quals at probe 1 1 generate null-extended tuple after 211211 probes208075 got match ofjoin quals at probe 1 1 got match of join quals at probe 208077 (This is a "uniq -c" summary of a lot of printfs, so the first column is the number of consecutive occurrences of the same printout.) Even though a large majority of the outer rows still manage to find a match at the first probe, there remain about 2800 that don't match there, because they've got pretty big ids, and so they traipse through the hash chain until they find the genuinely largest id, which is unfortunately way down there ---- the 208077'th chain entry in fact. That results in about half a billion more ExecScanHashBucket and ExecQual calls than occur in the easy case (and that's just for this one box_id). So it's not that the full data set is pathologically bad, it's that the reduced set is pathologically good. O(N^2) performance is what you should expect for this query, and that's what you're actually getting with the full data set. Also, I noted earlier that performance seemed a good deal better with a NestLoop plan. The reason for that is that NestLoop doesn't have the reversal of inner row ordering that's caused by prepending entries to the hash chain, so the very largest row id isn't 208077 entries into the list for it, but only 211211-208077 = 3134 entries in. But it still manages to eliminate most outer rows at the first probe, because there's a fairly large value at that end of the dataset too. I don't see anything much that we could or should do about this. It just depends on the order in which things appear in the hash chain, and even if we fooled around with that ordering, we'd only be moving the improbably-lucky behavior from one case to some other case. We do need to look into putting a CHECK_FOR_INTERRUPTS call in here somewhere, though. I'm inclined to think that right before the ExecScanHashBucket is the best place. The reason that nest and merge joins don't show a comparable non-responsiveness to cancels is that they always call a child plan node at the equivalent place, and ExecProcNode has got a CHECK_FOR_INTERRUPTS. So we ought to check for interrupts at the point of "fetching a tuple from the inner child plan", and ExecScanHashBucket is the equivalent thing in this logic. Cedric's suggestion of putting it before the switch would get the job done, but it would result in wasting cycles during unimportant transitions from one state machine state to another. regards, tom lane
On Wed, Jun 1, 2011 at 4:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Because of the way that a bitmap heap scan works, the rows are > guaranteed to be loaded into the hash table in physical order, which > means (in the fast case) that the row with the largest "id" value gets > loaded last. And because ExecHashTableInsert pushes each row onto the > front of its hash chain, that row ends up at the front of the hash > chain. Net result: for all the outer rows that aren't the one with > maximum id, we get a joinqual match at the very first entry in the hash > chain. Since it's an antijoin, we then reject that outer row and go > on to the next. The join thus ends up costing us only O(N) time. Ah! Make sense. If I'm reading your explanation right, this means that we could have hit a similar pathological case on a nestloop as well, just with a data ordering that is the reverse of what we have here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/6/1 Tom Lane <tgl@sss.pgh.pa.us>: > We do need to look into putting a CHECK_FOR_INTERRUPTS call in here > somewhere, though. I'm inclined to think that right before the > ExecScanHashBucket is the best place. The reason that nest and merge > joins don't show a comparable non-responsiveness to cancels is that they > always call a child plan node at the equivalent place, and ExecProcNode > has got a CHECK_FOR_INTERRUPTS. So we ought to check for interrupts > at the point of "fetching a tuple from the inner child plan", and > ExecScanHashBucket is the equivalent thing in this logic. Cedric's > suggestion of putting it before the switch would get the job done, but > it would result in wasting cycles during unimportant transitions from > one state machine state to another. exact, thanks to your last email I read more the code and get the same conclusion and put it in a more appropriate place : before ExecScanHashBucket. I was about sending it, so it is attached. > > regards, tom lane > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Вложения
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 1, 2011 at 4:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Because of the way that a bitmap heap scan works, the rows are >> guaranteed to be loaded into the hash table in physical order, which >> means (in the fast case) that the row with the largest "id" value gets >> loaded last. �And because ExecHashTableInsert pushes each row onto the >> front of its hash chain, that row ends up at the front of the hash >> chain. �Net result: for all the outer rows that aren't the one with >> maximum id, we get a joinqual match at the very first entry in the hash >> chain. �Since it's an antijoin, we then reject that outer row and go >> on to the next. �The join thus ends up costing us only O(N) time. > Ah! Make sense. If I'm reading your explanation right, this means > that we could have hit a similar pathological case on a nestloop as > well, just with a data ordering that is the reverse of what we have > here? Yeah. It's just chance that this particular data set, with this particular ordering, happens to work well with a nestloop version of the query. On average I'd expect nestloop to suck even more, because of more per-inner-tuple overhead. regards, tom lane
On Wed, Jun 1, 2011 at 4:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Jun 1, 2011 at 4:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Because of the way that a bitmap heap scan works, the rows are >>> guaranteed to be loaded into the hash table in physical order, which >>> means (in the fast case) that the row with the largest "id" value gets >>> loaded last. And because ExecHashTableInsert pushes each row onto the >>> front of its hash chain, that row ends up at the front of the hash >>> chain. Net result: for all the outer rows that aren't the one with >>> maximum id, we get a joinqual match at the very first entry in the hash >>> chain. Since it's an antijoin, we then reject that outer row and go >>> on to the next. The join thus ends up costing us only O(N) time. > >> Ah! Make sense. If I'm reading your explanation right, this means >> that we could have hit a similar pathological case on a nestloop as >> well, just with a data ordering that is the reverse of what we have >> here? > > Yeah. It's just chance that this particular data set, with this > particular ordering, happens to work well with a nestloop version > of the query. On average I'd expect nestloop to suck even more, > because of more per-inner-tuple overhead. I guess the real issue here is that m1.id < m2.id has to be evaluated as a filter condition rather than a join qual. That tends to perform poorly in general, which is why rewriting this using min() or max() or ORDER BY .. LIMIT 1 was elsewhere recommended. I've occasionally had cause to join on something other than equality in cases not susceptible to such rewriting, so it would be neat to improve this case, but it's not likely to make it to the top of my list any time soon. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I guess the real issue here is that m1.id < m2.id has to be evaluated > as a filter condition rather than a join qual. Well, if you can invent an optimized join technique that works for inequalities, go for it ... but I think you should get at least a PhD thesis out of that. regards, tom lane
On Wed, Jun 1, 2011 at 4:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I guess the real issue here is that m1.id < m2.id has to be evaluated >> as a filter condition rather than a join qual. > > Well, if you can invent an optimized join technique that works for > inequalities, go for it ... but I think you should get at least a > PhD thesis out of that. Sounds good, except that so far NOT getting a PhD seems like a much better financial prospect. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Cédric Villemain <cedric.villemain.debian@gmail.com> writes: > exact, thanks to your last email I read more the code and get the same > conclusion and put it in a more appropriate place : before > ExecScanHashBucket. > I was about sending it, so it is attached. Applied with cosmetic adjustments. regards, tom lane
On Wed, Jun 01, 2011 at 04:58:36PM -0400, Robert Haas wrote: > On Wed, Jun 1, 2011 at 4:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> I guess the real issue here is that m1.id < m2.id has to be evaluated > >> as a filter condition rather than a join qual. > > > > Well, if you can invent an optimized join technique that works for > > inequalities, go for it ... but I think you should get at least a > > PhD thesis out of that. > > Sounds good, except that so far NOT getting a PhD seems like a much > better financial prospect. :-) Yeah, last time I heard of some Uni being so impressed by independent work that they just sort of handed out a Ph.D. it involved a Swiss patent clerk ... Ross -- Ross Reedstrom, Ph.D. reedstrm@rice.edu Systems Engineer & Admin, Research Scientist phone: 713-348-6166 Connexions http://cnx.org fax: 713-348-3665 Rice University MS-375, Houston, TX 77005 GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE
I'd like to thank you all for getting this analyzed, especially Tom! Your rigor is pretty impressive. Seems like otherwise it'd impossible to maintain a DBS, though. In the end, I know a lot more of postgres internals and that this idiosyncrasy (from a user perspective) could happen again. I guess it is my first time where I actually encountered an unexpected worst case scenario like this... Seems it is up to me know to be a bit more creative with query optimzation. And in the end, it'll turn out to require an architectural change... As the only thing to achieve is in fact to obtain the last id (currently still with the constraint that it has to happen in an isolated subquery), i wonder whether this requirement (obtaining the last id) is worth a special technique/instrumentation/strategy ( lacking a good word here), given the fact that this data has a full logical ordering (in this case even total) and the use case is quite common I guess. Some ideas from an earlier post: panam wrote: > > ... > This also made me wonder how the internal plan is carried out. Is the > engine able to leverage the fact that a part/range of the rows ["/index > entries"] is totally or partially ordered on disk, e.g. using some kind of > binary search or even "nearest neighbor"-search in that section (i.e. a > special "micro-plan" or algorithm)? Or is the speed-up "just" because > related data is usually "nearby" and most of the standard algorithms work > best with clustered data? > If the first is not the case, would that be a potential point for > improvement? Maybe it would even be more efficient, if there were some > sort of constraints that guarantee "ordered row" sections on the disk, > i.e. preventing the addition of a row that had an index value in between > two row values of an already ordered/clustered section. In the simplest > case, it would start with the "first" row and end with the "last" row (on > the time of doing the equivalent of "cluster"). So there would be a small > list saying rows with id x - rows with id y are guaranteed to be ordered > on disk (by id for example) now and for all times. > Maybe I am completely off the mark but what's your conclusion? To much effort for small scenarios? Nothing that should be handled on a DB level? A try to battle the laws of thermodynamics with small technical dodge? Thanks again panam -- View this message in context: http://postgresql.1045698.n5.nabble.com/Re-PERFORM-Hash-Anti-Join-performance-degradation-tp4443803p4446629.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.