Обсуждение: Minor performance improvement in transition to external sort
The attached patch replaces the existing siftup method for heapify with a siftdown method. Tested with random integers it does 18% fewer compares and takes 10% less time for the heapify, over the work_mem range 1024 to 1048576. Both algorithms appear to be O(n) (contradicting Wikipedia's claim that a siftup heapify is O(n log n)). -- Cheers, Jeremy
Вложения
On Wed, Feb 5, 2014 at 7:22 AM, Jeremy Harris <jgh@wizmail.org> wrote: > The attached patch replaces the existing siftup method for heapify with > a siftdown method. Tested with random integers it does 18% fewer > compares and takes 10% less time for the heapify, over the work_mem > range 1024 to 1048576. > > Both algorithms appear to be O(n) (contradicting Wikipedia's claim > that a siftup heapify is O(n log n)). It looks interesting but it is too late to have that in 9.4... You should definitely add this patch to the next commit fest so as it is not lost in the void: https://commitfest.postgresql.org/action/commitfest_view?id=22 Regards, -- Michael
On Tue, Feb 4, 2014 at 5:22 PM, Jeremy Harris <jgh@wizmail.org> wrote: > The attached patch replaces the existing siftup method for heapify with > a siftdown method. Tested with random integers it does 18% fewer > compares and takes 10% less time for the heapify, over the work_mem > range 1024 to 1048576. > > Both algorithms appear to be O(n) (contradicting Wikipedia's claim > that a siftup heapify is O(n log n)). I think Wikipedia is right. Inserting an a tuple into a heap is O(lg n); doing that n times is therefore O(n lg n). Your patch likewise implements an outer loop which runs O(n) times and an inner loop which runs at most O(lg n) times, so a naive analysis of that algorithm also comes out to O(n lg n). Wikipedia's article on http://en.wikipedia.org/wiki/Heapsort explains why a tighter bound is possible for the siftdown case. I think I tested something like this at some point and concluded that it didn't really help much, because building the initial heap was a pretty small part of the runtime of a large sort. It may still be worth doing, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 05/02/14 21:56, Robert Haas wrote: > On Tue, Feb 4, 2014 at 5:22 PM, Jeremy Harris <jgh@wizmail.org> wrote: >> The attached patch replaces the existing siftup method for heapify with >> a siftdown method. Tested with random integers it does 18% fewer >> compares and takes 10% less time for the heapify, over the work_mem >> range 1024 to 1048576. >> >> Both algorithms appear to be O(n) (contradicting Wikipedia's claim >> that a siftup heapify is O(n log n)). > > I think Wikipedia is right. Inserting an a tuple into a heap is O(lg > n); doing that n times is therefore O(n lg n). Your patch likewise > implements an outer loop which runs O(n) times and an inner loop which > runs at most O(lg n) times, so a naive analysis of that algorithm also > comes out to O(n lg n). The patch implements a siftdown. Measurements attached. -- Cheers, Jeremy
Вложения
On Thu, Feb 6, 2014 at 4:22 AM, Jeremy Harris <jgh@wizmail.org> wrote: > On 05/02/14 21:56, Robert Haas wrote: >> On Tue, Feb 4, 2014 at 5:22 PM, Jeremy Harris <jgh@wizmail.org> wrote: >>> The attached patch replaces the existing siftup method for heapify with >>> a siftdown method. Tested with random integers it does 18% fewer >>> compares and takes 10% less time for the heapify, over the work_mem >>> range 1024 to 1048576. >>> >>> Both algorithms appear to be O(n) (contradicting Wikipedia's claim >>> that a siftup heapify is O(n log n)). >> >> >> I think Wikipedia is right. Inserting an a tuple into a heap is O(lg >> n); doing that n times is therefore O(n lg n). Your patch likewise >> implements an outer loop which runs O(n) times and an inner loop which >> runs at most O(lg n) times, so a naive analysis of that algorithm also >> comes out to O(n lg n). > > The patch implements a siftdown. Measurements attached. Uh, this spreadsheet is useless. You haven't explained how these numbers were generated, or what the columns in the spreadsheet actually mean. Ideally, you should include enough information that someone else can try to reproduce your results. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 4, 2014 at 2:22 PM, Jeremy Harris <jgh@wizmail.org> wrote:
The attached patch replaces the existing siftup method for heapify with
a siftdown method. Tested with random integers it does 18% fewer
compares and takes 10% less time for the heapify, over the work_mem
range 1024 to 1048576.
Thanks for working on this. Several times I've wondered why we didn't use siftdown for heapifying, as it is a well known optimization. How big of sets were you sorting in each case? Was it always just slightly bigger than would fit in work_mem? Did you try sorting already-sorted, reverse sorted, or pipe-organ shaped data sets? We will also need to test it on strings. I usually use md5(random()::text) to generate strings for such purposes, at least for a first pass.
The name of the attached patch is a bit unfortunate. And I'm not sure what you are doing with tupindex, the change there doesn't seem to be necessary to the rest of your work, so some explanation on that would be helpful.
The project coding style usually has comments in blocks before loops on which they comment, rather than interspersed throughout the clauses of the "for" statement.
Both algorithms appear to be O(n) (contradicting Wikipedia's claim
that a siftup heapify is O(n log n)).
Siftdown is linear in all cases. Siftup may be linear in the typical case, but is n log n in the worst case.
Another optimization that is possible is to do only one comparison at each level (between the two children) all the way down the heap, and then compare the parent to the chain of smallest children in reverse order. This can substantially reduce the number of comparisons on average, because most tuples sift most of the way down the heap (because most of the tuples are at the bottom of the heap). Robert submitted a patch to do this in the main siftdown routine (which for some reason is named tuplesort_heap_siftup, rather than siftdown), and I found it a substantial improvement but it never got pushed forward. I think Robert was concerned it might make some obscure cases worse rather than better, and no one put it through enough serious testing to overcome that concern.
(Also, I think you should make your siftdown code look more like the existing siftdown code.)
Cheers,
Jeff
On 06/02/14 18:21, Jeff Janes wrote: > How big of > sets were you sorting in each case? Big enough to go external. The timings and compare-counts given are purely for the heapify stage not the total for the sort, so are constrained by the work_mem not by the sort size per se. I'm limited to working on a small machine, so the work_mem value of 1e+6 is approaching my top limit of sort-time pain unless I rip the heapify out into a test harness. What range of work_mem is it useful to test over? > Was it always just slightly bigger > than would fit in work_mem? I didn't check. > Did you try sorting already-sorted, reverse > sorted, or pipe-organ shaped data sets? Not yet, but I can. What variety of pipe-organ shape is of interest (I can think of several that might be called that)? > We will also need to test it on > strings. I usually use md5(random()::text) to generate strings for such > purposes, at least for a first pass. OK. > > The name of the attached patch is a bit unfortunate. Is there a project standard for this? > And I'm not sure what > you are doing with tupindex, the change there doesn't seem to be necessary > to the rest of your work, so some explanation on that would be helpful. I'll add commentary. > The project coding style usually has comments in blocks before loops on > which they comment, rather than interspersed throughout the clauses of the > "for" statement. I'll adjust that. > Another optimization that is possible is to do only one comparison at each > level (between the two children) all the way down the heap, and then > compare the parent to the chain of smallest children in reverse order. > This can substantially reduce the number of comparisons on average, > because most tuples sift most of the way down the heap (because most of the > tuples are at the bottom of the heap). Sounds interesting; I'll see if I can get that going. > (Also, I think you should make your siftdown code look more like the > existing siftdown code.) A function called by the outer loop? Can do; the existing does that because the function is called from multiple places but this will only be used the once. Thanks for the review. -- Jeremy
On 06/02/14 14:22, Robert Haas wrote: > On Thu, Feb 6, 2014 at 4:22 AM, Jeremy Harris <jgh@wizmail.org> wrote: >> On 05/02/14 21:56, Robert Haas wrote: >>> On Tue, Feb 4, 2014 at 5:22 PM, Jeremy Harris <jgh@wizmail.org> wrote: >>>> The attached patch replaces the existing siftup method for heapify with >>>> a siftdown method. Tested with random integers it does 18% fewer >>>> compares and takes 10% less time for the heapify, over the work_mem >>>> range 1024 to 1048576. >>>> >>>> Both algorithms appear to be O(n) (contradicting Wikipedia's claim >>>> that a siftup heapify is O(n log n)). >>> >>> >>> I think Wikipedia is right. Inserting an a tuple into a heap is O(lg >>> n); doing that n times is therefore O(n lg n). Your patch likewise >>> implements an outer loop which runs O(n) times and an inner loop which >>> runs at most O(lg n) times, so a naive analysis of that algorithm also >>> comes out to O(n lg n). >> >> The patch implements a siftdown. Measurements attached. > > Uh, this spreadsheet is useless. You haven't explained how these > numbers were generated, or what the columns in the spreadsheet > actually mean. Ideally, you should include enough information that > someone else can try to reproduce your results. > I'm sorry I wasn't clear. Test code was groups of the form: set work_mem to 1024; CREATE TABLE jgh (i integer); INSERT INTO jgh (i) SELECT 20000 * random() FROM generate_series(1, 20000); VACUUM ANALYZE jgh; explain analyze SELECT * FROM jgh ORDER BY i; set enable_siftdown_heapify to off; explain analyze SELECT * FROM jgh ORDER BY i; set enable_siftdown_heapify to on; DROP TABLE jgh; .. for the tabulated work_mem, and scaled values for the INSERT. Compares were counted for the heapify stage (only), and timings taken using pg_rusage_init() before and pg_rusage_show() after it. Times are in seconds. Baseline value for compares and time used 858ec11858. Sift-down compares and time are for the patch. "Baseline K cmps" is compares divided by work_mem (which should be a scaled version of compares-per-tuple being heapified) pre-patch. "Siftdown K cmps" is likewise with the patch. "K ratio cmps" is the ratio (patched compares)/(unpatched compares). Repeat for time measurements. -- Cheers, Jeremy
On 06/02/14 22:12, Jeremy Harris wrote: >> Did you try sorting already-sorted, reverse >> sorted, or pipe-organ shaped data sets? Summary (low numbers better): Random ints: 83% compares, level on time. Sorted ints: level compares, 70% time. Reverse-sorted ints: 10% compares, 15% time (!) Constant ints: 200% compares, 360% time (ouch, and not O(n)) Pipe-organ ints: 80% compares, 107% time Random text: 83% compares, 106% time -- Cheers, Jeremy
Вложения
On Fri, Feb 7, 2014 at 4:28 PM, Jeremy Harris <jgh@wizmail.org> wrote: > On 06/02/14 22:12, Jeremy Harris wrote: >>> Did you try sorting already-sorted, reverse >>> sorted, or pipe-organ shaped data sets? > > Summary (low numbers better): > > Random ints: 83% compares, level on time. > Sorted ints: level compares, 70% time. > Reverse-sorted ints: 10% compares, 15% time (!) > Constant ints: 200% compares, 360% time (ouch, and not O(n)) > Pipe-organ ints: 80% compares, 107% time > Random text: 83% compares, 106% time This is kind of what I expected to happen. The problem is that it's hard to make some cases better without making other cases worse. I suspect (hope?) there's some simple fix for the constant-int case. But the last two cases are trickier. It seems intuitively that reducing comparisons ought to reduce runtime, but if I'm reading the results, the runtime actually went up even though the number of comparisons went down. This is hard to account for, but we probably need to at least understand why that's happening. I feel like this algorithm ought to be a win, but these data don't provide a compelling case for change. I wonder if it would be worth trying this test with text data as well.Text comparisons are much more expensive than integercomparisons, so the effect of saving comparisons ought to be more visible there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06/02/14 18:21, Jeff Janes wrote: > Did you try sorting already-sorted, reverse > sorted, or pipe-organ shaped data sets? We will also need to test it on > strings. I usually use md5(random()::text) to generate strings for such > purposes, at least for a first pass. Attached is version 2 of the patch, which fixes the performance on constant-input. Summary (low numbers better, vs 858ec11): Random ints: 83% compares, level on time. Sorted ints: level compares, 70% time. Reverse-sorted ints: 10% compares, 15% time (!) Constant ints: level compares, 75% time Pipe-organ ints: 80% compares, 107% time Random text: 83% compares, 106% time > Another optimization that is possible is to do only one comparison at each > level (between the two children) all the way down the heap, and then > compare the parent to the chain of smallest children in reverse order. > This can substantially reduce the number of comparisons on average, > because most tuples sift most of the way down the heap (because most of the > tuples are at the bottom of the heap). Robert submitted a patch to do this > in the main siftdown routine (which for some reason is > named tuplesort_heap_siftup, rather than siftdown), and I found it a > substantial improvement but it never got pushed forward. I think Robert > was concerned it might make some obscure cases worse rather than better, > and no one put it through enough serious testing to overcome that concern. I've done an implementation of this (also attached: siftdown_bounce). Summary: Random ints: 72% compares, 104% time. Sorted ints: 200% compares, 270% time. (ouch) Reverse-sorted ints: 7% compares, 12% time Constant ints: 150% compares, 275% time Pipe-organ ints: 93% compares, 115% time Random text: 72% compares, 91% time - I don't like the performance on already-sorted input. Thinking into this, a sorted array is already a well-formed heap so optimal behaviour is a single compare per item (lacking the global knowledge). This "bounce" method will always do a full walk down the tree and back, hence the poor behaviour. Unless the planner can tell the sort when sorted input is possible I don't think we dare use this despite the benefit on random input. The tests were run with an instrumented variant of each patch, counting compares and time for just the heapify portion of the inittapes() routine. Test script (jgh.sql and results spreadsheet) also attached. -- Cheers, Jeremy
Вложения
On 09/02/14 17:11, Jeremy Harris wrote: > On 06/02/14 18:21, Jeff Janes wrote: >> Did you try sorting already-sorted, reverse >> sorted, or pipe-organ shaped data sets? We will also need to test it on >> strings. I usually use md5(random()::text) to generate strings for such >> purposes, at least for a first pass. > > Attached is version 2 of the patch, which fixes the performance on > constant-input. Having beaten on this some more I'm prepared to abandon it. The wallclock time, for random input, drifts up at larger N (compared to the existing code) despite the number of comparisons being consistently less. Run under cachegrind, it takes about N/10 last-level cache misses, all for the new item being introduced to the heap. The existing code takes none at all. It might be worthwhile for a seriously expensive comparison function; say, more than 50 clocks. For integers and md5-strings it isn't. -- Cheers, Jeremy
On Thu, Feb 20, 2014 at 7:27 PM, Jeremy Harris <jgh@wizmail.org> wrote: > On 09/02/14 17:11, Jeremy Harris wrote: >> On 06/02/14 18:21, Jeff Janes wrote: >>> Did you try sorting already-sorted, reverse >>> sorted, or pipe-organ shaped data sets? We will also need to test it on >>> strings. I usually use md5(random()::text) to generate strings for such >>> purposes, at least for a first pass. >> >> >> Attached is version 2 of the patch, which fixes the performance on >> constant-input. > > Having beaten on this some more I'm prepared to abandon it. > > The wallclock time, for random input, drifts up at larger N > (compared to the existing code) despite the number of comparisons > being consistently less. > > Run under cachegrind, it takes about N/10 last-level cache misses, > all for the new item being introduced to the heap. The existing > code takes none at all. Can you explain this further? This seems like an important clue that could be useful when trying to optimize this code, but I'm a little unclear which part of the operation has more cache misses with your changes and why. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 24/02/14 17:38, Robert Haas wrote: > On Thu, Feb 20, 2014 at 7:27 PM, Jeremy Harris <jgh@wizmail.org> wrote: >> Run under cachegrind, it takes about N/10 last-level cache misses, >> all for the new item being introduced to the heap. The existing >> code takes none at all. > > Can you explain this further? This seems like an important clue that > could be useful when trying to optimize this code, but I'm a little > unclear which part of the operation has more cache misses with your > changes and why. In the patched version, for the heapify operation the outer loop starts at the last heap-parent tuple and works backward to the start of the tuples array. A copy is taken of the SortTuple being operated on for the inner loop to use. This copy suffers cache misses. (The inner loop operates on elements between the copy source and the end of the array). In the original, the outer loop runs the array in increasing index order. Again a copy is taken of the SortTuple for the inner loop to use. This copy does not appear to take significant cache misses. (The inner loop operates on elements between the copy source and the start of the array). -- Cheers, Jeremy
On Tue, Feb 25, 2014 at 2:55 PM, Jeremy Harris <jgh@wizmail.org> wrote: > On 24/02/14 17:38, Robert Haas wrote: >> >> On Thu, Feb 20, 2014 at 7:27 PM, Jeremy Harris <jgh@wizmail.org> wrote: >>> >>> Run under cachegrind, it takes about N/10 last-level cache misses, >>> all for the new item being introduced to the heap. The existing >>> code takes none at all. >> >> >> Can you explain this further? This seems like an important clue that >> could be useful when trying to optimize this code, but I'm a little >> unclear which part of the operation has more cache misses with your >> changes and why. > > > In the patched version, for the heapify operation the outer loop > starts at the last heap-parent tuple and works backward to the > start of the tuples array. A copy is taken of the SortTuple being > operated on for the inner loop to use. This copy suffers cache misses. > > (The inner loop operates on elements between the copy source and the > end of the array). > > > In the original, the outer loop runs the array in increasing index > order. Again a copy is taken of the SortTuple for the inner loop > to use. This copy does not appear to take significant cache misses. > > (The inner loop operates on elements between the copy source and > the start of the array). Do you have any theory as to why this happens in one case but not the other? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 26/02/14 03:08, Robert Haas wrote: > On Tue, Feb 25, 2014 at 2:55 PM, Jeremy Harris <jgh@wizmail.org> wrote: >> On 24/02/14 17:38, Robert Haas wrote: >>> >>> On Thu, Feb 20, 2014 at 7:27 PM, Jeremy Harris <jgh@wizmail.org> wrote: >>>> >>>> Run under cachegrind, it takes about N/10 last-level cache misses, >>>> all for the new item being introduced to the heap. The existing >>>> code takes none at all. >>> >>> >>> Can you explain this further? This seems like an important clue that >>> could be useful when trying to optimize this code, but I'm a little >>> unclear which part of the operation has more cache misses with your >>> changes and why. >> >> >> In the patched version, for the heapify operation the outer loop >> starts at the last heap-parent tuple and works backward to the >> start of the tuples array. A copy is taken of the SortTuple being >> operated on for the inner loop to use. This copy suffers cache misses. >> >> (The inner loop operates on elements between the copy source and the >> end of the array). >> >> >> In the original, the outer loop runs the array in increasing index >> order. Again a copy is taken of the SortTuple for the inner loop >> to use. This copy does not appear to take significant cache misses. >> >> (The inner loop operates on elements between the copy source and >> the start of the array). > > Do you have any theory as to why this happens in one case but not the other? Nope. Only really wild stuff requiring the cpu memory channel recognising the forward scan (despite the inner loop) and not the backward one - and ignoring prefetch instructions, which I experimented with. -- Cheers, Jeremy
On 6 February 2014 18:21, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Feb 4, 2014 at 2:22 PM, Jeremy Harris <jgh@wizmail.org> wrote: >> >> The attached patch replaces the existing siftup method for heapify with >> a siftdown method. Tested with random integers it does 18% fewer >> compares and takes 10% less time for the heapify, over the work_mem >> range 1024 to 1048576. > > > Thanks for working on this. +1 Your patch isn't linked properly from the CF manager though. If you like patches like this then there's a long(er) list of optimizations already proposed previously around sorting. It would be good to have someone work through them for external sorts. I believe Noah is working on parallel internal sort (as an aside). There's also an optimization possible for merge joins where we use the output of the first sort as an additional filter on the second sort. That can help when we're going to join two disjoint tables. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Apr 10, 2014 at 06:03:15PM +0100, Simon Riggs wrote: > On 6 February 2014 18:21, Jeff Janes <jeff.janes@gmail.com> wrote: > > On Tue, Feb 4, 2014 at 2:22 PM, Jeremy Harris <jgh@wizmail.org> wrote: > >> > >> The attached patch replaces the existing siftup method for heapify with > >> a siftdown method. Tested with random integers it does 18% fewer > >> compares and takes 10% less time for the heapify, over the work_mem > >> range 1024 to 1048576. > > > > > > Thanks for working on this. > > +1 > > Your patch isn't linked properly from the CF manager though. > > If you like patches like this then there's a long(er) list of > optimizations already proposed previously around sorting. It would be > good to have someone work through them for external sorts. I believe > Noah is working on parallel internal sort (as an aside). > > There's also an optimization possible for merge joins where we use the > output of the first sort as an additional filter on the second sort. > That can help when we're going to join two disjoint tables. Where should this be recorded? TODO? Commitfest manager? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Wed, Apr 16, 2014 at 7:38 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Thu, Apr 10, 2014 at 06:03:15PM +0100, Simon Riggs wrote: >> On 6 February 2014 18:21, Jeff Janes <jeff.janes@gmail.com> wrote: >> > On Tue, Feb 4, 2014 at 2:22 PM, Jeremy Harris <jgh@wizmail.org> wrote: >> >> >> >> The attached patch replaces the existing siftup method for heapify with >> >> a siftdown method. Tested with random integers it does 18% fewer >> >> compares and takes 10% less time for the heapify, over the work_mem >> >> range 1024 to 1048576. >> > >> > >> > Thanks for working on this. >> >> +1 >> >> Your patch isn't linked properly from the CF manager though. >> >> If you like patches like this then there's a long(er) list of >> optimizations already proposed previously around sorting. It would be >> good to have someone work through them for external sorts. I believe >> Noah is working on parallel internal sort (as an aside). >> >> There's also an optimization possible for merge joins where we use the >> output of the first sort as an additional filter on the second sort. >> That can help when we're going to join two disjoint tables. > > Where should this be recorded? TODO? Commitfest manager? IIUC, the original patch was withdrawn; any remaining action items should probably go to TODO. I'm not sure which specific idea you're referring to, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company