Re: Inlining comparators as a performance optimisation
От | Peter Geoghegan |
---|---|
Тема | Re: Inlining comparators as a performance optimisation |
Дата | |
Msg-id | CAEYLb_WdiO+k0DAg9QTMXp_37ibYZLHyzOY0xA+Z73BN4X_spQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Inlining comparators as a performance optimisation (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: Inlining comparators as a performance optimisation
(Robert Haas <robertmhaas@gmail.com>)
|
Список | pgsql-hackers |
On 18 November 2011 05:20, Robert Haas <robertmhaas@gmail.com> wrote: > I think that we should really consider doing with this patch what Tom > suggested upthread; namely, looking for a mechanism to allow > individual datatypes to offer up a comparator function that doesn't > require bouncing through FunctionCall2Coll(). It seems to me that > duplicating the entire qsort() algorithm is a little iffy. I understand that we highly value extensibility and genericity (yes, that's a real word). We may not always be well served by that tendency. > Sure, in this case it works out to a win. But it's only a small win - > three-quarters of it is in the uncontroversial activity of reducing > the impedance mismatch - and how do we know it will always be a win? Firstly, 1/4 of a quite large gain is still a pretty good gain. Secondly, I probably didn't actually isolate the effects of inlining, nor the overall benefit of the compiler knowing the comparator at compile-time. I just removed the inline keyword. Those are two different things. The inline keyword serves as a request to the compiler to inline. The compiler can and often will ignore that request. Most people know that. What isn't so widely known is that modern compilers may equally well inline even when they haven't been asked to (but only when they can). When you also consider, as I've already pointed out several times, that individual call sites are inlined, it becomes apparent that there may well still be a certain amount of inlining and/or other optimisations like procedural integration going on at some call sites even without the encouragement of the inline keyword, that would not have been performed without the benefit of compile-time comparator knowledge. The addition of the inline keyword may just, in this particular case, have the compiler inline even more call sites. Posting the ~8ms difference was motivated by a desire to prove that inlining had *some* role to play, without actually going to the trouble of implementing Tom's idea as a basis of comparison, because Tom was very sceptical of inlining. The long and the short of it is that I'm going to have to get my hands dirty with a dissembler before we really know exactly what's happening. That, or I could use an optimisation fence of some type. > Adding more copies of the same code can be an anti-optimization if it > means that a smaller percentage of it fits in the instruction cache, > and sometimes small changes in runtime are caused by random shifts in > the layout of memory that align things more or less favorably across > cache lines rather than by real effects. Now it may well be that this > is a real effect, but will it still look as good when we do this for > 10 data types? For 100 data types? I'd favour limiting it to just the common integer and float types. > In contrast, it seems to me that reducing the impedance mismatch is > something that we could go and do across our entire code base, and > every single data type would benefit from it. It would also be > potentially usable by other sorting algorithms, not just quick sort. Suppose that we went ahead and added that infrastructure. What you must acknowledge is that one reason that this speed-up is so dramatic is that the essential expense of a comparison is already so low - a single instruction - and therefore the overall per-comparison cost goes way down, particularly if the qsort inner loop can store the code across fewer cache lines. For that reason, any absolute improvement that you'll see in complex datatypes will be smaller, maybe much smaller, because for each comparison we'll execute many more instructions that are essential to the comparison. In my estimation, all of this work does not point to there being an undue overhead in the function calling convention as you suggested. Still, I'm not opposed to investigating generalising this in some way, reservations notwithstanding, unless we have to block-wait on it. I don't want to chase diminishing returns too far. > Well, that's kind of my point. I think this needs more work before we > decide what the best approach is. Agreed. > So far, the ONLY test result we > have that compares inlining to not inlining shows a speedup from 60 ms > to 52 ms. I think that an 8 ms speedup on one test with one datatype > on one platform/compiler combination isn't sufficient evidence to > conclude that this is the best possible approach. Fair enough, but it's not the only test I did - I posted other numbers for the same query when the table was 48mb, and we saw a proportional improvement, consistent with a per-comparison win. I'm supposed to be on leave for a few days at the moment, so I won't be very active this weekend, but I'm rather curious as to where you or others would like to see me go with benchmarks. I should point out that we currently don't have much idea how big of a win applying these principles could be for index creation times...it could possibly be very significant. My profiling of index creation makes this looks promising. On 18 November 2011 13:39, Robert Haas <robertmhaas@gmail.com> wrote: > It seems entirely possible that inlining any non-trivial comparator function could work > out to a loss, Compilers weigh the size of the function to be inlined heavily when deciding, on a call site by call site basis, whether or not to inline. Much effort has been expended on making these heuristics work well. Besides, you're the one advocating pursuing this for types with non-trivial comparators. This will certainly be less effective for such non-trivial types, perhaps quite a lot less effective, as already noted. It's worth acknowledging that sorting of integers and floats is in general very important, probably more important than any other sorting case. > or that the exact choice of compiler flags could affect > whether inlining works out to a plus or a minus (what happens with -O3 > vs. -O2? what about -O1? what about -O2 -fno-omit-frame-pointer? > what if they're using HP's aCC or Intel's icc rather than gcc?). If you'd like to see numbers for another platform, I can get those. How about Windows? I only have access to x86_64 boxes. I don't see that the exact compiler flags used will influence whether or not inlining is a win, so much as whether or not it occurs at all. As to -fno-omit-frame-pointer or something nullifying the benefits, well, that's a bit fantastical. We're already at the mercy of the compiler to do the right thing at this level of granularity. I really just want to help/enable it. > There's no point in installing an optimization that could easily be a > pessimization on some other workload, and that hasn't been tested, or > at least no results have been posted here. Hmm. Well, I accept the burden of proof lies with me here. I have a hard time believing that inlining/compile-time optimisation benefits (benefits which, to repeat for emphasis, have not been isolated/measured yet) could be entirely nullified by CPU caching effects on any plausible workload. What does a benchmark that alleviates your concerns here look like? The way that I understand inlining to interact with CPU cache is that it will increase the number of cache misses if it causes a cache line to be crossed, but will decrease the number of cache misses when locality of reference is improved such that less cache lines are crossed within an inner loop. Well, the overall cost of a sort is measured in the number of comparisons performed, so locality of reference is particularly important here - in general, the comparator is called lots of times. What I'm having a hard time with is general scepticism of the value of inlining, when the self-contained test case I posted showed an inlining qsort() radically out-performing my system's qsort(), without any impedance mismatch to muddy the waters. Sure, you can hold me to account and get me to prove my claim, but don't you think it's curious that that effect was observed there? Incidentally, I'd like to call this fast path sorting, if it ends up being committed in a form that is not significantly different to what we have now. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Tom LaneДата:
Сообщение: Re: [PATCH] Replace a long chain of if's in eval_const_expressions_mutator by a switch()