Обсуждение: use CLZ instruction in AllocSetFreeIndex()
Hi all, In commit ab5b4e2f9ed, we optimized AllocSetFreeIndex() using a lookup table. At the time, using CLZ was rejected because compiler/platform support was not widespread enough to justify it. For other reasons, we recently added bitutils.h which uses __builtin_clz() where available, so it makes sense to revisit this. I modified the test in [1] (C files attached), using two separate functions to test CLZ versus the open-coded algorithm of pg_leftmost_one_pos32(). These are typical results on a recent Intel platform: HEAD 5.55s clz 4.51s open-coded 9.67s CLZ gives a nearly 20% speed boost on this microbenchmark. I suspect that this micro-benchmark is actually biased towards the lookup table more than real-world workloads, because it can monopolize the L1 cache. Sparing cache is possibly the more interesting reason to use CLZ. The open-coded version is nearly twice as slow, so it makes sense to keep the current implementation as the default one, and not use pg_leftmost_one_pos32() directly. However, with a small tweak, we can reuse the lookup table in bitutils.c instead of the bespoke one used solely by AllocSetFreeIndex(), saving a couple cache lines here also. This is done in the attached patch. [1] https://www.postgresql.org/message-id/407d949e0907201811i13c73e18x58295566d27aadcc%40mail.gmail.com -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On 2019-Dec-26, John Naylor wrote: > In commit ab5b4e2f9ed, we optimized AllocSetFreeIndex() using a lookup > table. At the time, using CLZ was rejected because compiler/platform > support was not widespread enough to justify it. For other reasons, we > recently added bitutils.h which uses __builtin_clz() where available, > so it makes sense to revisit this. I modified the test in [1] (C files > attached), using two separate functions to test CLZ versus the > open-coded algorithm of pg_leftmost_one_pos32(). > > These are typical results on a recent Intel platform: > > HEAD 5.55s > clz 4.51s > open-coded 9.67s I can confirm these results on my Intel laptop. I ran it with a repetition of 20, averages of 4 runs: clz 1,614 bitutils 3,714 current 2,088 (stddevs are under 0,031). -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Dec-26, John Naylor wrote: >> In commit ab5b4e2f9ed, we optimized AllocSetFreeIndex() using a lookup >> table. At the time, using CLZ was rejected because compiler/platform >> support was not widespread enough to justify it. For other reasons, we >> recently added bitutils.h which uses __builtin_clz() where available, >> so it makes sense to revisit this. I modified the test in [1] (C files >> attached), using two separate functions to test CLZ versus the >> open-coded algorithm of pg_leftmost_one_pos32(). > I can confirm these results on my Intel laptop. I ran it with a > repetition of 20, averages of 4 runs: I tried this on a few other architectures --- ppc32, aarch64, and x86 (not 64). The general contours of the result are the same on all, eg here's the results on aarch64 (Fedora 28): $ ./a.out 100 ... clz 22.713s bitutils func 59.462s current 30.630s This kind of leads me to wonder if we don't need to expend more effort on the non-CLZ version of pg_leftmost_one_pos32; it seems like it shouldn't be losing this badly to the only-slightly- improved logic that's currently in AllocSetFreeIndex. On the other hand, the buildfarm thinks that __builtin_clz is essentially universal these days --- the only active non-MSVC critter that reports not having it is anole. So maybe it's not worth sweating over that. Perhaps what we really ought to be working on is finding MSVC equivalents for __builtin_clz and friends. Anyway, getting back to the presented patch, I find myself a bit dissatisfied with it because it seems like it's leaving something on the table. Specifically, looking at the generated assembly code on a couple of architectures, the setup logic generated by tsize = (size - 1) >> ALLOC_MINBITS; looks like it costs as much or more as the clz proper. I'm not sure we can get rid of the subtract-one step, but couldn't the right shift be elided in favor of changing the constant we subtract clz's result from? Shifting off those bits separately made sense in the old implementation, but assuming that CLZ is more or less constant-time, it doesn't with CLZ. regards, tom lane
On Fri, Dec 27, 2019 at 9:54 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Anyway, getting back to the presented patch, I find myself a bit > dissatisfied with it because it seems like it's leaving something > on the table. Specifically, looking at the generated assembly > code on a couple of architectures, the setup logic generated by > > tsize = (size - 1) >> ALLOC_MINBITS; > > looks like it costs as much or more as the clz proper. I'm not > sure we can get rid of the subtract-one step, As I understand it, the desired outcome is ceil(log2(size)), which can be computed by clz(size - 1) + 1. > but couldn't the > right shift be elided in favor of changing the constant we > subtract clz's result from? Shifting off those bits separately > made sense in the old implementation, but assuming that CLZ is > more or less constant-time, it doesn't with CLZ. That makes sense -- I'll look into doing that. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
John Naylor <john.naylor@2ndquadrant.com> writes: > On Fri, Dec 27, 2019 at 9:54 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... but couldn't the >> right shift be elided in favor of changing the constant we >> subtract clz's result from? Shifting off those bits separately >> made sense in the old implementation, but assuming that CLZ is >> more or less constant-time, it doesn't with CLZ. > That makes sense -- I'll look into doing that. Actually, we could apply that insight to both code paths. In the existing path, that requires assuming ALLOCSET_NUM_FREELISTS+ALLOC_MINBITS <= 17, but that's OK. (Nowadays I'd probably add a StaticAssert about that.) regards, tom lane
On 2019-Dec-27, Tom Lane wrote: > This kind of leads me to wonder if we don't need to expend more > effort on the non-CLZ version of pg_leftmost_one_pos32; it seems > like it shouldn't be losing this badly to the only-slightly- > improved logic that's currently in AllocSetFreeIndex. On the > other hand, the buildfarm thinks that __builtin_clz is essentially > universal these days --- the only active non-MSVC critter that > reports not having it is anole. So maybe it's not worth sweating > over that. Perhaps what we really ought to be working on is > finding MSVC equivalents for __builtin_clz and friends. Apparently clz() can be written using _BitScanReverse(), per https://stackoverflow.com/a/20468180 https://docs.microsoft.com/en-us/cpp/intrinsics/bitscanreverse-bitscanreverse64?view=vs-2015 -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Dec 27, 2019 at 01:29:47PM -0300, Alvaro Herrera wrote: > On 2019-Dec-27, Tom Lane wrote: > > > This kind of leads me to wonder if we don't need to expend more > > effort on the non-CLZ version of pg_leftmost_one_pos32; it seems > > like it shouldn't be losing this badly to the only-slightly- > > improved logic that's currently in AllocSetFreeIndex. On the > > other hand, the buildfarm thinks that __builtin_clz is essentially > > universal these days --- the only active non-MSVC critter that > > reports not having it is anole. So maybe it's not worth sweating > > over that. Perhaps what we really ought to be working on is > > finding MSVC equivalents for __builtin_clz and friends. > > Apparently clz() can be written using _BitScanReverse(), per > https://stackoverflow.com/a/20468180 > https://docs.microsoft.com/en-us/cpp/intrinsics/bitscanreverse-bitscanreverse64?view=vs-2015 There's also various flavors of lczntN for N in (16,32,64) on (relatively) modern architectures. https://docs.microsoft.com/en-us/cpp/intrinsics/lzcnt16-lzcnt-lzcnt64?redirectedfrom=MSDN&view=vs-2019 Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Dec-27, Tom Lane wrote: >> ... Perhaps what we really ought to be working on is >> finding MSVC equivalents for __builtin_clz and friends. > Apparently clz() can be written using _BitScanReverse(), per > https://stackoverflow.com/a/20468180 > https://docs.microsoft.com/en-us/cpp/intrinsics/bitscanreverse-bitscanreverse64?view=vs-2015 Yeah, I found that too. It looks promising, but we need to look into * portability to different MSVC versions? (I guess the buildfarm would tell us) * performance, does it actually generate comparable code? * intrinsics for the other bit instructions we use? regards, tom lane
On Fri, Dec 27, 2019 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > John Naylor <john.naylor@2ndquadrant.com> writes: > > On Fri, Dec 27, 2019 at 9:54 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> ... but couldn't the > >> right shift be elided in favor of changing the constant we > >> subtract clz's result from? Shifting off those bits separately > >> made sense in the old implementation, but assuming that CLZ is > >> more or less constant-time, it doesn't with CLZ. > > > That makes sense -- I'll look into doing that. > > Actually, we could apply that insight to both code paths. > In the existing path, that requires assuming > ALLOCSET_NUM_FREELISTS+ALLOC_MINBITS <= 17, but that's OK. > (Nowadays I'd probably add a StaticAssert about that.) I tried that in the attached files and got these results: current 6.14s clz 4.52s clz no right shift 3.15s lookup table 5.56s lookup table no right shift 7.34s Here, "lookup table" refers to using the pg_leftmost_one_pos[] array and incrementing the result. Removing the shift operation from the CLZ case is clearly an improvement, and the main body goes from movabsq $34359738367, %rax addq %rax, %rdi shrq $3, %rdi bsrl %edi, %eax xorl $-32, %eax addl $33, %eax to decl %edi bsrl %edi, %eax xorl $-32, %eax addl $30, %eax The lookup table case is less clear. Removing the shift results in assembly that looks more like the C code and is slower for me. The standard lookup table code uses some magic constants and does its own constant folding by shifting 11 (8 + 3). In the absence of testing on platforms that will actually exercise this path, it seems the open-coded path should keep the shift for now. Thoughts? -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On Fri, Dec 27, 2019 at 07:02:02PM -0500, John Naylor wrote: > On Fri, Dec 27, 2019 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > John Naylor <john.naylor@2ndquadrant.com> writes: > > > On Fri, Dec 27, 2019 at 9:54 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> ... but couldn't the > > >> right shift be elided in favor of changing the constant we > > >> subtract clz's result from? Shifting off those bits separately > > >> made sense in the old implementation, but assuming that CLZ is > > >> more or less constant-time, it doesn't with CLZ. > > > > > That makes sense -- I'll look into doing that. > > > > Actually, we could apply that insight to both code paths. > > In the existing path, that requires assuming > > ALLOCSET_NUM_FREELISTS+ALLOC_MINBITS <= 17, but that's OK. > > (Nowadays I'd probably add a StaticAssert about that.) > > I tried that in the attached files and got these results: > > current 6.14s > clz 4.52s > clz no right shift 3.15s > lookup table 5.56s > lookup table no right shift 7.34s > > Here, "lookup table" refers to using the pg_leftmost_one_pos[] array > and incrementing the result. Removing the shift operation from the CLZ > case is clearly an improvement, and the main body goes from > > movabsq $34359738367, %rax > addq %rax, %rdi > shrq $3, %rdi > bsrl %edi, %eax > xorl $-32, %eax > addl $33, %eax > > to > > decl %edi > bsrl %edi, %eax > xorl $-32, %eax > addl $30, %eax > > The lookup table case is less clear. Removing the shift results in > assembly that looks more like the C code and is slower for me. The > standard lookup table code uses some magic constants and does its own > constant folding by shifting 11 (8 + 3). In the absence of testing on > platforms that will actually exercise this path, it seems the > open-coded path should keep the shift for now. Thoughts? It's probably worth doing the things you've found unambiguous gains for as a patch, putting it on the next commitfest, and seeing what the commitfest.cputube.org machinery has to say about it. Maybe it'd be worth trying out a patch that enables CLZ for Windows, but that seems like a separate issue. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Fri, Dec 27, 2019 at 9:16 PM David Fetter <david@fetter.org> wrote: > On Fri, Dec 27, 2019 at 07:02:02PM -0500, John Naylor wrote: > > The lookup table case is less clear. Removing the shift results in > > assembly that looks more like the C code and is slower for me. The > > standard lookup table code uses some magic constants and does its own > > constant folding by shifting 11 (8 + 3). In the absence of testing on > > platforms that will actually exercise this path, it seems the > > open-coded path should keep the shift for now. Thoughts? > > It's probably worth doing the things you've found unambiguous gains > for as a patch, putting it on the next commitfest, and seeing what the > commitfest.cputube.org machinery has to say about it. Done in the attached. > Maybe it'd be worth trying out a patch that enables CLZ for Windows, > but that seems like a separate issue. Agreed. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
v2 had an Assert that was only correct while experimenting with eliding right shift. Fixed in v3. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
John Naylor <john.naylor@2ndquadrant.com> writes: > v2 had an Assert that was only correct while experimenting with > eliding right shift. Fixed in v3. I think there must have been something wrong with your test that said that eliminating the right shift from the non-CLZ code made it slower. It should be an unconditional win, just as it is for the CLZ code path. (Maybe some odd cache-line-boundary effect?) Also, I think it's just weird to account for ALLOC_MINBITS one way in the CLZ path and the other way in the other path. I decided that it might be a good idea to do performance testing in-place rather than in a standalone test program. I whipped up the attached that just does a bunch of palloc/pfree cycles. I got the following results on a non-cassert build (medians of a number of tests; the times are repeatable to ~ 0.1% for me): HEAD: 2429.431 ms v3 CLZ: 2131.735 ms v3 non-CLZ: 2477.835 ms remove shift: 2266.755 ms I didn't bother to try this on non-x86_64 architectures, as previous testing convinces me the outcome should be about the same. Hence, pushed that way, with a bit of additional cosmetic foolery: the static assertion made more sense to me in relation to the documented assumption that size <= ALLOC_CHUNK_LIMIT, and I thought the comment could use some work. regards, tom lane /* create function drive_palloc(count int) returns void strict volatile language c as '.../drive_palloc.so'; \timing select drive_palloc(10000000); */ #include "postgres.h" #include "fmgr.h" #include "miscadmin.h" #include "tcop/tcopprot.h" #include "utils/builtins.h" #include "utils/memutils.h" PG_MODULE_MAGIC; /* * drive_palloc(count int) returns void */ PG_FUNCTION_INFO_V1(drive_palloc); Datum drive_palloc(PG_FUNCTION_ARGS) { int32 count = PG_GETARG_INT32(0); while (count-- > 0) { for (size_t sz = 1; sz <= 8192; sz <<= 1) { void *p = palloc(sz); pfree(p); } CHECK_FOR_INTERRUPTS(); } PG_RETURN_VOID(); }