Обсуждение: Using POPCNT and other advanced bit manipulation instructions

Поиск
Список
Период
Сортировка

Using POPCNT and other advanced bit manipulation instructions

От
David Rowley
Дата:
Back in 2016 [1] there was some discussion about using the POPCNT
instruction to improve the performance of counting the number of bits
set in a word.  Improving this helps various cases, such as
bms_num_members and also things like counting the allvisible and
frozen pages in the visibility map.

Thomas Munro did some work to make this happen but didn't go as far as
adding the required run-time test to allow builds which were built on
a machine with these instructions to work on a machine without them.
We've now got other places in the code which have similar run-time
tests (for example CRC calculation), so I think we should be able to
do the same for the ABM instructions.

Thomas mentions in [1], to get the GCC to use the POPCNT instruction,
we must pass -mpopcnt in the build flags. After doing a bit of
research, I found [2] which mentions that some compilers have some
pattern matching code to allow the popcnt instruction to be used even
without a macro such as __builtin_popcount().  I believe I've
correctly written the run-time test to skip using the new popcnt
function, but if there's any code around that might cause the compiler
to use the popcnt instruction from pattern matching, then that might
cause problems.  Remember, that's not limited to core code since
pg_config will cause extensions to be compiled with -mpopcnt too.

I've put together a very rough patch to implement using POPCNT and the
leading and trailing 0-bit instructions to improve the performance of
bms_next_member() and bms_prev_member().  The correct function should
be determined on the first call to each function by way of setting a
function pointer variable to the most suitable supported
implementation.   I've not yet gone through and removed all the
number_of_ones[] arrays to replace with a pg_popcount*() call. That
seems to have mostly been done in Thomas' patch [3], part of which
I've used for the visibilitymap.c code changes.  If this patch proves
to be possible, then I'll look at including the other changes Thomas
made in his patch too.

What I'm really looking for by posting now are reasons why we can't do
this. I'm also interested in getting some testing done on older
machines, particularly machines with processors that are from before
2007, both AMD and Intel. 2007-2008 seems to be around the time both
AMD and Intel added support for POPCNT and LZCNT, going by [4].

I'm also a little uncertain of my cpuid bit tests. POPCNT appears to
have use bit 5 in EAX=80000001h, but also bit 23 in EAX=1 [5]. This
appears to be a variation between Intel and AMD. AMD always implement
either both POPCNT and LZCNT or neither. Where Intel use AMDs cpuid
bit flag just for LZCNT and have reserved their own flag for POPCNT
(they didn't implement both at once, as AMD did). I'm a bit uncertain
if AMD will set the Intel POPCNT flag or not, and if they do now, then
I'm not sure if they always did. Intel were 2nd in that race, so I
assume at least the earliest AMD processors would just set only the
AMD flag.  Testing might help reveal if I have this right.

I am able to measure performance gains from the patch.  In a 3.4GB
table containing a single column with just 10 statistics targets, I
got the following times after running ANALYZE on the table.

Patched:

postgres=# analyze t1;
Time: 680.833 ms
Time: 699.976 ms
Time: 695.608 ms
Time: 676.007 ms
Time: 693.487 ms
Time: 726.982 ms
Time: 677.835 ms
Time: 688.426 ms

Master:

postgres=# analyze t1;
Time: 721.837 ms
Time: 756.035 ms
Time: 734.545 ms
Time: 751.969 ms
Time: 730.140 ms
Time: 724.266 ms
Time: 713.625 ms

(+3.66% avg)

This should be down to the improved performance of
visibilitymap_count(), but it may not be entirely just from faster bit
counter as I also couldn't resist tightening up the inner-most loop.

[1]
https://www.postgresql.org/message-id/flat/CAEepm%3D3k%2B%2BYtf2LNQCvpP6m1%3DgY9zZHP_cfnn47%3DWTsoCrLCvA%40mail.gmail.com
[2] https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
[3] https://www.postgresql.org/message-id/CAEepm%3D3g1_fjJGp38QGv%2B38BC2HHVkzUq6s69nk3mWLgPHqC3A%40mail.gmail.com
[4] https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT
[5] https://en.wikipedia.org/wiki/CPUID#CPUID_usage_from_high-level_languages

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Using POPCNT and other advanced bit manipulation instructions

От
Gavin Flower
Дата:
On 20/12/2018 18:53, David Rowley wrote
[...]
> Patched:
>
> postgres=# analyze t1;
> Time: 680.833 ms
> Time: 699.976 ms
> Time: 695.608 ms
> Time: 676.007 ms
> Time: 693.487 ms
> Time: 726.982 ms
> Time: 677.835 ms
> Time: 688.426 ms
>
> Master:
>
> postgres=# analyze t1;
> Time: 721.837 ms
> Time: 756.035 ms
> Time: 734.545 ms
> Time: 751.969 ms
> Time: 730.140 ms
> Time: 724.266 ms
> Time: 713.625 ms
>
> (+3.66% avg)

[...]

Looking at the normalized standard deviations, the patched results have 
a higher than 5% chance of being better simply by chance.  I suspect 
that you have made an improvement, but the statistics are not convincing.

I can supply detailed working if you want.


Cheers,
Gavin




Re: Using POPCNT and other advanced bit manipulation instructions

От
David Rowley
Дата:
On Thu, 20 Dec 2018 at 20:17, Gavin Flower
<GavinFlower@archidevsys.co.nz> wrote:
> Looking at the normalized standard deviations, the patched results have
> a higher than 5% chance of being better simply by chance.  I suspect
> that you have made an improvement, but the statistics are not convincing.

Yeah, I'd hoped that I could have gotten a better signal to noise
ratio by running the test many times, but you're right.  That was on
my laptop.  I've run the test again on an AWS instance and the results
seem to be a bit more stable. Same table with 1 int column and 100m
rows. statistics set to 10.

Unpatched

postgres=# analyze a;

Time: 38.248 ms
Time: 35.185 ms
Time: 35.067 ms
Time: 34.879 ms
Time: 34.816 ms
Time: 34.558 ms
Time: 34.722 ms
Time: 34.427 ms
Time: 34.214 ms
Time: 34.301 ms
Time: 35.751 ms
Time: 33.993 ms
Time: 33.880 ms
Time: 33.617 ms
Time: 33.381 ms
Time: 33.326 ms

Patched:

postgres=# analyze a;

Time: 34.421 ms
Time: 33.523 ms
Time: 33.230 ms
Time: 33.678 ms
Time: 32.987 ms
Time: 32.914 ms
Time: 33.165 ms
Time: 32.707 ms
Time: 32.645 ms
Time: 32.814 ms
Time: 32.082 ms
Time: 32.143 ms
Time: 32.310 ms
Time: 31.966 ms
Time: 31.702 ms
Time: 32.089 ms

Avg +5.72%, Median +5.29%

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
Dmitry Dolgov
Дата:
> On Thu, Dec 20, 2018 at 6:53 AM David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> Thomas mentions in [1], to get the GCC to use the POPCNT instruction,
> we must pass -mpopcnt in the build flags. After doing a bit of
> research, I found [2] which mentions that some compilers have some
> pattern matching code to allow the popcnt instruction to be used even
> without a macro such as __builtin_popcount().  I believe I've
> correctly written the run-time test to skip using the new popcnt
> function, but if there's any code around that might cause the compiler
> to use the popcnt instruction from pattern matching, then that might
> cause problems.

I've checked for Clang 6, it turns out that indeed it generates popcnt without
any macro, but only in one place for bloom_prop_bits_set. After looking at this
function it seems that it would be benefitial to actually use popcnt there too.

> I am able to measure performance gains from the patch.  In a 3.4GB
> table containing a single column with just 10 statistics targets, I
> got the following times after running ANALYZE on the table.

I've tested it too a bit, and got similar results when the patched version is
slightly faster. But then I wonder if popcnt is the best solution here, since
after some short research I found a paper [1], where authors claim that:

    Maybe surprisingly, we show that a vectorized approach using SIMD
    instructions can be twice as fast as using the dedicated instructions on
    recent Intel processors.


[1]: https://arxiv.org/pdf/1611.07612.pdf


Re: Using POPCNT and other advanced bit manipulation instructions

От
Jose Luis Tallon
Дата:
On 20/12/18 6:53, David Rowley wrote:
> Back in 2016 [1] there was some discussion about using the POPCNT
> instruction to improve the performance of counting the number of bits
> set in a word.  Improving this helps various cases, such as
> bms_num_members and also things like counting the allvisible and
> frozen pages in the visibility map.
>
> [snip]
>
> I've put together a very rough patch to implement using POPCNT and the
> leading and trailing 0-bit instructions to improve the performance of
> bms_next_member() and bms_prev_member().  The correct function should
> be determined on the first call to each function by way of setting a
> function pointer variable to the most suitable supported
> implementation.   I've not yet gone through and removed all the
> number_of_ones[] arrays to replace with a pg_popcount*() call.

IMVHO: Please do not disregard potential optimization by the compiler 
around those calls.. o_0  That might explain the reduced performance 
improvement observed.

Not that I can see any obvious alternative to your implementation right 
away ...

> That
> seems to have mostly been done in Thomas' patch [3], part of which
> I've used for the visibilitymap.c code changes.  If this patch proves
> to be possible, then I'll look at including the other changes Thomas
> made in his patch too.
>
> What I'm really looking for by posting now are reasons why we can't do
> this. I'm also interested in getting some testing done on older
> machines, particularly machines with processors that are from before
> 2007, both AMD and Intel.

I can offer a 2005-vintage Opteron 2216 rev3 (bought late 2007) to test 
on. Feel free to toss me some test code.

cpuinfo flags:    fpu de tsc msr pae mce cx8 apic mca cmov pat clflush 
mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow 
rep_good nopl extd_apicid eagerfpu pni cx16 hypervisor lahf_lm 
cmp_legacy 3dnowprefetch vmmcall

>   2007-2008 seems to be around the time both
> AMD and Intel added support for POPCNT and LZCNT, going by [4].

Thanks




Re: Using POPCNT and other advanced bit manipulation instructions

От
David Rowley
Дата:
Thanks for looking at this.

On Thu, 20 Dec 2018 at 23:56, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> I've checked for Clang 6, it turns out that indeed it generates popcnt without
> any macro, but only in one place for bloom_prop_bits_set. After looking at this
> function it seems that it would be benefitial to actually use popcnt there too.

Yeah, that's the pattern that's mentioned in
https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
It would need to be changed to call the popcount function.  This
existing makes me a bit more worried that some extension could be
using a similar pattern and end up being compiled with -mpopcnt due to
pg_config having that CFLAG. That's all fine until the binary makes
it's way over to a machine without that instruction.

> > I am able to measure performance gains from the patch.  In a 3.4GB
> > table containing a single column with just 10 statistics targets, I
> > got the following times after running ANALYZE on the table.
>
> I've tested it too a bit, and got similar results when the patched version is
> slightly faster. But then I wonder if popcnt is the best solution here, since
> after some short research I found a paper [1], where authors claim that:
>
>     Maybe surprisingly, we show that a vectorized approach using SIMD
>     instructions can be twice as fast as using the dedicated instructions on
>     recent Intel processors.
>
>
> [1]: https://arxiv.org/pdf/1611.07612.pdf

I can't imagine that using the number_of_ones[] array processing
8-bits at a time would be slower than POPCNT though.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
David Rowley
Дата:
On Thu, 20 Dec 2018 at 23:59, Jose Luis Tallon
<jltallon@adv-solutions.net> wrote:
> IMVHO: Please do not disregard potential optimization by the compiler
> around those calls.. o_0  That might explain the reduced performance
> improvement observed.

It was a speedup that I measured. Did you see something else?

> > What I'm really looking for by posting now are reasons why we can't do
> > this. I'm also interested in getting some testing done on older
> > machines, particularly machines with processors that are from before
> > 2007, both AMD and Intel.
>
> I can offer a 2005-vintage Opteron 2216 rev3 (bought late 2007) to test
> on. Feel free to toss me some test code.
>
> cpuinfo flags:    fpu de tsc msr pae mce cx8 apic mca cmov pat clflush
> mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow
> rep_good nopl extd_apicid eagerfpu pni cx16 hypervisor lahf_lm
> cmp_legacy 3dnowprefetch vmmcall
>
> >   2007-2008 seems to be around the time both
> > AMD and Intel added support for POPCNT and LZCNT, going by [4].

It would be really good if you could git clone a copy of master and
patch it with the patch from earlier in the thread and see if you
encounter any issues running make check-world.

I'm a bit uncertain if passing -mpopcnt to a recent gcc would result
in the popcnt instruction being compiled in if the machine doing the
compiling had no support for that.

Likely it would be simple to test that with:

echo "int main(char **argv, int argc) { return
__builtin_popcount(argc); }" > popcnt.c && gcc popcnt.c -S -mpopcnt &&
cat popcnt.s | grep pop

I see a "popcntl" in there on my machine.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
Dmitry Dolgov
Дата:
> On Fri, Jan 4, 2019 at 1:38 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> On Thu, 20 Dec 2018 at 23:56, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > I've checked for Clang 6, it turns out that indeed it generates popcnt without
> > any macro, but only in one place for bloom_prop_bits_set. After looking at this
> > function it seems that it would be benefitial to actually use popcnt there too.
>
> Yeah, that's the pattern that's mentioned in
> https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
> It would need to be changed to call the popcount function.  This
> existing makes me a bit more worried that some extension could be
> using a similar pattern and end up being compiled with -mpopcnt due to
> pg_config having that CFLAG. That's all fine until the binary makes
> it's way over to a machine without that instruction.

It surprises me, that it's not that obvious how to disable this feature for
clang. I guess one should be able to turn it off by invoking opt manually:

    clang -S -mpopcnt -emit-llvm *.c
    opt -S -mattr=+popcnt <all the options without -loop-idiom> *.ll
    llc -mattr=+popcnt *.optimized.ll
    clang -mpopcnt *optimized.s

But for some reason this doesn't work for me (popcnt is not appearing in
the first place).

> > > I am able to measure performance gains from the patch.  In a 3.4GB
> > > table containing a single column with just 10 statistics targets, I
> > > got the following times after running ANALYZE on the table.
> >
> > I've tested it too a bit, and got similar results when the patched version is
> > slightly faster. But then I wonder if popcnt is the best solution here, since
> > after some short research I found a paper [1], where authors claim that:
> >
> >     Maybe surprisingly, we show that a vectorized approach using SIMD
> >     instructions can be twice as fast as using the dedicated instructions on
> >     recent Intel processors.
> >
> >
> > [1]: https://arxiv.org/pdf/1611.07612.pdf
>
> I can't imagine that using the number_of_ones[] array processing
> 8-bits at a time would be slower than POPCNT though.

Yeah, probably you're right. If I understand correctly even with the lookup
table in the cache the access would be a bit slower than a POPCNT instruction.


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
I only have cosmetic suggestions for this patch.  For one thing, I think
the .c file should be in src/port and its header should be in
src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h.
For another, I think the arrangement of all those "ifdef
HAVE_THIS_OR_THAT" in the bitutils.c file is a bit hard to read.  I'd
lay them out like this:

  #ifdef HAVE__BUILTIN_CTZ
  int            (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_choose;
  #else
  int            (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_slow;
  #endif

  #ifdef HAVE__BUILTIN_CTZ
  /*
   * This gets called on the first call. It replaces the function pointer
   * so that subsequent calls are routed directly to the chosen implementation.
   */
  static int
  pg_rightmost_one32_choose(uint32 word)
  {
  ...

(You need declarations for the "choose" variants at the top of the file,
but that seems okay.)

Finally, the part in bitmapset.c is repetitive on the #ifdefs; I'd just
put at the top of the file something like 

  #if bms are 32 bits
  #define pg_rightmost_one(x) pg_rightmost_one32(x)
  #define pg_popcount(x) pg_popcount32(x)
  #elif they are 64 bits
  #define ...
  #else
  #error ...
  #endif

This way, each place that uses the functions does not need the ifdefs.

Other than those minor changes, I think we should just get this pushed
and see what the buildfarm thinks.  In the words of a famous PG hacker:
if a platform ain't in the buildfarm, we don't support it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
Michael Paquier
Дата:
On Thu, Jan 31, 2019 at 07:45:02PM -0300, Alvaro Herrera wrote:
> Other than those minor changes, I think we should just get this pushed
> and see what the buildfarm thinks.  In the words of a famous PG hacker:
> if a platform ain't in the buildfarm, we don't support it.

Moved to next CF, waiting on author.  I think that this needs more
reviews.
--
Michael

Вложения

Re: Using POPCNT and other advanced bit manipulation instructions

От
David Rowley
Дата:
Thanks for looking at this.

On Fri, 1 Feb 2019 at 11:45, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> I only have cosmetic suggestions for this patch.  For one thing, I think
> the .c file should be in src/port and its header should be in
> src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h.

I've moved the code into src/port and renamed the file to pg_bitutils.c

> For another, I think the arrangement of all those "ifdef
> HAVE_THIS_OR_THAT" in the bitutils.c file is a bit hard to read.  I'd
> lay them out like this:

I've made this change too, although when doing it I realised that I
had forgotten to include the check for CPUID. It's possible that does
not exist but POPCNT does, I guess.  This has made the #ifs a bit more
complex.

> Finally, the part in bitmapset.c is repetitive on the #ifdefs; I'd just
> put at the top of the file something like

Yeah, agreed. Much neater that way.

> Other than those minor changes, I think we should just get this pushed
> and see what the buildfarm thinks.  In the words of a famous PG hacker:
> if a platform ain't in the buildfarm, we don't support it.

I also made a number of other changes to the patch.

1. The patch now only uses the -mpopcnt CFLAG for pg_bitutils.c.  I
thought this was important so we don't expose that flag in pg_config
and possibly end up building extension with popcnt instructions, which
might not be portable to other older hardware.
2. Wrote a new pg_popcnt function that accepts an array of bytes and a
size variable.  This seems useful for the bloomfilter use.

There are still various number_of_ones[] arrays around the codebase.
These exist in tsgistidx.c, _intbig_gist.c and _ltree_gist.c.  It
would be nice to get rid of those too, but one of the usages in each
of those 3 files requires XORing with another bit array before
counting the bits.  I thought about maybe writing a pop_count_xor()
function that accepts 2 byte arrays and a length parameter, but it
seems a bit special case, so I didn't.

Another thing I wasn't sure of was if I should just have
bms_num_members() just call pg_popcount(). It might be worth
benchmarking to see what's faster. My thinking is that pg_popcount
will inline the pg_popcount64() call so it would mean a single
function call rather than one for each bitmapword in the set.

I've compiled and run make check-world on Linux with GCC7.3 and
clang6.0. I've also tested on MSVC to ensure I didn't break windows.
It would be good to get a few more people to compile it and run the
tests.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-04, David Rowley wrote:

> On Fri, 1 Feb 2019 at 11:45, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > I only have cosmetic suggestions for this patch.  For one thing, I think
> > the .c file should be in src/port and its header should be in
> > src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h.
> 
> I've moved the code into src/port and renamed the file to pg_bitutils.c

I've pushed this now.  Let's see what the buildfarm has to say about it.

> I've compiled and run make check-world on Linux with GCC7.3 and
> clang6.0. I've also tested on MSVC to ensure I didn't break windows.
> It would be good to get a few more people to compile it and run the
> tests.

That's what the buildfarm is for ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I've pushed this now.  Let's see what the buildfarm has to say about it.

It's likely to be hard to tell, given the amount of pink from the Ryu
patch.  If Andrew is not planning to clean that up PDQ, I'd suggest
reverting that patch pending having some repairs for it.

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Andres Freund
Дата:
Hi,

On February 13, 2019 8:40:14 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> I've pushed this now.  Let's see what the buildfarm has to say about
>it.
>
>It's likely to be hard to tell, given the amount of pink from the Ryu
>patch.  If Andrew is not planning to clean that up PDQ, I'd suggest
>reverting that patch pending having some repairs for it.

I'd assume that breaking bit counting would cause distinct enough damage (compile time or crashes).  That's not to say
thatreverting ryu shouldn't be considered (although I'm not that bothered by cross version, ia64 and cygwin failures,
especiallybecause the latter two might be hard to come by outside the bf). 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> I'd assume that breaking bit counting would cause distinct enough damage (compile time or crashes).  That's not to
saythat reverting ryu shouldn't be considered (although I'm not that bothered by cross version, ia64 and cygwin
failures,especially because the latter two might be hard to come by outside the bf). 

The pink doesn't appear to be limited to non-mainstream platforms,
see eg lapwing, fulmar.  However, I see Andrew just pushed some fixes,
so this argument is moot pending how much that helps.

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Andrew Gierth
Дата:
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes:

 >> It's likely to be hard to tell, given the amount of pink from the
 >> Ryu patch. If Andrew is not planning to clean that up PDQ,

Besides crake (x-version), fulmar (icc) and lorikeet (cygwin), I hope
the rest of the known failures should pass on the next cycle; the
mac/ppc failures are because we redefine "bool" in a way that broke the
upstream code's c99 assumptions, and the rest are numerical instability
in ts_rank.

 >> I'd suggest reverting that patch pending having some repairs for it.

 Andres> I'd assume that breaking bit counting would cause distinct
 Andres> enough damage (compile time or crashes). That's not to say that
 Andres> reverting ryu shouldn't be considered (although I'm not that
 Andres> bothered by cross version, ia64 and cygwin failures, especially
 Andres> because the latter two might be hard to come by outside the
 Andres> bf).

IA64 is working fine as far as I can see (specifically, anole is
passing); it's ICC on x86_64 that broke (fulmar).

-- 
Andrew (irc:RhodiumToad)


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-13, Andres Freund wrote:

> Hi,
> 
> On February 13, 2019 8:40:14 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> >> I've pushed this now.  Let's see what the buildfarm has to say about
> >it.
> >
> >It's likely to be hard to tell, given the amount of pink from the Ryu
> >patch.  If Andrew is not planning to clean that up PDQ, I'd suggest
> >reverting that patch pending having some repairs for it.
> 
> I'd assume that breaking bit counting would cause distinct enough
> damage (compile time or crashes).

I was a bit surprised to find out that the assembly generated by
compiling the code in test for __builtin_foo() does not actually include
the calls being tested ... (they're only used to generate the value for
a static variable, and that gets optimized away); but then the comment
for the test does say that we're only testing that the compiler
understands the construct, so I suppose that's fine.  Also, we already
do that for bswap.

This "compiler explorer" tool is nice:


https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(j:1,source:'int+main(int+argc+,+char**+argv)%0A%7B%0A%0A++return++__builtin_popcountll((unsigned)argc)%3B%0A%0A%7D'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:45.01119572686435,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:g447,filters:(b:'0',commentOnly:'0',directives:'0',intel:'0'),libs:!(),options:'-Wall+-O3+-msse4.2',source:1),l:'5',n:'0',o:'x86-64+gcc+4.4.7+(Editor+%231,+Compiler+%231)',t:'0')),k:54.98880427313565,l:'4',m:100,n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
Andrew Gierth
Дата:
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

 Andrew> IA64 is working fine as far as I can see (specifically, anole
 Andrew> is passing); it's ICC on x86_64 that broke (fulmar).

And I know what's wrong on fulmar now, so that'll be fixed shortly.

-- 
Andrew (irc:RhodiumToad)


Re: Using POPCNT and other advanced bit manipulation instructions

От
Andrew Gierth
Дата:
>>>>> "Alvaro" == Alvaro Herrera <alvherre@2ndquadrant.com> writes:

 Alvaro> I've pushed this now. Let's see what the buildfarm has to say
 Alvaro> about it.

Lapwing's latest failure looks like yours rather than mine now? (the
previous two were mine)

-- 
Andrew (irc:RhodiumToad)


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-13, Andrew Gierth wrote:

> >>>>> "Alvaro" == Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> 
>  Alvaro> I've pushed this now. Let's see what the buildfarm has to say
>  Alvaro> about it.
> 
> Lapwing's latest failure looks like yours rather than mine now? (the
> previous two were mine)

It definitely is ... plans have changed from using IndexOnly scans to
Seqscans, which is likely fallout from the visibilitymap_count() change.
Looking.

(I patched the Makefile to add -mpopcnt to all the compile lines rather
than just the frontend one, but I can't see that explaining the
failure.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
Andrew Gierth
Дата:
>>>>> "Alvaro" == Alvaro Herrera <alvherre@2ndquadrant.com> writes:

 >> Lapwing's latest failure looks like yours rather than mine now? (the
 >> previous two were mine)

 Alvaro> It definitely is ... plans have changed from using IndexOnly
 Alvaro> scans to Seqscans, which is likely fallout from the
 Alvaro> visibilitymap_count() change. Looking.

As for the rest, crake's "configure" failure was due to Andrew aborting
a run presumably to tweak the config, and fulmar finished a run just
before I committed the fix that should turn it green again. I'm
obviously going to keep watching, but to my knowledge only crake
(x-version test) and lorikeet (cygwin) should still be broken from my
stuff.

-- 
Andrew (irc:RhodiumToad)


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-13, Alvaro Herrera wrote:

> It definitely is ... plans have changed from using IndexOnly scans to
> Seqscans, which is likely fallout from the visibilitymap_count() change.

I think the problem here is that "unsigned long" is 32 bits in this
machine:
  checking whether long int is 64 bits... no

and we have defined pg_popcount64() like this:

static int
pg_popcount64_sse42(uint64 word)
{
    return __builtin_popcountl(word);
}

so it's counting bits in the lower half of the uint64.

If that's correct, then I think we need something like this patch.  But
it makes me wonder whether we need a configure test for
__builtin_popcountll() and friends.  I wonder if there's any compiler
that implements __builtin_popcountl() but not __builtin_popcountll() ...
and if not, then the test for __builtin_popcountl() should be removed,
and have everything rely on the one for __builtin_popcount().

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> and we have defined pg_popcount64() like this:

> static int
> pg_popcount64_sse42(uint64 word)
> {
>     return __builtin_popcountl(word);
> }

That is clearly completely broken.


> If that's correct, then I think we need something like this patch.  But
> it makes me wonder whether we need a configure test for
> __builtin_popcountll() and friends.  I wonder if there's any compiler
> that implements __builtin_popcountl() but not __builtin_popcountll() ...
> and if not, then the test for __builtin_popcountl() should be removed,
> and have everything rely on the one for __builtin_popcount().

AFAICS, this is a gcc-ism, and it looks like they've probably had
all width variants for the same amount of time.  I'd take out the
test for __builtin_popcountl(), and assume that testing for
__builtin_popcount() is sufficient until proven differently.

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
... btw, why is pg_popcount casting away the const from its pointer
argument?

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-13, Tom Lane wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > and we have defined pg_popcount64() like this:
> 
> > static int
> > pg_popcount64_sse42(uint64 word)
> > {
> >     return __builtin_popcountl(word);
> > }
> 
> That is clearly completely broken.

Pushed my proposed fix, which includes removing the configure tests for
builtins of varying widths.  I couldn't resist sorting entries
alphabetically in configure.in.  (I also used autoheader to produce the
new pg_config.h, which showed me that David had not used it to generate
his diffs there.)

For pg_config.h.win32 I used the compiler explorer tool I just learned
about, and came to the conclusion that MSVC's compiler does not
implement these builtins.

I didn't do anything about the const-cast-away in pg_popcount() yet.
I think that should use PointerIsAligned() instead of what it's doing
now.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
And, while I'm complaining: why the devil is use of the compiler builtins
gated by HAVE__GET_CPUID?  This is unbelievably Intel-centric, because
it prevents use of the builtins on other architectures.  If the builtin
exists, we should use it, full stop.  There's no reason to expect that it
would be slower than hand-rolled code, regardless of the architecture.

I'd be inclined to rip out all of the run-time-detection logic here;
I doubt any of it is buying anything that's worth the price of an
indirect call.

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Thomas Munro
Дата:
On Thu, Feb 14, 2019 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> And, while I'm complaining: why the devil is use of the compiler builtins
> gated by HAVE__GET_CPUID?  This is unbelievably Intel-centric, because
> it prevents use of the builtins on other architectures.  If the builtin
> exists, we should use it, full stop.  There's no reason to expect that it
> would be slower than hand-rolled code, regardless of the architecture.

FWIW a quick test of __builtin_popcount(n) compiles as CNT on a Debian
ARM system, without any special compiler flags.

> I'd be inclined to rip out all of the run-time-detection logic here;
> I doubt any of it is buying anything that's worth the price of an
> indirect call.

No view on that but apparently there were Intel Atom and AMD C chips
sold in the early part of this decade that lack POPCNT so I suspect
the distros can't ship software that requires it with no fallback.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Thu, Feb 14, 2019 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'd be inclined to rip out all of the run-time-detection logic here;
>> I doubt any of it is buying anything that's worth the price of an
>> indirect call.

> No view on that but apparently there were Intel Atom and AMD C chips
> sold in the early part of this decade that lack POPCNT so I suspect
> the distros can't ship software that requires it with no fallback.

Ah, I was not looking at the business with the optional -mpopcnt
compiler flag.  I agree that we probably should not assume that
code compiled with that will run anywhere.  But it's silly to build
all this infrastructure and then throw away the opportunity to
optimize for anything but late-model Intel.

A survey of the buildfarm results so far says that __builtin_clz
and __builtin_ctz exist just about everywhere, and even
__builtin_popcount is available on some non-Intel architectures.
It is reasonable to assume that those builtins are faster than
the C equivalents if they exist.  It's reasonable to assume that
even on old-school Intel hardware.

The way this should have been done is to have a separate file
that's compiled with -mpopcnt if the compiler has that (and
has the builtins), and for the mainline file to have "slow"
versions that use the less-optimized builtins if available,
and only fall back to raw C code if not HAVE__BUILTIN_WHATEVER.

Also, in

#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)

static bool
pg_popcount_available(void)
{
    unsigned int exx[4] = { 0, 0, 0, 0 };

#if defined(HAVE__GET_CPUID)
    __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUID)
    __cpuid(exx, 1);
#else
#error cpuid instruction not available
#endif

    return (exx[2] & (1 << 23)) != 0;    /* POPCNT */
}
#endif

it's obvious to the naked eye that the __cpuid() and #error
branches are unreachable because of the outer #if.  I don't
think that was the design intention.

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Some further thoughts here ...

Does the "lzcnt" runtime probe actually do anything useful?
On the x86_64 compilers I tried (gcc 8.2.1 and 4.4.7), __builtin_clz
and __builtin_ctz compile to sequences involving bsrq and bsfq
regardless of -mpopcnt.  It's fairly hard to see how lzcnt would
buy anything over those sequences even if there were zero overhead
involved in using it.

Alvaro noted that the test programs used by c-compiler.m4 fail
to produce any actual code involving the builtin, because of
compile-time constant folding.  This seems pretty unwise.
I see that on my x86_64 compilers, without -mpopcnt,
__builtin_popcnt compiles to a call of some libgcc function
or other.  It's conceivable that on an (arguably misconfigured)
platform, these c-compiler.m4 tests would pass yet the build fails
at link because libgcc lacks the needed infrastructure.  These tests
should be coded in a way that doesn't allow the call to be optimized
away -- cf comments for PGAC_C_BUILTIN_OP_OVERFLOW.

Also, it's starting to seem like we have enough probes for compiler
builtins that we should fold them to use one set of infrastructure.
There are some like __builtin_constant_p that probably do need their
own custom tests, but these ones that just verify that a call
compiles seem pretty duplicative ...

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Gavin Flower
Дата:
On 14/02/2019 11:17, Alvaro Herrera wrote:
> On 2019-Feb-13, Alvaro Herrera wrote:
>
>> It definitely is ... plans have changed from using IndexOnly scans to
>> Seqscans, which is likely fallout from the visibilitymap_count() change.
> I think the problem here is that "unsigned long" is 32 bits in this
> machine:

[...]

 From my memory of reading of K&R many moons ago, it said that C only 
guarantees that the lengths are such that

byte <= half word <= word <= long

But I don't recall ever seeing a long less than 32 bits!


Cheers,
Gavin



Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
> From my memory of reading of K&R many moons ago, it said that C only 
> guarantees that the lengths are such that
> byte <= half word <= word <= long

Indeed.

> But I don't recall ever seeing a long less than 32 bits!

I'm not sure offhand what C89 said, but C99 requires "short" to be
at least 16 bits, "long" to be at least 32 bits, and "long long"
to be at least 64; see the minimum allowed values for SHRT_MAX etc.

C99 does permit "int" to be only 16 bits, but Postgres doesn't
pretend to work on such an architecture, and nobody's made one
since the (early?) 90s.

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-14, Tom Lane wrote:

> Some further thoughts here ...
> 
> Does the "lzcnt" runtime probe actually do anything useful?
> On the x86_64 compilers I tried (gcc 8.2.1 and 4.4.7), __builtin_clz
> and __builtin_ctz compile to sequences involving bsrq and bsfq
> regardless of -mpopcnt.  It's fairly hard to see how lzcnt would
> buy anything over those sequences even if there were zero overhead
> involved in using it.

Hah, I just realized you have to add -mlzcnt in order for these builtins
to use the lzcnt instructions.  It goes from something like

    bsrq    %rax, %rax
    xorq    $63, %rax

to
    lzcntq    %rax, %rax

Significant?

I have this patch now, written before I realized the above; I'll augment
it to cater to this (adding -mlzcnt and a new set of functions --
perhaps a new file "lzcnt.c" or maybe put the lot in pg_popcount.c and
rename it?) and resubmit after an errand I have to run now.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Hah, I just realized you have to add -mlzcnt in order for these builtins
> to use the lzcnt instructions.  It goes from something like

>     bsrq    %rax, %rax
>     xorq    $63, %rax

> to
>     lzcntq    %rax, %rax

> Significant?

I'd bet a fair amount of money that we'd be better off *not* using
lzcnt, even if available, because then we could just expose things
along this line:

static inline int
pg_clz(...)
{
#ifdef HAVE__BUILTIN_CLZ
    return __builtin_clz(x);
#else
    handwritten implementation;
#endif
}

Avoiding a function call (that has to indirect through a pointer) probably
saves much more than the difference between lzcnt and the other way.

The tradeoff might be different for popcount, though, especially since
it looks like __builtin_popcount() is not nearly as widely available
as the clz/ctz builtins.

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Andres Freund
Дата:
Hi,

On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:
> On 2019-Feb-14, Tom Lane wrote:
> 
> > Some further thoughts here ...
> > 
> > Does the "lzcnt" runtime probe actually do anything useful?
> > On the x86_64 compilers I tried (gcc 8.2.1 and 4.4.7), __builtin_clz
> > and __builtin_ctz compile to sequences involving bsrq and bsfq
> > regardless of -mpopcnt.  It's fairly hard to see how lzcnt would
> > buy anything over those sequences even if there were zero overhead
> > involved in using it.
> 
> Hah, I just realized you have to add -mlzcnt in order for these builtins
> to use the lzcnt instructions.  It goes from something like
> 
>     bsrq    %rax, %rax
>     xorq    $63, %rax

I'm confused how this is a general count leading zero operation? Did you
use constants or something that allowed ot infer a range in the test? If
so the compiler probably did some optimizations allowing it to do the
above.


> to
>     lzcntq    %rax, %rax
> 
> Significant?

If I understand Agner's tables correctly, then no, this isn't faster
than the two instructions above.


Greetings,

Andres Freund


Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:
>> Hah, I just realized you have to add -mlzcnt in order for these builtins
>> to use the lzcnt instructions.  It goes from something like
>> 
>> bsrq    %rax, %rax
>> xorq    $63, %rax

> I'm confused how this is a general count leading zero operation? Did you
> use constants or something that allowed ot infer a range in the test? If
> so the compiler probably did some optimizations allowing it to do the
> above.

No.  If you compile

int myclz(unsigned long long x)
{
  return __builtin_clzll(x);
}

at -O2, on just about any x86_64 gcc, you will get

myclz:
.LFB1:
        .cfi_startproc
        bsrq    %rdi, %rax
        xorq    $63, %rax
        ret
        .cfi_endproc

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-14, Tom Lane wrote:

> I'd bet a fair amount of money that we'd be better off *not* using
> lzcnt, even if available, because then we could just expose things
> along this line:
> 
> static inline int
> pg_clz(...)
> {
> #ifdef HAVE__BUILTIN_CLZ
>     return __builtin_clz(x);
> #else
>     handwritten implementation;
> #endif
> }
> 
> Avoiding a function call (that has to indirect through a pointer) probably
> saves much more than the difference between lzcnt and the other way.

I see ... makes sense.

That leads me to the attached patch.  It creates a new file
pg_popcount.c which is the only one compiled with -mpopcnt (if
available); if there's no compiler switch to enable POPCNT, we just
don't compile the file.  I'm not sure that's kosher -- in particular I'm
not sure if it can fail when POPCNT is enabled by other flags and
-mpopcnt is not needed at all.  I think our c-compiler.m4 stuff is a bit
too simplistic there: it just assumes that -mpopcnt is always required.
But what if the user passes it in CFLAGS?

I left CPUID alone for the CLZ/CTZ builtins.  So we either use the
table, or the builtins.  We never try the instructions.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-14, Tom Lane wrote:

> static inline int
> pg_clz(...)

Hmm, I missed this bit.  So we put all these functions in the header, as
in the attached.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> That leads me to the attached patch.  It creates a new file
> pg_popcount.c which is the only one compiled with -mpopcnt (if
> available); if there's no compiler switch to enable POPCNT, we just
> don't compile the file.  I'm not sure that's kosher -- in particular I'm
> not sure if it can fail when POPCNT is enabled by other flags and
> -mpopcnt is not needed at all.  I think our c-compiler.m4 stuff is a bit
> too simplistic there: it just assumes that -mpopcnt is always required.

Yes, the configure test for this stuff is really pretty broken.
It's conflating two nearly independent questions: (1) does the compiler
have __builtin_popcount(), and (2) does the compiler accept -mpopcnt.
It is certainly the case that (1) may hold without (2); in fact, every
recent non-x86_64 gcc is a counterexample to how that's done in HEAD.

I think we need a clean test for __builtin_popcount(), and to be willing
to use it if available, independently of -mpopcnt.  Then separately we
should test to see if -mpopcnt works, probably with the same
infrastructure we use for checking for other compiler flags, viz

   # Optimization flags for specific files that benefit from vectorization
   PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-funroll-loops])
   PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-ftree-vectorize])
+  # Optimization flags for bit-twiddling
+  PGAC_PROG_CC_VAR_OPT(CFLAGS_POPCNT, [-mpopcnt])
   # We want to suppress clang's unhelpful unused-command-line-argument warnings

Then the correct test to see if we want to build pg_popcount.c (BTW,
please pick a less generic name for that) and the choose function
is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty
CFLAGS_POPCNT.

I don't think this'd be fooled by user-specified CFLAGS.  The worst
possible outcome is that it builds a function that we intended would
use POPCNT but it's falling back to some other implementation, in
case the compiler has a switch named -mpopcnt but it doesn't do what
we think it does, or the user overrode things by adding -mno-popcnt.
That would really be nearly cost-free, other than the overhead of
the choose function the first time through: both of the execution
functions would be reducing to __builtin_popcount(), for whatever
version of that the compiler is giving us, so the choice wouldn't
matter.

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-14, Tom Lane wrote:

> I think we need a clean test for __builtin_popcount(), and to be willing
> to use it if available, independently of -mpopcnt.  Then separately we
> should test to see if -mpopcnt works, probably with the same
> infrastructure we use for checking for other compiler flags, viz
> 
>    # Optimization flags for specific files that benefit from vectorization
>    PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-funroll-loops])
>    PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-ftree-vectorize])
> +  # Optimization flags for bit-twiddling
> +  PGAC_PROG_CC_VAR_OPT(CFLAGS_POPCNT, [-mpopcnt])
>    # We want to suppress clang's unhelpful unused-command-line-argument warnings
> 
> Then the correct test to see if we want to build pg_popcount.c (BTW,
> please pick a less generic name for that) and the choose function
> is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty
> CFLAGS_POPCNT.

Yeah, this works.  I'll post the patch tomorrow.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
Kyotaro HORIGUCHI
Дата:
At Thu, 14 Feb 2019 16:45:38 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <822.1550180738@sss.pgh.pa.us>
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:
> >> Hah, I just realized you have to add -mlzcnt in order for these builtins
> >> to use the lzcnt instructions.  It goes from something like
> >> 
> >> bsrq    %rax, %rax
> >> xorq    $63, %rax
> 
> > I'm confused how this is a general count leading zero operation? Did you
> > use constants or something that allowed ot infer a range in the test? If
> > so the compiler probably did some optimizations allowing it to do the
> > above.
> 
> No.  If you compile
> 
> int myclz(unsigned long long x)
> {
>   return __builtin_clzll(x);
> }
> 
> at -O2, on just about any x86_64 gcc, you will get
> 
> myclz:
> .LFB1:
>         .cfi_startproc
>         bsrq    %rdi, %rax
>         xorq    $63, %rax
>         ret
>         .cfi_endproc
> 

I understand that the behavior of __builtin_c[tl]z(0) is
undefined from the reason, they convert to bs[rf]. So if we use
these builtins, additional check is required.

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

> Built-in Function: int __builtin_clz (unsigned int x)
>   Returns the number of leading 0-bits in x, starting at the most
>   significant bit position. If x is 0, the result is undefined.
> 
> Built-in Function: int __builtin_ctz (unsigned int x)
>   Returns the number of trailing 0-bits in x, starting at the
>   least significant bit position. If x is 0, the result is
>   undefined.


And even worse lzcntx is accidentially "fallback"s to bsrx on
unsupported CPUs, which leads to bogus results.
__builtin_clzll(8) = 3, which should be 60.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-15, Kyotaro HORIGUCHI wrote:

> I understand that the behavior of __builtin_c[tl]z(0) is
> undefined from the reason, they convert to bs[rf]. So if we use
> these builtins, additional check is required.
> 
> https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Okay -- the functions check for a 0 argument:

+static inline int
+pg_rightmost_one32(uint32 word)
+{
+   int         result = 0;
+
+   Assert(word != 0);
+
+#ifdef HAVE__BUILTIN_CTZ
+   result = __builtin_ctz(word);
+#else
+   while ((word & 255) == 0)
+   {
+       word >>= 8;
+       result += 8;
+   }
+   result += rightmost_one_pos[word & 255];
+#endif                         /* HAVE__BUILTIN_CTZ */
+
+   return result;
+}

so we're fine.

> And even worse lzcntx is accidentially "fallback"s to bsrx on
> unsupported CPUs, which leads to bogus results.
> __builtin_clzll(8) = 3, which should be 60.

I'm not sure I understand this point.  Are you saying that if we use
-mlzcnt to compile, then the compiler builtin is broken in platforms
that don't support the lzcnt instruction?  That's horrible.  Let's stay
away from -mlzcnt then.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-15, Alvaro Herrera wrote:

> On 2019-Feb-15, Kyotaro HORIGUCHI wrote:

> > And even worse lzcntx is accidentially "fallback"s to bsrx on
> > unsupported CPUs, which leads to bogus results.
> > __builtin_clzll(8) = 3, which should be 60.
> 
> I'm not sure I understand this point.  Are you saying that if we use
> -mlzcnt to compile, then the compiler builtin is broken in platforms
> that don't support the lzcnt instruction?  That's horrible.  Let's stay
> away from -mlzcnt then.

Ah, I understand it now:
https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701
if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise
SIGILL or anything ... it'll just silently compute the wrong result.
That's certainly not what I call a fallback!

I think David's code was correct because it was testing CPUID for
instruction support before using the specialized code (well, except that
he forgot to add the right compiler option to *enable* the LZCNT/TZCNT
instructions in the first place); but given subsequent discussion that
the instruction is not worth it anyway, we might as well ignore it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-14, Tom Lane wrote:

> Then the correct test to see if we want to build pg_popcount.c (BTW,
> please pick a less generic name for that) and the choose function
> is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty
> CFLAGS_POPCNT.

I used pg_bitutils_sse42.c to host the specially-compiled functions.
The name is not entirely correct, but seems clear enough.

I noticed in Compiler Explorer that some (ancient?) Power cpus
implement instruction "popcntb", and GCC support for those uses
-mpopcntb switch enabling __builtin_popcount() to use it.  I added the
switch to configure.in but I'm not sure how well that will work ...  I
don't know if this is represented in buildfarm.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Ah, I understand it now:
> https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701
> if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise
> SIGILL or anything ... it'll just silently compute the wrong result.
> That's certainly not what I call a fallback!

Yeah, that's pretty nasty; it means there's no backstop for whether
your choose function gets it right :-(

Is POPCNT any better in this respect?

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I noticed in Compiler Explorer that some (ancient?) Power cpus
> implement instruction "popcntb", and GCC support for those uses
> -mpopcntb switch enabling __builtin_popcount() to use it.  I added the
> switch to configure.in but I'm not sure how well that will work ...  I
> don't know if this is represented in buildfarm.

I experimented a bit with this on an old Apple laptop.  Apple's
compiler rejects -mpopcntb altogether.  FreeBSD's compiler
(gcc 4.2.1) recognizes the switch, but I could not get it to
emit the instruction, even when specifying -mcpu=power5,
which ought to enable it according to the gcc docs:

     ... The `-mpopcntb' option allows GCC to generate the
     popcount and double precision FP reciprocal estimate instruction
     implemented on the POWER5 processor and other processors that
     support the PowerPC V2.02 architecture.

A more recent gcc info file also mentions

     The `-mpopcntd' option
     allows GCC to generate the popcount instruction implemented on the
     POWER7 processor and other processors that support the PowerPC
     V2.06 architecture.

but the gcc version I have on this laptop doesn't know that switch.
In any case, I'm pretty sure Apple never shipped a CPU that could
run either instruction.

I suspect that probing for either option may not be worth the
configure cycles it'd consume :-( ... there are just way too
few of those specific POWER variants out there anymore, even
granting that you have a compiler that will play along.

Moreover, you can't turn on -mpopcntb without having some POWER
equivalent to the CPUID test.

However, if you want to leave the option for this open in
future, it really makes the file name pg_bitutils_sse42.c
quite inappropriate.  How about pg_bitutils_hwpopcnt.c
or something like that?

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
On 2019-Feb-15, Tom Lane wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Ah, I understand it now:
> > https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701
> > if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise
> > SIGILL or anything ... it'll just silently compute the wrong result.
> > That's certainly not what I call a fallback!
> 
> Yeah, that's pretty nasty; it means there's no backstop for whether
> your choose function gets it right :-(

Hopefully other tests will fail in some visible way, though.  My fear is
whether we have such systems in buildfarm.

> Is POPCNT any better in this respect?

I couldn't find how is POPCNT encoded.  https://stackoverflow.com/a/28803917/242383

I did find these articles:
http://danluu.com/assembly-intrinsics/

https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviati

This suggests that this all a largely pointless exercise at least on
Intel and GCC/Clang.  It may be better on AMD ... but to get really
better performance we'd need to be coding the popcnt calls in assembly
rather than using the compiler intrinsics, even with -mpopcnt, because
the intrinsics suck.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Using POPCNT and other advanced bit manipulation instructions

От
Alvaro Herrera
Дата:
Here's a final version that I intend to push shortly, to have time
before EOB today to handle any fallout.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Using POPCNT and other advanced bit manipulation instructions

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Here's a final version that I intend to push shortly, to have time
> before EOB today to handle any fallout.

I think this is likely to result in a lot of complaints about
rightmost_one_pos[] being unreferenced, in non-HAVE__BUILTIN_CTZ
builds.  Probably that has to be an extern rather than static
in the header.  leftmost_one_pos[] likewise.

I might have a go at improving the configure tests later ---
I still don't like that they're compile-time-optimizable.
But that can wait.

            regards, tom lane


Re: Using POPCNT and other advanced bit manipulation instructions

От
Andres Freund
Дата:
Hi,

On 2019-02-14 16:45:38 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:
> >> Hah, I just realized you have to add -mlzcnt in order for these builtins
> >> to use the lzcnt instructions.  It goes from something like
> >> 
> >> bsrq    %rax, %rax
> >> xorq    $63, %rax
> 
> > I'm confused how this is a general count leading zero operation? Did you
> > use constants or something that allowed ot infer a range in the test? If
> > so the compiler probably did some optimizations allowing it to do the
> > above.
> 
> No.  If you compile
> 
> int myclz(unsigned long long x)
> {
>   return __builtin_clzll(x);
> }
> 
> at -O2, on just about any x86_64 gcc, you will get
> 
> myclz:
> .LFB1:
>         .cfi_startproc
>         bsrq    %rdi, %rax
>         xorq    $63, %rax
>         ret
>         .cfi_endproc

Yea, sorry for the noise. I misremembered the bsrq mnemonic.

bsr has a latency of three cycles, xor of one. lzcnt a latency of
three. So it's mildly faster to use lzcnt (it uses fewer ports, and has
a shorter latency). But I doubt we have code where that's noticable.

Greetings,

Andres Freund