Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vpermb belongs to AVX512BW? #5

Open
fengyuleidian0615 opened this issue Aug 2, 2017 · 7 comments
Open

vpermb belongs to AVX512BW? #5

fengyuleidian0615 opened this issue Aug 2, 2017 · 7 comments

Comments

@fengyuleidian0615
Copy link

Hi

I'm running avx512bw test on my SKL which has avx512bw supported,
while I got illegal instruction traps, and after some investigation, it seems
vpermb/vpermi2b belongs to avx512vbmi instead, the CPU supported for
avx512vbmi seems not officially released yet.

So does the code need a littler tweak to use avx512bw instruction for test?

]# gdb /tmp/check_avx512bw ./core.103927
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /tmp/check_avx512bw...done.
[New LWP 103927]
Core was generated by `/tmp/check_avx512bw'.
Program terminated with signal 4, Illegal instruction.
#0 0x00000000004082a5 in _mm512_permutex2var_epi8 (__B=..., __I=..., __A=...) at /usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vbmiintrin.h:107
107 /usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vbmiintrin.h: No such file or directory.

[1] https://software.intel.com/en-us/node/534480
1

[2]
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
2

@WojciechMula
Copy link
Owner

Hi, thank you for the report. I obviously made a mistake in naming things, it's not an AVX512BW code. For now, the only thing you can do is simply comment out that procedure.

On the other hand, it would be interesting to see how these 16-bit shuffles from ABV512BW can help in base64 algorithms.

@pcordes
Copy link

pcordes commented Dec 14, 2017

@WojciechMula: Your http://0x80.pl/notesen/2016-04-03-avx512-base64.html write-up still says AVX512BW, not AVX512VBMI.

(Nice write up, BTW. I had the same idea for vpermb / vpmultishiftqb / vpermb when discussing Base64 encoding in asm on a recent Stack Overflow question. I googled for vpmultishiftqb base64 and found your writeup which made it easy to follow your implementation and see that someone had already written up the code for this implementation.)

VPMULTISHIFTQB also requires AVX512VBMI. The xmm/ymm versions also require AVX512VL (as usual), while the ZMM version only requires AVX512VBMI. Your writeup says it only requires AVX512VL.

I'm really curious how vpermb and vpermi2b will perform on Cannonlake (which will introduce AVX512VBMI). I expect it will be at least as slow as vpermw or vpermi/t2w are on Skylake-AVX512, where they decode to 2 or 3 shuffle uops respectively. But if they're only 2 or 3 uops, that's still fantastic. (I wouldn't be surprised if even vpermb is 3 uops in the first-gen CPU to have it, though, before AVX512-accelerated software is widespread, but probably not so slow that it's not worth using for a lot of cases. Building very wide many-lane MUXers is expensive)

But if it's only 2 uops, then assume encode bottlenecks on shuffle throughput, we can probably produce 64 bytes of results per 4 clocks. Or per 6 clocks if it's 3 uops. That's pretty fantastic, and is approaching L2 bandwidth. I wonder if Cannonlake (or some future generation) will speed up word-element lane-crossing shuffles vs. Skylake-X.


I'm not sure how slow vpermi2b would have to be before we'd want to avoid it for decode, though. A 7-bit table is very nice.

You might be able to use merge-masking into an existing mask for something, though. e.g. _mm512_movepi8_mask(input), and then some other mask-generating instruction can write that with merge-masking? Or hopefully a compiler could use kortest with two separate operands... 2x VPMOVB2M, one of them with merge-masking, isn't obviously better than VPORD + VPMOVB2M, though, so I don't think there's anything to gain over the current vpermi2b version if you're going to keep using vpermi2b for decode.

@WojciechMula
Copy link
Owner

@pcordes Hi, thank you for such a great comment. Right, I didn't update the www.

It's difficult to speculate about performance, especially when you remember what happened to AVX2 - due to overheating, CPU decreases the clock. You still get the result after X cycles, but the wall clock would say it's was slower. If Intel keep using high frequency rates, heating problem remain.

I would love to check the implementation against any real hardware, but it's quite difficult. :)

@WojciechMula
Copy link
Owner

@pcordes you perhaps know the numbers, but it's worth to cite anyway https://twitter.com/InstLatX64/status/1054655575680827392:

The real #CannonLake implementation is 3|1 for VPERMB; 5|2 for VPERMI2B and VPERMT2B1

So, it's really, really fast. There's no info on uops count.

@pcordes
Copy link

pcordes commented Jan 20, 2019

3 cycle latency and 1c throughput implies that it's a single uop. If there were any more uops it would be at least 4 cycle latency. Yes, I had seen that and it's surprisingly great, better than I thought we could hope for. But it's probably something that's worth throwing transistors at, because efficient shuffling makes it possible to do so much stuff that's otherwise not efficiently possible.

5|2 might be 3 uops, 2 of them for the shuffle port, with no ILP between them.

@TheIronBorn
Copy link

Note that's it's not only naming that's incorrect.

encode.avx512vl.cpp uses AVX512VBMI (vpermb/vpmultishiftqb) yet not all CPUs with AVX512VL have AVX512VBMI (chart). It actually uses no AVX512VL instructions at all (according to https://software.intel.com/sites/landingpage/IntrinsicsGuide).

Also encode.avx512vbmi.cpp doesn't use vpmultishiftqb to rearrange 6-bit indices, an AVX512VBMI instruction.

@WojciechMula
Copy link
Owner

Note that's it's not only naming that's incorrect.

encode.avx512vl.cpp uses AVX512VBMI (vpermb/vpmultishiftqb) yet not all CPUs with AVX512VL have AVX512VBMI (chart). It actually uses no AVX512VL instructions at all (according to https://software.intel.com/sites/landingpage/IntrinsicsGuide).

Also encode.avx512vbmi.cpp doesn't use vpmultishiftqb to rearrange 6-bit indices, an AVX512VBMI instruction.

Thank you, will fix it. I AM confused with all these AVX512 extensions. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants