-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vpermb belongs to AVX512BW? #5
Comments
Hi, thank you for the report. I obviously made a mistake in naming things, it's not an AVX512BW code. For now, the only thing you can do is simply comment out that procedure. On the other hand, it would be interesting to see how these 16-bit shuffles from ABV512BW can help in base64 algorithms. |
@WojciechMula: Your http://0x80.pl/notesen/2016-04-03-avx512-base64.html write-up still says AVX512BW, not AVX512VBMI. (Nice write up, BTW. I had the same idea for
I'm really curious how But if it's only 2 uops, then assume encode bottlenecks on shuffle throughput, we can probably produce 64 bytes of results per 4 clocks. Or per 6 clocks if it's 3 uops. That's pretty fantastic, and is approaching L2 bandwidth. I wonder if Cannonlake (or some future generation) will speed up word-element lane-crossing shuffles vs. Skylake-X. I'm not sure how slow You might be able to use merge-masking into an existing mask for something, though. e.g. |
@pcordes Hi, thank you for such a great comment. Right, I didn't update the www. It's difficult to speculate about performance, especially when you remember what happened to AVX2 - due to overheating, CPU decreases the clock. You still get the result after X cycles, but the wall clock would say it's was slower. If Intel keep using high frequency rates, heating problem remain. I would love to check the implementation against any real hardware, but it's quite difficult. :) |
@pcordes you perhaps know the numbers, but it's worth to cite anyway https://twitter.com/InstLatX64/status/1054655575680827392:
So, it's really, really fast. There's no info on uops count. |
3 cycle latency and 1c throughput implies that it's a single uop. If there were any more uops it would be at least 4 cycle latency. Yes, I had seen that and it's surprisingly great, better than I thought we could hope for. But it's probably something that's worth throwing transistors at, because efficient shuffling makes it possible to do so much stuff that's otherwise not efficiently possible.
|
Note that's it's not only naming that's incorrect.
Also |
Thank you, will fix it. I AM confused with all these AVX512 extensions. :) |
Hi
I'm running avx512bw test on my SKL which has avx512bw supported,
while I got illegal instruction traps, and after some investigation, it seems
vpermb/vpermi2b belongs to avx512vbmi instead, the CPU supported for
avx512vbmi seems not officially released yet.
So does the code need a littler tweak to use avx512bw instruction for test?
]# gdb /tmp/check_avx512bw ./core.103927
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /tmp/check_avx512bw...done.
[New LWP 103927]
Core was generated by `/tmp/check_avx512bw'.
Program terminated with signal 4, Illegal instruction.
#0 0x00000000004082a5 in _mm512_permutex2var_epi8 (__B=..., __I=..., __A=...) at /usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vbmiintrin.h:107
107 /usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vbmiintrin.h: No such file or directory.
[1] https://software.intel.com/en-us/node/534480
[2]
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
The text was updated successfully, but these errors were encountered: