-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a few missing things using SSE or NEON intrinsics. #1338
Comments
Here is how the vector version of the MulInt operation above can be implemented in Highway for an int32_t vector:
|
Wow, thanks a lot! I am simply not that fluent in highway to be able to write such a thing. The documentation is good, but I was still missing a few key pieces. |
I see the following compilation issue:
Do you know what might be the problem? |
Hi @ibogosavljevic , it's always interesting to hear how the documentation works for new users and what can be improved. I'm curious what the missing pieces were? @johnplatts nice implementation! For the compile issue, it's likely that you are using an older release of Highway, we added that intrinsic 2 weeks after the last release. You could use Compiler Explorer to test this, copying your code into this environment. We'll do another release soon, perhaps a week or two. Or until then, you could consider using the latest Git version? |
Here is a link to a Compiler Explorer snippet that includes the above MulInt implementation, and the snippet does compile successfully for the HWY_SSE4/HWY_NEON_WITHOUT_AES/HWY_RVV/HWY_SVE/HWY_PPC8/HWY_SCALAR targets: |
Hi John!
Yes, it works for me as well with the newest version of Highway. Thank you
a lot.
I have a question: I am developing a 1-day vectorization workshop, which
covers Neon, AVX and possibly Highway. In the upcoming weeks, I am going to
experiment a lot with all three frameworks, and also build repositories of
code.
I can share the insight about Highway with you, in order to improve it.
This means filing issues when documentation is not clear, or reporting bugs
if native implementations are significantly faster than Highway. Let me
know if this is what you want.
Ivica
…On Wed, Apr 26, 2023 at 3:29 PM John Platts ***@***.***> wrote:
Hi @ibogosavljevic <https://github.com/ibogosavljevic> , it's always
interesting to hear how the documentation works for new users and what can
be improved. I'm curious what the missing pieces were?
@johnplatts <https://github.com/johnplatts> nice implementation!
For the compile issue, it's likely that you are using an older release of
Highway, we added that intrinsic 2 weeks after the last release. You could
use Compiler Explorer to test this, copying your code into this
environment <https://gcc.godbolt.org/z/zP7MYe9Yf>. We'll do another
release soon, perhaps a week or two. Or until then, you could consider
using the latest Git version?
Here is a link to a Compiler Explorer snippet that includes the above
MulInt implementation, and the snippet does compile successfully for the
HWY_SSE4/HWY_NEON_WITHOUT_AES/HWY_RVV/HWY_SVE/HWY_PPC8/HWY_SCALAR targets:
https://godbolt.org/z/sK1zEv4xb
—
Reply to this email directly, view it on GitHub
<#1338 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACAJBLYLTI5FTV74YL3PIKTXDEPLFANCNFSM6AAAAAAXLCESC4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Glad to hear it works.
This sounds wonderful. Please don't hesitate to raise issues. It would be surprising if a native implementation were much faster, but if that happens we are open to adding new ops to bridge any gaps. |
I want to add more information about the performance of I compared the performance of highway vs the performance of automatically compiler vectorized code:
VS
Compiler's version is faster than highway, 0.56s vs 0.85s. Everything compiled with -mavx2 -O3 on newest clang. Compiler's version assembly: |
Thanks for sharing! Sometimes the compiler has interesting tricks, so let's have a look. First, I notice that the codegen is very similar: https://gcc.godbolt.org/z/8cvrcqbbW Was -maes indeed the difference? Seems that should bring us up to parity. And we can probably do better yet: Here's a quick sketch of Highway code without the cast:
If you didn't care about order that's likely going to be quicker (just store evens then odds), but if you do: we could |
A more optimal implementation of the set-before-first operation for masks is possible on SSE4/AVX2/AVX3/RVV/PPC10. Here is how the SetBeforeFirst operation could be implemented for masks for 128-bit or smaller vectors on SSE4/AVX2:
Here is how the SetBeforeFirst operation could be implemented for masks on AVX3:
Here is how the SetBeforeFirst operation could be implemented for masks on PPC10:
Here is how the SetBeforeFirst operation could be implemented for vectors on RVV (which is simply a wrapper for the __riscv_vmsbf_m_b1, __riscv_vmsbf_m_b2, __riscv_vmsbf_m_b4, __riscv_vmsbf_m_b8, __riscv_vmsbf_m_b16, __riscv_vmsbf_m_b32, and __riscv_vmsbf_m_b64 intrinsics):
|
@johnplatts sounds like you are proposing a new SetBeforeFirst op, unrelated to this particular MulInt code? |
There are some string operations that can be implemented in a more efficient manner on SSE4/AVX2/AVX3 (using the _mm_cmpistri, _mm_cmpistrm, _mm_cmpestri, and _mm_cmpestrm), RVV (using the __riscv_vmsbf_m_b intrinsics), SVE (using svbrkb_b_z), and PPC10 (using the vec_stril intrinsic). Here is how the ZeroPastNullTerminator operation (which zeroes out all lanes past the null terminator) can be implemented for U8 vectors using native SIMD intrinsics on SSE4.2/AVX3/RVV/SVE/PPC10:
|
The SSE4.2 PCMPISTRI, PCMPISTRM, PCMPESTRI, and PCMPESTRM can do the following operations using a single instruction:
|
I like your ZeroPastNullTerminator idea. It's not clear how this could look in AVX2, though? 32 lanes is too much for one lookup table. And NEON would also struggle, its BitsFromMask is quite expensive. That's a wider concern about the SSE4 string instructions - they are not very performance-portable. Even Intel hasn't carried them forward to the >128 bit instruction sets. But your emulations look reasonable. Seems it would be better to use those than not vectorize. Would you like to create a pull request with those operations? Perhaps we can put them in hwy/contrib/algo/string-inl.h or similar? |
I have the following code:
I tried to implement it through Highway alone, but it doesn't work. Highway currently doesn't support multiplication of int32_t to produce int64_t efficiently. Many architectures will multiply two int32_t to produce int64_t without explicit casting, and I want to take advantage of that.
I already have implementations for SSE and NEON, but I want to make them available to Highway. I want to have Highway function
V MulInt(V v2, V v2) for Vec128<int32_t> for SSE and NEON. Is there any official guide on how to supplement Highway with custom build functions?
The text was updated successfully, but these errors were encountered: