Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] deinterleave for arbitrary n #1190

Open
Tracked by #1554
DenisYaroshevskiy opened this issue Jan 22, 2022 · 5 comments
Open
Tracked by #1554

[FEATURE] deinterleave for arbitrary n #1190

DenisYaroshevskiy opened this issue Jan 22, 2022 · 5 comments
Labels
feature New feature or request

Comments

@DenisYaroshevskiy
Copy link
Collaborator

DenisYaroshevskiy commented Jan 22, 2022

I did implementation for 2 and 4 but for all others requires more work.

Relevant stack overflows:
https://stackoverflow.com/a/55932030/5021064
https://stackoverflow.com/a/69083795/5021064

@DenisYaroshevskiy DenisYaroshevskiy added the feature New feature or request label Jan 22, 2022
@andipeer
Copy link

In our software for physics computations, we often need to work with 3D vectors stored as AoS. Currently, we use the SSE intrinsics proposed in this article from Intel (I couldn't find it on their website anymore, but fortunately the Wayback Machine has it). We would like to get rid of our intrinsics code and use the EVE library instead, but as you stated, the interleaving functionality is currently not optimized for this task. Here is a gotbolt comparison of the code generated by the intrinsics and by deinterleave_groups(...): https://godbolt.org/z/v56c5vj51 For the compiler that we are using (GCC 11), this currently produces much longer and slower code.

Do you have any suggestion on how we could improve this using functionality that is already available in the EVE library? I tried to browse through the code, but from what I get, currently there is no high-level mapping for the _mm_shuffle_ps intrinsic to combine two registers using an arbitrary pattern. Would the way to go be to provide a SSE-specific implementation of deinterleave_groups_(...)? If so, I could try create a pull request for it.

@DenisYaroshevskiy
Copy link
Collaborator Author

Hi!

  1. Thanks for sharing, it's pretty cool trick.
  2. As a general problem this is very hard
  3. Especially it's hard the way this code does it - where it does overlaping loads to achieve this effect.
  4. The deinterleave_iterator that could do it this well is in the roadmap but I can't even begin to think where we might get to it.

What you want is to handwrite this function.
eve interoperates with intrinisics, no problem.

Here is the same code https://godbolt.org/z/WoqGWsKK3

I presume, if you are using eve, that you want an avx2 version of this code too. I don't know how to write that one.
If you want arm portability - my first guess it's vld3q_f32 instruction.

@jfalcou
Copy link
Owner

jfalcou commented Mar 20, 2024

Depending on how your original code deals with its data, using a wide of struct in soa vector is maybe another way to do the migration.

@DenisYaroshevskiy
Copy link
Collaborator Author

+1 to Joel, can you maybe change storage format? You'd get very good perf

@andipeer
Copy link

Thanks for your quick and very elaborated answers! I agree that storing the data as SoA would be the best solution, but of course also the one that would induce the most changes to our code base. So for now, we will probably stick with AoS and continue to use the intrinsics for SSE, and have deinterleave_groups as a fallback for architectures were we currently don't have a hand-optimized version. The plan is to add AVX2 support in the near future, and maybe ARM in the long term.

Just want to mention that I really enjoy working with EVE! I've tried several other libraries, but none of those provided such a smooth experience when porting the code. Keep up the great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants