Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD optimization #171

Open
1 of 2 tasks
arkpar opened this issue Jan 23, 2023 · 2 comments
Open
1 of 2 tasks

SIMD optimization #171

arkpar opened this issue Jan 23, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@arkpar
Copy link
Member

arkpar commented Jan 23, 2023

IndexTable::find_entry could benefit from SIMD optimizations for parallel entry search.

@arkpar arkpar added the enhancement New feature or request label Jan 23, 2023
@Tpt
Copy link
Contributor

Tpt commented Jan 30, 2023

I just started to investigated it quickly. The code generated by rustc -O is already quite good. The inner loop of the function mostly does bound checking on chunk to extract the entry, shift on the entry value to get the key and basic comparison to check if the entry has not already been found. I have pushed #175 to move the bound check out of the loop.

About SMID, the AVX instructions for unsigned 64 bits integers (_mm_XXX_epu64 instructions) are still "experimental" in Rust so it seems that there is no proper easy way to implement SIMD using unsigned instructions at the moment on Rust stable. There is maybe something doable with assembly or using masks on signed instructions (I am not sure, I am not very familiar with SIMD).

@arkpar
Copy link
Member Author

arkpar commented Jan 30, 2023

I think we can get away with 32 bit instructions. We only really need to check the first 32-bit word of each 64-bit entry for partial key. 16 bit index entries contain 34 bits of key and 17 bit have 33, but I think we can simply ignore the two last bits and just do full key comparison in case the first 32 bits match. I don't have that much experience with SIMD myself, but I suppose it is possible to load and pack resiter with every second word. Or run the search on all 32 bit words and then filter out every odd result.

hashbrown uses _mm_cmpeq_epi8 so I guess we could also use signed _mm_cmpeq_epi32. It should not matter for equality comparison.

So the whole thing should be:

  1. Build target with _mm_set1_epi32

Then, for each 256 bits:

  1. Load every second 32-bit word into a register
  2. Zero out n lower bits that should not be compared. (n = index_bits - 18 if I'm not mistaken). Alternatively, right shift by n
  3. Compare with target using _mm_cmpeq_epi32 to build a result bitmap
  4. Return the leftmost bit in the result, if any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants