-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify SSSE3 nibble2base function #1802
Conversation
e6e42b4
to
2a33e6a
Compare
2a33e6a
to
0bcfc72
Compare
I like it and it seems like an easy win for complexity and helps on speed as well. I did consider whether the memory loads were a bottle neck, so tried loading two lanes and processing two lanes, so the second load is essentially pipelined. It looked to work, but not for the reason I though:
My new code was this (with
Ie load It's clearly quicker on this system - 8-15% faster. However it has nothing to do with waiting on memory! PR (clang);
encoded[2] array (clang)
So the reduction is in instructions and correspondingly cycles. Same with gcc and clang. It turns out to be simply due to loop unrolling the code and then having the main while loop iterator checked less often. Surprising it's that significant! However it's icing on the cake and frankly all of these are fast enough now. It's using 4% of the total CPU for a test_view from uncompressed BAM to SAM, which is hardly a problem. |
That's kinda surprising indeed. If you look at the assembly it maps to very few instructions, so it is not that odd that the iterator check is significant. Kinda mind-boggling though that this decoding algorithm is so blazing-fast that iteration is a major component of the compute time.
The help on speed is because this compiles to very simple assembly (I checked on godbolt). The previous algorithm was more complex, as a result the compiler ran out of registers and issued a lot more load instructions to load stored registers from the stack. |
Thanks for the PR. Yes I checked the assembly from gcc and also found it surprisingly succint. It's not always that clear with some SIMD instructions as they can map to a variable number, especially things like the set instructions. |
Also, why isn't there an |
There are probably electronic constraints. They have integrated 16-bit, 32-bit and 64-bit shifts, and integrating 8-bit shifts would require extra wiring and logic gates. There are physical constraints in microchip design, so that is probably why. They could add It is not that big of a deal though, adding another bitwise AND instruction with a throughput of 1/3 of a cycle. I suspect the gains of an epi8 shift would be minimal. |
As promised in #1795, here is the simplified version of the SSSE3 routine for nibble2base conversion. @jmarshall suggested that the unpack instructions could be used for a simpler routine and that is what is done here.