-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor SIMD code and add ARM Neon nibble2base() implementation #1795
Conversation
This function is always called via a function pointer, hence it is never inlined so need not be in a header. [Commit marked as authored by RV so git blame attributes the moved code to Ruben, as it is unchanged from the original code. Temporarily include simd.c from the header so this commit builds and runs. -JM]
Also add copyright boilerplate to the new file. [Commit marked as authored by RMD so git blame attributes the boilerplate and moved code to Rob, as the latter is unchanged from the original code. -JM]
… file Build simd.c as a source file, so there is one copy of nibble2base_ssse3() rather than potentially one per translation unit that uses it. When everything was static, the names needed no prefix. However now that there is a non-static function pointer, it needs an htslib-specific prefix to avoid polluting the namespace. (In libhts.so it will usually be a non-visible symbol, but in libhts.a it needs a prefix.)
Configure checks for __builtin_cpu_supports("ssse3"), so the check for __attribute__((target)) might as well be there too. This enables this internal-only infrastructure to be removed from htslib/hts_defs.h, which is primarily for defines needed in the public headers. Add a configure check for __attribute__((constructor)) too, as we will likely be using it on other SIMD-using platforms as well.
Notably the adjustment to seq_vec_end_ptr stops it from reverting to the scalar code prematurely: at present it drops back to scalar when <=32 entries remain, rather than <=31.
As ARM CPU capability instructions are privileged, we also need to add cpu_supports_neon() and implement it to query for Neon/AdvSIMD in various platform-dependent ways. (On 32-bit ARM, glibc helpfully renamed the HWCAP_* macros to HWCAP_ARM_*, so accept both on Linux.) 32-bit ARM GCC erroneously does not define vst1q_u8_x2() (bug 71233, fixed in v14.1 in this case), so avoid it on 32-bit ARM by writing interleaved via vst2q_u8() instead.
That's impressive. On my intel machine SSSE3 appears to be running about 4x faster than the default version. You're getting 8-9x speed up which is great. PS. The mythical htslib v2.0 will not use nibble encoding in memory. It's a royal pain in the derriere! |
I updated the timings to values from the final code and test code in this PR (previously the timings were from earlier development code), included the 1-nibble-at-a-time naive scalar code timings, and added some x86_64 timings on a different iteration count that makes the times on one row comparable. (Not totally sure I believe that 916 seconds! And I only ran it once. But it's indicative that the difference between the two scalar versions is much bigger on x86_64.) It's not really meaningful to compare between the columns, but what it does tell you is this: the 2-nibbles-at-a-time scalar code is a much smaller win over 1-nibble-at-a-time on ARM than on x86_64. So this suggests that the apparent 8-9x speed up might not be that astonishing, because there is a lot left on the table by the 2-at-a-time scalar code on ARM, at least as it is written here. Or something — who knows, it's a nice speed up anyway 😄 |
Yep, very nice and thank you. I hunted the origin of the nibble2base and backtracked it to my initial BAM implementation in io_lib in Jan 2013 :-) https://github.com/jkbonfield/io_lib/blob/master/io_lib/bam.c#L3650-L3657 |
I think it would be possible, and possibly more optimal. Interesting find! |
Did a quick test in Sequali, the following code is simpler and all tests pass: static void
decode_bam_sequence_ssse3(uint8_t *dest, const uint8_t *encoded_sequence, size_t length)
{
static const uint8_t *nuc_lookup = (uint8_t *)"=ACMGRSVTWYHKDBN";
const uint8_t *dest_end_ptr = dest + length;
uint8_t *dest_cursor = dest;
const uint8_t *encoded_cursor = encoded_sequence;
const uint8_t *dest_vec_end_ptr = dest_end_ptr - (2 * sizeof(__m128i) - 1);
__m128i nuc_lookup_vec = _mm_lddqu_si128((__m128i *)nuc_lookup);
/* Nucleotides are encoded 4-bits per nucleotide and stored in 8-bit bytes
as follows: |AB|CD|EF|GH|. The 4-bit codes (going from 0-15) can be used
together with the pshufb instruction as a lookup table. The most efficient
way is to use bitwise AND and shift to create two vectors. One with all
the upper codes (|A|C|E|G|) and one with the lower codes (|B|D|F|H|).
The lookup can then be performed and the resulting vectors can be
interleaved again using the unpack instructions. */
while (dest_cursor < dest_vec_end_ptr) {
__m128i encoded = _mm_lddqu_si128((__m128i *)encoded_cursor);
__m128i encoded_upper = _mm_srli_epi64(encoded, 4);
encoded_upper = _mm_and_si128(encoded_upper, _mm_set1_epi8(15));
__m128i encoded_lower = _mm_and_si128(encoded, _mm_set1_epi8(15));
__m128i nucs_upper = _mm_shuffle_epi8(nuc_lookup_vec, encoded_upper);
__m128i nucs_lower = _mm_shuffle_epi8(nuc_lookup_vec, encoded_lower);
__m128i first_nucleotides = _mm_unpacklo_epi8(nucs_upper, nucs_lower);
__m128i second_nucleotides = _mm_unpackhi_epi8(nucs_upper, nucs_lower);
_mm_storeu_si128((__m128i *)dest_cursor, first_nucleotides);
_mm_storeu_si128((__m128i *)(dest_cursor + 16), second_nucleotides);
encoded_cursor += sizeof(__m128i);
dest_cursor += 2 * sizeof(__m128i);
}
decode_bam_sequence_default(dest_cursor, encoded_cursor, dest_end_ptr - dest_cursor);
} Thanks @jmarshall! This is a great improvement codewise! EDIT: (Of course I will contribute this after this is merged, or feel free to incorporate it right away). |
When building on my Mac without using
This is because use of the function is gated by |
Oops. I think it was more because on ARM non-configure builds, it wasn't building Probably in theory the definition of |
Cheers, that's fixed the problem. |
This PR moves
nibble2base_ssse3()
from an internal header file to a new source file, simd.c, and adds an ARM implementationnibble2base_neon()
alongside. Because these functions are always called via a pointer, they are never inlined so there is nothing gained (other than build complexity!) by having them in a header.I initially translated the SSSE3 algorithm directly to Neon, during which I noticed some tweaks that could be usefully applied to the SSSE3 version — see the “Minor improvements” commit below. I then realised that the Neon implementation could be simplified by using Neon's rather excellent interleaved load/store instructions, which I then replaced by an in-register zip as it is (sadly) faster (on aarch64 at least). (It may be possible to apply a similar approach for Intel via the unpack instruction.)
Indicative timings for various versions of the Neon implementation, on ARM and, for interest, an iteration count on x86_64 that makes the nibble2base_default time comparable — numbers are comparable vertically, but not directly horizontally:
I moved the existing code to the new source file in a pair of commits marked as authored by @rhpvorderman and @daviesrob, so that
git blame
preserves the authorship of their untouched lines. Hence ideally this PR would be merged via a merge commit rather than squashed.