Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify SSSE3 nibble2base function #1802

Merged
merged 1 commit into from
Jul 8, 2024

Conversation

rhpvorderman
Copy link
Contributor

As promised in #1795, here is the simplified version of the SSSE3 routine for nibble2base conversion. @jmarshall suggested that the unpack instructions could be used for a simpler routine and that is what is done here.

@rhpvorderman rhpvorderman force-pushed the simplernibblessse3 branch 3 times, most recently from e6e42b4 to 2a33e6a Compare July 8, 2024 06:13
@jkbonfield
Copy link
Contributor

I like it and it seems like an easy win for complexity and helps on speed as well.

I did consider whether the memory loads were a bottle neck, so tried loading two lanes and processing two lanes, so the second load is essentially pipelined. It looked to work, but not for the reason I though:

# ./test/test_nibbles -c 10000000 -r 5000 -n timings
dev  gcc13 3.99, clang16 3.85                                                   
pr   gcc13 3.07, clang16 3.10                                                   
new  gcc13 2.61, clang16 2.85                                                   

My new code was this (with seq_vec_end_ptr adjusted accordingly)

    while (seq_cursor < seq_vec_end_ptr) {
        __m128i encoded[2];
        int i;
        for (i = 0; i < 2; i++) {
            encoded[i] = _mm_lddqu_si128((__m128i *)nibble_cursor);
            nibble_cursor += sizeof(__m128i);
        }

        for (i = 0; i < 2; i++) {
            __m128i enc_upper = _mm_srli_epi64(encoded[i], 4);
            enc_upper = _mm_and_si128(enc_upper, _mm_set1_epi8(15));
            __m128i enc_lower = _mm_and_si128(encoded[i], _mm_set1_epi8(15));
            __m128i nucs_upper = _mm_shuffle_epi8(nuc_lookup_vec, enc_upper);
            __m128i nucs_lower = _mm_shuffle_epi8(nuc_lookup_vec, enc_lower);
            __m128i first_nucs = _mm_unpacklo_epi8(nucs_upper, nucs_lower);
            __m128i second_nucs = _mm_unpackhi_epi8(nucs_upper, nucs_lower);
            _mm_storeu_si128((__m128i *)seq_cursor, first_nucs);
            _mm_storeu_si128((__m128i *)(seq_cursor + sizeof(__m128i)),
                             second_nucs);
            seq_cursor += 2 * sizeof(__m128i);
        }
    }

Ie load encoded[2] worth of data and then process.

It's clearly quicker on this system - 8-15% faster. However it has nothing to do with waiting on memory!

PR (clang);

 Performance counter stats for './test/test_nibbles -c 10000000 -r 5000 -n':

       2994.299226      task-clock (msec)         #    1.000 CPUs utilized          
                 5      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               473      page-faults               #    0.158 K/sec                  
        8901778582      cycles                    #    2.973 GHz                      (83.30%)
         695691208      stalled-cycles-frontend   #    7.82% frontend cycles idle     (83.31%)
         175195991      stalled-cycles-backend    #    1.97% backend cycles idle      (66.60%)
       28838017000      instructions              #    3.24  insn per cycle         
                                                  #    0.02  stalled cycles per insn  (83.30%)
        1686141707      branches                  #  563.117 M/sec                    (83.41%)
          10109044      branch-misses             #    0.60% of all branches          (83.38%)

       2.995203340 seconds time elapsed

encoded[2] array (clang)

 Performance counter stats for './test/test_nibbles -c 10000000 -r 5000 -n':

       2799.450579      task-clock (msec)         #    1.000 CPUs utilized          
                 5      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               472      page-faults               #    0.169 K/sec                  
        8339368809      cycles                    #    2.979 GHz                      (83.29%)
        1408302220      stalled-cycles-frontend   #   16.89% frontend cycles idle     (83.28%)
         179340812      stalled-cycles-backend    #    2.15% backend cycles idle      (66.64%)
       25724737241      instructions              #    3.08  insn per cycle         
                                                  #    0.05  stalled cycles per insn  (83.35%)
         906285051      branches                  #  323.737 M/sec                    (83.43%)
          11540581      branch-misses             #    1.27% of all branches          (83.36%)

       2.800534160 seconds time elapsed

So the reduction is in instructions and correspondingly cycles. Same with gcc and clang. It turns out to be simply due to loop unrolling the code and then having the main while loop iterator checked less often. Surprising it's that significant!

However it's icing on the cake and frankly all of these are fast enough now. It's using 4% of the total CPU for a test_view from uncompressed BAM to SAM, which is hardly a problem.

@rhpvorderman
Copy link
Contributor Author

So the reduction is in instructions and correspondingly cycles. Same with gcc and clang. It turns out to be simply due to loop unrolling the code and then having the main while loop iterator checked less often. Surprising it's that significant!

That's kinda surprising indeed. If you look at the assembly it maps to very few instructions, so it is not that odd that the iterator check is significant. Kinda mind-boggling though that this decoding algorithm is so blazing-fast that iteration is a major component of the compute time.

I like it and it seems like an easy win for complexity and helps on speed as well.

The help on speed is because this compiles to very simple assembly (I checked on godbolt). The previous algorithm was more complex, as a result the compiler ran out of registers and issued a lot more load instructions to load stored registers from the stack.

@jkbonfield jkbonfield merged commit 6012472 into samtools:develop Jul 8, 2024
10 checks passed
@jkbonfield
Copy link
Contributor

Thanks for the PR.

Yes I checked the assembly from gcc and also found it surprisingly succint. It's not always that clear with some SIMD instructions as they can map to a variable number, especially things like the set instructions.

@rhpvorderman rhpvorderman deleted the simplernibblessse3 branch July 8, 2024 13:11
@jkbonfield
Copy link
Contributor

Also, why isn't there an _mm_srli_epi8? As always it appears with Intel intrinsics, there are baffling holes in the instruction set. I was thinking the AND instruction wouldn't be needed, but never mind.

@rhpvorderman
Copy link
Contributor Author

There are probably electronic constraints. They have integrated 16-bit, 32-bit and 64-bit shifts, and integrating 8-bit shifts would require extra wiring and logic gates. There are physical constraints in microchip design, so that is probably why.

They could add _mm_srli_epi8 as a sequence of instructions though, but then the performance would noticably differ from the other _mm_srli_epixx instructions.

It is not that big of a deal though, adding another bitwise AND instruction with a throughput of 1/3 of a cycle. I suspect the gains of an epi8 shift would be minimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants