Simplify SSSE3 nibble2base function #1802

rhpvorderman · 2024-07-08T05:54:44Z

As promised in #1795, here is the simplified version of the SSSE3 routine for nibble2base conversion. @jmarshall suggested that the unpack instructions could be used for a simpler routine and that is what is done here.

jkbonfield · 2024-07-08T11:38:13Z

I like it and it seems like an easy win for complexity and helps on speed as well.

I did consider whether the memory loads were a bottle neck, so tried loading two lanes and processing two lanes, so the second load is essentially pipelined. It looked to work, but not for the reason I though:

# ./test/test_nibbles -c 10000000 -r 5000 -n timings
dev  gcc13 3.99, clang16 3.85                                                   
pr   gcc13 3.07, clang16 3.10                                                   
new  gcc13 2.61, clang16 2.85

My new code was this (with seq_vec_end_ptr adjusted accordingly)

    while (seq_cursor < seq_vec_end_ptr) {
        __m128i encoded[2];
        int i;
        for (i = 0; i < 2; i++) {
            encoded[i] = _mm_lddqu_si128((__m128i *)nibble_cursor);
            nibble_cursor += sizeof(__m128i);
        }

        for (i = 0; i < 2; i++) {
            __m128i enc_upper = _mm_srli_epi64(encoded[i], 4);
            enc_upper = _mm_and_si128(enc_upper, _mm_set1_epi8(15));
            __m128i enc_lower = _mm_and_si128(encoded[i], _mm_set1_epi8(15));
            __m128i nucs_upper = _mm_shuffle_epi8(nuc_lookup_vec, enc_upper);
            __m128i nucs_lower = _mm_shuffle_epi8(nuc_lookup_vec, enc_lower);
            __m128i first_nucs = _mm_unpacklo_epi8(nucs_upper, nucs_lower);
            __m128i second_nucs = _mm_unpackhi_epi8(nucs_upper, nucs_lower);
            _mm_storeu_si128((__m128i *)seq_cursor, first_nucs);
            _mm_storeu_si128((__m128i *)(seq_cursor + sizeof(__m128i)),
                             second_nucs);
            seq_cursor += 2 * sizeof(__m128i);
        }
    }

Ie load encoded[2] worth of data and then process.

It's clearly quicker on this system - 8-15% faster. However it has nothing to do with waiting on memory!

PR (clang);

 Performance counter stats for './test/test_nibbles -c 10000000 -r 5000 -n':

       2994.299226      task-clock (msec)         #    1.000 CPUs utilized          
                 5      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               473      page-faults               #    0.158 K/sec                  
        8901778582      cycles                    #    2.973 GHz                      (83.30%)
         695691208      stalled-cycles-frontend   #    7.82% frontend cycles idle     (83.31%)
         175195991      stalled-cycles-backend    #    1.97% backend cycles idle      (66.60%)
       28838017000      instructions              #    3.24  insn per cycle         
                                                  #    0.02  stalled cycles per insn  (83.30%)
        1686141707      branches                  #  563.117 M/sec                    (83.41%)
          10109044      branch-misses             #    0.60% of all branches          (83.38%)

       2.995203340 seconds time elapsed

encoded[2] array (clang)

 Performance counter stats for './test/test_nibbles -c 10000000 -r 5000 -n':

       2799.450579      task-clock (msec)         #    1.000 CPUs utilized          
                 5      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               472      page-faults               #    0.169 K/sec                  
        8339368809      cycles                    #    2.979 GHz                      (83.29%)
        1408302220      stalled-cycles-frontend   #   16.89% frontend cycles idle     (83.28%)
         179340812      stalled-cycles-backend    #    2.15% backend cycles idle      (66.64%)
       25724737241      instructions              #    3.08  insn per cycle         
                                                  #    0.05  stalled cycles per insn  (83.35%)
         906285051      branches                  #  323.737 M/sec                    (83.43%)
          11540581      branch-misses             #    1.27% of all branches          (83.36%)

       2.800534160 seconds time elapsed

So the reduction is in instructions and correspondingly cycles. Same with gcc and clang. It turns out to be simply due to loop unrolling the code and then having the main while loop iterator checked less often. Surprising it's that significant!

However it's icing on the cake and frankly all of these are fast enough now. It's using 4% of the total CPU for a test_view from uncompressed BAM to SAM, which is hardly a problem.

rhpvorderman · 2024-07-08T11:52:47Z

So the reduction is in instructions and correspondingly cycles. Same with gcc and clang. It turns out to be simply due to loop unrolling the code and then having the main while loop iterator checked less often. Surprising it's that significant!

That's kinda surprising indeed. If you look at the assembly it maps to very few instructions, so it is not that odd that the iterator check is significant. Kinda mind-boggling though that this decoding algorithm is so blazing-fast that iteration is a major component of the compute time.

I like it and it seems like an easy win for complexity and helps on speed as well.

The help on speed is because this compiles to very simple assembly (I checked on godbolt). The previous algorithm was more complex, as a result the compiler ran out of registers and issued a lot more load instructions to load stored registers from the stack.

jkbonfield · 2024-07-08T12:56:14Z

Thanks for the PR.

Yes I checked the assembly from gcc and also found it surprisingly succint. It's not always that clear with some SIMD instructions as they can map to a variable number, especially things like the set instructions.

jkbonfield · 2024-07-08T13:16:02Z

Also, why isn't there an _mm_srli_epi8? As always it appears with Intel intrinsics, there are baffling holes in the instruction set. I was thinking the AND instruction wouldn't be needed, but never mind.

rhpvorderman · 2024-07-08T13:26:49Z

There are probably electronic constraints. They have integrated 16-bit, 32-bit and 64-bit shifts, and integrating 8-bit shifts would require extra wiring and logic gates. There are physical constraints in microchip design, so that is probably why.

They could add _mm_srli_epi8 as a sequence of instructions though, but then the performance would noticably differ from the other _mm_srli_epixx instructions.

It is not that big of a deal though, adding another bitwise AND instruction with a throughput of 1/3 of a cycle. I suspect the gains of an epi8 shift would be minimal.

rhpvorderman force-pushed the simplernibblessse3 branch 3 times, most recently from e6e42b4 to 2a33e6a Compare July 8, 2024 06:13

Simplify SSSE3 nibble2base function

0bcfc72

rhpvorderman force-pushed the simplernibblessse3 branch from 2a33e6a to 0bcfc72 Compare July 8, 2024 06:15

jkbonfield merged commit 6012472 into samtools:develop Jul 8, 2024
10 checks passed

rhpvorderman deleted the simplernibblessse3 branch July 8, 2024 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify SSSE3 nibble2base function #1802

Simplify SSSE3 nibble2base function #1802

rhpvorderman commented Jul 8, 2024

jkbonfield commented Jul 8, 2024

rhpvorderman commented Jul 8, 2024

jkbonfield commented Jul 8, 2024

jkbonfield commented Jul 8, 2024

rhpvorderman commented Jul 8, 2024

Simplify SSSE3 nibble2base function #1802

Simplify SSSE3 nibble2base function #1802

Conversation

rhpvorderman commented Jul 8, 2024

jkbonfield commented Jul 8, 2024

rhpvorderman commented Jul 8, 2024

jkbonfield commented Jul 8, 2024

jkbonfield commented Jul 8, 2024

rhpvorderman commented Jul 8, 2024