Enable optimisation level -O3 for SAM QUAL+33 formatting. #1679

jkbonfield · 2023-09-27T15:12:43Z

On long read data, the time to format SAM files is dominated by sequence and quality.

The qual[i]+33 loop to turn binary quals into printable ASCII is not vectorised by GCC without using -O3. I would consider this a weakness of the compiler, but nothing I've done has persuaded gcc (before v12) to generate vector instructions. Not even the "restrict" keyword.

Hence using attribute((optimize("O3"))).

The time the new add33 function is approx 15x quicker with gcc -O3 than gcc -O2. Clang's and icc's default optimisation level gives speeds comparable to the gcc -O3.

With a compressed Illumina BAM this was just 3% overall speed gain to decode to SAM. The extreme opposite is uncompressed ONT BAM which shows a 23% speed gain.

On long read data, the time to format SAM files is dominated by sequence and quality. The qual[i]+33 loop to turn binary quals into printable ASCII is not vectorised by GCC without using -O3. I would consider this a weakness of the compiler, but nothing I've done has persuaded gcc (before v12) to generate vector instructions. Not even the "restrict" keyword. Hence using __attribute__((optimize("O3"))). The time the new add33 function is approx 15x quicker with gcc -O3 than gcc -O2. Clang's and icc's default optimisation level gives speeds comparable to the gcc -O3. With a compressed Illumina BAM this was just 3% overall speed gain to decode to SAM. The extreme opposite is uncompressed ONT BAM which shows a 23% speed gain.

jkbonfield · 2023-09-27T15:15:49Z

Note my original implementation of this had a prototype of

static inline void HTS_OPT3 add33(uint8_t *restrict a, const uint8_t *restrict b, int32_t len)

However our Rocky Linux CI test explicitly turns off C99 support via --std=gnu90, hence restrict is not permitted. As far as I'm aware we target C99 (it is 20+ year old after all). What's the reason behind explicitly forbidding C99 extensions on Rocky Linux? I know it's an old compiler, but sure not over 20 year old!

That said, it makes no difference here. The use of restrict was a failed attempt to get gcc to behave, but it was resolute in wanting to avoid all forms of vectorisation without explicitly enabling them via -ftree-loop-vectorize (or -O3 which adds that).

rhpvorderman · 2023-09-28T05:08:07Z

Wow, this is very instructive. Goodbye #ifdef __SSE2__! Thanks for showing me this!

daviesrob · 2023-09-28T09:35:52Z

The -std=gnu90 test was added in #1524 to catch cases of for (int n = 0, ...), which is not supported by gnu90. This (or gnu89) was the default on the GCC-4 series, which was shipped in Red Hat Enterprise Linux versions 5 to 7. RHEL7 is still (just) in production, and will not be completely retired until June 2026. It's also possible that some BSDs ship ancient GCCs, although they seem to be migrating to clang.

It's possible we could allow more c99-only features with suitable configure tests to set the -std=c99 option, but we'd have to check that the code still worked on RedHat, at least.

We could also add a configure test to set the optimisation to -O3, instead of just doing it for one very small piece of code...

jkbonfield · 2023-09-28T12:47:22Z

I don't understand though why we're explicitly attempting to forbid the very useful "for (int i = ..." notation. It's standard in C99.

I get that the earlier RedHat's don't default to supporting this, but they do still support it: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Standards.html#Standards

All it requires is that people build with CC="gcc --std=gnu99" and it'd be accepted. I don't think this is an onerous problem, and it could be part of autoconf anyway. I'm happy with -O3 being the default and I'm sure I've suggested this in the past (but at the time wasn't able to be sufficiently convincing apparently).

I will say however that I think gcc are in error here. O3 is explicitly saying we favour speed over everything else, and aggressive unrolling of loops etc that significantly increases code size is worth it even if it's only a small speed gain. Basically it's the "turn it up to 11" level of optimisation. The vectorisation of this trivial loop is in a totally different class. It's an order of magnitude faster! It's not some minor speed gain vs big code size tradeoff unless you make the assumption that it's only executed with a very low number of cycles (and gcc has no way to hint at that, unlike several other compilers). Everyone else seems in agreement that vectorisation is a good thing to do even at earlier optimisation levels (such as the O2 offered by default from autoconf).

jkbonfield mentioned this pull request Sep 27, 2023

Faster nibble parsing and a SSSE3 routine for even faster nibble parsing #1677

Closed

daviesrob self-assigned this Oct 3, 2023

daviesrob merged commit de6a982 into samtools:develop Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable optimisation level -O3 for SAM QUAL+33 formatting. #1679

Enable optimisation level -O3 for SAM QUAL+33 formatting. #1679

jkbonfield commented Sep 27, 2023

jkbonfield commented Sep 27, 2023

rhpvorderman commented Sep 28, 2023

daviesrob commented Sep 28, 2023

jkbonfield commented Sep 28, 2023

Enable optimisation level -O3 for SAM QUAL+33 formatting. #1679

Enable optimisation level -O3 for SAM QUAL+33 formatting. #1679

Conversation

jkbonfield commented Sep 27, 2023

jkbonfield commented Sep 27, 2023

rhpvorderman commented Sep 28, 2023

daviesrob commented Sep 28, 2023

jkbonfield commented Sep 28, 2023