Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable optimisation level -O3 for SAM QUAL+33 formatting. #1679

Merged
merged 1 commit into from
Oct 12, 2023

Conversation

jkbonfield
Copy link
Contributor

On long read data, the time to format SAM files is dominated by sequence and quality.

The qual[i]+33 loop to turn binary quals into printable ASCII is not vectorised by GCC without using -O3. I would consider this a weakness of the compiler, but nothing I've done has persuaded gcc (before v12) to generate vector instructions. Not even the "restrict" keyword.

Hence using attribute((optimize("O3"))).

The time the new add33 function is approx 15x quicker with gcc -O3 than gcc -O2. Clang's and icc's default optimisation level gives speeds comparable to the gcc -O3.

With a compressed Illumina BAM this was just 3% overall speed gain to decode to SAM. The extreme opposite is uncompressed ONT BAM which shows a 23% speed gain.

On long read data, the time to format SAM files is dominated by
sequence and quality.

The qual[i]+33 loop to turn binary quals into printable ASCII is not
vectorised by GCC without using -O3.  I would consider this a weakness
of the compiler, but nothing I've done has persuaded gcc (before v12)
to generate vector instructions.  Not even the "restrict" keyword.

Hence using __attribute__((optimize("O3"))).

The time the new add33 function is approx 15x quicker with gcc -O3
than gcc -O2.  Clang's and icc's default optimisation level gives
speeds comparable to the gcc -O3.

With a compressed Illumina BAM this was just 3% overall speed gain to
decode to SAM.  The extreme opposite is uncompressed ONT BAM which
shows a 23% speed gain.
@jkbonfield
Copy link
Contributor Author

Note my original implementation of this had a prototype of

static inline void HTS_OPT3 add33(uint8_t *restrict a, const uint8_t *restrict b, int32_t len)        

However our Rocky Linux CI test explicitly turns off C99 support via --std=gnu90, hence restrict is not permitted. As far as I'm aware we target C99 (it is 20+ year old after all). What's the reason behind explicitly forbidding C99 extensions on Rocky Linux? I know it's an old compiler, but sure not over 20 year old!

That said, it makes no difference here. The use of restrict was a failed attempt to get gcc to behave, but it was resolute in wanting to avoid all forms of vectorisation without explicitly enabling them via -ftree-loop-vectorize (or -O3 which adds that).

@rhpvorderman
Copy link
Contributor

Wow, this is very instructive. Goodbye #ifdef __SSE2__! Thanks for showing me this!

@daviesrob
Copy link
Member

The -std=gnu90 test was added in #1524 to catch cases of for (int n = 0, ...), which is not supported by gnu90. This (or gnu89) was the default on the GCC-4 series, which was shipped in Red Hat Enterprise Linux versions 5 to 7. RHEL7 is still (just) in production, and will not be completely retired until June 2026. It's also possible that some BSDs ship ancient GCCs, although they seem to be migrating to clang.

It's possible we could allow more c99-only features with suitable configure tests to set the -std=c99 option, but we'd have to check that the code still worked on RedHat, at least.

We could also add a configure test to set the optimisation to -O3, instead of just doing it for one very small piece of code...

@jkbonfield
Copy link
Contributor Author

I don't understand though why we're explicitly attempting to forbid the very useful "for (int i = ..." notation. It's standard in C99.

I get that the earlier RedHat's don't default to supporting this, but they do still support it: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Standards.html#Standards

All it requires is that people build with CC="gcc --std=gnu99" and it'd be accepted. I don't think this is an onerous problem, and it could be part of autoconf anyway. I'm happy with -O3 being the default and I'm sure I've suggested this in the past (but at the time wasn't able to be sufficiently convincing apparently).

I will say however that I think gcc are in error here. O3 is explicitly saying we favour speed over everything else, and aggressive unrolling of loops etc that significantly increases code size is worth it even if it's only a small speed gain. Basically it's the "turn it up to 11" level of optimisation. The vectorisation of this trivial loop is in a totally different class. It's an order of magnitude faster! It's not some minor speed gain vs big code size tradeoff unless you make the assumption that it's only executed with a very low number of cycles (and gcc has no way to hint at that, unlike several other compilers). Everyone else seems in agreement that vectorisation is a good thing to do even at earlier optimisation levels (such as the O2 offered by default from autoconf).

@daviesrob daviesrob self-assigned this Oct 3, 2023
@daviesrob daviesrob merged commit de6a982 into samtools:develop Oct 12, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants