-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable optimisation level -O3 for SAM QUAL+33 formatting. #1679
Conversation
On long read data, the time to format SAM files is dominated by sequence and quality. The qual[i]+33 loop to turn binary quals into printable ASCII is not vectorised by GCC without using -O3. I would consider this a weakness of the compiler, but nothing I've done has persuaded gcc (before v12) to generate vector instructions. Not even the "restrict" keyword. Hence using __attribute__((optimize("O3"))). The time the new add33 function is approx 15x quicker with gcc -O3 than gcc -O2. Clang's and icc's default optimisation level gives speeds comparable to the gcc -O3. With a compressed Illumina BAM this was just 3% overall speed gain to decode to SAM. The extreme opposite is uncompressed ONT BAM which shows a 23% speed gain.
Note my original implementation of this had a prototype of
However our Rocky Linux CI test explicitly turns off C99 support via That said, it makes no difference here. The use of restrict was a failed attempt to get gcc to behave, but it was resolute in wanting to avoid all forms of vectorisation without explicitly enabling them via -ftree-loop-vectorize (or -O3 which adds that). |
Wow, this is very instructive. Goodbye |
The It's possible we could allow more We could also add a |
I don't understand though why we're explicitly attempting to forbid the very useful "for (int i = ..." notation. It's standard in C99. I get that the earlier RedHat's don't default to supporting this, but they do still support it: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Standards.html#Standards All it requires is that people build with I will say however that I think gcc are in error here. O3 is explicitly saying we favour speed over everything else, and aggressive unrolling of loops etc that significantly increases code size is worth it even if it's only a small speed gain. Basically it's the "turn it up to 11" level of optimisation. The vectorisation of this trivial loop is in a totally different class. It's an order of magnitude faster! It's not some minor speed gain vs big code size tradeoff unless you make the assumption that it's only executed with a very low number of cycles (and gcc has no way to hint at that, unlike several other compilers). Everyone else seems in agreement that vectorisation is a good thing to do even at earlier optimisation levels (such as the O2 offered by default from autoconf). |
On long read data, the time to format SAM files is dominated by sequence and quality.
The qual[i]+33 loop to turn binary quals into printable ASCII is not vectorised by GCC without using -O3. I would consider this a weakness of the compiler, but nothing I've done has persuaded gcc (before v12) to generate vector instructions. Not even the "restrict" keyword.
Hence using attribute((optimize("O3"))).
The time the new add33 function is approx 15x quicker with gcc -O3 than gcc -O2. Clang's and icc's default optimisation level gives speeds comparable to the gcc -O3.
With a compressed Illumina BAM this was just 3% overall speed gain to decode to SAM. The extreme opposite is uncompressed ONT BAM which shows a 23% speed gain.