Skip to content

[x86-64] Bad codegen for certain SIMD intrinsics #159670

@dcommander

Description

@dcommander

I am in the process of converting the x86-64 SIMD modules in libjpeg-turbo to compiler intrinsics (libjpeg-turbo/libjpeg-turbo#732.) However, I am encountering a problem whereby Clang/LLVM tries to outsmart me when it translates certain intrinsics into assembly code, and the resulting assembly code is often not smart at all. Here is a good example:

https://godbolt.org/z/7YGrvGbnc

Clang translates the two intrinsics into seven assembly instructions, whereas GCC correctly translates them into the two assembly instructions that correspond to the intrinsics. The result is that, when compiled with GCC, the intrinsics version of our color conversion algorithm performs as well as the NASM version, but when compiled with Clang, the intrinsics version regresses by 20-30%.

With AVX2, Clang translates the equivalent two intrinsics into two assembly instructions, but they are slower instructions than the instructions that correspond to the intrinsics:

https://godbolt.org/z/nzx7f16e7

If someone goes to the trouble of writing intrinsics that have a documented 1:1 correspondence with assembly instructions, it's because they are trying to talk to the hardware more directly. The compiler really shouldn't second guess them in that case.

Is there a way to disable this behavior?

I have tried all of the -O options, to no avail. I did observe that changing -msse2 to -mssse3 (or targeting any later SIMD instruction set, such as AVX2) causes Clang to compile _mm_slli_si128() and _mm_unpackhi_epi8() into vpshufd and vpunpcklbw rather than pslldq and punpckhbw. That behavior is inscrutable, though, since vpshufd and vpunpcklbw are both SSE2 instructions.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions