-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
freeze heavily pessimizes SIMD code. #42316
Comments
Is this a julia bug or LLVM bug? |
I'm not sure, but would guess LLVM, as Julia is producing scalar code, which LLVM then vectorizes -- except for the "noop" But it could also be a pass ordering/missing pass type problem. I'm too ignorant of LLVM to really say much here. If folks want, I could file an issue on the LLVM bugzilla. |
We insert the freeze instructions. But yes maybe we need to file an LLVM bug about the impact of freeze on the autovectorizer. |
Does Clang simply not freeze care about the possibility of poison? |
@chriselrod the reason to insert |
@oscardssmith On the one hand, I'll point out that C get's away with it. On the other hand, there are plenty of complaints about undefined behavior in C. The problem seems to have been fixed with LLVM 13. %.fr = freeze <4 x i32> %18 We can confirm that the above LLVM produces good code by using This is all vectorized (note that the So running just the default I got the original "unoptimized" IR by starting Julia with Switching to LLVM 12 shows the same codegen we currently get in Julia. |
I think
|
Ah, that makes sense. The former allows optimizing away the program, the latter does not. |
This is autovectorized code.
The pessimization is because the SIMD vectors are scalarized (via a series of
extractelements
), the scalars frozen, and then reassembled (viainsertelement
). LLVM is not able to clean up/remove this round trip, resulting in a heavy performance penalty vs earlier versions that did not requirefreeze
.With LLVM12:
Julia 1.5:
With LLVM12:
Benchmarks:
The assembly matches the LLVM: the vectors are decomposed and reassembled, without actually doing anything to the scalars.
The text was updated successfully, but these errors were encountered: