You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After exhausting registers inside of a loop, clang stores the results of a broadcast on the stack. This is inefficient, since broadcasting from memory is as fast as loading
Consider the following pseudo code:
float *restrict arr = ...; // prevent aliasing
loop {
exhaust vector registers
__mm256 x = _mm256_set1_ps(arr[0]);
use x
}
When clang compiles this, arr[0] is broadcasted outside the loop then x is stored on the stack.
vbroadcastss ymm0, dword ptr [rdx]
vmovups ymmword ptr [rsp - 72], ymm0
loop:
...
load x from stack
use x
jmp loop
The expected behavior is:
loop:
...
vbroadcastss x, dword ptr [rdx]
use x
jmp loop
Obligatory Godbolt Sample: https://godbolt.org/z/v7MYcefxY (Sorry if my method of stressing register allocation results in too much asm/bytecode.)
The text was updated successfully, but these errors were encountered:
After exhausting registers inside of a loop, clang stores the results of a broadcast on the stack. This is inefficient, since broadcasting from memory is as fast as loading
Consider the following pseudo code:
float *restrict arr = ...; // prevent aliasing
loop {
exhaust vector registers
__mm256 x = _mm256_set1_ps(arr[0]);
use x
}
When clang compiles this, arr[0] is broadcasted outside the loop then x is stored on the stack.
vbroadcastss ymm0, dword ptr [rdx]
vmovups ymmword ptr [rsp - 72], ymm0
loop:
...
load x from stack
use x
jmp loop
The expected behavior is:
loop:
...
vbroadcastss x, dword ptr [rdx]
use x
jmp loop
Obligatory Godbolt Sample: https://godbolt.org/z/v7MYcefxY (Sorry if my method of stressing register allocation results in too much asm/bytecode.)
This has been on the backlog for a long time now - for constants at least we made progress by adding X86FixupVectorConstantsPass, and I started work on removing constant pool broadcasts from DAG entirely with #73509 - but addressing all the regressions for AVX512VL is a slog and handling the regressions for basic AVX was even worse (plus we need to handle optsize constants cases).
For non-constant loads, it might be that we can add a tweak to MachineLICM for loop hoisting cases as a starting point.
After exhausting registers inside of a loop, clang stores the results of a broadcast on the stack. This is inefficient, since broadcasting from memory is as fast as loading
Consider the following pseudo code:
When clang compiles this, arr[0] is broadcasted outside the loop then x is stored on the stack.
The expected behavior is:
Obligatory Godbolt Sample: https://godbolt.org/z/v7MYcefxY (Sorry if my method of stressing register allocation results in too much asm/bytecode.)
The text was updated successfully, but these errors were encountered: