Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX mem broadcasts are cached on the stack #120015

Open
KyleSiefring opened this issue Dec 15, 2024 · 2 comments
Open

AVX mem broadcasts are cached on the stack #120015

KyleSiefring opened this issue Dec 15, 2024 · 2 comments
Assignees

Comments

@KyleSiefring
Copy link

After exhausting registers inside of a loop, clang stores the results of a broadcast on the stack. This is inefficient, since broadcasting from memory is as fast as loading

Consider the following pseudo code:

float *restrict arr = ...; // prevent aliasing
loop {
     exhaust vector registers
     __mm256 x = _mm256_set1_ps(arr[0]);
     use x
}

When clang compiles this, arr[0] is broadcasted outside the loop then x is stored on the stack.

        vbroadcastss    ymm0, dword ptr [rdx]
        vmovups ymmword ptr [rsp - 72], ymm0
loop:
        ...
        load x from stack
        use x
        jmp loop

The expected behavior is:

loop:
       ...
        vbroadcastss    x, dword ptr [rdx]
        use x
        jmp loop

Obligatory Godbolt Sample: https://godbolt.org/z/v7MYcefxY (Sorry if my method of stressing register allocation results in too much asm/bytecode.)

@llvmbot
Copy link
Member

llvmbot commented Dec 16, 2024

@llvm/issue-subscribers-backend-x86

Author: Kyle Siefring (KyleSiefring)

After exhausting registers inside of a loop, clang stores the results of a broadcast on the stack. This is inefficient, since broadcasting from memory is as fast as loading

Consider the following pseudo code:

float *restrict arr = ...; // prevent aliasing
loop {
     exhaust vector registers
     __mm256 x = _mm256_set1_ps(arr[0]);
     use x
}

When clang compiles this, arr[0] is broadcasted outside the loop then x is stored on the stack.

        vbroadcastss    ymm0, dword ptr [rdx]
        vmovups ymmword ptr [rsp - 72], ymm0
loop:
        ...
        load x from stack
        use x
        jmp loop

The expected behavior is:

loop:
       ...
        vbroadcastss    x, dword ptr [rdx]
        use x
        jmp loop

Obligatory Godbolt Sample: https://godbolt.org/z/v7MYcefxY (Sorry if my method of stressing register allocation results in too much asm/bytecode.)

@RKSimon RKSimon self-assigned this Dec 16, 2024
@RKSimon
Copy link
Collaborator

RKSimon commented Dec 16, 2024

This has been on the backlog for a long time now - for constants at least we made progress by adding X86FixupVectorConstantsPass, and I started work on removing constant pool broadcasts from DAG entirely with #73509 - but addressing all the regressions for AVX512VL is a slog and handling the regressions for basic AVX was even worse (plus we need to handle optsize constants cases).

For non-constant loads, it might be that we can add a tweak to MachineLICM for loop hoisting cases as a starting point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants