Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_mm256_loadu_si256 only loads 128 bits when compiled with default cargo build --release #52636

Closed
djsweet opened this issue Jul 23, 2018 · 9 comments · Fixed by #55073
Closed

Comments

@djsweet
Copy link

djsweet commented Jul 23, 2018

The AVX2 intrinsic _mm256_loadu_si256 fully loads all 256 bits from memory into the register when compiled without any optimization, but only loads 128 bits when compiled with the default cargo build --release option. This small program exhibits the issue:

use std::arch::x86_64;

fn main() {
    if is_x86_feature_detected!("avx2") {
        let load_bytes: [u8; 32] = [0x0f; 32];
        let lb_ptr = load_bytes.as_ptr();
        let reg_load = unsafe {
            x86_64::_mm256_loadu_si256(
                lb_ptr as *const x86_64::__m256i
            )
        };
        println!("{:?}", reg_load);
        let mut store_bytes: [u8; 32] = [0; 32];
        let sb_ptr = store_bytes.as_mut_ptr();
        unsafe {
            x86_64::_mm256_storeu_si256(sb_ptr as *mut x86_64::__m256i, reg_load);
        }
        assert_eq!(load_bytes, store_bytes);
    } else {
        println!("AVX2 is not supported on this machine/build.");
    }
}

When I run cargo run, this is the output:

   Compiling...
    Finished dev [unoptimized + debuginfo] target(s) in 0.33s
     Running `target/debug/avx2_bug_hunt`
__m256i(1085102592571150095, 1085102592571150095, 1085102592571150095, 1085102592571150095)

However, with cargo run --release, this is the output:

   Compiling...
    Finished release [optimized] target(s) in 0.26s
     Running `target/release/avx2_bug_hunt`
__m256i(1085102592571150095, 1085102592571150095, 0, 0)
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]`,
 right: `[15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]`', src/main.rs:18:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.

I am on macOS 10.13.6 with a Core i7 I7-4960HQ, and the output of rustc --version --verbose is

rustc 1.27.2
binary: rustc
commit-hash: unknown
commit-date: unknown
host: x86_64-apple-darwin
release: 1.27.2
LLVM version: 6.0

Curiously, when inspecting the assembly of main, the call to _mm256_loadu_si256 is not inlined, but instead generates this function:

avx2_bug_hunt`core::coresimd::x86::avx::_mm256_loadu_si256::hd7fc98ebefdce593:
avx2_bug_hunt[0x1000018a0] <+0>:  pushq  %rbp
avx2_bug_hunt[0x1000018a1] <+1>:  movq   %rsp, %rbp
avx2_bug_hunt[0x1000018a4] <+4>:  vmovaps %ymm0, (%rdi)
avx2_bug_hunt[0x1000018a8] <+8>:  popq   %rbp
avx2_bug_hunt[0x1000018a9] <+9>:  vzeroupper 
avx2_bug_hunt[0x1000018ac] <+12>: retq   
avx2_bug_hunt[0x1000018ad] <+13>: nopl   (%rax)

Note the vzeroupper instruction, which clears out the non-XMM registers. This is incorrect behavior, _m256i requires the full YMM register to be loaded unmodified. A similar spurious vzeroupper is also present in the assembly generated for _mm256_storeu_si256, but after the register is stored into memory.

@hellow554
Copy link
Contributor

hellow554 commented Jul 23, 2018

Behavoir on Windows is kind of strange. Debug just runs fine, release errors out

# rustc --version
rustc 1.27.2 (58cc626de 2018-07-18)
# cargo run --release
    Finished release [optimized] target(s) in 0.00s
     Running `target\release\foo.exe`
error: process didn't exit successfully: `target\release\foo.exe` (exit code: 3221225477)

I am not a windows developer, so I have no clue why and what happens :D maybe some register clobbering.

Cannot confirm this on playground though :/

Also breaks on my Linux 😮

@hellow554
Copy link
Contributor

hellow554 commented Jul 23, 2018

With rustc this only happens with opt-level=3, not 2.

Level 2:

_mm256_loadu_si256:
    vmovups (%rsi), %ymm0
    vmovaps %ymm0,(%rdi)
    mov    %rdi,%rax
    vzeroupper
    retq

_mm256_storeu_si256:
    vmovaps (%rsi), %ymm0
    vmovups %ymmo0, (%rdi)
    vzeroupper
    retq

Level 3:

_mm256_loadu_si256:
    vmovups (%rsi), %ymm0
    vmovaps %ymm0, (%rdi)
    vzeroupper
    retq

_mm256_storeu_si256:
    vmovups %ymm0, (%rdi)
    vzeroupper
    retq

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Jul 23, 2018

Note the vzeroupper instruction, which clears out the non-XMM registers. This is incorrect behavior, _m256i requires the full YMM register to be loaded unmodified. A similar spurious vzeroupper is also present in the assembly generated for _mm256_storeu_si256, but after the register is stored into memory.

I don't think this analysis is correct, the loadu function doesn't return the loaded value in a register (to avoid ABI mismatches) but rather returns it in memory (writing the contents of ymm0 to an address that's passed in, in this case the address of reg_load). Indeed the assembly never even mentions ymm registers or AVX instructions outside of loadu and storeu, in main the 256 bit vectors are handled as a pair of xmm registers. It can't be using them, since the avx2 target feature is not enabled on the function. Incidentially, this target feature mismatch is also why loadu and storeu are not being inlined.

More importantly, this combined with #50154 is why the store doesn't work: storeu expects the value to be stored in a ymm register, but the caller doesn't have that register, so it passes the two halves of the vector in xmm0 and xmm1 instead. Normally rustc passes all vector arguments and return values through memory to avoid this problem, but as described in #50154 LLVM sometimes undoes this with disastrous consequences.

tl;dr duplicate of #50154

@djsweet
Copy link
Author

djsweet commented Jul 23, 2018

@rkruppe You're right that this is related to #50154, but based on your comments here, we might have established the true cause of these AVX bugs.

My use of is_x86_feature_detected!("avx2") is a bit of a red herring. I discovered this bug while attempting to use AVX2-specific intrinsics, but _mm256_loadu_si256 is a part of AVX as per the Intel Intrinsics Guide, and is only enabled in stdsimd if the avx feature flag is enabled.

the loadu function doesn't return the loaded value in a register (to avoid ABI mismatches) but rather returns it in memory (writing the contents of ymm0 to an address that's passed in, in this case the address of reg_load).

You appear to have _mm256_loadu_si256 and _mm256_storeu_si256 conflated. The only argument passed explicitly to _mm256_loadu_si256, reg_load, is used to load the register, even as per the stdsimd source. If the intent is to perform a memory-to-memory copy to avoid ABI mismatch, this has failed here, too. The assembly listing for _mm256_loadu_si256 takes the input argument at %rdi and loads 256 bits from it into %ymm0. This aligns with the Rust-level definition of the function, and unless the __m256 structs have special compiler support, would be what I would expect from reading that definition.

I find the idea that registers should be "demoted" to memory regions to fit an ABI circumspect. If the target ABI does not allow for the use of said registers, it would be more ergonomic for the compiler to fail early and often, notifying the user that their target ABI does not support the registers they are attempting to use.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Jul 23, 2018

My use of is_x86_feature_detected!("avx2") is a bit of a red herring. I discovered this bug while attempting to use AVX2-specific intrinsics, but _mm256_loadu_si256 is a part of AVX as per the Intel Intrinsics Guide, and is only enabled in stdsimd if the avx feature flag is enabled.

I don't see how that's relevant to this issue.

You appear to have _mm256_loadu_si256 and _mm256_storeu_si256 conflated. The only argument passed explicitly to _mm256_loadu_si256, reg_load, is used to load the register, even as per the stdsimd source.

I am talking about the LLVM IR and assembly generated, not about the source code. However, it appears we're talking about different machine code: on the playground and in @hellow554's experiments, the body of loadu is essentially:

	vmovups	(%rsi), %ymm0
	vmovaps	%ymm0, (%rdi)

while your original post quotes a disassembly that contains only vmovaps %ymm0, (%rdi) ... huh, that's a store? That can't be right. Please double-check.

In any case, #50154 would also explain why storeu is miscompiled.

unless the __m256 structs have special compiler support

They have, not just in the sense required for allocating them to ymm registers and using AVX instructions on them, but yes also in the sense you mean here: they are passed in memory rather than as immediates.

I find the idea that registers should be "demoted" to memory regions to fit an ABI circumspect. If the target ABI does not allow for the use of said registers, it would be more ergonomic for the compiler to fail early and often, notifying the user that their target ABI does not support the registers they are attempting to use.

Unfortunately the ABI mismatch problems are real, difficult to solve, we can't very well ignore them, so this is the only feasible approach for the time being.

@djsweet
Copy link
Author

djsweet commented Jul 23, 2018

@rkruppe The goofy AT&T ASM syntax always trips me up. In Intel-flavor mnemonics,

avx2_bug_hunt`core::coresimd::x86::avx::_mm256_load_si256::h86383795cdef4461:
avx2_bug_hunt[0x100000cc0] <+0>:  push   rbp
avx2_bug_hunt[0x100000cc1] <+1>:  mov    rbp, rsp
avx2_bug_hunt[0x100000cc4] <+4>:  vmovaps ymmword ptr [rdi], ymm0
avx2_bug_hunt[0x100000cc8] <+8>:  pop    rbp
avx2_bug_hunt[0x100000cc9] <+9>:  vzeroupper 
avx2_bug_hunt[0x100000ccc] <+12>: ret    
avx2_bug_hunt[0x100000ccd] <+13>: nop    dword ptr [rax]

avx2_bug_hunt`core::coresimd::x86::avx::_mm256_store_si256::h1b513ee5f14081f3:
avx2_bug_hunt[0x100000cd0] <+0>:  push   rbp
avx2_bug_hunt[0x100000cd1] <+1>:  mov    rbp, rsp
avx2_bug_hunt[0x100000cd4] <+4>:  vmovaps ymmword ptr [rdi], ymm0
avx2_bug_hunt[0x100000cd8] <+8>:  pop    rbp
avx2_bug_hunt[0x100000cd9] <+9>:  vzeroupper 
avx2_bug_hunt[0x100000cdc] <+12>: ret    
avx2_bug_hunt[0x100000cdd] <+13>: nop    dword ptr [rax]

For some reason, load and store are compiling identically on optimization level 3.

EDIT: It's worth noting that in order to ensure it was the dereference causing issues and not explicitly written memory copying, I rewrote the test program earlier using only _mm256_load_si256 and _mm256_store_si256, with similar issues unless explicitly compiled with the AVX ABI.

The updated program:

use std::arch::x86_64;

#[repr(align(32))]
struct BytePair {
    load_bytes: [u8; 32],
    store_bytes: [u8; 32]
}

fn main() {
    let mut byte_pair = BytePair{
        load_bytes: [0x0f; 32],
        store_bytes: [0; 32]
    };
    let lb_ptr = byte_pair.load_bytes.as_ptr();
    let reg_load = unsafe {
        x86_64::_mm256_load_si256(
            lb_ptr as *const x86_64::__m256i
        )
    };
    println!("{:?}", reg_load);
    let sb_ptr = byte_pair.store_bytes.as_mut_ptr();
    unsafe {
        x86_64::_mm256_store_si256(sb_ptr as *mut x86_64::__m256i, reg_load);
    }
    assert_eq!(&byte_pair.load_bytes, &byte_pair.store_bytes);
}

EDIT 2: The disassembly of main up to the x86_64::_mm256_load_si256 call:

   100000ce0:   55                      push   rbp
   100000ce1:   48 89 e5                mov    rbp,rsp
   100000ce4:   41 56                   push   r14
   100000ce6:   53                      push   rbx
   100000ce7:   48 83 e4 e0             and    rsp,0xffffffffffffffe0
   100000ceb:   48 81 ec e0 00 00 00    sub    rsp,0xe0
   100000cf2:   0f 28 05 87 d3 03 00    movaps xmm0,XMMWORD PTR [rip+0x3d387]        # 10003e080 <__ZN53_$LT$$RF$$u27$a$u20$T$u20$as$u20$core..fmt..Debug$GT$3fmt17h743e723bc811d43eE+0x680>
   100000cf9:   0f 29 84 24 90 00 00    movaps XMMWORD PTR [rsp+0x90],xmm0
   100000d00:   00 
   100000d01:   0f 29 84 24 80 00 00    movaps XMMWORD PTR [rsp+0x80],xmm0
   100000d08:   00 
   100000d09:   4c 8d b4 24 a0 00 00    lea    r14,[rsp+0xa0]
   100000d10:   00 
   100000d11:   0f 57 c0                xorps  xmm0,xmm0
   100000d14:   0f 29 84 24 b0 00 00    movaps XMMWORD PTR [rsp+0xb0],xmm0
   100000d1b:   00 
   100000d1c:   0f 29 84 24 a0 00 00    movaps XMMWORD PTR [rsp+0xa0],xmm0
   100000d23:   00 
   100000d24:   0f 28 05 55 d3 03 00    movaps xmm0,XMMWORD PTR [rip+0x3d355]        # 10003e080 <__ZN53_$LT$$RF$$u27$a$u20$T$u20$as$u20$core..fmt..Debug$GT$3fmt17h743e723bc811d43eE+0x680>
   100000d2b:   48 8d 9c 24 c0 00 00    lea    rbx,[rsp+0xc0]
   100000d32:   00 
   100000d33:   48 89 df                mov    rdi,rbx
   100000d36:   0f 28 c8                movaps xmm1,xmm0
   100000d39:   e8 82 ff ff ff          call   100000cc0 <__ZN4core8coresimd3x863avx17_mm256_load_si25617h86383795cdef4461E>

And the contents of 0x10003e080:

(lldb) x 0x10003e080
0x10003e080: 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f  ................
0x10003e090: 00 0a 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................

@alexcrichton
Copy link
Member

I've confirmed that this is the same bug as #50154, which is the same as upstream LLVM https://bugs.llvm.org/show_bug.cgi?id=37358 as @rkruppe mentioned

@djsweet
Copy link
Author

djsweet commented Jul 28, 2018

@alexcrichton The LLVM bug report was opened in May 2018 with seemingly no progress since then. In the intervening period, the SIMD features were marked "stable" in general (not just at the API level) and shipped even with this bug present. Are there any plans to address the bug at the rustc level, or are consumers of rustc also having to wait for a bugfix from LLVM upstream at this point?

@alexcrichton
Copy link
Member

@djsweet I don't personally know how we could fix this at the rustc level, but it may be good to ping the LLVM issue if you're interested in stirring up activity!

alexcrichton added a commit to alexcrichton/rust that referenced this issue Oct 14, 2018
The issue of passing around SIMD types as values between functions has
seen [quite a lot] of [discussion], and although we thought [we fixed
it][quite a lot] it [wasn't]! This PR is a change to rustc to, again,
try to fix this issue.

The fundamental problem here remains the same, if a SIMD vector argument
is passed by-value in LLVM's function type, then if the caller and
callee disagree on target features a miscompile happens. We solve this
by never passing SIMD vectors by-value, but LLVM will still thwart us
with its argument promotion pass to promote by-ref SIMD arguments to
by-val SIMD arguments.

This commit is an attempt to thwart LLVM thwarting us. We, just before
codegen, will take yet another look at the LLVM module and demote any
by-value SIMD arguments we see. This is a very manual attempt by us to
ensure the codegen for a module keeps working, and it unfortunately is
likely producing suboptimal code, even in release mode. The saving grace
for this, in theory, is that if SIMD types are passed by-value across
a boundary in release mode it's pretty unlikely to be performance
sensitive (as it's already doing a load/store, and otherwise
perf-sensitive bits should be inlined).

The implementation here is basically a big wad of C++. It was largely
copied from LLVM's own argument promotion pass, only doing the reverse.
In local testing this...

Closes rust-lang#50154
Closes rust-lang#52636
Closes rust-lang#54583
Closes rust-lang#55059

[quite a lot]: rust-lang#47743
[discussion]: rust-lang#44367
[wasn't]: rust-lang#50154
bors added a commit that referenced this issue Oct 14, 2018
rustc: Fix (again) simd vectors by-val in ABI

The issue of passing around SIMD types as values between functions has
seen [quite a lot] of [discussion], and although we thought [we fixed
it][quite a lot] it [wasn't]! This PR is a change to rustc to, again,
try to fix this issue.

The fundamental problem here remains the same, if a SIMD vector argument
is passed by-value in LLVM's function type, then if the caller and
callee disagree on target features a miscompile happens. We solve this
by never passing SIMD vectors by-value, but LLVM will still thwart us
with its argument promotion pass to promote by-ref SIMD arguments to
by-val SIMD arguments.

This commit is an attempt to thwart LLVM thwarting us. We, just before
codegen, will take yet another look at the LLVM module and demote any
by-value SIMD arguments we see. This is a very manual attempt by us to
ensure the codegen for a module keeps working, and it unfortunately is
likely producing suboptimal code, even in release mode. The saving grace
for this, in theory, is that if SIMD types are passed by-value across
a boundary in release mode it's pretty unlikely to be performance
sensitive (as it's already doing a load/store, and otherwise
perf-sensitive bits should be inlined).

The implementation here is basically a big wad of C++. It was largely
copied from LLVM's own argument promotion pass, only doing the reverse.
In local testing this...

Closes #50154
Closes #52636
Closes #54583
Closes #55059

[quite a lot]: #47743
[discussion]: #44367
[wasn't]: #50154
alexcrichton added a commit to alexcrichton/rust that referenced this issue Oct 16, 2018
The issue of passing around SIMD types as values between functions has
seen [quite a lot] of [discussion], and although we thought [we fixed
it][quite a lot] it [wasn't]! This PR is a change to rustc to, again,
try to fix this issue.

The fundamental problem here remains the same, if a SIMD vector argument
is passed by-value in LLVM's function type, then if the caller and
callee disagree on target features a miscompile happens. We solve this
by never passing SIMD vectors by-value, but LLVM will still thwart us
with its argument promotion pass to promote by-ref SIMD arguments to
by-val SIMD arguments.

This commit is an attempt to thwart LLVM thwarting us. We, just before
codegen, will take yet another look at the LLVM module and demote any
by-value SIMD arguments we see. This is a very manual attempt by us to
ensure the codegen for a module keeps working, and it unfortunately is
likely producing suboptimal code, even in release mode. The saving grace
for this, in theory, is that if SIMD types are passed by-value across
a boundary in release mode it's pretty unlikely to be performance
sensitive (as it's already doing a load/store, and otherwise
perf-sensitive bits should be inlined).

The implementation here is basically a big wad of C++. It was largely
copied from LLVM's own argument promotion pass, only doing the reverse.
In local testing this...

Closes rust-lang#50154
Closes rust-lang#52636
Closes rust-lang#54583
Closes rust-lang#55059

[quite a lot]: rust-lang#47743
[discussion]: rust-lang#44367
[wasn't]: rust-lang#50154
kennytm added a commit to kennytm/rust that referenced this issue Oct 18, 2018
rustc: Fix (again) simd vectors by-val in ABI

The issue of passing around SIMD types as values between functions has
seen [quite a lot] of [discussion], and although we thought [we fixed
it][quite a lot] it [wasn't]! This PR is a change to rustc to, again,
try to fix this issue.

The fundamental problem here remains the same, if a SIMD vector argument
is passed by-value in LLVM's function type, then if the caller and
callee disagree on target features a miscompile happens. We solve this
by never passing SIMD vectors by-value, but LLVM will still thwart us
with its argument promotion pass to promote by-ref SIMD arguments to
by-val SIMD arguments.

This commit is an attempt to thwart LLVM thwarting us. We, just before
codegen, will take yet another look at the LLVM module and demote any
by-value SIMD arguments we see. This is a very manual attempt by us to
ensure the codegen for a module keeps working, and it unfortunately is
likely producing suboptimal code, even in release mode. The saving grace
for this, in theory, is that if SIMD types are passed by-value across
a boundary in release mode it's pretty unlikely to be performance
sensitive (as it's already doing a load/store, and otherwise
perf-sensitive bits should be inlined).

The implementation here is basically a big wad of C++. It was largely
copied from LLVM's own argument promotion pass, only doing the reverse.
In local testing this...

Closes rust-lang#50154
Closes rust-lang#52636
Closes rust-lang#54583
Closes rust-lang#55059

[quite a lot]: rust-lang#47743
[discussion]: rust-lang#44367
[wasn't]: rust-lang#50154
alexcrichton added a commit to alexcrichton/rust that referenced this issue Oct 19, 2018
The issue of passing around SIMD types as values between functions has
seen [quite a lot] of [discussion], and although we thought [we fixed
it][quite a lot] it [wasn't]! This PR is a change to rustc to, again,
try to fix this issue.

The fundamental problem here remains the same, if a SIMD vector argument
is passed by-value in LLVM's function type, then if the caller and
callee disagree on target features a miscompile happens. We solve this
by never passing SIMD vectors by-value, but LLVM will still thwart us
with its argument promotion pass to promote by-ref SIMD arguments to
by-val SIMD arguments.

This commit is an attempt to thwart LLVM thwarting us. We, just before
codegen, will take yet another look at the LLVM module and demote any
by-value SIMD arguments we see. This is a very manual attempt by us to
ensure the codegen for a module keeps working, and it unfortunately is
likely producing suboptimal code, even in release mode. The saving grace
for this, in theory, is that if SIMD types are passed by-value across
a boundary in release mode it's pretty unlikely to be performance
sensitive (as it's already doing a load/store, and otherwise
perf-sensitive bits should be inlined).

The implementation here is basically a big wad of C++. It was largely
copied from LLVM's own argument promotion pass, only doing the reverse.
In local testing this...

Closes rust-lang#50154
Closes rust-lang#52636
Closes rust-lang#54583
Closes rust-lang#55059

[quite a lot]: rust-lang#47743
[discussion]: rust-lang#44367
[wasn't]: rust-lang#50154
Manishearth added a commit to Manishearth/rust that referenced this issue Oct 20, 2018
The issue of passing around SIMD types as values between functions has
seen [quite a lot] of [discussion], and although we thought [we fixed
it][quite a lot] it [wasn't]! This PR is a change to rustc to, again,
try to fix this issue.

The fundamental problem here remains the same, if a SIMD vector argument
is passed by-value in LLVM's function type, then if the caller and
callee disagree on target features a miscompile happens. We solve this
by never passing SIMD vectors by-value, but LLVM will still thwart us
with its argument promotion pass to promote by-ref SIMD arguments to
by-val SIMD arguments.

This commit is an attempt to thwart LLVM thwarting us. We, just before
codegen, will take yet another look at the LLVM module and demote any
by-value SIMD arguments we see. This is a very manual attempt by us to
ensure the codegen for a module keeps working, and it unfortunately is
likely producing suboptimal code, even in release mode. The saving grace
for this, in theory, is that if SIMD types are passed by-value across
a boundary in release mode it's pretty unlikely to be performance
sensitive (as it's already doing a load/store, and otherwise
perf-sensitive bits should be inlined).

The implementation here is basically a big wad of C++. It was largely
copied from LLVM's own argument promotion pass, only doing the reverse.
In local testing this...

Closes rust-lang#50154
Closes rust-lang#52636
Closes rust-lang#54583
Closes rust-lang#55059

[quite a lot]: rust-lang#47743
[discussion]: rust-lang#44367
[wasn't]: rust-lang#50154
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants