_mm256_loadu_si256 only loads 128 bits when compiled with default cargo build --release #52636

djsweet · 2018-07-23T04:20:53Z

The AVX2 intrinsic _mm256_loadu_si256 fully loads all 256 bits from memory into the register when compiled without any optimization, but only loads 128 bits when compiled with the default cargo build --release option. This small program exhibits the issue:

use std::arch::x86_64;

fn main() {
    if is_x86_feature_detected!("avx2") {
        let load_bytes: [u8; 32] = [0x0f; 32];
        let lb_ptr = load_bytes.as_ptr();
        let reg_load = unsafe {
            x86_64::_mm256_loadu_si256(
                lb_ptr as *const x86_64::__m256i
            )
        };
        println!("{:?}", reg_load);
        let mut store_bytes: [u8; 32] = [0; 32];
        let sb_ptr = store_bytes.as_mut_ptr();
        unsafe {
            x86_64::_mm256_storeu_si256(sb_ptr as *mut x86_64::__m256i, reg_load);
        }
        assert_eq!(load_bytes, store_bytes);
    } else {
        println!("AVX2 is not supported on this machine/build.");
    }
}

When I run cargo run, this is the output:

   Compiling...
    Finished dev [unoptimized + debuginfo] target(s) in 0.33s
     Running `target/debug/avx2_bug_hunt`
__m256i(1085102592571150095, 1085102592571150095, 1085102592571150095, 1085102592571150095)

However, with cargo run --release, this is the output:

   Compiling...
    Finished release [optimized] target(s) in 0.26s
     Running `target/release/avx2_bug_hunt`
__m256i(1085102592571150095, 1085102592571150095, 0, 0)
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]`,
 right: `[15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]`', src/main.rs:18:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.

I am on macOS 10.13.6 with a Core i7 I7-4960HQ, and the output of rustc --version --verbose is

rustc 1.27.2
binary: rustc
commit-hash: unknown
commit-date: unknown
host: x86_64-apple-darwin
release: 1.27.2
LLVM version: 6.0

Curiously, when inspecting the assembly of main, the call to _mm256_loadu_si256 is not inlined, but instead generates this function:

avx2_bug_hunt`core::coresimd::x86::avx::_mm256_loadu_si256::hd7fc98ebefdce593:
avx2_bug_hunt[0x1000018a0] <+0>:  pushq  %rbp
avx2_bug_hunt[0x1000018a1] <+1>:  movq   %rsp, %rbp
avx2_bug_hunt[0x1000018a4] <+4>:  vmovaps %ymm0, (%rdi)
avx2_bug_hunt[0x1000018a8] <+8>:  popq   %rbp
avx2_bug_hunt[0x1000018a9] <+9>:  vzeroupper 
avx2_bug_hunt[0x1000018ac] <+12>: retq   
avx2_bug_hunt[0x1000018ad] <+13>: nopl   (%rax)

Note the vzeroupper instruction, which clears out the non-XMM registers. This is incorrect behavior, _m256i requires the full YMM register to be loaded unmodified. A similar spurious vzeroupper is also present in the assembly generated for _mm256_storeu_si256, but after the register is stored into memory.

The text was updated successfully, but these errors were encountered:

hellow554 · 2018-07-23T06:28:01Z

Behavoir on Windows is kind of strange. Debug just runs fine, release errors out

# rustc --version
rustc 1.27.2 (58cc626de 2018-07-18)
# cargo run --release
    Finished release [optimized] target(s) in 0.00s
     Running `target\release\foo.exe`
error: process didn't exit successfully: `target\release\foo.exe` (exit code: 3221225477)

I am not a windows developer, so I have no clue why and what happens :D maybe some register clobbering.

Cannot confirm this on playground though :/

Also breaks on my Linux 😮

hellow554 · 2018-07-23T06:39:26Z

With rustc this only happens with opt-level=3, not 2.

Level 2:

_mm256_loadu_si256:
    vmovups (%rsi), %ymm0
    vmovaps %ymm0,(%rdi)
    mov    %rdi,%rax
    vzeroupper
    retq

_mm256_storeu_si256:
    vmovaps (%rsi), %ymm0
    vmovups %ymmo0, (%rdi)
    vzeroupper
    retq

Level 3:

_mm256_loadu_si256:
    vmovups (%rsi), %ymm0
    vmovaps %ymm0, (%rdi)
    vzeroupper
    retq

_mm256_storeu_si256:
    vmovups %ymm0, (%rdi)
    vzeroupper
    retq

hanna-kruppe · 2018-07-23T08:51:02Z

Note the vzeroupper instruction, which clears out the non-XMM registers. This is incorrect behavior, _m256i requires the full YMM register to be loaded unmodified. A similar spurious vzeroupper is also present in the assembly generated for _mm256_storeu_si256, but after the register is stored into memory.

I don't think this analysis is correct, the loadu function doesn't return the loaded value in a register (to avoid ABI mismatches) but rather returns it in memory (writing the contents of ymm0 to an address that's passed in, in this case the address of reg_load). Indeed the assembly never even mentions ymm registers or AVX instructions outside of loadu and storeu, in main the 256 bit vectors are handled as a pair of xmm registers. It can't be using them, since the avx2 target feature is not enabled on the function. Incidentially, this target feature mismatch is also why loadu and storeu are not being inlined.

More importantly, this combined with #50154 is why the store doesn't work: storeu expects the value to be stored in a ymm register, but the caller doesn't have that register, so it passes the two halves of the vector in xmm0 and xmm1 instead. Normally rustc passes all vector arguments and return values through memory to avoid this problem, but as described in #50154 LLVM sometimes undoes this with disastrous consequences.

tl;dr duplicate of #50154

djsweet · 2018-07-23T12:08:38Z

@rkruppe You're right that this is related to #50154, but based on your comments here, we might have established the true cause of these AVX bugs.

My use of is_x86_feature_detected!("avx2") is a bit of a red herring. I discovered this bug while attempting to use AVX2-specific intrinsics, but _mm256_loadu_si256 is a part of AVX as per the Intel Intrinsics Guide, and is only enabled in stdsimd if the avx feature flag is enabled.

the loadu function doesn't return the loaded value in a register (to avoid ABI mismatches) but rather returns it in memory (writing the contents of ymm0 to an address that's passed in, in this case the address of reg_load).

You appear to have _mm256_loadu_si256 and _mm256_storeu_si256 conflated. The only argument passed explicitly to _mm256_loadu_si256, reg_load, is used to load the register, even as per the stdsimd source. If the intent is to perform a memory-to-memory copy to avoid ABI mismatch, this has failed here, too. The assembly listing for _mm256_loadu_si256 takes the input argument at %rdi and loads 256 bits from it into %ymm0. This aligns with the Rust-level definition of the function, and unless the __m256 structs have special compiler support, would be what I would expect from reading that definition.

I find the idea that registers should be "demoted" to memory regions to fit an ABI circumspect. If the target ABI does not allow for the use of said registers, it would be more ergonomic for the compiler to fail early and often, notifying the user that their target ABI does not support the registers they are attempting to use.

hanna-kruppe · 2018-07-23T12:30:31Z

My use of is_x86_feature_detected!("avx2") is a bit of a red herring. I discovered this bug while attempting to use AVX2-specific intrinsics, but _mm256_loadu_si256 is a part of AVX as per the Intel Intrinsics Guide, and is only enabled in stdsimd if the avx feature flag is enabled.

I don't see how that's relevant to this issue.

You appear to have _mm256_loadu_si256 and _mm256_storeu_si256 conflated. The only argument passed explicitly to _mm256_loadu_si256, reg_load, is used to load the register, even as per the stdsimd source.

I am talking about the LLVM IR and assembly generated, not about the source code. However, it appears we're talking about different machine code: on the playground and in @hellow554's experiments, the body of loadu is essentially:

	vmovups	(%rsi), %ymm0
	vmovaps	%ymm0, (%rdi)

while your original post quotes a disassembly that contains only vmovaps %ymm0, (%rdi) ... huh, that's a store? That can't be right. Please double-check.

In any case, #50154 would also explain why storeu is miscompiled.

unless the __m256 structs have special compiler support

They have, not just in the sense required for allocating them to ymm registers and using AVX instructions on them, but yes also in the sense you mean here: they are passed in memory rather than as immediates.

I find the idea that registers should be "demoted" to memory regions to fit an ABI circumspect. If the target ABI does not allow for the use of said registers, it would be more ergonomic for the compiler to fail early and often, notifying the user that their target ABI does not support the registers they are attempting to use.

Unfortunately the ABI mismatch problems are real, difficult to solve, we can't very well ignore them, so this is the only feasible approach for the time being.

djsweet · 2018-07-23T13:15:12Z

@rkruppe The goofy AT&T ASM syntax always trips me up. In Intel-flavor mnemonics,

avx2_bug_hunt`core::coresimd::x86::avx::_mm256_load_si256::h86383795cdef4461:
avx2_bug_hunt[0x100000cc0] <+0>:  push   rbp
avx2_bug_hunt[0x100000cc1] <+1>:  mov    rbp, rsp
avx2_bug_hunt[0x100000cc4] <+4>:  vmovaps ymmword ptr [rdi], ymm0
avx2_bug_hunt[0x100000cc8] <+8>:  pop    rbp
avx2_bug_hunt[0x100000cc9] <+9>:  vzeroupper 
avx2_bug_hunt[0x100000ccc] <+12>: ret    
avx2_bug_hunt[0x100000ccd] <+13>: nop    dword ptr [rax]

avx2_bug_hunt`core::coresimd::x86::avx::_mm256_store_si256::h1b513ee5f14081f3:
avx2_bug_hunt[0x100000cd0] <+0>:  push   rbp
avx2_bug_hunt[0x100000cd1] <+1>:  mov    rbp, rsp
avx2_bug_hunt[0x100000cd4] <+4>:  vmovaps ymmword ptr [rdi], ymm0
avx2_bug_hunt[0x100000cd8] <+8>:  pop    rbp
avx2_bug_hunt[0x100000cd9] <+9>:  vzeroupper 
avx2_bug_hunt[0x100000cdc] <+12>: ret    
avx2_bug_hunt[0x100000cdd] <+13>: nop    dword ptr [rax]

For some reason, load and store are compiling identically on optimization level 3.

EDIT: It's worth noting that in order to ensure it was the dereference causing issues and not explicitly written memory copying, I rewrote the test program earlier using only _mm256_load_si256 and _mm256_store_si256, with similar issues unless explicitly compiled with the AVX ABI.

The updated program:

use std::arch::x86_64;

#[repr(align(32))]
struct BytePair {
    load_bytes: [u8; 32],
    store_bytes: [u8; 32]
}

fn main() {
    let mut byte_pair = BytePair{
        load_bytes: [0x0f; 32],
        store_bytes: [0; 32]
    };
    let lb_ptr = byte_pair.load_bytes.as_ptr();
    let reg_load = unsafe {
        x86_64::_mm256_load_si256(
            lb_ptr as *const x86_64::__m256i
        )
    };
    println!("{:?}", reg_load);
    let sb_ptr = byte_pair.store_bytes.as_mut_ptr();
    unsafe {
        x86_64::_mm256_store_si256(sb_ptr as *mut x86_64::__m256i, reg_load);
    }
    assert_eq!(&byte_pair.load_bytes, &byte_pair.store_bytes);
}

EDIT 2: The disassembly of main up to the x86_64::_mm256_load_si256 call:

   100000ce0:   55                      push   rbp
   100000ce1:   48 89 e5                mov    rbp,rsp
   100000ce4:   41 56                   push   r14
   100000ce6:   53                      push   rbx
   100000ce7:   48 83 e4 e0             and    rsp,0xffffffffffffffe0
   100000ceb:   48 81 ec e0 00 00 00    sub    rsp,0xe0
   100000cf2:   0f 28 05 87 d3 03 00    movaps xmm0,XMMWORD PTR [rip+0x3d387]        # 10003e080 <__ZN53_$LT$$RF$$u27$a$u20$T$u20$as$u20$core..fmt..Debug$GT$3fmt17h743e723bc811d43eE+0x680>
   100000cf9:   0f 29 84 24 90 00 00    movaps XMMWORD PTR [rsp+0x90],xmm0
   100000d00:   00 
   100000d01:   0f 29 84 24 80 00 00    movaps XMMWORD PTR [rsp+0x80],xmm0
   100000d08:   00 
   100000d09:   4c 8d b4 24 a0 00 00    lea    r14,[rsp+0xa0]
   100000d10:   00 
   100000d11:   0f 57 c0                xorps  xmm0,xmm0
   100000d14:   0f 29 84 24 b0 00 00    movaps XMMWORD PTR [rsp+0xb0],xmm0
   100000d1b:   00 
   100000d1c:   0f 29 84 24 a0 00 00    movaps XMMWORD PTR [rsp+0xa0],xmm0
   100000d23:   00 
   100000d24:   0f 28 05 55 d3 03 00    movaps xmm0,XMMWORD PTR [rip+0x3d355]        # 10003e080 <__ZN53_$LT$$RF$$u27$a$u20$T$u20$as$u20$core..fmt..Debug$GT$3fmt17h743e723bc811d43eE+0x680>
   100000d2b:   48 8d 9c 24 c0 00 00    lea    rbx,[rsp+0xc0]
   100000d32:   00 
   100000d33:   48 89 df                mov    rdi,rbx
   100000d36:   0f 28 c8                movaps xmm1,xmm0
   100000d39:   e8 82 ff ff ff          call   100000cc0 <__ZN4core8coresimd3x863avx17_mm256_load_si25617h86383795cdef4461E>

And the contents of 0x10003e080:

(lldb) x 0x10003e080
0x10003e080: 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f  ................
0x10003e090: 00 0a 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................

alexcrichton · 2018-07-23T14:15:45Z

I've confirmed that this is the same bug as #50154, which is the same as upstream LLVM https://bugs.llvm.org/show_bug.cgi?id=37358 as @rkruppe mentioned

djsweet · 2018-07-28T16:22:10Z

@alexcrichton The LLVM bug report was opened in May 2018 with seemingly no progress since then. In the intervening period, the SIMD features were marked "stable" in general (not just at the API level) and shipped even with this bug present. Are there any plans to address the bug at the rustc level, or are consumers of rustc also having to wait for a bugfix from LLVM upstream at this point?

alexcrichton · 2018-07-28T20:35:43Z

@djsweet I don't personally know how we could fix this at the rustc level, but it may be good to ping the LLVM issue if you're interested in stirring up activity!

The issue of passing around SIMD types as values between functions has seen [quite a lot] of [discussion], and although we thought [we fixed it][quite a lot] it [wasn't]! This PR is a change to rustc to, again, try to fix this issue. The fundamental problem here remains the same, if a SIMD vector argument is passed by-value in LLVM's function type, then if the caller and callee disagree on target features a miscompile happens. We solve this by never passing SIMD vectors by-value, but LLVM will still thwart us with its argument promotion pass to promote by-ref SIMD arguments to by-val SIMD arguments. This commit is an attempt to thwart LLVM thwarting us. We, just before codegen, will take yet another look at the LLVM module and demote any by-value SIMD arguments we see. This is a very manual attempt by us to ensure the codegen for a module keeps working, and it unfortunately is likely producing suboptimal code, even in release mode. The saving grace for this, in theory, is that if SIMD types are passed by-value across a boundary in release mode it's pretty unlikely to be performance sensitive (as it's already doing a load/store, and otherwise perf-sensitive bits should be inlined). The implementation here is basically a big wad of C++. It was largely copied from LLVM's own argument promotion pass, only doing the reverse. In local testing this... Closes rust-lang#50154 Closes rust-lang#52636 Closes rust-lang#54583 Closes rust-lang#55059 [quite a lot]: rust-lang#47743 [discussion]: rust-lang#44367 [wasn't]: rust-lang#50154

rustc: Fix (again) simd vectors by-val in ABI The issue of passing around SIMD types as values between functions has seen [quite a lot] of [discussion], and although we thought [we fixed it][quite a lot] it [wasn't]! This PR is a change to rustc to, again, try to fix this issue. The fundamental problem here remains the same, if a SIMD vector argument is passed by-value in LLVM's function type, then if the caller and callee disagree on target features a miscompile happens. We solve this by never passing SIMD vectors by-value, but LLVM will still thwart us with its argument promotion pass to promote by-ref SIMD arguments to by-val SIMD arguments. This commit is an attempt to thwart LLVM thwarting us. We, just before codegen, will take yet another look at the LLVM module and demote any by-value SIMD arguments we see. This is a very manual attempt by us to ensure the codegen for a module keeps working, and it unfortunately is likely producing suboptimal code, even in release mode. The saving grace for this, in theory, is that if SIMD types are passed by-value across a boundary in release mode it's pretty unlikely to be performance sensitive (as it's already doing a load/store, and otherwise perf-sensitive bits should be inlined). The implementation here is basically a big wad of C++. It was largely copied from LLVM's own argument promotion pass, only doing the reverse. In local testing this... Closes #50154 Closes #52636 Closes #54583 Closes #55059 [quite a lot]: #47743 [discussion]: #44367 [wasn't]: #50154

The issue of passing around SIMD types as values between functions has seen [quite a lot] of [discussion], and although we thought [we fixed it][quite a lot] it [wasn't]! This PR is a change to rustc to, again, try to fix this issue. The fundamental problem here remains the same, if a SIMD vector argument is passed by-value in LLVM's function type, then if the caller and callee disagree on target features a miscompile happens. We solve this by never passing SIMD vectors by-value, but LLVM will still thwart us with its argument promotion pass to promote by-ref SIMD arguments to by-val SIMD arguments. This commit is an attempt to thwart LLVM thwarting us. We, just before codegen, will take yet another look at the LLVM module and demote any by-value SIMD arguments we see. This is a very manual attempt by us to ensure the codegen for a module keeps working, and it unfortunately is likely producing suboptimal code, even in release mode. The saving grace for this, in theory, is that if SIMD types are passed by-value across a boundary in release mode it's pretty unlikely to be performance sensitive (as it's already doing a load/store, and otherwise perf-sensitive bits should be inlined). The implementation here is basically a big wad of C++. It was largely copied from LLVM's own argument promotion pass, only doing the reverse. In local testing this... Closes rust-lang#50154 Closes rust-lang#52636 Closes rust-lang#54583 Closes rust-lang#55059 [quite a lot]: rust-lang#47743 [discussion]: rust-lang#44367 [wasn't]: rust-lang#50154

rustc: Fix (again) simd vectors by-val in ABI The issue of passing around SIMD types as values between functions has seen [quite a lot] of [discussion], and although we thought [we fixed it][quite a lot] it [wasn't]! This PR is a change to rustc to, again, try to fix this issue. The fundamental problem here remains the same, if a SIMD vector argument is passed by-value in LLVM's function type, then if the caller and callee disagree on target features a miscompile happens. We solve this by never passing SIMD vectors by-value, but LLVM will still thwart us with its argument promotion pass to promote by-ref SIMD arguments to by-val SIMD arguments. This commit is an attempt to thwart LLVM thwarting us. We, just before codegen, will take yet another look at the LLVM module and demote any by-value SIMD arguments we see. This is a very manual attempt by us to ensure the codegen for a module keeps working, and it unfortunately is likely producing suboptimal code, even in release mode. The saving grace for this, in theory, is that if SIMD types are passed by-value across a boundary in release mode it's pretty unlikely to be performance sensitive (as it's already doing a load/store, and otherwise perf-sensitive bits should be inlined). The implementation here is basically a big wad of C++. It was largely copied from LLVM's own argument promotion pass, only doing the reverse. In local testing this... Closes rust-lang#50154 Closes rust-lang#52636 Closes rust-lang#54583 Closes rust-lang#55059 [quite a lot]: rust-lang#47743 [discussion]: rust-lang#44367 [wasn't]: rust-lang#50154

The issue of passing around SIMD types as values between functions has seen [quite a lot] of [discussion], and although we thought [we fixed it][quite a lot] it [wasn't]! This PR is a change to rustc to, again, try to fix this issue. The fundamental problem here remains the same, if a SIMD vector argument is passed by-value in LLVM's function type, then if the caller and callee disagree on target features a miscompile happens. We solve this by never passing SIMD vectors by-value, but LLVM will still thwart us with its argument promotion pass to promote by-ref SIMD arguments to by-val SIMD arguments. This commit is an attempt to thwart LLVM thwarting us. We, just before codegen, will take yet another look at the LLVM module and demote any by-value SIMD arguments we see. This is a very manual attempt by us to ensure the codegen for a module keeps working, and it unfortunately is likely producing suboptimal code, even in release mode. The saving grace for this, in theory, is that if SIMD types are passed by-value across a boundary in release mode it's pretty unlikely to be performance sensitive (as it's already doing a load/store, and otherwise perf-sensitive bits should be inlined). The implementation here is basically a big wad of C++. It was largely copied from LLVM's own argument promotion pass, only doing the reverse. In local testing this... Closes rust-lang#50154 Closes rust-lang#52636 Closes rust-lang#54583 Closes rust-lang#55059 [quite a lot]: rust-lang#47743 [discussion]: rust-lang#44367 [wasn't]: rust-lang#50154

alexcrichton mentioned this issue Oct 14, 2018

rustc: Fix (again) simd vectors by-val in ABI #55073

Merged

bors closed this as completed in #55073 Oct 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

_mm256_loadu_si256 only loads 128 bits when compiled with default cargo build --release #52636

_mm256_loadu_si256 only loads 128 bits when compiled with default cargo build --release #52636

djsweet commented Jul 23, 2018

hellow554 commented Jul 23, 2018 •

edited

Loading

Uh oh!

hellow554 commented Jul 23, 2018 •

edited

Loading

Uh oh!

hanna-kruppe commented Jul 23, 2018 •

edited

Loading

Uh oh!

djsweet commented Jul 23, 2018

Uh oh!

hanna-kruppe commented Jul 23, 2018 •

edited

Loading

Uh oh!

djsweet commented Jul 23, 2018 •

edited

Loading

Uh oh!

alexcrichton commented Jul 23, 2018

Uh oh!

djsweet commented Jul 28, 2018

Uh oh!

alexcrichton commented Jul 28, 2018

Uh oh!

_mm256_loadu_si256 only loads 128 bits when compiled with default cargo build --release #52636

_mm256_loadu_si256 only loads 128 bits when compiled with default cargo build --release #52636

Comments

djsweet commented Jul 23, 2018

hellow554 commented Jul 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hellow554 commented Jul 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanna-kruppe commented Jul 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djsweet commented Jul 23, 2018

Uh oh!

hanna-kruppe commented Jul 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djsweet commented Jul 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Jul 23, 2018

Uh oh!

djsweet commented Jul 28, 2018

Uh oh!

alexcrichton commented Jul 28, 2018

Uh oh!

hellow554 commented Jul 23, 2018 •

edited

Loading

hellow554 commented Jul 23, 2018 •

edited

Loading

hanna-kruppe commented Jul 23, 2018 •

edited

Loading

hanna-kruppe commented Jul 23, 2018 •

edited

Loading

djsweet commented Jul 23, 2018 •

edited

Loading