Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX2 related regression introduced by rust 1.56.0 and onwards #91839

Closed
marmeladema opened this issue Dec 12, 2021 · 12 comments
Closed

AVX2 related regression introduced by rust 1.56.0 and onwards #91839

marmeladema opened this issue Dec 12, 2021 · 12 comments
Labels
C-bug Category: This is a bug. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@marmeladema
Copy link
Contributor

Code

I tried this code:

I expected to see this happen: tests pass with success
Instead, this happened: some tests are failing:

failures:
    tests::search_middle
    tests::search_multiple
    tests::search_prefix
    tests::search_suffix

Version it worked on

It most recently worked on: Rust 1.55.0

Version with regression

rustc +1.56.0 --version --verbose:

rustc 1.56.0 (09c42c458 2021-10-18)
binary: rustc
commit-hash: 09c42c45858d5f3aedfa670698275303a3d19afa
commit-date: 2021-10-18
host: x86_64-unknown-linux-gnu
release: 1.56.0
LLVM version: 13.0.0

and onwards (1.56.1, 1.57.0, beta and nightly)

How to reproduce

Just run the following command with the previously mentioned commit checked out:

$ cargo +1.56.0 test --release -- tests::search_middle

What does commit cloudflare/sliceslice-rs@a7d4556 do

It just moves method vector_search_in from the Avx2Searcher struct to the Searcher trait. Loosing the #[target_feature(enable = "avx2")] attribute at the same time.

I found two (unsatisfactory) way to make the tests pass:

    #[inline]
    // Uncommenting the following lines makes it work
    // #[target_feature(enable = "avx2")]
    unsafe fn vector_search_in<V: Vector>(

But that won't work ultimately because the trait is going to be implemented for other architectures.

        // Uncommenting the following lines makes it work
        // println!("[vector_search_in_chunk] hash.first={:?}", hash.first);
        println!("[vector_search_in_chunk] first={:?}", first);

Absolutely no idea why printing something would have any influence!

Everything works properly for rust versions <= 1.55.0

Bisection attempt

I have tried to use cargo-bisect-rustc to bisect and it seems introduced between nightly-2021-08-10 and nightly-2021-08-12 (there are no nightly for 2021-08-11).

Further (manual) bisection points to #87254 but I am not sure my bisection is actually correct so any help is welcome 👍

@marmeladema marmeladema added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Dec 12, 2021
@rustbot rustbot added regression-from-stable-to-stable Performance or correctness regression from one stable version to another. I-prioritize Issue: Indicates that prioritization has been requested for this issue. regression-untriaged Untriaged performance or correctness regression. and removed regression-untriaged Untriaged performance or correctness regression. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Dec 12, 2021
@Aaron1011 Aaron1011 pinned this issue Dec 12, 2021
@Aaron1011 Aaron1011 unpinned this issue Dec 12, 2021
@hkratz
Copy link
Contributor

hkratz commented Dec 13, 2021

I can confirm that the test failure can be tracked down to #87254, though I currently have no idea why that change would cause an issue here. A reduced testcase would help with the investigation.

As an aside, functions or trait methods will only be inlined if the caller is compiled with the same (or, unreliably, a superset of the) target features as the callee. E.g. after the change introduced in cloudflare/sliceslice-rs@820a28f calls from vector_search_in_chunk() to Vector::cmpeq_epi8(), Vector::loadu_si(), etc. are not inlined even with Rust 1.55. That is likely not what you want.

@marmeladema
Copy link
Contributor Author

marmeladema commented Dec 13, 2021

Thank you for looking into this @hkratz !

A reduced testcase would help with the investigation.

Here is a reduced testcase, even though it's still quite big
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

trait Vector: Copy + std::fmt::Debug {
    const LANES: usize;

    unsafe fn set1_epi8(a: i8) -> Self;

    unsafe fn loadu_si(a: *const u8) -> Self;

    unsafe fn cmpeq_epi8(a: Self, b: Self) -> Self;

    unsafe fn and_si(a: Self, b: Self) -> Self;

    unsafe fn movemask_epi8(a: Self) -> i32;
}

impl Vector for __m256i {
    const LANES: usize = 32;

    #[inline]
    #[target_feature(enable = "avx2")]
    unsafe fn set1_epi8(a: i8) -> Self {
        _mm256_set1_epi8(a)
    }

    #[inline]
    #[target_feature(enable = "avx2")]
    unsafe fn loadu_si(a: *const u8) -> Self {
        _mm256_loadu_si256(a as *const Self)
    }

    #[inline]
    #[target_feature(enable = "avx2")]
    unsafe fn cmpeq_epi8(a: Self, b: Self) -> Self {
        _mm256_cmpeq_epi8(a, b)
    }

    #[inline]
    #[target_feature(enable = "avx2")]
    unsafe fn and_si(a: Self, b: Self) -> Self {
        _mm256_and_si256(a, b)
    }

    #[inline]
    #[target_feature(enable = "avx2")]
    unsafe fn movemask_epi8(a: Self) -> i32 {
        _mm256_movemask_epi8(a)
    }
}

struct VectorHash<V: Vector> {
    first: V,
    last: V,
}

impl<V: Vector> VectorHash<V> {
    unsafe fn new(first: u8, last: u8) -> Self {
        Self {
            first: Vector::set1_epi8(first as i8),
            last: Vector::set1_epi8(last as i8),
        }
    }
}

pub struct Avx2Searcher {
    position: usize,
    avx2_hash: VectorHash<__m256i>,
    needle_len: usize,
}

impl Avx2Searcher {
    #[target_feature(enable = "avx2")]
    pub unsafe fn with_position(needle: &[u8], position: usize) -> Self {
        // Implicitly checks that the needle is not empty because position is an
        // unsized integer.
        assert!(position < needle.len());

        let avx2_hash = VectorHash::new(needle[0], needle[position]);

        Self {
            position,
            avx2_hash,
            needle_len: needle.len(),
        }
    }

    /// Performs a substring search for the `needle` within `haystack`.
    #[target_feature(enable = "avx2")]
    pub unsafe fn search_in(&self, haystack: &[u8]) -> bool {
        println!("[search_in] haystack({})={:?}", haystack.len(), haystack);

        let end = haystack.len() - self.needle_len + 1;

        self.vector_search_in(haystack, end, &self.avx2_hash)
    }
}

trait Searcher {
    fn position(&self) -> usize;

    #[inline]
    unsafe fn vector_search_in_chunk<V: Vector>(
        &self,
        hash: &VectorHash<V>,
        start: *const u8,
        mask: i32,
    ) -> bool {
        let first = Vector::loadu_si(start);
        let last = Vector::loadu_si(start.add(self.position()));

        // Uncommenting the following lines makes it work
        // println!("[vector_search_in_chunk] hash.first={:?}", hash.first);
        println!("[vector_search_in_chunk] first={:?}", first);
        let eq_first = Vector::cmpeq_epi8(hash.first, first);
        println!(
            "[vector_search_in_chunk] eq_first={:?}",
            eq_first
        );
        let eq_last = Vector::cmpeq_epi8(hash.last, last);
        println!(
            "[vector_search_in_chunk] eq_last ={:?}",
            eq_last
        );

        let eq = Vector::and_si(eq_first, eq_last);
        println!(
            "[vector_search_in_chunk] eq      ={:?}",
            eq
        );
        let eq = (Vector::movemask_epi8(eq) & mask) as u32;
        println!(
        	"[vector_search_in_chunk] eq      ={:?}",
        	eq
        );

        eq != 0
    }

    #[inline]
    // Uncommenting the following lines makes it work
    // #[target_feature(enable = "avx2")]
    unsafe fn vector_search_in<V: Vector>(
        &self,
        haystack: &[u8],
        end: usize,
        hash: &VectorHash<V>,
    ) -> bool {
        println!(
            "[vector_search_in] haystack({})={:?}, end={}",
            haystack.len(),
            haystack,
            end
        );
        // debug_assert!(haystack.len() >= self.needle().size());

        let mut chunks = haystack[..end].chunks_exact(V::LANES);
        for chunk in &mut chunks {
            println!("[vector_search_in] chunk({})={:?}", chunk.len(), chunk);
            if self.vector_search_in_chunk(hash, chunk.as_ptr(), -1) {
                return true;
            }
        }

        let remainder = chunks.remainder().len();
        println!("[vector_search_in] remainder: {}", remainder);
        if remainder > 0 {
            let start = haystack.as_ptr().add(end - V::LANES);
            let mask = -1 << (V::LANES - remainder);

            if self.vector_search_in_chunk(hash, start, mask) {
                return true;
            }
        }

        false
    }
}

impl Searcher for Avx2Searcher {
    #[inline(always)]
    fn position(&self) -> usize {
        self.position
    }
}

pub fn main() {
	let needle = b"consectetur";
	let haystack = b"Lorem ipsum dolor sit amet, consectetur adipiscing elit";
	let searcher = unsafe { Avx2Searcher::with_position(needle, 0) };
	assert!(unsafe { searcher.search_in(haystack) });
}

Compiled in debug mode with rust 1.57.0:

$ rustc repro.rs
$ ./repro 
[search_in] haystack(55)=[76, 111, 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32, 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111, 110, 115, 101, 99, 116, 101, 116, 117, 114, 32, 97, 100, 105, 112, 105, 115, 99, 105, 110, 103, 32, 101, 108, 105, 116]
[vector_search_in] haystack(55)=[76, 111, 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32, 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111, 110, 115, 101, 99, 116, 101, 116, 117, 114, 32, 97, 100, 105, 112, 105, 115, 99, 105, 110, 103, 32, 101, 108, 105, 116], end=45
[vector_search_in] chunk(32)=[76, 111, 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32, 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111, 110, 115]
[vector_search_in_chunk] first=__m256i(8100041059028070220, 8028914711526208883, 7881616507232526450, 8317708033332114533)
[vector_search_in_chunk] eq_first=__m256i(0, 0, 0, 1095216660480)
[vector_search_in_chunk] eq_last =__m256i(0, 0, 0, 1095216660480)
[vector_search_in_chunk] eq      =__m256i(0, 0, 0, 1095216660480)
[vector_search_in_chunk] eq      =268435456

In the working case, eq is properly not equal to zero.

Compiled with opt-level=3 with rust 1.57.0:

$ rustc repro.rs -C opt-level=3
$ ./repro 
[search_in] haystack(55)=[76, 111, 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32, 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111, 110, 115, 101, 99, 116, 101, 116, 117, 114, 32, 97, 100, 105, 112, 105, 115, 99, 105, 110, 103, 32, 101, 108, 105, 116]
[vector_search_in] haystack(55)=[76, 111, 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32, 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111, 110, 115, 101, 99, 116, 101, 116, 117, 114, 32, 97, 100, 105, 112, 105, 115, 99, 105, 110, 103, 32, 101, 108, 105, 116], end=45
[vector_search_in] chunk(32)=[76, 111, 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32, 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111, 110, 115]
[vector_search_in_chunk] first=__m256i(8100041059028070220, 8028914711526208883, 7881616507232526450, 8317708033332114533)
[vector_search_in_chunk] eq_first=__m256i(0, 0, 0, 1095216660480)
[vector_search_in_chunk] eq_last =__m256i(0, 0, 0, 0)
[vector_search_in_chunk] eq      =__m256i(0, 0, 0, 0)
[vector_search_in_chunk] eq      =0
[vector_search_in] remainder: 13
[vector_search_in_chunk] first=__m256i(8388362364150312047, 7142757887439102240, 8387237851300064879, 7597688451221189237)
[vector_search_in_chunk] eq_first=__m256i(0, -72057594037927936, 1095216660480, 0)
[vector_search_in_chunk] eq_last =__m256i(0, 0, 0, 0)
[vector_search_in_chunk] eq      =__m256i(0, 0, 0, 0)
[vector_search_in_chunk] eq      =0
thread 'main' panicked at 'assertion failed: unsafe { searcher.search_in(haystack) }', repro.rs:191:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Here, for whatever reason, eq ends up being equal to 0.

Not reproducible with opt-level=1 or opt-level=2

As an aside, functions or trait methods will only be inlined if the caller is compiled with the same (or, unreliably, a superset of the) target features as the callee. E.g. after the change introduced in cloudflare/sliceslice-rs@820a28f calls from vector_search_in_chunk() to Vector::cmpeq_epi8(), Vector::loadu_si(), etc. are not inlined even with Rust 1.55. That is likely not what you want.

That's kind of what I was experimenting with. Is there a technical reason that forbids inlining here? The trait is only called through methods marked with #[target_feature(enable = "avx2")], so I would assume it would locally inherit the target feature flag and that further calls to avx2 methods would get inlined properly too.

This a bit of a distraction compared to what this issue is about but what are my options here? Use macro_rules instead?

@marmeladema
Copy link
Contributor Author

Depending on the level optimization, it seems eq_last doesn't have the same value but printing either hash.first or hash.last makes it work somehow.

@hkratz
Copy link
Contributor

hkratz commented Dec 13, 2021

Thanks for the testcase. Will need some quiet time to look into this.

Is there a technical reason that forbids inlining here?

This is a restriction of both Rust (see RFC 2045) and LLVM (see e.g. the implementation for X86). Essentially the backend optimizes based on the assumption that functions are the smallest cohesive units for feature selection. What you would probably want here is for the compiler to prove that a monomorphised method can only be called from AVX2-enabled contexts and thus promote it to AVX2 as well. We don't have that (yet).

This a bit of a distraction compared to what this issue is about but what are my options here? Use macro_rules instead?

Either use macros like I did in simdutf8 or use something like @calebzulawski's multiversion crate.

@nikic
Copy link
Contributor

nikic commented Dec 13, 2021

FWIW miscompilations in this area can be related to target_features mismatches due to an LLVM bug, see #79865. I haven't checked whether this is the case here or not.

@hkratz
Copy link
Contributor

hkratz commented Dec 13, 2021

Minimized:

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

pub struct TwoVecs(__m256i, __m256i);

#[target_feature(enable = "avx2")]
pub unsafe fn avx2_any_upper_bit_set(vecs: &TwoVecs) -> bool {
    inlined_non_avx2(&vecs)
}

#[inline]
unsafe fn inlined_non_avx2(
    vecs: &TwoVecs
) -> bool {
    non_inlined_non_avx2(vecs)
}

#[inline(never)]
unsafe fn non_inlined_non_avx2(
    vecs: &TwoVecs,
) -> bool {
    let eq = _mm256_or_si256(vecs.0, vecs.1);
    _mm256_movemask_epi8(eq) != 0
}

pub fn main() {
    unsafe {
        let vecs = TwoVecs(_mm256_set1_epi8(0x00_u8 as i8), _mm256_set1_epi8(0x00_u8 as i8));
        assert!(!avx2_any_upper_bit_set(&vecs));
    }
}

Godbolt

This asserts in release but not debug. Essentially it is an ABI mismatch between avx2_any_upper_bit_set() putting parameters in ymm registers and non_inlined_non_avx2() expecting them in xmm registers. The LLVM IR looks OK. So it can probably be easily turned into a bug report to LLVM.

#87254 exposes this problem by removing zero-sized padding fields, which previously prevented the (broken) LLVM optimization of putting the struct values into ymm registers.

@marmeladema
Copy link
Contributor Author

Wow that was fast! Thank you so much @hkratz ! 👍

So it can probably be easily turned into a bug report to LLVM.

This kind of issue is really outside of my area of expertise. What's the process to create an llvm bug report? I must admit that I don't feel comfortable doing it myself.

@nikic
Copy link
Contributor

nikic commented Dec 13, 2021

Okay, that does look like the same issue as #79865. I'll file a bug with LLVM.

@nikic
Copy link
Contributor

nikic commented Dec 13, 2021

Opened llvm/llvm-project#52660.

@marmeladema
Copy link
Contributor Author

Thank you very much @nikic !

In the mean time, would it be possible (or even desirable) to detect such ABI mismatch cases on the rust compiler side and issue an error?

@apiraino apiraino added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Dec 23, 2021
@apiraino
Copy link
Contributor

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

@rustbot label -I-prioritize +P-medium

@rustbot rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Dec 30, 2021
@nikic
Copy link
Contributor

nikic commented Jan 20, 2022

Closing this as a duplicate of #79865, as they have the same root cause.

@nikic nikic closed this as completed Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants