Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in performance (poor codegen) between 1.81 and 1.82 #132636

Open
tobz opened this issue Nov 5, 2024 · 10 comments
Open

Regression in performance (poor codegen) between 1.81 and 1.82 #132636

tobz opened this issue Nov 5, 2024 · 10 comments
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@tobz
Copy link

tobz commented Nov 5, 2024

I've observed a fairly nasty performance regression when switching to 1.82 related to compression performance that appears to boil down to some very weird/suboptimal machine code being generated.

Delving into things, it looks like what was previously reasonable generated code is now turning into... well, something way less reasonable, and this results in executing a vastly higher number of instructions for the same task, in turn slowing everything down. The MVCE (mentioned below) demonstrates this behavior across at least two instruction sets (x86-64 and ARM64) and a number of different microarchitectures (AMD Zen2, Zen4, and Apple M2) and reliably shows a performance regression.

Some other folks who were helping me debug this will also be posting some more information, but the biggest initial suspect looks to be the upgrade to LLVM 19 in 1.82.

Code

The MVCE is here: https://github.com/tobz/miniz-oxide-slowdown-repro.

It's a simple program that generates a deterministic input corpus, compresses it multiple times (using flate2/miniz_oxide), and then exits. Efforts were made to try and ensure things are deterministic and mostly representative of the observed change between Rust 1.81 and 1.82.

On my environment, this all boils down to each compress operation taking ~750ms when compiled on 1.81.0 or earlier, jumping to ~1.25s per operation on 1.82.0 and later, representing a ~60% slowdown.

Version it worked on

This "worked" (ran well) on 1.81.0, 1.80.0, and 1.79.0 (as far back as I checked).

Version with regression

rustc --version --verbose:

rustc 1.82.0 (f6e511eec 2024-10-15)
binary: rustc
commit-hash: f6e511eec7342f59a25f7c0534f1dbea00d01b14
commit-date: 2024-10-15
host: x86_64-unknown-linux-gnu
release: 1.82.0
LLVM version: 19.1.1
@tobz tobz added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Nov 5, 2024
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Nov 5, 2024
@apiraino
Copy link
Contributor

apiraino commented Nov 5, 2024

cc @nikic perhaps more insights

@rustbot label +A-LLVM +I-slow

@rustbot rustbot added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Nov 5, 2024
@bobrik
Copy link

bobrik commented Nov 5, 2024

Some analysis from my end:

  • flamegraph on v1.81:

Image

  • flamegraph on v1.82:

Image

It's clear that read_u16_le stands out.

  • perf annotate on v1.81:

Image

  • perf annotate on v1.82:

Image

That's a lot of instructions! You can find them with objdump:

objdump -S --no-show-raw-insn ~/miniz-oxide-slowdown-repro.1.81 | perl -p -e 's/^\s+(\S+):\t//;' | c++filt | less
  • v1.81:
    slice[pos] as u16 | ((slice[pos + 1] as u16) << 8)
movzwl 0x1ffff(%rcx,%rax,1),%eax
mov    %ax,0x48(%rsp)
                if num_probes_left == 0 {
lea    0xa(%r11),%rax
mov    %rax,0x90(%rsp)
xor    %ebx,%ebx
mov    %r11,%r10
mov    %r8d,%eax
mov    %rbp,0x10(%rsp)
mov    %rcx,0x20(%rsp)
mov    %eax,%edx
nopw   0x0(%rax,%rax,1)
  • v1.82 (truncated):
    slice[pos] as u16 | ((slice[pos + 1] as u16) << 8)
movzwl -0x1(%r9,%rax,1),%eax
mov    %ax,0x40(%rsp)
lea    0x2(%r15),%rax
mov    %rax,0x280(%rsp)
and    $0x7fff,%eax
mov    %rax,0x278(%rsp)
lea    0xa(%r15),%rax
mov    %rax,0x270(%rsp)
and    $0x7fff,%eax
mov    %rax,0x268(%rsp)
lea    0x12(%r15),%rax
mov    %rax,0x260(%rsp)
and    $0x7fff,%eax
mov    %rax,0x258(%rsp)
lea    0x1a(%r15),%rax
mov    %rax,0x250(%rsp)
and    $0x7fff,%eax
mov    %rax,0x248(%rsp)
lea    0x22(%r15),%rax
mov    %rax,0x240(%rsp)
and    $0x7fff,%eax
mov    %rax,0x238(%rsp)
lea    0x2a(%r15),%rax
mov    %rax,0x230(%rsp)
and    $0x7fff,%eax
mov    %rax,0x228(%rsp)
lea    0x32(%r15),%rax
mov    %rax,0x220(%rsp)
and    $0x7fff,%eax
mov    %rax,0x218(%rsp)
lea    0x3a(%r15),%rax
mov    %rax,0x210(%rsp)
and    $0x7fff,%eax
mov    %rax,0x208(%rsp)
lea    0x42(%r15),%rax
mov    %rax,0x200(%rsp)
and    $0x7fff,%eax
mov    %rax,0x1f8(%rsp)
lea    0x4a(%r15),%rax
mov    %rax,0x1f0(%rsp)
and    $0x7fff,%eax
mov    %rax,0x1e8(%rsp)
lea    0x52(%r15),%rax
mov    %rax,0x1e0(%rsp)
and    $0x7fff,%eax
mov    %rax,0x1d8(%rsp)
lea    0x5a(%r15),%rax
mov    %rax,0x1d0(%rsp)
and    $0x7fff,%eax
mov    %rax,0x1c8(%rsp)

I'm no assembly wizard, but it seems excessive and likely the reason.

@randomairborne
Copy link

randomairborne commented Nov 5, 2024

This function is not actually this issue, see hanna-kruppe's comment. Just wanna add- [this function does claim to rely on the compiler having certain smarts](https://github.com/Frommi/miniz_oxide/blob/0a33effd414711b379e01b0613ba5ae85a0e14d0/miniz_oxide/src/deflate/core.rs#L406-L411), in case it's relevant ```rust // Read the two bytes starting at pos and interpret them as an u16. #[inline] const fn read_u16_le(slice: &[u8], pos: usize) -> u16 { // The compiler is smart enough to optimize this into an unaligned load. slice[pos] as u16 | ((slice[pos + 1] as u16) << 8) } ```

@jieyouxu jieyouxu added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Nov 5, 2024
@hanna-kruppe
Copy link
Contributor

The read_u16_le function appears to compile to a single movzwl in both versions (and I would have been really surprised of that optimization had regressed). The other instructions that do differ have no relation to read_u16_le and come from other parts of the source code, they just aren’t attributed to the respective source lines because debug information is very imperfect when compiling with heavy optimizations.

@saethlin saethlin removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Nov 5, 2024
@dianqk
Copy link
Member

dianqk commented Nov 6, 2024

The function miniz_oxide::deflate::core::compress_inner has doubled its instruction count, which I suspect is related.

@dianqk dianqk self-assigned this Nov 7, 2024
@dianqk
Copy link
Member

dianqk commented Nov 7, 2024

I bisected to llvm/llvm-project@166c184, but I believe it is an improvement. It's about loop unroll.

If you run with RUSTFLAGS="-Copt-level=2" or RUSTFLAGS="-Cllvm-args=-unroll-threshold-aggressive=150", it will be fast again. LOL, O2 faster than O3 again.

You can also reproduce it with rust 1.81.0 by using the following patch:

diff --git a/miniz_oxide/src/deflate/core.rs b/miniz_oxide/src/deflate/core.rs
index 143dbe0..4af3758 100644
--- a/miniz_oxide/src/deflate/core.rs
+++ b/miniz_oxide/src/deflate/core.rs
@@ -1358,11 +1358,11 @@ impl DictOxide {
                 let p_data: u64 = self.read_unaligned_u64(p);
                 let q_data: u64 = self.read_unaligned_u64(q);
                 // Compare of 8 bytes at a time by using unaligned loads of 64-bit integers.
-                let xor_data = p_data ^ q_data;
-                if xor_data == 0 {
+                if p_data == q_data {
                     p += 8;
                     q += 8;
                 } else {
+                    let xor_data = p_data ^ q_data;
                     // If not all of the last 8 bytes matched, check how may of them did.
                     let trailing = xor_data.trailing_zeros();
 

I am not family with loop unroll yet, but I guess we should not unroll for the wired edge: continue 'outer;.

@apiraino
Copy link
Contributor

apiraino commented Nov 8, 2024

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

@rustbot rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Nov 8, 2024
@dianqk
Copy link
Member

dianqk commented Nov 11, 2024

TL;DR: This is due to LICM hoisting numerous loop invariants, which increases register pressure. We should revert it in MachineSink.

llvm/llvm-project@166c184 transformed p_data ^ q_data == 0 to p_data == q_data at miniz_oxide/src/deflate/core.rs#L1359-L1360. This provides an opportunity to fully unroll the 32 cycle loop, allowing p += 8 to become a loop invariant and be hoisted that increased register pressure.

A reproduced example is:

Details

use std::cmp;
use std::convert::TryInto;
use std::hint::black_box;

pub(crate) const LZ_DICT_SIZE: usize = 32_768;
const LZ_DICT_SIZE_MASK: usize = (LZ_DICT_SIZE as u32 - 1) as usize;
pub(crate) const MAX_MATCH_LEN: usize = 258;
/// Size of the buffer of lz77 encoded data.
pub const LZ_CODE_BUF_SIZE: usize = 64 * 1024;
/// Size of the output buffer.
pub const OUT_BUF_SIZE: usize = (LZ_CODE_BUF_SIZE * 13) / 10;
pub const LZ_DICT_FULL_SIZE: usize = LZ_DICT_SIZE + MAX_MATCH_LEN - 1 + 1;

/// Size of hash values in the hash chains.
pub const LZ_HASH_BITS: i32 = 15;
/// How many bits to shift when updating the current hash value.
pub const LZ_HASH_SHIFT: i32 = (LZ_HASH_BITS + 2) / 3;
/// Size of the chained hash tables.
pub const LZ_HASH_SIZE: usize = 1 << LZ_HASH_BITS;

#[inline]
const fn read_u16_le(slice: &[u8], pos: usize) -> u16 {
    // The compiler is smart enough to optimize this into an unaligned load.
    slice[pos] as u16 | ((slice[pos + 1] as u16) << 8)
}

pub struct HashBuffers {
    pub dict: [u8; LZ_DICT_FULL_SIZE],
    pub next: [u16; LZ_DICT_SIZE],
    pub hash: [u16; LZ_DICT_SIZE],
}

impl HashBuffers {
    #[inline]
    pub fn reset(&mut self) {
        *self = HashBuffers::default();
    }
}

impl Default for HashBuffers {
    fn default() -> HashBuffers {
        HashBuffers {
            dict: [0; LZ_DICT_FULL_SIZE],
            next: [0; LZ_DICT_SIZE],
            hash: [0; LZ_DICT_SIZE],
        }
    }
}

struct DictOxide {
    /// The maximum number of checks in the hash chain, for the initial,
    /// and the lazy match respectively.
    pub max_probes: [u32; 2],
    /// Buffer of input data.
    /// Padded with 1 byte to simplify matching code in `compress_fast`.
    pub b: Box<HashBuffers>,
}

impl DictOxide {
    fn new() -> Self {
        DictOxide {
            max_probes: [0, 0],
            b: Box::default(),
        }
    }

    /// Do an unaligned read of the data at `pos` in the dictionary and treat it as if it was of
    /// type T.
    #[inline]
    fn read_unaligned_u64(&self, pos: usize) -> u64 {
        // Help evade bounds/panic code check by masking the position value
        // This provides a small speedup at the cost of an instruction or two instead of
        // having to use unsafe.
        let pos = pos & LZ_DICT_SIZE_MASK;
        let bytes: [u8; 8] = self.b.dict[pos..pos + 8].try_into().unwrap();
        u64::from_le_bytes(bytes)
    }

    /// Do an unaligned read of the data at `pos` in the dictionary and treat it as if it was of
    /// type T.
    #[inline]
    fn read_as_u16(&self, pos: usize) -> u16 {
        read_u16_le(&self.b.dict[..], pos)
    }

    #[inline(never)]
    #[no_mangle]
    fn find_match(
        &self,
        lookahead_pos: usize,
        max_dist: usize,
        max_match_len: u32,
        mut match_dist: u32,
        mut match_len: u32,
    ) -> (u32, u32) {
        // Clamp the match len and max_match_len to be valid. (It should be when this is called, but
        // do it for now just in case for safety reasons.)
        // This should normally end up as at worst conditional moves,
        // so it shouldn't slow us down much.
        // TODO: Statically verify these so we don't need to do this.
        let max_match_len = cmp::min(MAX_MATCH_LEN as u32, max_match_len);
        match_len = cmp::max(match_len, 1);

        let pos = lookahead_pos & LZ_DICT_SIZE_MASK;
        let mut probe_pos = pos;
        // Number of probes into the hash chains.
        let mut num_probes_left = self.max_probes[(match_len >= 32) as usize];

        // If we already have a match of the full length don't bother searching for another one.
        if max_match_len <= match_len {
            return (match_dist, match_len);
        }

        // Read the last byte of the current match, and the next one, used to compare matches.
        let mut c01: u16 = self.read_as_u16(pos + match_len as usize - 1);
        // Read the two bytes at the end position of the current match.
        let s01: u16 = self.read_as_u16(pos);

        'outer: loop {
            let mut dist;
            'found: loop {
                num_probes_left -= 1;
                if num_probes_left == 0 {
                    // We have done as many probes in the hash chain as the current compression
                    // settings allow, so return the best match we found, if any.
                    return (match_dist, match_len);
                }

                for _ in 0..3 {
                    let next_probe_pos = self.b.next[probe_pos] as usize;

                    dist = (lookahead_pos - next_probe_pos) & 0xFFFF;
                    if next_probe_pos == 0 || dist > max_dist {
                        // We reached the end of the hash chain, or the next value is further away
                        // than the maximum allowed distance, so return the best match we found, if
                        // any.
                        return (match_dist, match_len);
                    }

                    // Mask the position value to get the position in the hash chain of the next
                    // position to match against.
                    probe_pos = next_probe_pos & LZ_DICT_SIZE_MASK;

                    if self.read_as_u16(probe_pos + match_len as usize - 1) == c01 {
                        break 'found;
                    }
                }
            }

            if dist == 0 {
                // We've looked through the whole match range, so return the best match we
                // found.
                return (match_dist, match_len);
            }

            // Check if the two first bytes match.
            if self.read_as_u16(probe_pos) != s01 {
                continue;
            }

            let mut p = pos + 2;
            let mut q = probe_pos + 2;
            // The first two bytes matched, so check the full length of the match.
            for _ in 0..32 {
                let p_data: u64 = self.read_unaligned_u64(p);
                let q_data: u64 = self.read_unaligned_u64(q);
                // Compare of 8 bytes at a time by using unaligned loads of 64-bit integers.
                let xor_data = p_data ^ q_data;
                if xor_data == 0 {
                    p += 8;
                    q += 8;
                } else {
                    // If not all of the last 8 bytes matched, check how may of them did.
                    let trailing = xor_data.trailing_zeros();

                    let probe_len = p - pos + (trailing as usize >> 3);
                    if probe_len > match_len as usize {
                        match_dist = dist as u32;
                        match_len = cmp::min(max_match_len, probe_len as u32);
                        if match_len == max_match_len {
                            // We found a match that had the maximum allowed length,
                            // so there is now point searching further.
                            return (match_dist, match_len);
                        }
                        // We found a better match, so save the last two bytes for further match
                        // comparisons.
                        c01 = self.read_as_u16(pos + match_len as usize - 1)
                    }
                    continue 'outer;
                }
            }

            return (dist as u32, cmp::min(max_match_len, MAX_MATCH_LEN as u32));
        }
    }
}

use std::time::Instant;

fn main() {
    let start = Instant::now();
    let dict_oxide = DictOxide::new();
    let mut dist = 0;
    for _ in 0..black_box(i32::MAX >> 5) {
        dist += dict_oxide
            .find_match(
                black_box(0),
                black_box(0),
                black_box(256),
                black_box(0),
                black_box(0),
            )
            .0;
    }
    let duration = start.elapsed();
    println!("{duration:?},{dist}");
}

@dianqk
Copy link
Member

dianqk commented Nov 12, 2024

Upstream issue: llvm/llvm-project#115862.

@oyvindln
Copy link
Contributor

I've worked around it for now in miniz_oxide with this silly workaround, maybe there is a better way to work around it though, I don't know. So in case anyone wants to reproduce the issue, use the extracted example in #132636 (comment) or pin miniz_oxide to a version prior to 0.8.4.

Wasn't aware of this regression until recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants