Regression in performance (poor codegen) between 1.81 and 1.82 #132636

tobz · 2024-11-05T11:27:43Z

I've observed a fairly nasty performance regression when switching to 1.82 related to compression performance that appears to boil down to some very weird/suboptimal machine code being generated.

Delving into things, it looks like what was previously reasonable generated code is now turning into... well, something way less reasonable, and this results in executing a vastly higher number of instructions for the same task, in turn slowing everything down. The MVCE (mentioned below) demonstrates this behavior across at least two instruction sets (x86-64 and ARM64) and a number of different microarchitectures (AMD Zen2, Zen4, and Apple M2) and reliably shows a performance regression.

Some other folks who were helping me debug this will also be posting some more information, but the biggest initial suspect looks to be the upgrade to LLVM 19 in 1.82.

Code

The MVCE is here: https://github.com/tobz/miniz-oxide-slowdown-repro.

It's a simple program that generates a deterministic input corpus, compresses it multiple times (using flate2/miniz_oxide), and then exits. Efforts were made to try and ensure things are deterministic and mostly representative of the observed change between Rust 1.81 and 1.82.

On my environment, this all boils down to each compress operation taking ~750ms when compiled on 1.81.0 or earlier, jumping to ~1.25s per operation on 1.82.0 and later, representing a ~60% slowdown.

Version it worked on

This "worked" (ran well) on 1.81.0, 1.80.0, and 1.79.0 (as far back as I checked).

Version with regression

rustc --version --verbose:

rustc 1.82.0 (f6e511eec 2024-10-15)
binary: rustc
commit-hash: f6e511eec7342f59a25f7c0534f1dbea00d01b14
commit-date: 2024-10-15
host: x86_64-unknown-linux-gnu
release: 1.82.0
LLVM version: 19.1.1

The text was updated successfully, but these errors were encountered:

apiraino · 2024-11-05T11:58:43Z

cc @nikic perhaps more insights

@rustbot label +A-LLVM +I-slow

bobrik · 2024-11-05T13:19:00Z

Some analysis from my end:

flamegraph on v1.81:

flamegraph on v1.82:

It's clear that read_u16_le stands out.

perf annotate on v1.81:

perf annotate on v1.82:

That's a lot of instructions! You can find them with objdump:

objdump -S --no-show-raw-insn ~/miniz-oxide-slowdown-repro.1.81 | perl -p -e 's/^\s+(\S+):\t//;' | c++filt | less

v1.81:

    slice[pos] as u16 | ((slice[pos + 1] as u16) << 8)
movzwl 0x1ffff(%rcx,%rax,1),%eax
mov    %ax,0x48(%rsp)
                if num_probes_left == 0 {
lea    0xa(%r11),%rax
mov    %rax,0x90(%rsp)
xor    %ebx,%ebx
mov    %r11,%r10
mov    %r8d,%eax
mov    %rbp,0x10(%rsp)
mov    %rcx,0x20(%rsp)
mov    %eax,%edx
nopw   0x0(%rax,%rax,1)

v1.82 (truncated):

    slice[pos] as u16 | ((slice[pos + 1] as u16) << 8)
movzwl -0x1(%r9,%rax,1),%eax
mov    %ax,0x40(%rsp)
lea    0x2(%r15),%rax
mov    %rax,0x280(%rsp)
and    $0x7fff,%eax
mov    %rax,0x278(%rsp)
lea    0xa(%r15),%rax
mov    %rax,0x270(%rsp)
and    $0x7fff,%eax
mov    %rax,0x268(%rsp)
lea    0x12(%r15),%rax
mov    %rax,0x260(%rsp)
and    $0x7fff,%eax
mov    %rax,0x258(%rsp)
lea    0x1a(%r15),%rax
mov    %rax,0x250(%rsp)
and    $0x7fff,%eax
mov    %rax,0x248(%rsp)
lea    0x22(%r15),%rax
mov    %rax,0x240(%rsp)
and    $0x7fff,%eax
mov    %rax,0x238(%rsp)
lea    0x2a(%r15),%rax
mov    %rax,0x230(%rsp)
and    $0x7fff,%eax
mov    %rax,0x228(%rsp)
lea    0x32(%r15),%rax
mov    %rax,0x220(%rsp)
and    $0x7fff,%eax
mov    %rax,0x218(%rsp)
lea    0x3a(%r15),%rax
mov    %rax,0x210(%rsp)
and    $0x7fff,%eax
mov    %rax,0x208(%rsp)
lea    0x42(%r15),%rax
mov    %rax,0x200(%rsp)
and    $0x7fff,%eax
mov    %rax,0x1f8(%rsp)
lea    0x4a(%r15),%rax
mov    %rax,0x1f0(%rsp)
and    $0x7fff,%eax
mov    %rax,0x1e8(%rsp)
lea    0x52(%r15),%rax
mov    %rax,0x1e0(%rsp)
and    $0x7fff,%eax
mov    %rax,0x1d8(%rsp)
lea    0x5a(%r15),%rax
mov    %rax,0x1d0(%rsp)
and    $0x7fff,%eax
mov    %rax,0x1c8(%rsp)

I'm no assembly wizard, but it seems excessive and likely the reason.

randomairborne · 2024-11-05T13:24:31Z

This function is not actually this issue, see hanna-kruppe's comment.

Just wanna add- [this function does claim to rely on the compiler having certain smarts](https://github.com/Frommi/miniz_oxide/blob/0a33effd414711b379e01b0613ba5ae85a0e14d0/miniz_oxide/src/deflate/core.rs#L406-L411), in case it's relevant ```rust // Read the two bytes starting at pos and interpret them as an u16. #[inline] const fn read_u16_le(slice: &[u8], pos: usize) -> u16 { // The compiler is smart enough to optimize this into an unaligned load. slice[pos] as u16 | ((slice[pos + 1] as u16) << 8) } ```

hanna-kruppe · 2024-11-05T15:16:43Z

The read_u16_le function appears to compile to a single movzwl in both versions (and I would have been really surprised of that optimization had regressed). The other instructions that do differ have no relation to read_u16_le and come from other parts of the source code, they just aren’t attributed to the respective source lines because debug information is very imperfect when compiling with heavy optimizations.

dianqk · 2024-11-06T12:33:56Z

The function miniz_oxide::deflate::core::compress_inner has doubled its instruction count, which I suspect is related.

dianqk · 2024-11-07T10:59:56Z

I bisected to llvm/llvm-project@166c184, but I believe it is an improvement. ~~It's about loop unroll.~~

If you run with RUSTFLAGS="-Copt-level=2" or RUSTFLAGS="-Cllvm-args=-unroll-threshold-aggressive=150", it will be fast again. LOL, O2 faster than O3 again.

You can also reproduce it with rust 1.81.0 by using the following patch:

diff --git a/miniz_oxide/src/deflate/core.rs b/miniz_oxide/src/deflate/core.rs
index 143dbe0..4af3758 100644
--- a/miniz_oxide/src/deflate/core.rs
+++ b/miniz_oxide/src/deflate/core.rs
@@ -1358,11 +1358,11 @@ impl DictOxide {
                 let p_data: u64 = self.read_unaligned_u64(p);
                 let q_data: u64 = self.read_unaligned_u64(q);
                 // Compare of 8 bytes at a time by using unaligned loads of 64-bit integers.
-                let xor_data = p_data ^ q_data;
-                if xor_data == 0 {
+                if p_data == q_data {
                     p += 8;
                     q += 8;
                 } else {
+                    let xor_data = p_data ^ q_data;
                     // If not all of the last 8 bytes matched, check how may of them did.
                     let trailing = xor_data.trailing_zeros();

~~I am not family with loop unroll yet, but I guess we should not unroll for the wired edge: continue 'outer;.~~

apiraino · 2024-11-08T11:39:41Z

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

dianqk · 2024-11-11T02:14:46Z

TL;DR: This is due to LICM hoisting numerous loop invariants, which increases register pressure. We should revert it in MachineSink.

llvm/llvm-project@166c184 transformed p_data ^ q_data == 0 to p_data == q_data at miniz_oxide/src/deflate/core.rs#L1359-L1360. This provides an opportunity to fully unroll the 32 cycle loop, allowing p += 8 to become a loop invariant and be hoisted that increased register pressure.

A reproduced example is:

Details

use std::cmp;
use std::convert::TryInto;
use std::hint::black_box;

pub(crate) const LZ_DICT_SIZE: usize = 32_768;
const LZ_DICT_SIZE_MASK: usize = (LZ_DICT_SIZE as u32 - 1) as usize;
pub(crate) const MAX_MATCH_LEN: usize = 258;
/// Size of the buffer of lz77 encoded data.
pub const LZ_CODE_BUF_SIZE: usize = 64 * 1024;
/// Size of the output buffer.
pub const OUT_BUF_SIZE: usize = (LZ_CODE_BUF_SIZE * 13) / 10;
pub const LZ_DICT_FULL_SIZE: usize = LZ_DICT_SIZE + MAX_MATCH_LEN - 1 + 1;

/// Size of hash values in the hash chains.
pub const LZ_HASH_BITS: i32 = 15;
/// How many bits to shift when updating the current hash value.
pub const LZ_HASH_SHIFT: i32 = (LZ_HASH_BITS + 2) / 3;
/// Size of the chained hash tables.
pub const LZ_HASH_SIZE: usize = 1 << LZ_HASH_BITS;

#[inline]
const fn read_u16_le(slice: &[u8], pos: usize) -> u16 {
    // The compiler is smart enough to optimize this into an unaligned load.
    slice[pos] as u16 | ((slice[pos + 1] as u16) << 8)
}

pub struct HashBuffers {
    pub dict: [u8; LZ_DICT_FULL_SIZE],
    pub next: [u16; LZ_DICT_SIZE],
    pub hash: [u16; LZ_DICT_SIZE],
}

impl HashBuffers {
    #[inline]
    pub fn reset(&mut self) {
        *self = HashBuffers::default();
    }
}

impl Default for HashBuffers {
    fn default() -> HashBuffers {
        HashBuffers {
            dict: [0; LZ_DICT_FULL_SIZE],
            next: [0; LZ_DICT_SIZE],
            hash: [0; LZ_DICT_SIZE],
        }
    }
}

struct DictOxide {
    /// The maximum number of checks in the hash chain, for the initial,
    /// and the lazy match respectively.
    pub max_probes: [u32; 2],
    /// Buffer of input data.
    /// Padded with 1 byte to simplify matching code in `compress_fast`.
    pub b: Box<HashBuffers>,
}

impl DictOxide {
    fn new() -> Self {
        DictOxide {
            max_probes: [0, 0],
            b: Box::default(),
        }
    }

    /// Do an unaligned read of the data at `pos` in the dictionary and treat it as if it was of
    /// type T.
    #[inline]
    fn read_unaligned_u64(&self, pos: usize) -> u64 {
        // Help evade bounds/panic code check by masking the position value
        // This provides a small speedup at the cost of an instruction or two instead of
        // having to use unsafe.
        let pos = pos & LZ_DICT_SIZE_MASK;
        let bytes: [u8; 8] = self.b.dict[pos..pos + 8].try_into().unwrap();
        u64::from_le_bytes(bytes)
    }

    /// Do an unaligned read of the data at `pos` in the dictionary and treat it as if it was of
    /// type T.
    #[inline]
    fn read_as_u16(&self, pos: usize) -> u16 {
        read_u16_le(&self.b.dict[..], pos)
    }

    #[inline(never)]
    #[no_mangle]
    fn find_match(
        &self,
        lookahead_pos: usize,
        max_dist: usize,
        max_match_len: u32,
        mut match_dist: u32,
        mut match_len: u32,
    ) -> (u32, u32) {
        // Clamp the match len and max_match_len to be valid. (It should be when this is called, but
        // do it for now just in case for safety reasons.)
        // This should normally end up as at worst conditional moves,
        // so it shouldn't slow us down much.
        // TODO: Statically verify these so we don't need to do this.
        let max_match_len = cmp::min(MAX_MATCH_LEN as u32, max_match_len);
        match_len = cmp::max(match_len, 1);

        let pos = lookahead_pos & LZ_DICT_SIZE_MASK;
        let mut probe_pos = pos;
        // Number of probes into the hash chains.
        let mut num_probes_left = self.max_probes[(match_len >= 32) as usize];

        // If we already have a match of the full length don't bother searching for another one.
        if max_match_len <= match_len {
            return (match_dist, match_len);
        }

        // Read the last byte of the current match, and the next one, used to compare matches.
        let mut c01: u16 = self.read_as_u16(pos + match_len as usize - 1);
        // Read the two bytes at the end position of the current match.
        let s01: u16 = self.read_as_u16(pos);

        'outer: loop {
            let mut dist;
            'found: loop {
                num_probes_left -= 1;
                if num_probes_left == 0 {
                    // We have done as many probes in the hash chain as the current compression
                    // settings allow, so return the best match we found, if any.
                    return (match_dist, match_len);
                }

                for _ in 0..3 {
                    let next_probe_pos = self.b.next[probe_pos] as usize;

                    dist = (lookahead_pos - next_probe_pos) & 0xFFFF;
                    if next_probe_pos == 0 || dist > max_dist {
                        // We reached the end of the hash chain, or the next value is further away
                        // than the maximum allowed distance, so return the best match we found, if
                        // any.
                        return (match_dist, match_len);
                    }

                    // Mask the position value to get the position in the hash chain of the next
                    // position to match against.
                    probe_pos = next_probe_pos & LZ_DICT_SIZE_MASK;

                    if self.read_as_u16(probe_pos + match_len as usize - 1) == c01 {
                        break 'found;
                    }
                }
            }

            if dist == 0 {
                // We've looked through the whole match range, so return the best match we
                // found.
                return (match_dist, match_len);
            }

            // Check if the two first bytes match.
            if self.read_as_u16(probe_pos) != s01 {
                continue;
            }

            let mut p = pos + 2;
            let mut q = probe_pos + 2;
            // The first two bytes matched, so check the full length of the match.
            for _ in 0..32 {
                let p_data: u64 = self.read_unaligned_u64(p);
                let q_data: u64 = self.read_unaligned_u64(q);
                // Compare of 8 bytes at a time by using unaligned loads of 64-bit integers.
                let xor_data = p_data ^ q_data;
                if xor_data == 0 {
                    p += 8;
                    q += 8;
                } else {
                    // If not all of the last 8 bytes matched, check how may of them did.
                    let trailing = xor_data.trailing_zeros();

                    let probe_len = p - pos + (trailing as usize >> 3);
                    if probe_len > match_len as usize {
                        match_dist = dist as u32;
                        match_len = cmp::min(max_match_len, probe_len as u32);
                        if match_len == max_match_len {
                            // We found a match that had the maximum allowed length,
                            // so there is now point searching further.
                            return (match_dist, match_len);
                        }
                        // We found a better match, so save the last two bytes for further match
                        // comparisons.
                        c01 = self.read_as_u16(pos + match_len as usize - 1)
                    }
                    continue 'outer;
                }
            }

            return (dist as u32, cmp::min(max_match_len, MAX_MATCH_LEN as u32));
        }
    }
}

use std::time::Instant;

fn main() {
    let start = Instant::now();
    let dict_oxide = DictOxide::new();
    let mut dist = 0;
    for _ in 0..black_box(i32::MAX >> 5) {
        dist += dict_oxide
            .find_match(
                black_box(0),
                black_box(0),
                black_box(256),
                black_box(0),
                black_box(0),
            )
            .0;
    }
    let duration = start.elapsed();
    println!("{duration:?},{dist}");
}

dianqk · 2024-11-12T12:49:20Z

Upstream issue: llvm/llvm-project#115862.

oyvindln · 2025-02-11T17:51:10Z

I've worked around it for now in miniz_oxide with this silly workaround, maybe there is a better way to work around it though, I don't know. So in case anyone wants to reproduce the issue, use the extracted example in #132636 (comment) or pin miniz_oxide to a version prior to 0.8.4.

Wasn't aware of this regression until recently.

…r=Zalathar Update miniz_oxide dependency of coverage_dump This was the final subproject that depended on ```miniz_oxide``` 0.7.x after the rest were when updating the ```backtrace-rs``` dependency in in rust-lang#140705. Older versions of ```miniz_oxide``` got hit by a [serious](rust-lang#132636) performance regression in rust 1.82 (which has been worked around in more recent versions of the library) so should really be avoided if possible (granted it only affects compression so not sure if it had much impact in practice here, though there have also been some other performance improvements since .) This also means no longer having to build two versions of miniz_oxide as everything can now use the same version, and no longer needing to build both ```adler``` and ```adler2```

Update miniz_oxide dependency of coverage_dump This was the final subproject that depended on ```miniz_oxide``` 0.7.x after the rest were when updating the ```backtrace-rs``` dependency in in #140705. Older versions of ```miniz_oxide``` got hit by a [serious](rust-lang/rust#132636) performance regression in rust 1.82 (which has been worked around in more recent versions of the library) so should really be avoided if possible (granted it only affects compression so not sure if it had much impact in practice here, though there have also been some other performance improvements since .) This also means no longer having to build two versions of miniz_oxide as everything can now use the same version, and no longer needing to build both ```adler``` and ```adler2```

tobz added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Nov 5, 2024

rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Nov 5, 2024

rustbot added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Nov 5, 2024

tobz mentioned this issue Nov 5, 2024

chore: update Rust to 1.82.0 DataDog/saluki#305

Closed

jieyouxu added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Nov 5, 2024

saethlin removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Nov 5, 2024

dianqk self-assigned this Nov 7, 2024

rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Nov 8, 2024

dianqk mentioned this issue Nov 12, 2024

[MachineLICM][MachineSink] Sinking invariants into cycle llvm/llvm-project#115862

Open

oyvindln mentioned this issue Feb 8, 2025

Massive performance regression in deflate compression between rust 1.81 and 1.82 due to upstream rust compiler optimization issue Frommi/miniz_oxide#163

Closed

awxkee mentioned this issue Feb 11, 2025

Reduce cost of bounds checks in transforms FirefoxGraphics/qcms#43

Open

Sunshine40 mentioned this issue Feb 27, 2025

Iterator performance regression 1.81 -> 1.82 #137727

Open

oyvindln mentioned this issue May 10, 2025

Update miniz_oxide dependency of coverage_dump #140899

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in performance (poor codegen) between 1.81 and 1.82 #132636

Regression in performance (poor codegen) between 1.81 and 1.82 #132636

tobz commented Nov 5, 2024 •

edited

Loading

apiraino commented Nov 5, 2024 •

edited

Loading

bobrik commented Nov 5, 2024

randomairborne commented Nov 5, 2024 •

edited

Loading

hanna-kruppe commented Nov 5, 2024

dianqk commented Nov 6, 2024

dianqk commented Nov 7, 2024 •

edited

Loading

apiraino commented Nov 8, 2024 •

edited

Loading

dianqk commented Nov 11, 2024

dianqk commented Nov 12, 2024

oyvindln commented Feb 11, 2025

Regression in performance (poor codegen) between 1.81 and 1.82 #132636

Regression in performance (poor codegen) between 1.81 and 1.82 #132636

Comments

tobz commented Nov 5, 2024 • edited Loading

Code

Version it worked on

Version with regression

apiraino commented Nov 5, 2024 • edited Loading

bobrik commented Nov 5, 2024

randomairborne commented Nov 5, 2024 • edited Loading

hanna-kruppe commented Nov 5, 2024

dianqk commented Nov 6, 2024

dianqk commented Nov 7, 2024 • edited Loading

apiraino commented Nov 8, 2024 • edited Loading

dianqk commented Nov 11, 2024

dianqk commented Nov 12, 2024

oyvindln commented Feb 11, 2025

tobz commented Nov 5, 2024 •

edited

Loading

apiraino commented Nov 5, 2024 •

edited

Loading

randomairborne commented Nov 5, 2024 •

edited

Loading

dianqk commented Nov 7, 2024 •

edited

Loading

apiraino commented Nov 8, 2024 •

edited

Loading