-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in performance (poor codegen) between 1.81 and 1.82 #132636
Comments
Some analysis from my end:
It's clear that
That's a lot of instructions! You can find them with
I'm no assembly wizard, but it seems excessive and likely the reason. |
This function is not actually this issue, see hanna-kruppe's comment.Just wanna add- [this function does claim to rely on the compiler having certain smarts](https://github.com/Frommi/miniz_oxide/blob/0a33effd414711b379e01b0613ba5ae85a0e14d0/miniz_oxide/src/deflate/core.rs#L406-L411), in case it's relevant ```rust // Read the two bytes starting at pos and interpret them as an u16. #[inline] const fn read_u16_le(slice: &[u8], pos: usize) -> u16 { // The compiler is smart enough to optimize this into an unaligned load. slice[pos] as u16 | ((slice[pos + 1] as u16) << 8) } ``` |
The |
The function |
I bisected to llvm/llvm-project@166c184, but I believe it is an improvement. If you run with You can also reproduce it with rust 1.81.0 by using the following patch: diff --git a/miniz_oxide/src/deflate/core.rs b/miniz_oxide/src/deflate/core.rs
index 143dbe0..4af3758 100644
--- a/miniz_oxide/src/deflate/core.rs
+++ b/miniz_oxide/src/deflate/core.rs
@@ -1358,11 +1358,11 @@ impl DictOxide {
let p_data: u64 = self.read_unaligned_u64(p);
let q_data: u64 = self.read_unaligned_u64(q);
// Compare of 8 bytes at a time by using unaligned loads of 64-bit integers.
- let xor_data = p_data ^ q_data;
- if xor_data == 0 {
+ if p_data == q_data {
p += 8;
q += 8;
} else {
+ let xor_data = p_data ^ q_data;
// If not all of the last 8 bytes matched, check how may of them did.
let trailing = xor_data.trailing_zeros();
|
WG-prioritization assigning priority (Zulip discussion). @rustbot label -I-prioritize +P-medium |
TL;DR: This is due to LICM hoisting numerous loop invariants, which increases register pressure. We should revert it in MachineSink. llvm/llvm-project@166c184 transformed A reproduced example is: Details
use std::cmp;
use std::convert::TryInto;
use std::hint::black_box;
pub(crate) const LZ_DICT_SIZE: usize = 32_768;
const LZ_DICT_SIZE_MASK: usize = (LZ_DICT_SIZE as u32 - 1) as usize;
pub(crate) const MAX_MATCH_LEN: usize = 258;
/// Size of the buffer of lz77 encoded data.
pub const LZ_CODE_BUF_SIZE: usize = 64 * 1024;
/// Size of the output buffer.
pub const OUT_BUF_SIZE: usize = (LZ_CODE_BUF_SIZE * 13) / 10;
pub const LZ_DICT_FULL_SIZE: usize = LZ_DICT_SIZE + MAX_MATCH_LEN - 1 + 1;
/// Size of hash values in the hash chains.
pub const LZ_HASH_BITS: i32 = 15;
/// How many bits to shift when updating the current hash value.
pub const LZ_HASH_SHIFT: i32 = (LZ_HASH_BITS + 2) / 3;
/// Size of the chained hash tables.
pub const LZ_HASH_SIZE: usize = 1 << LZ_HASH_BITS;
#[inline]
const fn read_u16_le(slice: &[u8], pos: usize) -> u16 {
// The compiler is smart enough to optimize this into an unaligned load.
slice[pos] as u16 | ((slice[pos + 1] as u16) << 8)
}
pub struct HashBuffers {
pub dict: [u8; LZ_DICT_FULL_SIZE],
pub next: [u16; LZ_DICT_SIZE],
pub hash: [u16; LZ_DICT_SIZE],
}
impl HashBuffers {
#[inline]
pub fn reset(&mut self) {
*self = HashBuffers::default();
}
}
impl Default for HashBuffers {
fn default() -> HashBuffers {
HashBuffers {
dict: [0; LZ_DICT_FULL_SIZE],
next: [0; LZ_DICT_SIZE],
hash: [0; LZ_DICT_SIZE],
}
}
}
struct DictOxide {
/// The maximum number of checks in the hash chain, for the initial,
/// and the lazy match respectively.
pub max_probes: [u32; 2],
/// Buffer of input data.
/// Padded with 1 byte to simplify matching code in `compress_fast`.
pub b: Box<HashBuffers>,
}
impl DictOxide {
fn new() -> Self {
DictOxide {
max_probes: [0, 0],
b: Box::default(),
}
}
/// Do an unaligned read of the data at `pos` in the dictionary and treat it as if it was of
/// type T.
#[inline]
fn read_unaligned_u64(&self, pos: usize) -> u64 {
// Help evade bounds/panic code check by masking the position value
// This provides a small speedup at the cost of an instruction or two instead of
// having to use unsafe.
let pos = pos & LZ_DICT_SIZE_MASK;
let bytes: [u8; 8] = self.b.dict[pos..pos + 8].try_into().unwrap();
u64::from_le_bytes(bytes)
}
/// Do an unaligned read of the data at `pos` in the dictionary and treat it as if it was of
/// type T.
#[inline]
fn read_as_u16(&self, pos: usize) -> u16 {
read_u16_le(&self.b.dict[..], pos)
}
#[inline(never)]
#[no_mangle]
fn find_match(
&self,
lookahead_pos: usize,
max_dist: usize,
max_match_len: u32,
mut match_dist: u32,
mut match_len: u32,
) -> (u32, u32) {
// Clamp the match len and max_match_len to be valid. (It should be when this is called, but
// do it for now just in case for safety reasons.)
// This should normally end up as at worst conditional moves,
// so it shouldn't slow us down much.
// TODO: Statically verify these so we don't need to do this.
let max_match_len = cmp::min(MAX_MATCH_LEN as u32, max_match_len);
match_len = cmp::max(match_len, 1);
let pos = lookahead_pos & LZ_DICT_SIZE_MASK;
let mut probe_pos = pos;
// Number of probes into the hash chains.
let mut num_probes_left = self.max_probes[(match_len >= 32) as usize];
// If we already have a match of the full length don't bother searching for another one.
if max_match_len <= match_len {
return (match_dist, match_len);
}
// Read the last byte of the current match, and the next one, used to compare matches.
let mut c01: u16 = self.read_as_u16(pos + match_len as usize - 1);
// Read the two bytes at the end position of the current match.
let s01: u16 = self.read_as_u16(pos);
'outer: loop {
let mut dist;
'found: loop {
num_probes_left -= 1;
if num_probes_left == 0 {
// We have done as many probes in the hash chain as the current compression
// settings allow, so return the best match we found, if any.
return (match_dist, match_len);
}
for _ in 0..3 {
let next_probe_pos = self.b.next[probe_pos] as usize;
dist = (lookahead_pos - next_probe_pos) & 0xFFFF;
if next_probe_pos == 0 || dist > max_dist {
// We reached the end of the hash chain, or the next value is further away
// than the maximum allowed distance, so return the best match we found, if
// any.
return (match_dist, match_len);
}
// Mask the position value to get the position in the hash chain of the next
// position to match against.
probe_pos = next_probe_pos & LZ_DICT_SIZE_MASK;
if self.read_as_u16(probe_pos + match_len as usize - 1) == c01 {
break 'found;
}
}
}
if dist == 0 {
// We've looked through the whole match range, so return the best match we
// found.
return (match_dist, match_len);
}
// Check if the two first bytes match.
if self.read_as_u16(probe_pos) != s01 {
continue;
}
let mut p = pos + 2;
let mut q = probe_pos + 2;
// The first two bytes matched, so check the full length of the match.
for _ in 0..32 {
let p_data: u64 = self.read_unaligned_u64(p);
let q_data: u64 = self.read_unaligned_u64(q);
// Compare of 8 bytes at a time by using unaligned loads of 64-bit integers.
let xor_data = p_data ^ q_data;
if xor_data == 0 {
p += 8;
q += 8;
} else {
// If not all of the last 8 bytes matched, check how may of them did.
let trailing = xor_data.trailing_zeros();
let probe_len = p - pos + (trailing as usize >> 3);
if probe_len > match_len as usize {
match_dist = dist as u32;
match_len = cmp::min(max_match_len, probe_len as u32);
if match_len == max_match_len {
// We found a match that had the maximum allowed length,
// so there is now point searching further.
return (match_dist, match_len);
}
// We found a better match, so save the last two bytes for further match
// comparisons.
c01 = self.read_as_u16(pos + match_len as usize - 1)
}
continue 'outer;
}
}
return (dist as u32, cmp::min(max_match_len, MAX_MATCH_LEN as u32));
}
}
}
use std::time::Instant;
fn main() {
let start = Instant::now();
let dict_oxide = DictOxide::new();
let mut dist = 0;
for _ in 0..black_box(i32::MAX >> 5) {
dist += dict_oxide
.find_match(
black_box(0),
black_box(0),
black_box(256),
black_box(0),
black_box(0),
)
.0;
}
let duration = start.elapsed();
println!("{duration:?},{dist}");
} |
Upstream issue: llvm/llvm-project#115862. |
I've worked around it for now in miniz_oxide with this silly workaround, maybe there is a better way to work around it though, I don't know. So in case anyone wants to reproduce the issue, use the extracted example in #132636 (comment) or pin miniz_oxide to a version prior to 0.8.4. Wasn't aware of this regression until recently. |
I've observed a fairly nasty performance regression when switching to 1.82 related to compression performance that appears to boil down to some very weird/suboptimal machine code being generated.
Delving into things, it looks like what was previously reasonable generated code is now turning into... well, something way less reasonable, and this results in executing a vastly higher number of instructions for the same task, in turn slowing everything down. The MVCE (mentioned below) demonstrates this behavior across at least two instruction sets (x86-64 and ARM64) and a number of different microarchitectures (AMD Zen2, Zen4, and Apple M2) and reliably shows a performance regression.
Some other folks who were helping me debug this will also be posting some more information, but the biggest initial suspect looks to be the upgrade to LLVM 19 in 1.82.
Code
The MVCE is here: https://github.com/tobz/miniz-oxide-slowdown-repro.
It's a simple program that generates a deterministic input corpus, compresses it multiple times (using
flate2
/miniz_oxide
), and then exits. Efforts were made to try and ensure things are deterministic and mostly representative of the observed change between Rust 1.81 and 1.82.On my environment, this all boils down to each compress operation taking ~750ms when compiled on 1.81.0 or earlier, jumping to ~1.25s per operation on 1.82.0 and later, representing a ~60% slowdown.
Version it worked on
This "worked" (ran well) on 1.81.0, 1.80.0, and 1.79.0 (as far back as I checked).
Version with regression
rustc --version --verbose
:The text was updated successfully, but these errors were encountered: