-
Notifications
You must be signed in to change notification settings - Fork 7
Refactor checksum algorithm to process 4 bytes at a time #2
Conversation
…r increased instruction-level parallelism. Benchmarks demonstrate performance improvements upwards of 70% on my development machine.
…s by default aligned to 1, not >=N)
Oh this is great, thanks! I've looked into doing this with manual SIMD but didn't get anywhere so far, so this is nice to have. I can confirm most benchmarks can now do >5 GB/s instead of the 3-4 I was getting previously. However the shorter benchmarks with 100 Byte chunks slow down quite a bit (from ~3.4 GB/s to ~2.4), is that something that could be fixed? I expect short messages or blocks to be somewhat common. |
Manual SIMD is the next logical step, but doing it properly with portability, runtime detection, etc. is such a pain (at least until the std::simd RFCs stabilize). I'd expect some regression for very small buffer sizes-- there is greater loop overhead with this algorithm-- but not the extent that you're seeing. My own bench results aligned with this hypothesis, but based on the throughput numbers I'm seeing relative to yours, it appears my system is just too bottlenecked to notice the change. A simple mitigation might be bailing out to the serial algorithm for short slices, but finding the crossover point would require careful tuning. I will investigate further. |
Modulo operations are computationally expensive, and small inputs are not able to amortize the cost effectively. The solution is to special-case short slices with a faster algorithm.
I think Criterion has functionality for benchmarking and plotting using a variable input parameter. It should be possible to use that to get more insight into how performance behaves across different input lengths. I've never used it though. |
Seems like it doesn't, or at least doesn't really support what I'd like to do. Oh well. |
I haven't looked into it much yet, but from a quick search I thought this seemed to line up with what you had in mind. I see that the usage of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this a first deeper look, code looks mostly good to me.
I plugged this into miniz_oxide and saw that the benchmarks got slightly worse, can you confirm that?
Ah yeah that should work.
👍 |
* input-size comparison test * simpler + faster code * place the new implementation behind an "unsafe" feature flag * fix typos
I ran the miniz_oxide benchmarks with 0cb41e6, and saw slight improvements. In order to see how the use of Baseline (commit: dee8f1d)let mut a = u32::from(self.a);
let mut b = u32::from(self.b);
for chunk in bytes.chunks(CHUNK_SIZE) {
for byte in chunk {
let val = u32::from(*byte);
a += val;
b += a;
}
a %= MOD;
b %= MOD;
}
self.a = a as u16;
self.b = b as u16; Results:
Simple 4-way version:let (bytes, remainder) = bytes.split_at(bytes.len() - (bytes.len() % 4));
let mut a = u32::from(self.a);
let mut b = u32::from(self.b);
let mut a_vec = U32X4([0; 4]);
let mut b_vec = a_vec;
// iterate 4 bytes at a time over 4 separate sums
for chunks in bytes.chunks(CHUNK_SIZE * 4) {
for byte_vec in chunks.chunks(4) {
for ((&byte, av), bv) in byte_vec
.iter()
.zip(a_vec.0.iter_mut())
.zip(b_vec.0.iter_mut())
{
*av += u32::from(byte);
*bv += *av;
}
}
b += a * chunks.len() as u32;
a_vec %= MOD;
b_vec %= MOD;
b %= MOD;
}
// combine the sub-sum results
b_vec *= 4;
b_vec.0[1] += MOD - a_vec.0[1];
b_vec.0[2] += (MOD - a_vec.0[2]) * 2;
b_vec.0[3] += (MOD - a_vec.0[3]) * 3;
for &av in a_vec.0.iter() {
a += av;
}
for &bv in b_vec.0.iter() {
b += bv;
}
// add any remaining bytes in serial
for &byte in remainder {
a += u32::from(byte);
b += a;
}
a %= MOD;
b %= MOD;
self.a = a as u16;
self.b = b as u16; Results:
Simple 4-way version with unchecked indexing:// Measure 1.
let (bytes, remainder) = bytes.split_at(bytes.len() - (bytes.len() % 4));
let mut a = u32::from(self.a);
let mut b = u32::from(self.b);
let mut a_vec = U32X4([0; 4]);
let mut b_vec = a_vec;
// iterate 4 bytes at a time over 4 separate sums
for chunks in bytes.chunks(CHUNK_SIZE * 4) {
// Measure 2.
let byte_vecs = chunks.chunks_exact(4);
debug_assert_eq!(0, byte_vecs.remainder().len());
for byte_vec in chunks.chunks(4) {
unsafe {
// only safe if we can garantee byte_vec.len()
// >= 4, which holds for this inner loop as per
// Measure 1. and Measure 2. above
a_vec.add_four_from_slice(byte_vec); // calls byte_vec.get_unchecked(i)
}
b_vec += a_vec;
}
b += a * chunks.len() as u32;
a_vec %= MOD;
b_vec %= MOD;
b %= MOD;
}
// combine the sub-sum results
b_vec *= 4;
b_vec.0[1] += MOD - a_vec.0[1];
b_vec.0[2] += (MOD - a_vec.0[2]) * 2;
b_vec.0[3] += (MOD - a_vec.0[3]) * 3;
for &av in a_vec.0.iter() {
a += av;
}
for &bv in b_vec.0.iter() {
b += bv;
}
// add any remaining bytes in serial
for &byte in remainder {
a += u32::from(byte);
b += a;
}
a %= MOD;
b %= MOD;
self.a = a as u16;
self.b = b as u16; Results:
Simple 4-way version using
|
Hello there, was working on a project that's highly dependent on flate2 performance and fell down a deep dark hole...so here I am. 😄 I was just playing around with this PR and got pretty reasonable results without using fn checksum_loop(a: u32, b: u32, bytes: &[u8]) -> (u32, u32) {
let mut a = a;
let mut b = b;
let mut a_vec = split_sum::U32X4([0; 4]);
let mut b_vec = a_vec;
let chunk_iter = bytes.chunks_exact(4 * CHUNK_SIZE);
let post_bytes = chunk_iter.remainder();
for chunk in chunk_iter {
for byte_vec in chunk.chunks_exact(4) {
a_vec += split_sum::U32X4::from(byte_vec);
b_vec += a_vec;
}
b += a * 4 * CHUNK_SIZE as u32;
a_vec %= MOD;
b_vec %= MOD;
b %= MOD;
} Plus modifying I benchmarked this with edit: I just checked |
Adding a second clean-up loop seems to have helped. Still getting a mild performance hit (compared to this PR) on the small samples.
|
Wow, thank you for catching this! Looking back at my writeup, you can see exactly where I went wrong: let byte_vecs = chunks.chunks_exact(4);
debug_assert_eq!(0, byte_vecs.remainder().len());
for byte_vec in chunks.chunks(4) {
//... That last line obviously should read for byte_vec in byte_vecs {
//... No wonder-- I only added unchecked-indexing to that example because I was disappointed at how little Welp, as the saying goes,
I bow to safe rustc/stdlib. |
…from @jamestwebber. Add "#![forbid(unsafe)]". This greatly simplifies code changes relative to v0.2.2.
Oh hah I didn't even notice that when I was looking at your post. I think adding SSE etc instructions to this implementation would not be too bad, although it would be going back into |
Great job, this looks really good now! The 100-Byte benchmark still regresses from 3.3622 GiB/s to 3.1368 GiB/s, but I think that's okay. I've tried the miniz_oxide benchmarks again and found out that they're incredibly noisy (in a way that isn't reflected by the confidence interval printed in the bench results), so I'd say it's fine to ignore them for now. It's probably best to wait with SIMD until there's a strong motivation for it (it has to improve performance massively), or until there's a more ergonomic way to do SIMD in Rust. |
...allowing for increased instruction-level parallelism. Benchmarks demonstrate performance improvements upwards of 70% on my development machine.