Skip to content
This repository has been archived by the owner on Mar 26, 2024. It is now read-only.

Refactor checksum algorithm to process 4 bytes at a time #2

Merged
merged 7 commits into from
Jul 10, 2020

Conversation

NathanaelLane
Copy link
Contributor

...allowing for increased instruction-level parallelism. Benchmarks demonstrate performance improvements upwards of 70% on my development machine.

…r increased instruction-level parallelism. Benchmarks demonstrate performance improvements upwards of 70% on my development machine.
@jonas-schievink
Copy link
Owner

Oh this is great, thanks! I've looked into doing this with manual SIMD but didn't get anywhere so far, so this is nice to have.

I can confirm most benchmarks can now do >5 GB/s instead of the 3-4 I was getting previously. However the shorter benchmarks with 100 Byte chunks slow down quite a bit (from ~3.4 GB/s to ~2.4), is that something that could be fixed? I expect short messages or blocks to be somewhat common.

@NathanaelLane
Copy link
Contributor Author

Manual SIMD is the next logical step, but doing it properly with portability, runtime detection, etc. is such a pain (at least until the std::simd RFCs stabilize). I'd expect some regression for very small buffer sizes-- there is greater loop overhead with this algorithm-- but not the extent that you're seeing. My own bench results aligned with this hypothesis, but based on the throughput numbers I'm seeing relative to yours, it appears my system is just too bottlenecked to notice the change.

A simple mitigation might be bailing out to the serial algorithm for short slices, but finding the crossover point would require careful tuning. I will investigate further.

Modulo operations are computationally expensive, and small inputs are not able to amortize the cost effectively.
The solution is to special-case short slices with a faster algorithm.
@jonas-schievink
Copy link
Owner

I think Criterion has functionality for benchmarking and plotting using a variable input parameter. It should be possible to use that to get more insight into how performance behaves across different input lengths. I've never used it though.

@jonas-schievink
Copy link
Owner

I think Criterion has functionality for benchmarking and plotting using a variable input parameter.

Seems like it doesn't, or at least doesn't really support what I'd like to do. Oh well.

@NathanaelLane
Copy link
Contributor Author

NathanaelLane commented Jul 5, 2020

I haven't looked into it much yet, but from a quick search I thought this seemed to line up with what you had in mind.

I see that the usage of align_to() is an issue for miniz_oxide-- It's technically not fundamental, I'll test performance without it when I have time to work on this tomorrow. Even if there is a noticeable performance diff I think the approach suggested here is reasonable.

Copy link
Owner

@jonas-schievink jonas-schievink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a first deeper look, code looks mostly good to me.

I plugged this into miniz_oxide and saw that the benchmarks got slightly worse, can you confirm that?

src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@jonas-schievink
Copy link
Owner

I haven't looked into it much yet, but from a quick search I thought this seemed to line up with what you had in mind.

Ah yeah that should work.

I see that the usage of align_to() is an issue for miniz_oxide-- It's technically not fundamental, I'll test performance without it when I have time to work on this tomorrow. Even if there is a noticeable performance diff I think the approach suggested here is reasonable.

👍

* input-size comparison test
* simpler + faster code
* place the new implementation behind an "unsafe" feature flag
* fix typos
@NathanaelLane
Copy link
Contributor Author

NathanaelLane commented Jul 6, 2020

I plugged this into miniz_oxide and saw that the benchmarks got slightly worse, can you confirm that?

I ran the miniz_oxide benchmarks with 0cb41e6, and saw slight improvements.

In order to see how the use of unsafe {} impacts performance, I walked through a series of implementations, benchmarking along the way. My findings are as follows:

Baseline (commit: dee8f1d)

let mut a = u32::from(self.a);
let mut b = u32::from(self.b);
for chunk in bytes.chunks(CHUNK_SIZE) {
    for byte in chunk {
        let val = u32::from(*byte);
        a += val;
        b += a;
    }

    a %= MOD;
    b %= MOD;
}
self.a = a as u16;
self.b = b as u16;

Results:

simple-100b/ones-100    time:   [38.992 ns 39.578 ns 40.371 ns]                                  
                        thrpt:  [2.3069 GiB/s 2.3531 GiB/s 2.3885 GiB/s]

simple-1m/zeroes-1m     time:   [401.06 us 402.16 us 403.50 us]                                
                        thrpt:  [2.4203 GiB/s 2.4283 GiB/s 2.4349 GiB/s]

Simple 4-way version:

let (bytes, remainder) = bytes.split_at(bytes.len() - (bytes.len() % 4));
let mut a = u32::from(self.a);
let mut b = u32::from(self.b);
let mut a_vec = U32X4([0; 4]);
let mut b_vec = a_vec;

// iterate 4 bytes at a time over 4 separate sums
for chunks in bytes.chunks(CHUNK_SIZE * 4) {
    for byte_vec in chunks.chunks(4) {
        for ((&byte, av), bv) in byte_vec
            .iter()
            .zip(a_vec.0.iter_mut())
            .zip(b_vec.0.iter_mut())
        {
            *av += u32::from(byte);
            *bv += *av;
        }
    }
    b += a * chunks.len() as u32;
    a_vec %= MOD;
    b_vec %= MOD;
    b %= MOD;
}

// combine the sub-sum results
b_vec *= 4;
b_vec.0[1] += MOD - a_vec.0[1];
b_vec.0[2] += (MOD - a_vec.0[2]) * 2;
b_vec.0[3] += (MOD - a_vec.0[3]) * 3;
for &av in a_vec.0.iter() {
    a += av;
}
for &bv in b_vec.0.iter() {
    b += bv;
}

// add any remaining bytes in serial
for &byte in remainder {
    a += u32::from(byte);
    b += a;
}
a %= MOD;
b %= MOD;
self.a = a as u16;
self.b = b as u16;

Results:

simple-100b/ones-100    time:   [125.20 ns 125.45 ns 125.75 ns]                                 
                        thrpt:  [758.37 MiB/s 760.20 MiB/s 761.74 MiB/s]

simple-1m/zeroes-1m     time:   [1.1630 ms 1.1636 ms 1.1644 ms]                                 
                        thrpt:  [858.81 MiB/s 859.38 MiB/s 859.88 MiB/s]

Simple 4-way version with unchecked indexing:

// Measure 1.
let (bytes, remainder) = bytes.split_at(bytes.len() - (bytes.len() % 4));

let mut a = u32::from(self.a);
let mut b = u32::from(self.b);
let mut a_vec = U32X4([0; 4]);
let mut b_vec = a_vec;

// iterate 4 bytes at a time over 4 separate sums
for chunks in bytes.chunks(CHUNK_SIZE * 4) {
    // Measure 2.
    let byte_vecs = chunks.chunks_exact(4);
    debug_assert_eq!(0, byte_vecs.remainder().len());

    for byte_vec in chunks.chunks(4) {
        unsafe {
            // only safe if we can garantee byte_vec.len()
            // >= 4, which holds for this inner loop as per
            // Measure 1. and Measure 2. above
            a_vec.add_four_from_slice(byte_vec); // calls byte_vec.get_unchecked(i) 
        }
        b_vec += a_vec;
    }
    b += a * chunks.len() as u32;
    a_vec %= MOD;
    b_vec %= MOD;
    b %= MOD;
}

// combine the sub-sum results
b_vec *= 4;
b_vec.0[1] += MOD - a_vec.0[1];
b_vec.0[2] += (MOD - a_vec.0[2]) * 2;
b_vec.0[3] += (MOD - a_vec.0[3]) * 3;
for &av in a_vec.0.iter() {
    a += av;
}
for &bv in b_vec.0.iter() {
    b += bv;
}

// add any remaining bytes in serial
for &byte in remainder {
    a += u32::from(byte);
    b += a;
}
a %= MOD;
b %= MOD;
self.a = a as u16;
self.b = b as u16;

Results:

simple-100b/ones-100    time:   [78.569 ns 78.741 ns 78.949 ns]                                 
                        thrpt:  [1.1796 GiB/s 1.1828 GiB/s 1.1854 GiB/s]

simple-1m/zeroes-1m     time:   [731.80 us 734.17 us 736.74 us]                                
                        thrpt:  [1.3255 GiB/s 1.3302 GiB/s 1.3345 GiB/s]

Simple 4-way version using align_to():

let (pre_bytes, aligned_bytes, post_bytes) = unsafe {
    // safe to do because we aren't actually reinterpreting bytes--
    // only making sure we do aligned accesses in the main loop
    bytes.align_to::<[u8; 4]>()
};

let mut a = u32::from(self.a);
let mut b = u32::from(self.b);
let mut a_vec = U32X4([0; 4]);
let mut b_vec = a_vec;

// we expect this loop to be short enough that per-element
// mod is cheaper on average
for &byte in pre_bytes {
    a += u32::from(byte);
    a -= (a >= MOD) as u32 * MOD;
    b += a;
    b -= (b >= MOD) as u32 * MOD;
}

for chunk in aligned_bytes.chunks(CHUNK_SIZE) {
    for byte_vec in chunk {
        a_vec += U32X4::from(byte_vec);
        b_vec += a_vec;
    }
    b += a * 4 * chunk.len() as u32;
    a_vec %= MOD;
    b_vec %= MOD;
    b %= MOD;
}

// combine the sub-sum results
b_vec *= 4;
b_vec.0[1] += MOD - a_vec.0[1];
b_vec.0[2] += (MOD - a_vec.0[2]) * 2;
b_vec.0[3] += (MOD - a_vec.0[3]) * 3;
for &av in a_vec.0.iter() {
    a += av;
}
for &bv in b_vec.0.iter() {
    b += bv;
}
a %= MOD;
b %= MOD;

// we expect this loop to be short enough that per-element
// mod is cheaper on average
for &byte in post_bytes {
    a += u32::from(byte);
    a -= (a >= MOD) as u32 * MOD;
    b += a;
    b -= (b >= MOD) as u32 * MOD;
}

self.a = a as u16;
self.b = b as u16;

Results:

simple-100b/ones-100    time:   [35.169 ns 35.239 ns 35.316 ns]                                  
                        thrpt:  [2.6371 GiB/s 2.6429 GiB/s 2.6481 GiB/s]

simple-1m/zeroes-1m     time:   [226.25 us 226.75 us 227.26 us]                                
                        thrpt:  [4.2971 GiB/s 4.3068 GiB/s 4.3162 GiB/s]

Conclusion:

There is a good chance that my safer 4-way implementations are not optimal, and I welcome any suggestions to improve the performance there without resorting to align_to() :-)

However, as the results stand, it seems that using align_to() is critical for unlocking performance beyond the straightforward serial sum implementation. Our specific usage can theoretically be made safe, but would likely require an API built around pi-types/const-generics.

@jamestwebber
Copy link

jamestwebber commented Jul 10, 2020

Hello there, was working on a project that's highly dependent on flate2 performance and fell down a deep dark hole...so here I am. 😄

I was just playing around with this PR and got pretty reasonable results without using align_to, instead using chunks_exact and the remainder. Code change is pretty simple and I suspect performance could be improved:

fn checksum_loop(a: u32, b: u32, bytes: &[u8]) -> (u32, u32) {
    let mut a = a;
    let mut b = b;
    let mut a_vec = split_sum::U32X4([0; 4]);
    let mut b_vec = a_vec;
    
    let chunk_iter = bytes.chunks_exact(4 * CHUNK_SIZE);
    let post_bytes = chunk_iter.remainder();

    for chunk in chunk_iter {
        for byte_vec in chunk.chunks_exact(4) {
            a_vec += split_sum::U32X4::from(byte_vec);
            b_vec += a_vec;
        }
        b += a * 4 * CHUNK_SIZE as u32;
        a_vec %= MOD;
        b_vec %= MOD;
        b %= MOD;
    }

Plus modifying U32X4::from to take &[u8] instead of [u8; 4] which is fine.

I benchmarked this with simple-1m. There was maybe a slight hit on simple-1m/ones-1m but it was around 3% and the zeros case was fine.

edit: I just checked simple-100b and saw bigger differences. An obvious thing I didn't bother with was trying to "vectorize" the remainder chunk, which can get pretty large.

@jamestwebber
Copy link

Adding a second clean-up loop seems to have helped. Still getting a mild performance hit (compared to this PR) on the small samples.

    let chunk_iter = bytes.chunks_exact(4 * CHUNK_SIZE);
    let chunk_remainder = chunk_iter.remainder();
    
    ...  main loop ...

    let byte_iter = chunk_remainder.chunks_exact(4);
    let post_bytes = byte_iter.remainder();
    b += a * 4 * byte_iter.len() as u32;
    b %= MOD;
    
    for byte_vec in byte_iter {
        a_vec += split_sum::U32X4::from(byte_vec);
        b_vec += a_vec;
    }
    a_vec %= MOD;
    b_vec %= MOD;

    // combine the sub-sum results into the main sum
    ...

@NathanaelLane
Copy link
Contributor Author

NathanaelLane commented Jul 10, 2020

Wow, thank you for catching this!

Looking back at my writeup, you can see exactly where I went wrong:

    let byte_vecs = chunks.chunks_exact(4);
    debug_assert_eq!(0, byte_vecs.remainder().len());

    for byte_vec in chunks.chunks(4) {
    //...

That last line obviously should read

    for byte_vec in byte_vecs {
    //...

No wonder-- I only added unchecked-indexing to that example because I was disappointed at how little chunks_exact() appeared to help. (Turns out it does help, but only if you actually use it!) Fixing that, and using chunks_exact() for the outer loop as well as per your suggestion, indeed brings performance roughly on-par with the unsafe {} code.

Welp, as the saying goes,

I fought the crab, and the crab won.

I bow to safe rustc/stdlib.

…from @jamestwebber. Add "#![forbid(unsafe)]". This greatly simplifies code changes relative to v0.2.2.
@jamestwebber
Copy link

Looking back at my writeup, you can see exactly where I went wrong:

Oh hah I didn't even notice that when I was looking at your post.

I think adding SSE etc instructions to this implementation would not be too bad, although it would be going back into unsafe land as I believe all of those instructions are unsafe..

@jonas-schievink
Copy link
Owner

Great job, this looks really good now!

The 100-Byte benchmark still regresses from 3.3622 GiB/s to 3.1368 GiB/s, but I think that's okay. I've tried the miniz_oxide benchmarks again and found out that they're incredibly noisy (in a way that isn't reflected by the confidence interval printed in the bench results), so I'd say it's fine to ignore them for now.

It's probably best to wait with SIMD until there's a strong motivation for it (it has to improve performance massively), or until there's a more ergonomic way to do SIMD in Rust.

@jonas-schievink jonas-schievink merged commit 1aad4db into jonas-schievink:master Jul 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants