Refactor checksum algorithm to process 4 bytes at a time #2

NathanaelLane · 2020-07-02T18:10:51Z

...allowing for increased instruction-level parallelism. Benchmarks demonstrate performance improvements upwards of 70% on my development machine.

…r increased instruction-level parallelism. Benchmarks demonstrate performance improvements upwards of 70% on my development machine.

…s by default aligned to 1, not >=N)

jonas-schievink · 2020-07-02T22:27:54Z

Oh this is great, thanks! I've looked into doing this with manual SIMD but didn't get anywhere so far, so this is nice to have.

I can confirm most benchmarks can now do >5 GB/s instead of the 3-4 I was getting previously. However the shorter benchmarks with 100 Byte chunks slow down quite a bit (from ~3.4 GB/s to ~2.4), is that something that could be fixed? I expect short messages or blocks to be somewhat common.

NathanaelLane · 2020-07-03T00:18:39Z

Manual SIMD is the next logical step, but doing it properly with portability, runtime detection, etc. is such a pain (at least until the std::simd RFCs stabilize). I'd expect some regression for very small buffer sizes-- there is greater loop overhead with this algorithm-- but not the extent that you're seeing. My own bench results aligned with this hypothesis, but based on the throughput numbers I'm seeing relative to yours, it appears my system is just too bottlenecked to notice the change.

A simple mitigation might be bailing out to the serial algorithm for short slices, but finding the crossover point would require careful tuning. I will investigate further.

Modulo operations are computationally expensive, and small inputs are not able to amortize the cost effectively. The solution is to special-case short slices with a faster algorithm.

jonas-schievink · 2020-07-03T10:15:07Z

I think Criterion has functionality for benchmarking and plotting using a variable input parameter. It should be possible to use that to get more insight into how performance behaves across different input lengths. I've never used it though.

jonas-schievink · 2020-07-04T14:38:52Z

I think Criterion has functionality for benchmarking and plotting using a variable input parameter.

Seems like it doesn't, or at least doesn't really support what I'd like to do. Oh well.

NathanaelLane · 2020-07-05T05:42:27Z

I haven't looked into it much yet, but from a quick search I thought this seemed to line up with what you had in mind.

I see that the usage of align_to() is an issue for miniz_oxide-- It's technically not fundamental, I'll test performance without it when I have time to work on this tomorrow. Even if there is a noticeable performance diff I think the approach suggested here is reasonable.

jonas-schievink

Gave this a first deeper look, code looks mostly good to me.

I plugged this into miniz_oxide and saw that the benchmarks got slightly worse, can you confirm that?

src/lib.rs

jonas-schievink · 2020-07-05T16:41:03Z

I haven't looked into it much yet, but from a quick search I thought this seemed to line up with what you had in mind.

Ah yeah that should work.

I see that the usage of align_to() is an issue for miniz_oxide-- It's technically not fundamental, I'll test performance without it when I have time to work on this tomorrow. Even if there is a noticeable performance diff I think the approach suggested here is reasonable.

👍

* input-size comparison test * simpler + faster code * place the new implementation behind an "unsafe" feature flag * fix typos

NathanaelLane · 2020-07-06T01:15:36Z

I plugged this into miniz_oxide and saw that the benchmarks got slightly worse, can you confirm that?

I ran the miniz_oxide benchmarks with 0cb41e6, and saw slight improvements.

In order to see how the use of unsafe {} impacts performance, I walked through a series of implementations, benchmarking along the way. My findings are as follows:

Baseline (commit: `dee8f1d`)

let mut a = u32::from(self.a);
let mut b = u32::from(self.b);
for chunk in bytes.chunks(CHUNK_SIZE) {
    for byte in chunk {
        let val = u32::from(*byte);
        a += val;
        b += a;
    }

    a %= MOD;
    b %= MOD;
}
self.a = a as u16;
self.b = b as u16;

Results:

simple-100b/ones-100    time:   [38.992 ns 39.578 ns 40.371 ns]                                  
                        thrpt:  [2.3069 GiB/s 2.3531 GiB/s 2.3885 GiB/s]

simple-1m/zeroes-1m     time:   [401.06 us 402.16 us 403.50 us]                                
                        thrpt:  [2.4203 GiB/s 2.4283 GiB/s 2.4349 GiB/s]

Simple 4-way version:

let (bytes, remainder) = bytes.split_at(bytes.len() - (bytes.len() % 4));
let mut a = u32::from(self.a);
let mut b = u32::from(self.b);
let mut a_vec = U32X4([0; 4]);
let mut b_vec = a_vec;

// iterate 4 bytes at a time over 4 separate sums
for chunks in bytes.chunks(CHUNK_SIZE * 4) {
    for byte_vec in chunks.chunks(4) {
        for ((&byte, av), bv) in byte_vec
            .iter()
            .zip(a_vec.0.iter_mut())
            .zip(b_vec.0.iter_mut())
        {
            *av += u32::from(byte);
            *bv += *av;
        }
    }
    b += a * chunks.len() as u32;
    a_vec %= MOD;
    b_vec %= MOD;
    b %= MOD;
}

// combine the sub-sum results
b_vec *= 4;
b_vec.0[1] += MOD - a_vec.0[1];
b_vec.0[2] += (MOD - a_vec.0[2]) * 2;
b_vec.0[3] += (MOD - a_vec.0[3]) * 3;
for &av in a_vec.0.iter() {
    a += av;
}
for &bv in b_vec.0.iter() {
    b += bv;
}

// add any remaining bytes in serial
for &byte in remainder {
    a += u32::from(byte);
    b += a;
}
a %= MOD;
b %= MOD;
self.a = a as u16;
self.b = b as u16;

Results:

simple-100b/ones-100    time:   [125.20 ns 125.45 ns 125.75 ns]                                 
                        thrpt:  [758.37 MiB/s 760.20 MiB/s 761.74 MiB/s]

simple-1m/zeroes-1m     time:   [1.1630 ms 1.1636 ms 1.1644 ms]                                 
                        thrpt:  [858.81 MiB/s 859.38 MiB/s 859.88 MiB/s]

Simple 4-way version with unchecked indexing:

// Measure 1.
let (bytes, remainder) = bytes.split_at(bytes.len() - (bytes.len() % 4));

let mut a = u32::from(self.a);
let mut b = u32::from(self.b);
let mut a_vec = U32X4([0; 4]);
let mut b_vec = a_vec;

// iterate 4 bytes at a time over 4 separate sums
for chunks in bytes.chunks(CHUNK_SIZE * 4) {
    // Measure 2.
    let byte_vecs = chunks.chunks_exact(4);
    debug_assert_eq!(0, byte_vecs.remainder().len());

    for byte_vec in chunks.chunks(4) {
        unsafe {
            // only safe if we can garantee byte_vec.len()
            // >= 4, which holds for this inner loop as per
            // Measure 1. and Measure 2. above
            a_vec.add_four_from_slice(byte_vec); // calls byte_vec.get_unchecked(i) 
        }
        b_vec += a_vec;
    }
    b += a * chunks.len() as u32;
    a_vec %= MOD;
    b_vec %= MOD;
    b %= MOD;
}

// combine the sub-sum results
b_vec *= 4;
b_vec.0[1] += MOD - a_vec.0[1];
b_vec.0[2] += (MOD - a_vec.0[2]) * 2;
b_vec.0[3] += (MOD - a_vec.0[3]) * 3;
for &av in a_vec.0.iter() {
    a += av;
}
for &bv in b_vec.0.iter() {
    b += bv;
}

// add any remaining bytes in serial
for &byte in remainder {
    a += u32::from(byte);
    b += a;
}
a %= MOD;
b %= MOD;
self.a = a as u16;
self.b = b as u16;

Results:

simple-100b/ones-100    time:   [78.569 ns 78.741 ns 78.949 ns]                                 
                        thrpt:  [1.1796 GiB/s 1.1828 GiB/s 1.1854 GiB/s]

simple-1m/zeroes-1m     time:   [731.80 us 734.17 us 736.74 us]                                
                        thrpt:  [1.3255 GiB/s 1.3302 GiB/s 1.3345 GiB/s]

Simple 4-way version using `align_to()`:

let (pre_bytes, aligned_bytes, post_bytes) = unsafe {
    // safe to do because we aren't actually reinterpreting bytes--
    // only making sure we do aligned accesses in the main loop
    bytes.align_to::<[u8; 4]>()
};

let mut a = u32::from(self.a);
let mut b = u32::from(self.b);
let mut a_vec = U32X4([0; 4]);
let mut b_vec = a_vec;

// we expect this loop to be short enough that per-element
// mod is cheaper on average
for &byte in pre_bytes {
    a += u32::from(byte);
    a -= (a >= MOD) as u32 * MOD;
    b += a;
    b -= (b >= MOD) as u32 * MOD;
}

for chunk in aligned_bytes.chunks(CHUNK_SIZE) {
    for byte_vec in chunk {
        a_vec += U32X4::from(byte_vec);
        b_vec += a_vec;
    }
    b += a * 4 * chunk.len() as u32;
    a_vec %= MOD;
    b_vec %= MOD;
    b %= MOD;
}

// combine the sub-sum results
b_vec *= 4;
b_vec.0[1] += MOD - a_vec.0[1];
b_vec.0[2] += (MOD - a_vec.0[2]) * 2;
b_vec.0[3] += (MOD - a_vec.0[3]) * 3;
for &av in a_vec.0.iter() {
    a += av;
}
for &bv in b_vec.0.iter() {
    b += bv;
}
a %= MOD;
b %= MOD;

// we expect this loop to be short enough that per-element
// mod is cheaper on average
for &byte in post_bytes {
    a += u32::from(byte);
    a -= (a >= MOD) as u32 * MOD;
    b += a;
    b -= (b >= MOD) as u32 * MOD;
}

self.a = a as u16;
self.b = b as u16;

Results:

simple-100b/ones-100    time:   [35.169 ns 35.239 ns 35.316 ns]                                  
                        thrpt:  [2.6371 GiB/s 2.6429 GiB/s 2.6481 GiB/s]

simple-1m/zeroes-1m     time:   [226.25 us 226.75 us 227.26 us]                                
                        thrpt:  [4.2971 GiB/s 4.3068 GiB/s 4.3162 GiB/s]

Conclusion:

There is a good chance that my safer 4-way implementations are not optimal, and I welcome any suggestions to improve the performance there without resorting to align_to() :-)

However, as the results stand, it seems that using align_to() is critical for unlocking performance beyond the straightforward serial sum implementation. Our specific usage can theoretically be made safe, but would likely require an API built around pi-types/const-generics.

jamestwebber · 2020-07-10T02:47:14Z

Hello there, was working on a project that's highly dependent on flate2 performance and fell down a deep dark hole...so here I am. 😄

I was just playing around with this PR and got pretty reasonable results without using align_to, instead using chunks_exact and the remainder. Code change is pretty simple and I suspect performance could be improved:

fn checksum_loop(a: u32, b: u32, bytes: &[u8]) -> (u32, u32) {
    let mut a = a;
    let mut b = b;
    let mut a_vec = split_sum::U32X4([0; 4]);
    let mut b_vec = a_vec;
    
    let chunk_iter = bytes.chunks_exact(4 * CHUNK_SIZE);
    let post_bytes = chunk_iter.remainder();

    for chunk in chunk_iter {
        for byte_vec in chunk.chunks_exact(4) {
            a_vec += split_sum::U32X4::from(byte_vec);
            b_vec += a_vec;
        }
        b += a * 4 * CHUNK_SIZE as u32;
        a_vec %= MOD;
        b_vec %= MOD;
        b %= MOD;
    }

Plus modifying U32X4::from to take &[u8] instead of [u8; 4] which is fine.

I benchmarked this with simple-1m. There was maybe a slight hit on simple-1m/ones-1m but it was around 3% and the zeros case was fine.

edit: I just checked simple-100b and saw bigger differences. An obvious thing I didn't bother with was trying to "vectorize" the remainder chunk, which can get pretty large.

jamestwebber · 2020-07-10T03:02:17Z

Adding a second clean-up loop seems to have helped. Still getting a mild performance hit (compared to this PR) on the small samples.

    let chunk_iter = bytes.chunks_exact(4 * CHUNK_SIZE);
    let chunk_remainder = chunk_iter.remainder();
    
    ...  main loop ...

    let byte_iter = chunk_remainder.chunks_exact(4);
    let post_bytes = byte_iter.remainder();
    b += a * 4 * byte_iter.len() as u32;
    b %= MOD;
    
    for byte_vec in byte_iter {
        a_vec += split_sum::U32X4::from(byte_vec);
        b_vec += a_vec;
    }
    a_vec %= MOD;
    b_vec %= MOD;

    // combine the sub-sum results into the main sum
    ...

NathanaelLane · 2020-07-10T03:48:15Z

Wow, thank you for catching this!

Looking back at my writeup, you can see exactly where I went wrong:

    let byte_vecs = chunks.chunks_exact(4);
    debug_assert_eq!(0, byte_vecs.remainder().len());

    for byte_vec in chunks.chunks(4) {
    //...

That last line obviously should read

    for byte_vec in byte_vecs {
    //...

No wonder-- I only added unchecked-indexing to that example because I was disappointed at how little chunks_exact() appeared to help. (Turns out it does help, but only if you actually use it!) Fixing that, and using chunks_exact() for the outer loop as well as per your suggestion, indeed brings performance roughly on-par with the unsafe {} code.

Welp, as the saying goes,

I fought the crab, and the crab won.

I bow to safe rustc/stdlib.

@jamestwebber

…from @jamestwebber. Add "#![forbid(unsafe)]". This greatly simplifies code changes relative to v0.2.2.

jamestwebber · 2020-07-10T04:54:56Z

Looking back at my writeup, you can see exactly where I went wrong:

Oh hah I didn't even notice that when I was looking at your post.

I think adding SSE etc instructions to this implementation would not be too bad, although it would be going back into unsafe land as I believe all of those instructions are unsafe..

jonas-schievink · 2020-07-10T21:42:55Z

Great job, this looks really good now!

The 100-Byte benchmark still regresses from 3.3622 GiB/s to 3.1368 GiB/s, but I think that's okay. I've tried the miniz_oxide benchmarks again and found out that they're incredibly noisy (in a way that isn't reflected by the confidence interval printed in the bench results), so I'd say it's fine to ignore them for now.

It's probably best to wait with SIMD until there's a strong motivation for it (it has to improve performance massively), or until there's a more ergonomic way to do SIMD in Rust.

NathanaelLane added 3 commits July 2, 2020 14:03

Refactor checksum algorithm to process 4 bytes at a time, allowing fo…

a1bcf3c

…r increased instruction-level parallelism. Benchmarks demonstrate performance improvements upwards of 70% on my development machine.

fix upstream rustfmt issues

b0b72f2

add alignment test and coerce intended alignment with repr ([u8; N] i…

082e3cd

…s by default aligned to 1, not >=N)

Improve performance for short input slices.

4bc0273

Modulo operations are computationally expensive, and small inputs are not able to amortize the cost effectively. The solution is to special-case short slices with a faster algorithm.

jonas-schievink mentioned this pull request Jul 4, 2020

Forbid unsafe code #3

Closed

Shnatsel mentioned this pull request Jul 4, 2020

Switch from adler32 to adler crate Frommi/miniz_oxide#83

Merged

jonas-schievink reviewed Jul 5, 2020

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

various improvements:

0cb41e6

* input-size comparison test * simpler + faster code * place the new implementation behind an "unsafe" feature flag * fix typos

NathanaelLane added 2 commits July 10, 2020 00:40

Remove all unsafe code and related infrastructure, thanks to insight …

6caed1f

…from @jamestwebber. Add "#![forbid(unsafe)]". This greatly simplifies code changes relative to v0.2.2.

rustfmt import alphabetization

d7f0da2

jonas-schievink merged commit 1aad4db into jonas-schievink:master Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor checksum algorithm to process 4 bytes at a time #2

Refactor checksum algorithm to process 4 bytes at a time #2

NathanaelLane commented Jul 2, 2020

jonas-schievink commented Jul 2, 2020

NathanaelLane commented Jul 3, 2020

jonas-schievink commented Jul 3, 2020

jonas-schievink commented Jul 4, 2020

NathanaelLane commented Jul 5, 2020 •

edited

Loading

jonas-schievink left a comment

jonas-schievink commented Jul 5, 2020

NathanaelLane commented Jul 6, 2020 •

edited

Loading

jamestwebber commented Jul 10, 2020 •

edited

Loading

jamestwebber commented Jul 10, 2020

NathanaelLane commented Jul 10, 2020 •

edited

Loading

jamestwebber commented Jul 10, 2020

jonas-schievink commented Jul 10, 2020

Refactor checksum algorithm to process 4 bytes at a time #2

Refactor checksum algorithm to process 4 bytes at a time #2

Conversation

NathanaelLane commented Jul 2, 2020

jonas-schievink commented Jul 2, 2020

NathanaelLane commented Jul 3, 2020

jonas-schievink commented Jul 3, 2020

jonas-schievink commented Jul 4, 2020

NathanaelLane commented Jul 5, 2020 • edited Loading

jonas-schievink left a comment

Choose a reason for hiding this comment

jonas-schievink commented Jul 5, 2020

NathanaelLane commented Jul 6, 2020 • edited Loading

Baseline (commit: dee8f1d)

Simple 4-way version:

Simple 4-way version with unchecked indexing:

Simple 4-way version using align_to():

Conclusion:

jamestwebber commented Jul 10, 2020 • edited Loading

jamestwebber commented Jul 10, 2020

NathanaelLane commented Jul 10, 2020 • edited Loading

jamestwebber commented Jul 10, 2020

jonas-schievink commented Jul 10, 2020

NathanaelLane commented Jul 5, 2020 •

edited

Loading

NathanaelLane commented Jul 6, 2020 •

edited

Loading

Baseline (commit: `dee8f1d`)

Simple 4-way version using `align_to()`:

jamestwebber commented Jul 10, 2020 •

edited

Loading

NathanaelLane commented Jul 10, 2020 •

edited

Loading