Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for a new Release #53

Closed
sagar-solana opened this issue Sep 16, 2019 · 10 comments
Closed

Request for a new Release #53

sagar-solana opened this issue Sep 16, 2019 · 10 comments
Milestone

Comments

@sagar-solana
Copy link

Hi @darrenldl!

We over at https://github.com/solana-labs/solana are seeing some major slowdowns with reed-solomon-erasure's use of rayon on 3.1.1 (possibly related to rayon-rs/rfcs#5).

Our use case hurts particularly badly since our data size is much smaller(set to MTU) than those used in the rse-benchmarks project.

We were about to fork this project to get rid of the rayon usage but I just noticed that master has moved on from rayon and doesn't rely on it anymore.
Is it possible that you can make a release any time soon ?

@sagar-solana
Copy link
Author

sagar-solana commented Sep 16, 2019

On second thought it seems as though master is much slower than the current 3.1.1 release without rayon.

Here's a comparison of master with rse_benchmark vs 3.1.1 without rayon.

On Master

benchmark_encode(500, 16, 4, 1_232, ParallelParam::new(32768)); //pparam is unused

encode :
    shards           : 16 / 4
    shard length     : 1232
    bytes per encode : 32768
    time taken       : 0.020329589
    byte count       : 9856000
    MB/s             : 462.35140624338254

On 3.1.1 the only change I've made is

--- a/reed-solomon-erasure/src/lib.rs
+++ b/reed-solomon-erasure/src/lib.rs
@@ -712,7 +712,7 @@ impl ReedSolomon {
                          input       : &[u8],
                          outputs     : &mut [&mut [u8]]) {
         outputs
-            .into_par_iter()
+            .into_iter()

and it benchmarks really well

benchmark_encode(500, 16, 4, 1_232, ParallelParam::new(32768));

encode :
    shards           : 16 / 4
    shard length     : 1232
    bytes per encode : 32768
    time taken       : 0.001602317
    byte count       : 9856000
    MB/s             : 5866.138886687216

without the into_par_iter change

benchmark_encode(500, 16, 4, 1_232, ParallelParam::new(32768)); 

encode :
    shards           : 16 / 4
    shard length     : 1232
    bytes per encode : 32768
    time taken       : 0.062636174
    byte count       : 9856000
    MB/s             : 150.06366867969936

I guess for now we're going to fork this repo and run without rayon.

@darrenldl
Copy link
Contributor

Sorry I've just started my PhD, so schedule's a bit hectic. But I am trying to clear issues this week as promised to others as well.

Rayon should be strictly an improvement, so I need to think a bit more as to why it leads to performance degradation. On my machine, Rayon scales the performance linear to number of parity shards up to number of CPU cores.
Okay I just read your comment again and saw the Rayon issue, whoops.

On second thought it seems as though master is much slower than the current 3.1.1 release without rayon.

The current master core code is implemented for polymorphic type with trait bound (I think), which is done by the various people who added the Galois 16 backend. I wasn't previously sure of the performance implication of not implementing code for u8 directly, but now you mentioned it, maybe we'll need to revert some changes.

Do you have some numbers on how slow master is compared to your 3.1.1 branch without Rayon?

@sagar-solana
Copy link
Author

Oh that's pretty great! I can imagine finding time for pet projects must be hard :)

My previous comment shows the benchmark(parameters included) on master vs the benchmark on 3.1.1 and also the massive improvement by disabling the outer rayon loop. Take a look at it again? Lmk if I can help clarify.
Tl;Dr with our parameters master benches about 5x faster than 3.1.1. But by disabling rayon on 3.1.1 it speeds up 50x.

@darrenldl
Copy link
Contributor

Whoops sorry! Yep I read it properly now.

I just remembered master (here) disables SIMD by default now, mind if you add feature `"simd-accel" for master and try again in rse-benchmark? That might remove a large part of the discrepancy.

@sagar-solana
Copy link
Author

Alright I'll give that a shot. I'm afk for the next few hours but I'll post here when I've got the numbers :)

@darrenldl
Copy link
Contributor

Thanks very much!

@sagar-solana
Copy link
Author

@darrenldl that seems to have done it!

cargo run --release --features=simd-accel

encode :
    shards           : 16 / 4
    shard length     : 1232
    bytes per encode : 32768
    time taken       : 0.001646524
    byte count       : 9856000
    MB/s             : 5708.6407865904175

Super fast now and lines up with the non-rayon impl I forked. So that makes sense now. :)

@darrenldl
Copy link
Contributor

Thanks!

Since master here doesn't use rayon as well, I'm guessing the trait way of doing things are causing slight slow down maybe, but benchmarking this lib has always been very sensitive to noise, so I'll confirm later.

In any case, 100MB/s performance degradation seems acceptable if it means the architecture can support more backends (gf8 + gf16).

@sagar-solana
Copy link
Author

I'm sure it needs a wider data set of testing to confirm but this definitely speeds everything up within margin of error of our fork of 3.1.1.

I think as soon as you're able to make a new release we'll drop our fork and switch to this.

@darrenldl darrenldl added this to the 4.0.0 release milestone Sep 17, 2019
@darrenldl
Copy link
Contributor

4.0.0 has been released, closing this issue

But feel free to let me know if there are stuff still to be resolved via here or other channels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants