-
-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare BLAS and non-BLAS performance #228
Comments
Is there any way of getting Intel's MKL optimizations in pure Rust without using BLAS? My guess is those would be quite difficult to replicate. |
Intel MKL libraries have their own ASM routines with architecture-specific optimizations, so it's not viable to completely reproduce in pure Rust. There's definitely space of improvement in the Rust library though. |
Thank you, that's what I thought. Perhaps someday we can get Intel itself to provide a Rust + assembly version. |
with guidance I am willing to do a blas vs non blas comparison for linfa-clustering. |
Since benchmarks already exist for |
@YuhanLiin Linfa Clustering results attached as zip. Conditions: Commands: System specs: hardware:
OS:
Update Linfa_preprocessing results below. These benchmarks use iai instead of criterion thus it does not automatically let you know if the difference is statistically different. The last set of results here has the percentage delta between the two commands. Additionally, this Linfa_preprocessing benchmarks were run under WSL due to an OS level dependency of iai. Mentioning because I am not sure if it matters. cargo bench -p linfa-preprocessing -F intel-mkl-static -q
cargo bench -p linfa-preprocessing -q
|
Also for linfa_linear is the expectation that the written benchmark just use the diabetes dataset (n=1599)? Or do we want to do something akin to a small,mediu, large benchmark say something like n=:1_000, 10_000, 100_000? I imagine the latter would allow us to know if the blas non-blas tradeoff is also a function of sample size. |
We should go with the latter. BTW, for the |
Okay I'll start a benchmarking PR for linfa_linear soon with iai per #103 That is correct for linfa-clustering the first benchmark is with blas enabled and the second is without. |
Context
Run: cargo bench -p linfa-ica
Run cargo bench -p linfa-ica -F intel-mkl-static
|
The zip file is also attached. Will dump flamegraph in #262. Context
System specs:hardware:
OS:
cargo bench -p linfa-ica -q
cargo bench -p linfa-ica -q -F intel-mkl-static
cargo bench -p linfa-linear -qoojo12@femi-device:/mnt/c/Users/femio/Documents/linfa$ cargo bench -p linfa-linear -q
cargo bench -p linfa-linear -q -F intel-mkl-static
cargo bench -p linfa-pls -q
cargo bench -p linfa-pls -q -F intel-mkl-static
|
Why are the ICA results you posted different from mine? In your post the discrepancy for ICA isn't as high. |
Some thoughts
If your post is mine from the related PR then it just might be the subtle difference in the conditions between both were run, per 4. |
It actually might be a good idea to raise the significance level to .99 or 1 since we don't care about small changes to performance but rather large ones that warrant the additional code complexity from maintain Blas support. Additionally, it might be best practice to run the benchmarks twice to offset the chance of a spurious result. |
Quick question in the non-blas case we are using linfa-linalg correct? If so, has that package itself been benchmarked against the blas alternative? I'm just wondering if that is a more direct comparison getting to what we are interest in here. Rationale:It seems there are only 2 cases that would cause the performance difference between Blas and non-blas
In my view it might be better to benchmark the rust implementation vs blas directly since it would be a dependency for future algorithms. This way if we imagine linfa-linalg has a more efficient svd implementation than blas we'd be confident that all algorithms using it instead of blas are going to perform better. The other approach of benchmarking the algorithms kinda makes it seem to me that we'd have to benchmark each alg against the blas vs non-blas version to be confident that performance doesn't take a hit. As opposed to being confident that anything using the svd non-blas implmentation will perform better than the blas version. ScenariosFurther we can imagine two scenarios where the following holds: Scenario I - we happen to implement algorithms that rely on algs A and B. In this case we falsely rule that we can remove blas dependencies. Also, by removing we have a one-time benchmark to compare performance. The community would question this as the project evolves. Scenario II - We happen to implement algorithms dependent on algs C and D. In this case we again falsely rule that blas performance is overall better. However, here we update the crate that is meant to replace blas and now have a benchmark in place to compare the two as releases are made. In this case we confidently release and remove blas dependency and can easily from release-to-release track and see if blas ever becomes substantially better over the long term (in such a case blas support would be warranted). Additionally, it could increase community adoption for the blas alternative. Hope the explanations make sense. |
The only difference between the first and second ICA benchmarks are the charge levels (first one is half power and second one is full power). I guess something like this can make a big difference. I'll update the ICA issue with the new results. I think what you want is to reduce the significance level, not increasing it, and I'm willing to accept that. You can also increase the noise threshold. Instead of running the benchmarks twice we can just increase the sample size and measurement time to run more iterations and increase confidence in our results. Running benchmarks twice just adds confusion when we get two different results. |
So how about the below to increase confidence?
1. Up the confidence level to 97 (default 95) - yielding 97% confidence that the true runtime lies within our estimated confidence interval
2. Up measurement time to 10s (default 3) - time allotted to finish a sample however if it is to low Criterion will automatically increase it.
4. Up sample size to 200 (default 100) - number of samples in a run
5. Up warmup time to 10s (default 5s) - time given to comp to adjust to load
6. Increase noise threshold to 5% (changes less than 5% will be ignored. The default is .01)
7. Decrease significance level to .02 (default.05) - for hypothesis testing in criterions context the default means about 5% of identical benchmarks will be considered different due to noise.
If agreed I'll update the benchmarking section of the Contribute.md
|
Try running a few benchmarks with these changes. If they don't take too long, then we can accept them. Regarding benchmarking New algorithms using linalg routines may perform worse without BLAS than with BLAS. Practically speaking, after BLAS removal of existing algorithms, we won't care about BLAS performance of new algorithms. Rather, we'll use benchmarks and profiling to improve the non-BLAS performance to an acceptable level. This is acceptable because most new algorithms don't use BLAS at all, so it's unlikely that we'll run into this problem in the first place. |
Gotcha that makes sense. As it pertains to the new settings here are the results. Albeit I upped the measurement time to 15 and still got a warning message about either reducing sample size or increasing it slightly for one of the parameterized benchmarks. Normal - 1min 39secs That warning disappeared at 20 and doesn't appear if I drop sample size down to 150 and keep measurement time at 15. Testing was done with linfa-ica |
I don't really care about that warning. We can keep the sample size and measurement time the same. Can you make a PR for the benchmark changes so we don't need to discuss them here? |
Context
System specs:hardware:
OS:
AlgorithmsFirst run was without blas second run was with blas for both algorithm results depcited below. Also ran with the update criterion branch. Linfa PLS
Linfa-linear
Linfa-Ica
|
These new benchmark results look strange. Some of the benchmarks have around a +40% runtime, which makes it seem like BLAS is a lot slower. |
hmmm I reran it as a sanity check with the following changes to get the results below:
New resultsJust noticed the above doesn't contain graphs. If you'd still like those I can rerun with cargo-criterion to get it and update the benchmarking section of the contribute.md Linfa-ica
Linfa-linear
Linfa-pls
|
Now that we've made BLAS support optional on several
linfa
crates, we should compare the performance of those crates with and without BLAS. Doing this requires those crates to have a complete set of benchmarks that represent realistic workloads. If BLAS turns out to have no performance improvements, we can even remove BLAS support, improving code quality.Benchmark status for each crate that supports BLAS:
linfa-clustering
:linfa-ica
:linfa-reduction
:linfa-preprocessing
:linfa-pls
:linfa-linear
:linfa-elasticnet
:I'm not 100% sure about removing the optional BLAS support, because that's technically a breaking change. That type of change would be unacceptable on a crate like
serde
, even on a breaking release, but we might be able to get away with it.The text was updated successfully, but these errors were encountered: