-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust: Use rayon and better performance #203
base: master
Are you sure you want to change the base?
Conversation
The issue with Rayon or any dependencies is that it brings a lot of new dependencies that need to be audited or at least vetted. This is very undesirable in a cryptographic product securing billions of assets. It is especially critical when supply chain attacks are increasing in appeal. Furthermore, a simple threadpool written with cryptographic/security application in mind and fully control by the dev is significantly easier to maintain and update if there is a security issue. |
I generally agree with you on this one. However, as I have shown in the description already, there were some questionable decisions from soundness point of view around thread pool as it was used and performance was lower than it could have been. You 100% can rewrite original code and make it faster without rayon, achieving rayon's features in a safe misuse-resistant way is non-trivial and I'd rather use something that it maintained by Rust experts for this purpose (check top contributors to rayon project, they are literally working on Rust itself) than seeing every library invent their own a bit flawed version of the same features that doesn't integrated as well with the rest of Rust ecosystem. I would estimate risk of relying on |
@@ -25,8 +25,7 @@ macro_rules! pippenger_test_mod { | |||
let mut scalars = Box::new([0u8; nbytes * npoints]); | |||
ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut()); | |||
|
|||
let mut points: Vec<$point> = Vec::with_capacity(npoints); | |||
unsafe { points.set_len(points.capacity()) }; | |||
let mut points = vec![<$point>::default(); npoints]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: avoid useless zero init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left because it is test code
@@ -61,8 +60,7 @@ macro_rules! pippenger_test_mod { | |||
let mut scalars = Box::new([0u8; nbytes * npoints]); | |||
ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut()); | |||
|
|||
let mut points: Vec<$point> = Vec::with_capacity(npoints); | |||
unsafe { points.set_len(points.capacity()) }; | |||
let mut points = vec![<$point>::default(); npoints]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: avoid useless zero init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left because it is test code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review @mratsim! If the rest of the stuff is fine, I can try to remove zero initialization and see if it makes a performance difference, if yes then I'll look to refactor things in a way that makes code easier to audit.
Having documentation at least in C code would have been great, ideally in bindings as well. Right now without C expanding macros in my head I have no idea that blst_p1s_to_affine
initializes memory internally and is guaranteed to never ever read it prior to that, which to me is horrifying from security point of view, especially when there is an easy way to avoid it.
… newer version of Rust
…dependency and unnecessary/unused `no-threads` feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored blst_fp12::miller_loop_n
into something that is close to original implementation.
For zero initialized vectors I wrote this simple bench:
fn bench_alloc(c: &mut Criterion) {
let mut group = c.benchmark_group("allocation");
group.bench_function("alloc/set_size", |b| {
b.iter(|| {
let mut points =
Vec::<u8>::with_capacity(criterion::black_box(1_000_000));
unsafe { points.set_len(points.capacity()) };
criterion::black_box(points);
});
});
group.bench_function("alloc/zeroed", |b| {
b.iter(|| {
let points = vec![0u8; criterion::black_box(1_000_000)];
criterion::black_box(points);
});
});
group.finish();
}
The results were like this:
allocation/alloc/set_size
time: [12.233 ns 12.276 ns 12.321 ns]
allocation/alloc/zeroed time: [11.259 µs 11.263 µs 11.267 µs]
It seemed significant, so I decided to partially revert set_len
, but I moved some of them after the function that initializes memory to make it clearer to the reader what is actually happening there. Only left zero initialization in 2 places in tests where performance doesn't matter as much.
And I applied relaxed ordering in places where suggested.
@mratsim, appreciate careful review, hopefully you'll have time to check the diff.
@@ -25,8 +25,7 @@ macro_rules! pippenger_test_mod { | |||
let mut scalars = Box::new([0u8; nbytes * npoints]); | |||
ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut()); | |||
|
|||
let mut points: Vec<$point> = Vec::with_capacity(npoints); | |||
unsafe { points.set_len(points.capacity()) }; | |||
let mut points = vec![<$point>::default(); npoints]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left because it is test code
@@ -61,8 +60,7 @@ macro_rules! pippenger_test_mod { | |||
let mut scalars = Box::new([0u8; nbytes * npoints]); | |||
ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut()); | |||
|
|||
let mut points: Vec<$point> = Vec::with_capacity(npoints); | |||
unsafe { points.set_len(points.capacity()) }; | |||
let mut points = vec![<$point>::default(); npoints]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left because it is test code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to add my review.
On the technical side, LGTM.
Now on whether to use Rayon or keep BLST's own Threadpool, I'll leave the decision to the team.
Just confirmed that performance improvements here translate to performance improvements downstream in our usage usage of https://github.com/sifraitech/rust-kzg (uses blst) in https://github.com/subspace/subspace, both of which also take advantage of Rayon as well. |
I don't know whether commitment, proof generation or verification is the bottleneck for your use-case but @sauliusgrigaitis might be able to point you on how to best use rust-kzg, i.e. either through native BLST or reimplemented+parallelized BLST (i.e. using their own multithreading strategy). Also they've been working on a new set of benchmarks. I have my own in mratsim/constantine#304 (comment), but I don't cover parallel BLST. |
@dot-asm is there something I can do to interest you reviewing this? |
This os honestly frustrating. There are merge conflicts again and not review for clear performance and API improvements in half a year 😕 @dot-asm any chance you can look into this any time soon? Any feedback at all would be nice. |
As correctly suggested by @mratsim the threadpool was chosen for having minimum dependencies. Another important reason is that the parallelization pattern is used in both Rust and Go bindings. As a way to minimize the maintenance cost. I mean we only need to understand one pattern and solving problems applies to both implementations. It also gives better control by being more predictable. For example consider use std::sync::atomic::*;
fn main() {
rayon::spawn(move || {
loop {}
});
let counter = AtomicUsize::new(0);
rayon::scope(|scope| {
scope.spawn_broadcast(|_, _| {
println!("{}", counter.fetch_add(1, Ordering::Relaxed));
});
});
println!("Hello, {} cpus!", counter.load(Ordering::Acquire));
} The keyword is that scope.spawn_broadcast never finishes here. In other words the invocations in question are dependent on the progress being made by independent threads. Which is unsustainable in real-life application context. As for benchmark performance. It appears to be system-dependent and even circumstantial. On some systems I failed to note significant improvements. On some it's indeed better, but only marginally. On some systems I can observe significant variations from run to run with either approach, with the original occasionally outperforming the suggested approach on some points. It ought to be because of unfortunate interplay with the OS scheduler and power management... Either way, it made me dive till my eardrums ruptured and found that if modified, the threadpool can perform way better on larger systems. Here is a sample from a 128-core one:
vs. this PR's (which is about the same as the main branch on average)
On smaller systems, such as personal computers, it's kind all over the place. In some points the difference is not as significant, in others is... If you want to test it on your system add
to your .cargo/config.toml. It needs more work and testing before bringing it to maintainers' attention, but one can test... I mean do test :-) |
I understand the appeal of minimal dependencies, but as described above
This makes sense as well, but I should note that this shouldn't be a hard requirement since languages are in fact substantially different and translating code directly from one language to another can and does often lead to awkward ergonomics and even performance issues. Not to say that tricky
That code contains literally an infinite loop. It is a problem one way or another. The whole point is that those are NOT independent threads, they are specific threads on a fixed size thread pool that is context-dependent. It is a much more flexible approach as I have described in the original post than assuming library can use all of the cores and let OS scheduler figure out where to run it or forcing all usages to use a single CPU core with nothing in between.
I consider this as a good thing. Because it didn't regress and provides substantial flexibility without changing the library that are not possible otherwise. I do notice large improvements in performance from pinning threads to cores on Threadripper/Epyc and even Ryzen 7/9 CPUs in many cases (crossing chiplets and sockets is very costly sometimes), yet currently because the library uses its own independent thread pool I can't do anything about it without forking.
Very nice! Changes make a lot of sense to me, for sensitive contexts channel is not the most efficient data structure and simple Mutex/Condvar pair is all you really need. I wouldn't be surprised if similar gains can be made in rayon as well. |
Come on, just imagine say a socket listener instead of the infinite loop. Or replace it with sleep for a second and perform signature verification in the spawn_broadcast. Are you ok with the signature verification taking a full second? I'm not. A library simply may not bring this upon the user. |
You misunderstand what rayon is and what are the use cases for it. I'll quote the first line of their readme:
If you use it for socket listener you're using it wrong. Rayon is for compute-intensive work, for I/O you'll be using a regular thread or a dedicated I/O thread pool (that can in fact use
This is another great example of how to use the wrong tool for the job. Neither of these examples are directly applicable to what this PR does. Yes, rayon will use global thread pool by default, but you also have an option to create a custom one as described in the very first post, such that random sleeps and I/O in other parts of the app are not fundamentally blocking any work that is being done by |
The examples were obviously exaggerated to illuminate the thesis that scope.spawn_broadcast is considered an inappropriate choice for this library. Because we don't make the assumption that the application makes calls from the same thread and the trouble is that faster operations get serialized by slower ones. In case you say that the current approach serializes operations too, well, not if you have say a 30-thread operation and a 2-thread operation on a 32-core system. |
Every single example you provided was a literal misuse of the library for things that it is not designed to be used for, while this PR uses it exactly for its direct purpose: relatively short-lived parallelizable compute that doesn't have sleeps or I/O, etc. I have no idea how you came to conclusion "considered an inappropriate choice for this library", I can't connect the dots 😕
|
Being afraid that someone uses it for sockets or puts useless sleeps in |
To give you very specific example we use blst as indirect dependency in https://github.com/autonomys/space-acres. In this desktop app we have an option to limit CPU usage to just half of the cores so user can keep the app running and not experience sluggish desktop due to high CPU utilization even though the app is still making a decent progress. Right now we use blst from a fork with this PR included so we can reduce thread pool size and pick to a specific set of cores depending on CPU topology (NUMA, L3/L2 cache locality, SMT/no-SMT cores on hybrid Intel CPUs, etc.) to ensure a great user experience. Out of probably a thousand of Rust dependencies If PR is not accepted we will either have to switch to upstream and affect user experience or maintain a repeatedly diverging fork with this change set in it ourselves. Neither of these options are ideal, which is why I opened a PR and tried to not only keep the performance the same, but even improve it in some cases as well as improve code quality. |
Once again, if you don't assume that all calls to this library are made from the same thread, which we don't, then a spawn_broadcast call [made by this library], that otherwise takes a pair of threads and 1 ms, made 1 ms after a 5-ms otherwise 30-thread one [also made by this library] will finish in 4+ ms. As for affinity, threadpool does take into consideration the process' affinity mask upon instantiation. |
What you're writing makes sense hypothetically in a vacuum. I already explained in this thread a few times why this is a hypothetical issue. If you really have designed an app where you have literally no idea what is running, how and where (in which case I'm really sorry), you can still create an explicit thread pool and call bslt under it to ensure no unexpected overlap with other operations: let thread_pool = rayon::ThreadPoolBuilder::default().build().unwrap();
thread_pool.install(|| {
// Do blst stuff on a dedicated thread pool here
}); Read the docs about it here: https://docs.rs/rayon/latest/rayon/struct.ThreadPool.html#method.install
That is not a replacement for what I'm asking, not even close. To give you another example in https://github.com/autonomys/subspace we also check topology of CPU and split work into multiple thread pools, each pinned to a specific NUMA node (mostly Intel Xeon) or CCX (on AMD Threadripper/Epyc) to ensure threads do not do expensive migration, which results in significant performance wins. All this is possible with the changes I'm proposing and impossible with the stock You just can't make those decisions at the process level. You know about compute performance way more than I do, not sure why I have to explain this in so much detail. |
Just to give you perspective to my claims I did benchmarks of this PR on AMD 7970X (32C64T 4 CCD). Using all CPU cores (at the mercy of default thread scheduling):
Limiting to just 1 CCD (8C16T):
As you can see, depending on benchmark there is potential to do 1.5x-3.5x more work with this PR in the same app by strategically pinning CPU cores for different thread pools, which is not possible with current Is there any chance this PR can be merged or am I wasting my time here? |
This PR introduces a bunch of changes that target at better performance and higher code quality.
Performance
Before:
After:
The reason is most likely due to higher parallelism and not using channels in those operations.
Rayon
Rayon doesn't always result in better performance than manual thread management (see rayon-rs/rayon#1124), but changes in this PR do result in higher performance.
There are many great benefits to rayon though in the way it is used:
safe API, previous thread pool usage was either unsound or sound by accident (comments were claiming things like:
Which while worked by accident, was a lie (nothing was actually joined in most cases, but worked in practice due to use of channels to send work back to the parent thread), fundamentally unsound API (parent thread panic would result in worker threads accessing undefined memory location) and generally a major foot gun for anyone who would try to change the code in any way.
support for customizing number of threads using environment variables
support for customizing number of threads and their pinning to CPU cores depending on thread pool from which operations are called (by default global thread pool is used and default behavior is equivalent to old code)
avoid conflicting/overlapping with other threads if used from within parallel code that already uses rayon
no need to special-case platforms like WASM, if specific WASM environment supports threading,
blst
will now enjoy it automaticallyVarious cleanups
I changed all but one
channel
usage with a cleaner code usingMutex
,Arc
usage was removed as unnecessary and some other bookkeeping was cleaned up after understanding what code does and rewriting intention in more idiomatic Rust code (seefn from
for example). The only place wherechannel
still remains isfn mult
, but it is quite difficult code and I didn't have the will to decipher what it tries to do to get rid of channel, but hopefully examples in this PR will prompt someone to clean it up at some later point.Some
unsafe
blocks withvec.set_len()
are moved after memory is actually initialized and some questionable in terms of soundness places were refactored such that they are quite obviously sound now. Some remaining places can also be fixed, but require pointer arithmetic that might require newer versions of Rust, so I left TODOs there.I also took liberty to refactor things like
&q[0]
toq.as_ptr()
since while both result in the same pointer, the former is deceptive because it suggests to the reader that only the first element is used, while in practice the wholeq
is used by the function that is being called later (documentation is almost non-existent, so it wasn't obvious at first and I hope this improves readability).no-threads
no-threads
was a hack and non-idiomatic because it violates "additivity" assumptions of Rust features. Thankfully it is no longer necessary and can be safely removed, but since this is a breaking change (the only one in this PR to the best of my knowledge), I bumped crate version from0.3.11
to0.4.0
.How to review this PR
This PR is split into self-contained logically structured commits and it is recommended to review commits one by one instead of the complete diff, this way it is much easier to understand what was changed and why.
Let me know if there are any questions, happy to explain each line change in detail.