-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack-based thread-local storage #941
Comments
One thing to keep in mind is that Rayon has absolutely no idea about your stack or what you're borrowing, because we don't have any compiler integration here. As a pure library, we only know basic type-system details about your code, and even that is only expressed in constraints like In your example, Rayon doesn't know anything about the That doesn't mean this is impossible, per se, just that we may need to reframe how to think about this.
I'm not what you mean, because generally an inner closure can borrow from outer locals. But in a parallel context, you usually can't borrow mutably, as that makes the closure Can you sketch out a more rayon-like example, and try to give details how you think that thread-local would work? |
I think the challenge is that what I want to do requires that I have a per-thread context I can use from within a task. I think that means I need to use a The alternative, and which is closer to what I posted in my code example, is that I get the ability to create per-thread closures that then explicitly grab work from a queue (so that they can pass stack-local things to it). Rayon does allow me to create threads and it also has automatic work-stealing, but I don't see where they explicitly meet in any exposed interface.
|
I JUST ran into this issue. I need to run an algorithm that needs some scratch space, but I only want to allocate 1 "scratch space" per thread. Then the algorithm can just reuse that scratch space. This works without rayon but with rayon there's no way to "get the scratch space for this iteration/thread". |
Rayon would have difficulty providing mutability as well, for much the same reason -- a globally accessible resource could be accessed by re-entrant code, even when "global" is thread-specific. Work-stealing makes this even worse, because code that is not at all recursive can still end up nested on the stack, when one part gets blocked and we steal another part of the "same" code to execute in the meantime.
I think you would like the |
In my example, each invocation of the closure passed to It wouldn't be globally accessible, just inside the closure. Like: you give an initializer to rayon which initializes the scratch space, to be executed in each thread. The type itself wouldn't need to be Send or Sync because it's not being shared between or sent across threads, though the initializer would need to be Send/Sync to be invoked by each thread. It could be kind of like "zip" in that it passes an extra value to the I'm bad at explaining but hopefully you can see my thought process. |
That depends on what you mean by "each invocation" -- where are each of these coming from? If it's a The |
I'm not saying there is no work-stealing just that the work-stealing is not relevant. At the end of the day each thread in the pool invokes the for_each closure a certain number of times. It can pass that closure a mutable reference to a piece of data kept by that thread. That data would have to be initialized separately for each thread in the thread pool, so it is effectively a "thread local". In reality it could easily be stored in the thread's stack and then dropped after the for_each is complete. |
It is relevant because work-stealing makes your code implicitly re-entrant. If the first call to your code is holding that |
I didn't know that rayon did coroutine stuff, that's wacky. Thanks for explaining. (maybe this is solvable by, instead of using just one thread-local, using a thread-local "Vec of however many of these are needed at any one time", that is grown on demand... or actually some other data structure that can be grown without invalidating references) |
I will add that this solves my problem because my algorithm is constant-time so I can just split the work evenly. As long as there's an easy way to get the number of threads to do so. |
This isn't stack-based, but: let mut opses = vec![Some(VecDeque::new()); rayon::current_num_threads()];
let get_scratch_opt = {
let opses_uniq = unique::Unique::new(&mut opses[..] as *mut [Option<VecDeque<_>>]).unwrap();
move || {
let opses_ptr = opses_uniq.as_ptr();
let thread = rayon::current_thread_index().unwrap();
unsafe { (*opses_ptr).get_mut(thread).unwrap() }
}
}; Fun! |
I started a thread on users.rust-lang.org on this and I thought it would be good to bring some of the takeaways here.
The key thing is that there are cases where you want thread-local storage for the lifetime of certain tasks. The various thread-local storage crates available are possible, but there are undesirable qualities to things that are statically or heap allocated when a stack-based solution is possible.
This is my (non-Rayon) solution; The 'r' unbounded channel is the work-range queue. The final send is not ideal for performing a reduction, but one could imagine implementing a tree reduction with a similar mechanism.
I think the missing abstraction in Rayon for this is a nested task construct, where closures like the innermost ones above can borrow thread-local variables. I'm pretty new to Rust and Rayon, so I don't have an appreciation for how this fits with the implementation strategies used therein, but I do think this is a relatively common pattern. I'd be happy to work with you to explore possible solutions.
The text was updated successfully, but these errors were encountered: