-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify the iterator adaptive splitting strategy #857
base: main
Are you sure you want to change the base?
Conversation
Here are my benchmark results with a 5% threshold. It's rather mixed at extremes, so I'm not sure how to evaluate this...
|
cc @wagnerf42 |
hi, i'd like to take some time looking at the benches. on top of my head here are raw comments:
if you can give me a bit more time (1 week ?) i will take a closer look. |
Yes, I agree that needs will vary, but I hope this change makes a better default.
Yeah, it's hard. Even with the larger benchmarks (
I'm not in a rush, so I'm happy to let you dig further -- thanks! |
hi, so here is one example i had in mind: let r = (0..*size)
.into_par_iter()
.map(|e| (0..e).map(|x| x % 2).sum::<u64>())
.sum::<u64>(); so, why that ? if everything is nicely balanced then i don't see how the proposed modification could be bad. in this example when the parallel range gets divided in two, the left part contains 1/4 of total work and the right part 3/4. for performances the scheduler will need to divide regularly throughout the execution. (log(n) times). i wonder of the following: if you need to keep generating tasks on one branch during the execution then i think at one point it should be possible that only one single stealable task remains with the new mechanism. what does it mean ? well it is still enough to distribute work to all threads but now all stealers must try around p times unsuccesfully before achieving a successful steal. i did a run of this code on a 32 cores machine using 64 threads and the new code was slower (40%) there (sizes 100k to 300k). i'd still need to take a look inside the run to see what is really going on. i'll also try some more benches next week. |
Thank you for your investigation! The slowdown is surprising and disappointing, but I hope you'll be able to get a complete picture of why that is. We should also encode that knowledge in source comments at least, because this is all currently very opaque. |
To add another data point to the discussion, I have an example at GeoStat-Framework/GSTools-Core#6 where this makes the difference between a slow down (serial versus parallel) and a speed-up because without it around 500 large temporary arrays are needed for a |
Before, when an iterator job was stolen, we would reset the split count all the way back to `current_num_threads` to adaptively split jobs more aggressively when threads seem to need more work. This ends up splitting a lot farther than a lot of people expect, especially in the tail end of a computation when threads are fighting over what's left. Excess splitting can also be harmful for things like `fold` or `map_with` that want to share state as much as possible. We can get a much lazier "adaptive" effect by just not updating the split count when we split a stolen job, effectively giving it only _one_ extra boost of splitting.
Before, when an iterator job was stolen, we would reset the split count
all the way back to
current_num_threads
to adaptively split jobs moreaggressively when threads seem to need more work. This ends up splitting
a lot farther than a lot of people expect, especially in the tail end of
a computation when threads are fighting over what's left. Excess
splitting can also be harmful for things like
fold
ormap_with
thatwant to share state as much as possible.
We can get a much lazier "adaptive" effect by just not updating the
split count when we split a stolen job, effectively giving it only one
extra boost of splitting.