-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreaded sequential programs spend most their active time waiting #184
Comments
I think the goal should be to avoid unproductive stealing with suitable heuristics, well-placed back-offs, etc. If threads spend a lot of effort stealing the work of each-other and no parallelism is had, that should be identifiable so we can just stop doing that - right? |
I think the exponential backoff should be doing that, but it seems that they're able to steal tasks just often enough that they don't just keep sleeping forever and instead all threads just run slowly. Another problem of the current implementation is that after reducing a term to WHNF the backoff object is deleted and created again to reduce the subterms, so the heuristical knowledge is lost, although that doesn't happen in this specific example |
Regarding The threads spawned aren't the only threads running on the system. Unless each thread gets pinned to a core, the OS can move the producer to the core of the spinning consumer (either randomly or to improve temporal cache locality) and the consumer can waste the core's quota spinning even though the producer (on the same core) would have been ready to execute, so the more spinning stealer threads there are, the more potential for slowing down producers. Because to the CPU, a spinning thread looks like it's being very productive, even busier than actually productive threads who are intermittently blocked (due to a cache miss usually, but possibly due to branch mispredict or even a long-latency operation). Scheduling a task involves both queueing it to be popped & telling other worker threads to pop and perform it. "telling others to pop a task" without a bunch of overhead is its own challenge:
If we want to get something that's robust fast, we can use rayon's scoped pool :) |
I was messing around with my own interaction net evaluator for educational purposes ( not working yet or anything ), and I can testify to Rayon being extremely nice to use, and has so far prevented me from reaching for |
For programs that are dominated by expressions that must be reduced in sequential order, there is a very large performance drop proportional to the number of threads used.
Take this example program:
Running with one thread, I get ~30MR/s, while with 2 threads I get ~20MR/s and with 16 threads ~10MR/s.
Analizing what the program spends its time on using
perf
, it looks like most of the time the threads are trying to steal a job from another task, failing (because this program is inherently sequential and there is not enough parallel work for all the threads), and the waiting to try again.While it's true that we should try to write parallelizable programs for the HVM, for tasks that are mostly sequential we should have better behaviour.
The text was updated successfully, but these errors were encountered: