Replies: 2 comments 1 reply
-
Hello Jeremy, At Google we have some other schedulers internally for our internal targets that do a similar thing and include some heuristics to expose more overlap opportunities and also a memory pressure tracking system to avoid memory pressure to go out of hand (re-ordering instructions can cause memory pressure to increase in unexpected ways). Considering you need something very similar here we could opensource this pass eventually to also serve the opensource targets. It will require a bit of refactoring to make sure its ready for opensource. In this way you could also have this other option that then you can evaluate for your workloads as well, but it might require (as I mentioned) a bit more time to get it to opensource. WDYT? Does this plan sound good to you. |
Beta Was this translation helpful? Give feedback.
-
PR is here: |
Beta Was this translation helpful? Give feedback.
-
We have noticed that, in some cases, NCCL kernels launched by HLO asynchronous collective operations (i.e.
AllReduce
, which is lowered toAllReduceStart
andAllReduceDone
) are "exposed" - there are no concurrent compute kernels executing, even though the HLO instruction dependencies and available GPU resources indicate that they could.We have an alternative implementation of the memory schedule postprocessor, which is currently used to "move"
AllReduceStart
andAllReduceDone
instructions. Some notes on the new implementation and some preliminary (anecdotal) improvement results are attached.Since changes to the GPU scheduler will potentially have a broad impact, would it be best to generate a pull request with this scheduler enabled via an opt-in flag? Or should the PR just replace the existing postprocessor?
XLA_sched_notes.pdf
Beta Was this translation helpful? Give feedback.
All reactions