Broadcast-like operations are poorly scheduled (widely-shared dependencies) #6570
Labels
memory
performance
scheduling
stability
Issue or feature related to cluster stability (e.g. deadlock)
Graphs like this are not currently scheduled well:
The
.
tasks should definitely take into account the location of the*
data when scheduling. But if we have 5 workers, every worker will have*
data on it, but only 2 workers will have ana
orb
. In scheduling the first few.
s, there's a tug-of-war between thea
and the*
—which do we want to schedule near? We want a way to disregard thea
.Say
(*, 0)
completes first, anda
is already complete, on a different worker. Each*
is the same size (or smaller than)a
. We now schedule(., 0)
. If we choose to go toa
, we might have a short-term gain, but we've taken a spot that could have gone to better use in the near future. Say the worker holdinga
is already running(*, 6)
. Now,(., 6)
may get scheduled on yet another worker, because(., 0)
is already running where it should have gone, and the scheduler prioritizes "where can I start this task soonest" over "how can I minimize data transfer".This can cascade through all the
.
s, until we've transferred most root tasks to different workers (on top ofa
, which we have to transfer everywhere no matter what).What could have been a nearly-zero-transfer operation is instead likely to transfer every piece of input data to a different worker, greatly increasing memory usage.
This pattern will occur anytime you broadcast one thing against another in a binary operation (which can occur in arrays, dataframes, bags, etc.).
In the above case, the
mul
tasks will tend to "dogpile" onto the one worker that holds the middlerandom_sample
task (x
).@crusaderky has also observed cases where this "dogpile" effect can cause what should be an embarrassingly-parallell operation to all get scheduled on one worker, overwhelming it.
#5325 was a heuristic attempt to fix this, but there are probably better ways to approach it.
The text was updated successfully, but these errors were encountered: