You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're double counting estimated network cost in multiple places
Forst, we're calculating the estimated network cost of dependencies a worker needs to fetch in _set_duration_estimate and are setting the result to WorkerState.processing, i.e. processing = compute + comm
This is also used to set the workers occupancy
When making a scheduling decision, we're typically using Scheduler.worker_objective which calculates a start_time that is defined as
comm cost should be constant and not scale with nthreads
we should only account for comm_cost once
A similar double counting is introduced on work stealing side when calculating the cost_multiplier
compute_time=ws.processing[ts] # occupancytransfer_time=nbytes/self.scheduler.bandwidth+LATENCYcost_multiplier=transfer_time/compute_time# If we ignore latency for now, this yields something likecost_multiplier~NBytes/ (Bandwidth*duration_average+NBytes)
= (NBytes/Bandwidth) / (duration_average+NBytes/Bandwidth)
i.e. for network heavy tasks, this converges towards 1 which is quite the opposite of what this ratio is supposed to encode
The text was updated successfully, but these errors were encountered:
There is another double/multiple counting problem in _set_duration_estimate that concerns tasks with shared dependencies.
_set_duration_estimate is evaluated once per task w/out any regard of shared dependencies. Therefore, specifically for graphs where N tasks share one common node, this nodes transfer cost is vastly overestimated since it is counted N times.
This double counting can be catastrophic for cases where transfer cost is potentially larger or of similar size than compute. Apart from an erroneous worker_objective, this can lead to misclassification of idle workers which then causes very aggressive work stealing where all tasks are stolen by the worker with the dependency. An extreme example is #6573
We're double counting estimated network cost in multiple places
Forst, we're calculating the estimated network cost of dependencies a worker needs to fetch in
_set_duration_estimate
and are setting the result toWorkerState.processing
, i.e.processing = compute + comm
This is also used to set the workers occupancy
When making a scheduling decision, we're typically using
Scheduler.worker_objective
which calculates astart_time
that is defined asdistributed/distributed/scheduler.py
Lines 3000 to 3001 in b133009
i.e.
A similar double counting is introduced on work stealing side when calculating the cost_multiplier
i.e. for network heavy tasks, this converges towards 1 which is quite the opposite of what this ratio is supposed to encode
The text was updated successfully, but these errors were encountered: