-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler underestimates data transfer cost for small transfers #5324
Comments
If a dependency is already on every worker—or will end up on every worker regardless, because many things depend on it—we should ignore it when selecting our candidate workers. Otherwise, we'll end up considering every worker as a candidate, which is 1) slow and 2) often leads to poor choices (xref dask#5253, dask#5324). Just like with root-ish tasks, this is particularly important at the beginning. Say we have a bunch of tasks `x, 0`..`x, 10` that each depend on `root, 0`..`root, 10` respectively, but every `x` also depends on one task called `everywhere`. If `x, 0` is ready first, but `root, 0` and `everywhere` live on different workers, it appears as though we have a legitimate choice to make: do we schedule near `root, 0`, or near `everywhere`? But if we choose to go closer to `everywhere`, we might have a short-term gain, but we've taken a spot that could have gone to better use in the near future. Say that `everywhere` worker is about to complete `root, 6`. Now `x, 6` may run on yet another worker (because `x, 0` is already running where it should have gone). This can cascade through all the `x`s, until we've transferred most `root` tasks to different workers (on top of `everywhere`, which we have to transfer everywhere no matter what). The principle of this is the same as dask#4967: be more forward-looking in worker assignment and accept a little short-term slowness to ensure that downstream tasks have to transfer less data. This PR is a binary choice, but I think we could actually generalize to some weight in `worker_objective` like: the more dependents or replicas a task has, the less weight we should give to the workers that hold it. I wonder if, barring significant data transfer inbalance, having stronger affinity for the more "rare" keys will tend to lead to better placement.
This might also be connected to some bandwidth bias I discovered in #4962 We only take into account bandwidth measurements if the byte size is beyond a threshold distributed/distributed/worker.py Lines 2432 to 2441 in 3f86e58
|
Maybe we want to add 10ms?
…On Thu, Sep 16, 2021 at 3:49 AM Florian Jetter ***@***.***> wrote:
This might also be connected to some bandwidth bias I discovered in #4962
<#4962>
We only take into account bandwidth measurements if the byte size is
beyond a threshold
https://github.com/dask/distributed/blob/3f86e58f729c315905d51acedd9229a1db240cf4/distributed/worker.py#L2432-L2441
biasing our measurements. If the bandwidth is perceived to be too lage, the
cost is underestimated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5324 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTBAMVD3GVBB5MEZNOTUCGVQ7ANCNFSM5ED42POQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Yeah I experimented a little with turning off that threshold. What was the motivation for it? Do we have a measure of latency on the scheduler side? Workers have
That sounds generally reasonable. Could be nice to have something latency-based too though. |
The worker measures the latency upon every heartbeat but it isn't used anywhere. It should be easy enough to sync this to the scheduler |
Sure. One small thing I like about the latency measure on the worker side is that it's updated as soon as the worker connects to the scheduler (doesn't have to wait for a heartbeat cycle). Ideally it would be nice to have the same on the scheduler side (get it on first connection without waiting for the first heartbeat), mostly just for tests. But for now I'll send worker latency in metrics. Do we want to track a single average latency on the scheduler? Or per-WorkerState? |
I would suggest to do this per worker since some may be colocated, others may be on a very busy host, slow network, etc. I'm not sure if this makes a huge difference but it's very easy to measure, after all |
It definitely is easier. Though the worker's latency measurement is of it talking to the scheduler. Whereas the latency that we'll actually incur is the worker talking to one of its peers. So if network topology is inconsistent, this estimate could be inaccurate. But the whole thing is pretty inaccurate anyway, so I think it's fine. |
I feel like this has been discussed tangentially in other places, but I couldn't find an issue just for this. The scheduler generally seems a little too eager to schedule tasks in a way that requires data transfer, when transfer could be avoided (#5253 aims to address another aspect of this).
In our
worker_objective
function used to estimate how long it will take before a given task can start running on a given worker, we just consider basicallyn_bytes_to_transfer / self.bandwidth
. But there's a lot more than that that has to happen in reality.Here are some things we're not accounting for:
Worker.data_needed
be priority-ordered? #5323)Now I'm not at all saying we should try to actually measure these things. And when the data to transfer is big, what we have is pretty accurate.
But we should probably add some fixed-cost penalty for any transfer (covering network latency, GIL-pausing, etc.). Ideally at least latency can be measured (is it already on the scheduler?).
We should also just make
decide_worker
less inclined to cause data transfer. Maybe, if one worker already holds all the dependencies for a task, only pick a different worker if:Test to look at estimated vs actual transfer times
cc @fjetter @mrocklin
The text was updated successfully, but these errors were encountered: