-
-
Notifications
You must be signed in to change notification settings - Fork 734
Description
TL:DR
We currently allow for stealing of split-shuffle
tasks. That's a really bad idea...
Intro
Our distributed shuffle algorithm currently employs three kinds of tasks which are repeated a few times in multiple stages
group-shuffle
split-shuffle
shuffle
where group-shuffle
performs an actual groupby per chunk and outputs this to a tuple or list. split-shuffle
performs simply a getitem
on that list. It is therefore incredibly cheap and benefits greatly from data locality. Not only is a replica of group-shuffle
useless, the dependent shuffle
can usually not benefit greatly from data locality itself. therefore moving a split-shuffle
or assigning it to a suboptimal worker is a horrible decision.
Problem
We currently have two mechanisms which (re-)assign a task to a worker
A) decide_worker
This is the first stop for a task and we typically choose a worker which holds a dependency for a task. However, we are currently opening this logic to allow for more flexible assignments #4925 In particular we'll include idling workers to the list which exposes us to the risk of moving split-shuffle
tasks to the wrong worker, iiuc.
cc @gjoseph92
B) Work stealing
There are two mechanisms in work stealing which serve to protect us from assigning tasks like split-shuffle
to the wrong worker.
B1) There is a dedicated blacklist for this specific task, see
distributed/distributed/stealing.py
Line 455 in 9f4165a
fast_tasks = {"shuffle-split"} |
The problem, as you can see, is that the blacklisted task prefix
shuffle-split
the reverse of the actual name is. This is a regression which was introduced at some point in time (I remember fixing this one or two years ago; I apparently wrote a bad unit test :) )
B2) There is a heuristic which blacklists tasks which are computing below 5ms or are relatively costly to move compared to their compute time, see
distributed/distributed/stealing.py
Lines 120 to 126 in 9f4165a
if split in fast_tasks: | |
return None, None | |
ws = ts.processing_on | |
compute_time = ws.processing[ts] | |
if compute_time < 0.005: # 5ms, just give up | |
return None, None |
I'm currently investigating an issue where this logic might need to be removed or relaxed, see #4920 and #4471
which would remove the last layer of protection.
To see how bad this can become, I executed a shuffle computation on my local machine, see perf report below. My OS kernel seemed to be pretty busy with process management such that it slowed down everyhing. Great, this led to the split to be whitelisted for a steal and caused incredibly costly traffic costs.
from distributed.client import performance_report
import time
import distributed
from dask.datasets import timeseries
if __name__ == "__main__":
client = distributed.Client(n_workers=1, threads_per_worker=1, memory_limit="16G")
print(client.dashboard_link)
data = timeseries(
start="2000-01-01",
end="2002-12-31",
freq="1s",
partition_freq="1d",
)
# This includes one str col which causes non-trivial transfer_cost
data = data.shuffle("id", shuffle="tasks")
# drop the string col such that the sum doesn't consume too much CPU
data = data[["id", "x", "y"]]
result = data.sum()
print(len(result.dask))
with performance_report("gh4471_dataframe_shuffle_main.html"):
print("submit computation")
future = client.compute(result)
print("started computation")
time.sleep(20)
print("scaling to 10 workers")
client.cluster.scale(10)
_ = future.result()
Future
Short term, I'll add the split-shuffle
to the blacklist of tasks to be stolen. However, we should ensure that for both the decide_worker and for future work stealing this is respected. I would like to come up with a heuristic to identify a task like this. While split-shuffle
s are probably the most prominent examples (and maybe most impactful) examples for this, I would imagine that this pattern is not unusual.
cc @madsbk I beleive you have been looking into shuffle operations in the past and might be interested in this
Related PRs contributing to a fix