Worker assignment for split-shuffle tasks

## TL:DR

We currently allow for stealing of `split-shuffle` tasks. That's a really bad idea...

## Intro

Our distributed shuffle algorithm currently employs three kinds of tasks which are repeated a few times in multiple stages

1. `group-shuffle`
2. `split-shuffle`
3. `shuffle`

where `group-shuffle` performs an actual groupby per chunk and outputs this to a tuple or list. `split-shuffle` performs simply a `getitem` on that list. It is therefore incredibly cheap and benefits greatly from data locality. Not only is a replica of `group-shuffle` useless, the dependent `shuffle` can usually not benefit greatly from data locality itself. therefore moving a `split-shuffle` or assigning it to a suboptimal worker is a horrible decision.


## Problem

We currently have two mechanisms which (re-)assign a task to a worker

A) decide_worker

This is the first stop for a task and we typically choose a worker which holds a dependency for a task. However, we are currently opening this logic to allow for more flexible assignments https://github.com/dask/distributed/pull/4925 In particular we'll include idling workers to the list which exposes us to the risk of moving `split-shuffle` tasks to the wrong worker, iiuc.

cc @gjoseph92

B) Work stealing

There are two mechanisms in work stealing which serve to protect us from assigning tasks like `split-shuffle` to the wrong worker.

B1) There is a dedicated blacklist for this specific task, see https://github.com/dask/distributed/blob/9f4165a4278bc67f628089b72001ae5d97f5146b/distributed/stealing.py#L455
The problem, as you can see, is that the blacklisted task prefix `shuffle-split` the reverse of the actual name is. This is a regression which was introduced at some point in time (I remember fixing this one or two years ago; I apparently wrote a bad unit test :) )

B2) There is a heuristic which blacklists tasks which are computing below 5ms or are relatively costly to move compared to their compute time, see https://github.com/dask/distributed/blob/9f4165a4278bc67f628089b72001ae5d97f5146b/distributed/stealing.py#L120-L126

I'm currently investigating an issue where this logic might need to be removed or relaxed, see https://github.com/dask/distributed/pull/4920 and https://github.com/dask/distributed/issues/4471
which would remove the last layer of protection.

To see how bad this can become, I executed a shuffle computation on my local machine, see perf report below. My OS kernel seemed to be pretty busy with process management such that it slowed down everyhing. Great, this led to the split to be whitelisted for a steal and caused incredibly costly traffic costs.

![image](https://user-images.githubusercontent.com/8629629/123131844-5e537f00-d44e-11eb-80ce-60f9c1146864.png)


https://gistcdn.githack.com/fjetter/c444f51e378331fb70aa8dd09d66ff63/raw/80bc5796687eccfa3ad328b63cae81f97bd70b4b/gh4471_dataframe_shuffle_main.html


<details>

```python
from distributed.client import performance_report
import time
import distributed
from dask.datasets import timeseries

if __name__ == "__main__":
    client = distributed.Client(n_workers=1, threads_per_worker=1, memory_limit="16G")

    print(client.dashboard_link)

    data = timeseries(
        start="2000-01-01",
        end="2002-12-31",
        freq="1s",
        partition_freq="1d",
    )
    # This includes one str col which causes non-trivial transfer_cost
    data = data.shuffle("id", shuffle="tasks")
    # drop the string col such that the sum doesn't consume too much CPU
    data = data[["id", "x", "y"]]

    result = data.sum()

    print(len(result.dask))

    with performance_report("gh4471_dataframe_shuffle_main.html"):
        print("submit computation")
        future = client.compute(result)

        print("started computation")

        time.sleep(20)
        print("scaling to 10 workers")
        client.cluster.scale(10)

        _ = future.result()

```

</details>


## Future

Short term, I'll add the `split-shuffle` to the blacklist of tasks to be stolen. However, we should ensure that for both the decide_worker and for future work stealing this is respected. I would like to come up with a heuristic to identify a task like this. While `split-shuffle`s are probably the most prominent examples (and maybe most impactful) examples for this, I would imagine that this pattern is not unusual.

cc @madsbk I beleive you have been looking into shuffle operations in the past and might be interested in this


----

Related PRs contributing to a fix

* a version of https://github.com/dask/dask/pull/7834
* maybe https://github.com/dask/distributed/pull/4920
* maybe/maybe not https://github.com/dask/distributed/pull/4925

	if split in fast_tasks:
	return None, None

	ws = ts.processing_on
	compute_time = ws.processing[ts]
	if compute_time < 0.005: # 5ms, just give up
	return None, None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Worker assignment for split-shuffle tasks #4962

TL:DR

Intro

Problem

Future

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Worker assignment for split-shuffle tasks #4962

Description

TL:DR

Intro

Problem

Future

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions