Skip to content

Worker assignment for split-shuffle tasks #4962

@fjetter

Description

@fjetter

TL:DR

We currently allow for stealing of split-shuffle tasks. That's a really bad idea...

Intro

Our distributed shuffle algorithm currently employs three kinds of tasks which are repeated a few times in multiple stages

  1. group-shuffle
  2. split-shuffle
  3. shuffle

where group-shuffle performs an actual groupby per chunk and outputs this to a tuple or list. split-shuffle performs simply a getitem on that list. It is therefore incredibly cheap and benefits greatly from data locality. Not only is a replica of group-shuffle useless, the dependent shuffle can usually not benefit greatly from data locality itself. therefore moving a split-shuffle or assigning it to a suboptimal worker is a horrible decision.

Problem

We currently have two mechanisms which (re-)assign a task to a worker

A) decide_worker

This is the first stop for a task and we typically choose a worker which holds a dependency for a task. However, we are currently opening this logic to allow for more flexible assignments #4925 In particular we'll include idling workers to the list which exposes us to the risk of moving split-shuffle tasks to the wrong worker, iiuc.

cc @gjoseph92

B) Work stealing

There are two mechanisms in work stealing which serve to protect us from assigning tasks like split-shuffle to the wrong worker.

B1) There is a dedicated blacklist for this specific task, see

fast_tasks = {"shuffle-split"}

The problem, as you can see, is that the blacklisted task prefix shuffle-split the reverse of the actual name is. This is a regression which was introduced at some point in time (I remember fixing this one or two years ago; I apparently wrote a bad unit test :) )

B2) There is a heuristic which blacklists tasks which are computing below 5ms or are relatively costly to move compared to their compute time, see

if split in fast_tasks:
return None, None
ws = ts.processing_on
compute_time = ws.processing[ts]
if compute_time < 0.005: # 5ms, just give up
return None, None

I'm currently investigating an issue where this logic might need to be removed or relaxed, see #4920 and #4471
which would remove the last layer of protection.

To see how bad this can become, I executed a shuffle computation on my local machine, see perf report below. My OS kernel seemed to be pretty busy with process management such that it slowed down everyhing. Great, this led to the split to be whitelisted for a steal and caused incredibly costly traffic costs.

image

https://gistcdn.githack.com/fjetter/c444f51e378331fb70aa8dd09d66ff63/raw/80bc5796687eccfa3ad328b63cae81f97bd70b4b/gh4471_dataframe_shuffle_main.html

from distributed.client import performance_report
import time
import distributed
from dask.datasets import timeseries

if __name__ == "__main__":
    client = distributed.Client(n_workers=1, threads_per_worker=1, memory_limit="16G")

    print(client.dashboard_link)

    data = timeseries(
        start="2000-01-01",
        end="2002-12-31",
        freq="1s",
        partition_freq="1d",
    )
    # This includes one str col which causes non-trivial transfer_cost
    data = data.shuffle("id", shuffle="tasks")
    # drop the string col such that the sum doesn't consume too much CPU
    data = data[["id", "x", "y"]]

    result = data.sum()

    print(len(result.dask))

    with performance_report("gh4471_dataframe_shuffle_main.html"):
        print("submit computation")
        future = client.compute(result)

        print("started computation")

        time.sleep(20)
        print("scaling to 10 workers")
        client.cluster.scale(10)

        _ = future.result()

Future

Short term, I'll add the split-shuffle to the blacklist of tasks to be stolen. However, we should ensure that for both the decide_worker and for future work stealing this is respected. I would like to come up with a heuristic to identify a task like this. While split-shuffles are probably the most prominent examples (and maybe most impactful) examples for this, I would imagine that this pattern is not unusual.

cc @madsbk I beleive you have been looking into shuffle operations in the past and might be interested in this


Related PRs contributing to a fix

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions