[DISCUSS] Improve client/scheduler performance during shuffling #6163

rjzamora · 2020-05-01T20:48:47Z

Lets use this issue to coordinate some ongoing efforts to improve client/scheduler graph performance related to large-scale shuffle operations.

In order to rearrange data between partitions in dask.dataframe (for parallel merge/sort/shuffle routines), the rearrange_by_column_tasks routine is used to build a task graph for staged shuffling. Since this logic represents n log(n) scaling, the time required for graph creation and execution itself can be quite significant.

Note that a detailed explanation of a nearly identical "staged shuffle" is described in this discussion. One component of the algorithm that is clearly dominating the size of the graph is the repetition of shuffle_group tasks (which output dictionaries of pd/cudf DataFrame objects) and getitem tasks (which select elements of the shuffle-group output). It is my understanding that some people may have promising ideas to improve performance here.

cc @kkraus14 @quasiben @mrocklin (Please do cc others as well..)

The text was updated successfully, but these errors were encountered:

kkraus14 · 2020-05-01T20:53:50Z

cc @madsbk who had some ideas in this area

kkraus14 · 2020-05-01T21:14:47Z

Also, taken from #6137 (comment), here's a standalone example of where client/scheduler graph performance is problematic:

from distributed import Client
from dask.datasets import timeseries
from dask.dataframe.shuffle import shuffle


client = Client()

ddf_d = timeseries(start='2000-01-01', end='2005-01-01', partition_freq='1d')  # 1827 partitions
ddf_d_2 = shuffle(ddf_d, "id", shuffle="tasks")

%time ddf_d_2 = ddf_d_2.persist()  #~8s on my machine

This was referenced May 2, 2020

[REVIEW] Reduce number of loop computations dask/distributed#3760

Open

[REVIEW][Perf] Remove Redundant string concatenations in dask code-base #6137

Merged

rjzamora mentioned this issue May 3, 2020

[REVIEW] Allow writing parquet to python file like objects rapidsai/cudf#5061

Merged

madsbk mentioned this issue May 4, 2020

[Prototype] Task Generator for Shuffle #6173

Closed

madsbk mentioned this issue May 19, 2020

[FEA] Dynamic Task Graph / Task Checkpointing dask/distributed#3811

Closed

madsbk mentioned this issue Jun 10, 2020

[WIP] Dynamic Tasks: Tasks inserting new tasks dask/distributed#3879

Closed

jakirkham mentioned this issue Jun 12, 2020

Large Dataframe shuffle operations #6314

Open

GenevieveBuckley added dataframe discussion Discussing a topic with no specific actions yet scheduler labels Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] Improve client/scheduler performance during shuffling #6163

[DISCUSS] Improve client/scheduler performance during shuffling #6163

rjzamora commented May 1, 2020 •

edited

Loading

kkraus14 commented May 1, 2020

kkraus14 commented May 1, 2020

[DISCUSS] Improve client/scheduler performance during shuffling #6163

[DISCUSS] Improve client/scheduler performance during shuffling #6163

Comments

rjzamora commented May 1, 2020 • edited Loading

kkraus14 commented May 1, 2020

kkraus14 commented May 1, 2020

rjzamora commented May 1, 2020 •

edited

Loading