You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Lets use this issue to coordinate some ongoing efforts to improve client/scheduler graph performance related to large-scale shuffle operations.
In order to rearrange data between partitions in dask.dataframe (for parallel merge/sort/shuffle routines), the rearrange_by_column_tasks routine is used to build a task graph for staged shuffling. Since this logic represents n log(n) scaling, the time required for graph creation and execution itself can be quite significant.
Note that a detailed explanation of a nearly identical "staged shuffle" is described in this discussion. One component of the algorithm that is clearly dominating the size of the graph is the repetition of shuffle_group tasks (which output dictionaries of pd/cudf DataFrame objects) and getitem tasks (which select elements of the shuffle-group output). It is my understanding that some people may have promising ideas to improve performance here.
Lets use this issue to coordinate some ongoing efforts to improve client/scheduler graph performance related to large-scale shuffle operations.
In order to rearrange data between partitions in
dask.dataframe
(for parallel merge/sort/shuffle routines), therearrange_by_column_tasks
routine is used to build a task graph for staged shuffling. Since this logic representsn log(n)
scaling, the time required for graph creation and execution itself can be quite significant.Note that a detailed explanation of a nearly identical "staged shuffle" is described in this discussion. One component of the algorithm that is clearly dominating the size of the graph is the repetition of
shuffle_group
tasks (which output dictionaries of pd/cudf DataFrame objects) andgetitem
tasks (which select elements of theshuffle-group
output). It is my understanding that some people may have promising ideas to improve performance here.cc @kkraus14 @quasiben @mrocklin (Please do cc others as well..)
The text was updated successfully, but these errors were encountered: