Add multi-partition `Shuffle` operation to cuDF Polars #17744

rjzamora · 2025-01-15T17:26:29Z

Description

This PR pulls out the Shuffle logic from #17518 to simplify the review process.

The goal is to establish the shuffle groundwork for multi-partition Join and Sort operations.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…igration

…-multi-shuffle

python/cudf_polars/cudf_polars/experimental/shuffle.py

rjzamora · 2025-01-15T17:39:34Z

cc @wence- - It may make sense to get this in before Join or Groupby support. The Join logic largely depends on shuffling, and GroupBy may take a bit longer to clean up. (I can also push on Sort once the shuffling foundation is in place).

wence-

Some suggestions for discussion

python/cudf_polars/cudf_polars/experimental/base.py

python/cudf_polars/cudf_polars/experimental/parallel.py

python/cudf_polars/cudf_polars/experimental/shuffle.py

wence- · 2025-01-20T14:35:53Z

python/cudf_polars/cudf_polars/experimental/shuffle.py

+    A Shuffle node may have either one or two children. In both
+    cases, the first child corresponds to the DataFrame we are
+    shuffling. The optional second child corresponds to a distinct
+    DataFrame to extract the shuffle keys from. For example, it
+    may be useful to reference a distinct DataFrame in the case
+    of sorting.
+
+    The type of argument `keys` controls whether or not hash
+    partitioning will be applied. If `keys` is a tuple, we
+    assume that the corresponding columns must be hashed. If
+    `keys` is a `NamedExpr`, we assume that the corresponding
+    column already contains a direct partition mapping.


Under what circumstances do we shuffle one dataframe with the keys/expressions from another dataframe?

In the case of a sortby then all the referenced columns must live in the same dataframe.

It seems like this would be simpler if we always took a dataframe that is being shuffled and a dataframe that is being used to compute the partitioning keys (these can be the same), along with a NamedExpr (or just an Expr) that can produce the partition mapping?

Under what circumstances do we shuffle one dataframe with the keys/expressions from another dataframe?
In the case of a sortby then all the referenced columns must live in the same dataframe.

My thinking is that we want the Shuffle design to be something that we can use to "lower" both a hash-based shuffle (for a join or groupby), or a sortby. In the case of sortby, we don't actually care whether the referenced columns live in the same dataframe being sorted, because we need to do something like a global quantiles calculation on the referenced columns to figure out which partition each row corresponds to. Therefore, when we are sort df on column "A", we will probably want to add a new graph that transforms df["A"] into the final partition mapping.

Hmmm, I guess somehow the thing we're using to shuffle the dataframe does come from that dataframe (otherwise it seems like you would have had to do a join first, at least morally).

So are you kind of asking for an extension of the expression language to express the computation on the input dataframe that results in a new column with appropriate partition keys?

So are you kind of asking for an extension of the expression language to express the computation on the input dataframe that results in a new column with appropriate partition keys?

Yes, that is probably a reasonable way to think about it. For a simple hash-based shuffle, the hypothetical expression for finding the output partition of each row is pointwise. In the case of a sort, the expression requires global data movement (i.e. the histogram/quantiles).

At the moment, it's trivial to evaluate a pointwise expression to calculate the partition mapping. However, it is not possible to evaluate a non-pointwise expression without offloading that calculation to a distinct IR node.

Relevant context: We don't currently support multi-partition expression unless they are "pointwise". We spent some time refactoring the IR class so that we can "lower" the evaluation of an IR node into tasks that execute the (static) IR.do_evaluate method. However, we cannot do this for Expr.do_evaluate yet. My impression was that we are not planning to refactor the Expr class. If so, we will probably need to decompose a single IR node containing a non-pointwise expression into one or more IR nodes that we know how to map onto a task graph.

Thanks for all you work so far @rjzamora! My apologies, I don't have anything to add to the review. I'm adding this comment just to check my understanding.

At the moment, it's trivial to evaluate a pointwise expression to calculate the partition mapping.

So we've got hash-based shuffles which are pointwise. This makes it relatively straightforward to determine the partition mapping. Eg. hash(df["A"]) % num_partitions only depends on row "A"`.

Sort-based shuffles are non-pointwise because you'd need to know the ranges that divide the dataframe into partitions. Eg. [8, 4, 10, 2, 1] into 3 partitions -> {0: [1, 2], 1: [4], 2: [8,10]}. How would we calculate the boundaries? (which I think is the quantile calculation)

However, it is not possible to evaluate a non-pointwise expression without offloading that calculation to a distinct IR node.

Would you use multiple IR nodes to do the calculation?

Sorry for the delayed response here @Matt711 !

So we've got hash-based shuffles which are pointwise.

Exactly right. Just to state this a slightly-different way: Any shuffle operation is actually two distinct operations. First, we need to figure out where each row is going, then we perform the actual shuffle. Lets call that first step the "partition-mapping" calculation. For a hash-bashed shuffle, the partition-mapping step is indeed pointwise. For a sort, the partition-mapping step is not.

Sort-based shuffles ... How would we calculate the boundaries? (which I think is the quantile calculation)

In Dask DataFrame, we essentially calculate a list of N quantiles on each partition independently (where N is >= the number of output partitions). Since the data may not be balanced, we then calculate an approximate "global" quantiles by merging these independent quantile calculations together (the code is generally in dask/dataframe/partitionquantiles.py).

In Dask DataFrame, we reduce these "global" quantiles on the client. However, for cudf-polars we may want to write it as more of an all-reduce pattern (TBD).

Would you use multiple IR nodes to do the calculation?

Yes, I think so. But this is just a design choice that allows us to keep "Shuffle" logic separate from "partition-mapping" logic. There is no fundamental requirement for us to do this.

python/cudf_polars/tests/experimental/test_shuffle.py

python/cudf_polars/cudf_polars/experimental/shuffle.py

…-multi-shuffle

rjzamora · 2025-01-23T16:05:19Z

@wence- - As we discussed offline, I decided to simplify the Shuffle class for now (to focus on hash-based shuffling). We can tackle the sorting problem after we have the (basic) join and groupby support out of the way.

rjzamora · 2025-01-27T15:32:52Z

@wence- Are we good here? (should I re-target 25.04?)

wence-

Some small suggestions, but let's go for 25.04

wence- · 2025-01-27T15:39:36Z

python/cudf_polars/cudf_polars/experimental/shuffle.py

+        self.schema = schema
+        self.keys = keys
+        self.options = options
+        self._non_child_args = ()


Should this be (schema, keys, options)?

I feel that a Shuffle IR node is a "special" case where we don't actually want the do_evaluate method to be used at all. I actually just changed Shuffle.do_evaluate to return a NotImplementedError, since a single-partition shuffle should never occur.

FWIW, I think it would be useful to be able to evaluate it, because then one can test the rewrites on a single partition independent of the partitioning and dask backend

Okay, seems reasonable to me. I changed Shuffle.do_evaluate to be a no-op for now.

python/cudf_polars/cudf_polars/experimental/shuffle.py

…-multi-shuffle

galipremsagar · 2025-01-27T20:08:15Z

@rjzamora I cancelled the most recent workflow to free up resources to unblock all of cudf CI for this PR: #17771

I'll rerun once #17771 is merged.

rjzamora · 2025-01-28T14:29:34Z

Thanks @galipremsagar - Does anyone know what's going on with the "pre-commit.ci" check? Do I need to do something to update my local pre-commit hooks?

galipremsagar · 2025-01-28T14:30:51Z

Thanks @galipremsagar - Does anyone know what's going on with the "pre-commit.ci" check? Do I need to do something to update my local pre-commit hooks?

CI is unblocked. They are optional for now. But @bdice will know more about it.

vyasr · 2025-01-28T14:42:48Z

You need to merge the latest changes in from 25.02. 25.04 is a bit behind because the forward merger was blocked. We should be able to get that resolved this morning.

rjzamora · 2025-01-28T17:00:24Z

@wence- We happy here once CI is clear?

rjzamora · 2025-01-29T14:35:21Z

/merge

rjzamora added 8 commits January 9, 2025 07:25

try importing dask_expr from dask.dataframe

89392c0

Merge remote-tracking branch 'upstream/branch-25.02' into dask-expr-m…

2a6821d

…igration

Merge branch 'branch-25.02' into dask-expr-migration

5743030

Merge remote-tracking branch 'upstream/branch-25.02' into dask-expr-m…

7d36d3b

…igration

update the error message

88e078d

add basic shuffle support

1f77ec4

major revision

8c52fde

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

0886ab7

…-multi-shuffle

rjzamora added feature request New feature or request 3 - Ready for Review Ready for review by team non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Jan 15, 2025

rjzamora self-assigned this Jan 15, 2025

rjzamora requested review from a team as code owners January 15, 2025 17:26

rjzamora requested review from bdice and mroeschke January 15, 2025 17:26

github-actions bot added the Python Affects Python cuDF API. label Jan 15, 2025

roll back unrelated changes

f714a51

rjzamora commented Jan 15, 2025

View reviewed changes

python/cudf_polars/cudf_polars/experimental/shuffle.py Show resolved Hide resolved

wence- reviewed Jan 20, 2025

View reviewed changes

rjzamora added 8 commits January 22, 2025 07:35

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

677ef36

…-multi-shuffle

address some code review

6b0b9f1

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

4da24b1

…-multi-shuffle

check the result

c7b81e3

fix test

fd6e39c

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

ecba98d

…-multi-shuffle

simplify Shuffle (only handle hash-based partitioning for now)

f02c146

remove multi-child validation

8604e1b

wence- approved these changes Jan 27, 2025

View reviewed changes

rjzamora added 2 commits January 27, 2025 08:51

Merge remote-tracking branch 'upstream/branch-25.04' into cudf-polars…

86fad9d

…-multi-shuffle

address code review

9624396

rjzamora changed the base branch from branch-25.02 to branch-25.04 January 27, 2025 17:04

rjzamora added 2 commits January 27, 2025 09:11

avoid shuffling single partition

264fcfd

fix test bug

82f9c78

turn do_evaluate back into a no-op

a502f71

Merge branch 'branch-25.04' into cudf-polars-multi-shuffle

08b4db5

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jan 29, 2025

rapids-bot bot merged commit a6f90f0 into rapidsai:branch-25.04 Jan 29, 2025
108 checks passed

rjzamora deleted the cudf-polars-multi-shuffle branch January 29, 2025 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-partition `Shuffle` operation to cuDF Polars #17744

Add multi-partition `Shuffle` operation to cuDF Polars #17744

rjzamora commented Jan 15, 2025

rjzamora commented Jan 15, 2025

wence- left a comment

wence- Jan 20, 2025

wence- Jan 20, 2025

rjzamora Jan 22, 2025

wence- Jan 22, 2025

rjzamora Jan 22, 2025

Matt711 Jan 24, 2025

rjzamora Jan 27, 2025

rjzamora commented Jan 23, 2025

rjzamora commented Jan 27, 2025

wence- left a comment

wence- Jan 27, 2025

rjzamora Jan 27, 2025

wence- Jan 28, 2025

rjzamora Jan 28, 2025

galipremsagar commented Jan 27, 2025

rjzamora commented Jan 28, 2025

galipremsagar commented Jan 28, 2025

vyasr commented Jan 28, 2025

rjzamora commented Jan 28, 2025

rjzamora commented Jan 29, 2025

Add multi-partition Shuffle operation to cuDF Polars #17744

Add multi-partition Shuffle operation to cuDF Polars #17744

Conversation

rjzamora commented Jan 15, 2025

Description

Checklist

rjzamora commented Jan 15, 2025

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjzamora commented Jan 23, 2025

rjzamora commented Jan 27, 2025

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galipremsagar commented Jan 27, 2025

rjzamora commented Jan 28, 2025

galipremsagar commented Jan 28, 2025

vyasr commented Jan 28, 2025

rjzamora commented Jan 28, 2025

rjzamora commented Jan 29, 2025

Add multi-partition `Shuffle` operation to cuDF Polars #17744

Add multi-partition `Shuffle` operation to cuDF Polars #17744