Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-partition Shuffle operation to cuDF Polars #17744

Merged
merged 23 commits into from
Jan 29, 2025

Conversation

rjzamora
Copy link
Member

Description

This PR pulls out the Shuffle logic from #17518 to simplify the review process.

The goal is to establish the shuffle groundwork for multi-partition Join and Sort operations.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added feature request New feature or request 3 - Ready for Review Ready for review by team non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Jan 15, 2025
@rjzamora rjzamora self-assigned this Jan 15, 2025
@rjzamora rjzamora requested review from a team as code owners January 15, 2025 17:26
@rjzamora rjzamora requested review from bdice and mroeschke January 15, 2025 17:26
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jan 15, 2025
@rjzamora
Copy link
Member Author

cc @wence- - It may make sense to get this in before Join or Groupby support. The Join logic largely depends on shuffling, and GroupBy may take a bit longer to clean up. (I can also push on Sort once the shuffling foundation is in place).

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for discussion

python/cudf_polars/cudf_polars/experimental/base.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/shuffle.py Outdated Show resolved Hide resolved
Comment on lines 36 to 47
A Shuffle node may have either one or two children. In both
cases, the first child corresponds to the DataFrame we are
shuffling. The optional second child corresponds to a distinct
DataFrame to extract the shuffle keys from. For example, it
may be useful to reference a distinct DataFrame in the case
of sorting.

The type of argument `keys` controls whether or not hash
partitioning will be applied. If `keys` is a tuple, we
assume that the corresponding columns must be hashed. If
`keys` is a `NamedExpr`, we assume that the corresponding
column already contains a direct partition mapping.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances do we shuffle one dataframe with the keys/expressions from another dataframe?

In the case of a sortby then all the referenced columns must live in the same dataframe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this would be simpler if we always took a dataframe that is being shuffled and a dataframe that is being used to compute the partitioning keys (these can be the same), along with a NamedExpr (or just an Expr) that can produce the partition mapping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances do we shuffle one dataframe with the keys/expressions from another dataframe?
In the case of a sortby then all the referenced columns must live in the same dataframe.

My thinking is that we want the Shuffle design to be something that we can use to "lower" both a hash-based shuffle (for a join or groupby), or a sortby. In the case of sortby, we don't actually care whether the referenced columns live in the same dataframe being sorted, because we need to do something like a global quantiles calculation on the referenced columns to figure out which partition each row corresponds to. Therefore, when we are sort df on column "A", we will probably want to add a new graph that transforms df["A"] into the final partition mapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I guess somehow the thing we're using to shuffle the dataframe does come from that dataframe (otherwise it seems like you would have had to do a join first, at least morally).

So are you kind of asking for an extension of the expression language to express the computation on the input dataframe that results in a new column with appropriate partition keys?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are you kind of asking for an extension of the expression language to express the computation on the input dataframe that results in a new column with appropriate partition keys?

Yes, that is probably a reasonable way to think about it. For a simple hash-based shuffle, the hypothetical expression for finding the output partition of each row is pointwise. In the case of a sort, the expression requires global data movement (i.e. the histogram/quantiles).

At the moment, it's trivial to evaluate a pointwise expression to calculate the partition mapping. However, it is not possible to evaluate a non-pointwise expression without offloading that calculation to a distinct IR node.

Relevant context: We don't currently support multi-partition expression unless they are "pointwise". We spent some time refactoring the IR class so that we can "lower" the evaluation of an IR node into tasks that execute the (static) IR.do_evaluate method. However, we cannot do this for Expr.do_evaluate yet. My impression was that we are not planning to refactor the Expr class. If so, we will probably need to decompose a single IR node containing a non-pointwise expression into one or more IR nodes that we know how to map onto a task graph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all you work so far @rjzamora! My apologies, I don't have anything to add to the review. I'm adding this comment just to check my understanding.

At the moment, it's trivial to evaluate a pointwise expression to calculate the partition mapping.

So we've got hash-based shuffles which are pointwise. This makes it relatively straightforward to determine the partition mapping. Eg. hash(df["A"]) % num_partitions only depends on row "A"`.

Sort-based shuffles are non-pointwise because you'd need to know the ranges that divide the dataframe into partitions. Eg. [8, 4, 10, 2, 1] into 3 partitions -> {0: [1, 2], 1: [4], 2: [8,10]}. How would we calculate the boundaries? (which I think is the quantile calculation)

However, it is not possible to evaluate a non-pointwise expression without offloading that calculation to a distinct IR node.

Would you use multiple IR nodes to do the calculation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response here @Matt711 !

So we've got hash-based shuffles which are pointwise.

Exactly right. Just to state this a slightly-different way: Any shuffle operation is actually two distinct operations. First, we need to figure out where each row is going, then we perform the actual shuffle. Lets call that first step the "partition-mapping" calculation. For a hash-bashed shuffle, the partition-mapping step is indeed pointwise. For a sort, the partition-mapping step is not.

Sort-based shuffles ... How would we calculate the boundaries? (which I think is the quantile calculation)

In Dask DataFrame, we essentially calculate a list of N quantiles on each partition independently (where N is >= the number of output partitions). Since the data may not be balanced, we then calculate an approximate "global" quantiles by merging these independent quantile calculations together (the code is generally in dask/dataframe/partitionquantiles.py).

In Dask DataFrame, we reduce these "global" quantiles on the client. However, for cudf-polars we may want to write it as more of an all-reduce pattern (TBD).

Would you use multiple IR nodes to do the calculation?

Yes, I think so. But this is just a design choice that allows us to keep "Shuffle" logic separate from "partition-mapping" logic. There is no fundamental requirement for us to do this.

python/cudf_polars/tests/experimental/test_shuffle.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/shuffle.py Outdated Show resolved Hide resolved
@rjzamora
Copy link
Member Author

@wence- - As we discussed offline, I decided to simplify the Shuffle class for now (to focus on hash-based shuffling). We can tackle the sorting problem after we have the (basic) join and groupby support out of the way.

@rjzamora
Copy link
Member Author

@wence- Are we good here? (should I re-target 25.04?)

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small suggestions, but let's go for 25.04

self.schema = schema
self.keys = keys
self.options = options
self._non_child_args = ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be (schema, keys, options)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that a Shuffle IR node is a "special" case where we don't actually want the do_evaluate method to be used at all. I actually just changed Shuffle.do_evaluate to return a NotImplementedError, since a single-partition shuffle should never occur.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think it would be useful to be able to evaluate it, because then one can test the rewrites on a single partition independent of the partitioning and dask backend

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, seems reasonable to me. I changed Shuffle.do_evaluate to be a no-op for now.

python/cudf_polars/cudf_polars/experimental/shuffle.py Outdated Show resolved Hide resolved
@rjzamora rjzamora changed the base branch from branch-25.02 to branch-25.04 January 27, 2025 17:04
@galipremsagar
Copy link
Contributor

@rjzamora I cancelled the most recent workflow to free up resources to unblock all of cudf CI for this PR: #17771

I'll rerun once #17771 is merged.

@rjzamora
Copy link
Member Author

Thanks @galipremsagar - Does anyone know what's going on with the "pre-commit.ci" check? Do I need to do something to update my local pre-commit hooks?

@galipremsagar
Copy link
Contributor

Thanks @galipremsagar - Does anyone know what's going on with the "pre-commit.ci" check? Do I need to do something to update my local pre-commit hooks?

CI is unblocked. They are optional for now. But @bdice will know more about it.

@vyasr
Copy link
Contributor

vyasr commented Jan 28, 2025

You need to merge the latest changes in from 25.02. 25.04 is a bit behind because the forward merger was blocked. We should be able to get that resolved this morning.

@rjzamora
Copy link
Member Author

@wence- We happy here once CI is clear?

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jan 29, 2025
@rjzamora
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit a6f90f0 into rapidsai:branch-25.04 Jan 29, 2025
108 checks passed
@rjzamora rjzamora deleted the cudf-polars-multi-shuffle branch January 29, 2025 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants