Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-partition
Shuffle
operation to cuDF Polars #17744Add multi-partition
Shuffle
operation to cuDF Polars #17744Changes from 9 commits
89392c0
2a6821d
5743030
7d36d3b
88e078d
1f77ec4
8c52fde
0886ab7
f714a51
677ef36
6b0b9f1
4da24b1
c7b81e3
fd6e39c
ecba98d
f02c146
8604e1b
86fad9d
9624396
264fcfd
82f9c78
a502f71
08b4db5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances do we shuffle one dataframe with the keys/expressions from another dataframe?
In the case of a
sortby
then all the referenced columns must live in the same dataframe.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this would be simpler if we always took a dataframe that is being shuffled and a dataframe that is being used to compute the partitioning keys (these can be the same), along with a
NamedExpr
(or just anExpr
) that can produce the partition mapping?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that we want the
Shuffle
design to be something that we can use to "lower" both a hash-based shuffle (for a join or groupby), or a sortby. In the case of sortby, we don't actually care whether the referenced columns live in the same dataframe being sorted, because we need to do something like a global quantiles calculation on the referenced columns to figure out which partition each row corresponds to. Therefore, when we are sortdf
on column"A"
, we will probably want to add a new graph that transformsdf["A"]
into the final partition mapping.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I guess somehow the thing we're using to shuffle the dataframe does come from that dataframe (otherwise it seems like you would have had to do a join first, at least morally).
So are you kind of asking for an extension of the expression language to express the computation on the input dataframe that results in a new column with appropriate partition keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is probably a reasonable way to think about it. For a simple hash-based shuffle, the hypothetical expression for finding the output partition of each row is pointwise. In the case of a sort, the expression requires global data movement (i.e. the histogram/quantiles).
At the moment, it's trivial to evaluate a pointwise expression to calculate the partition mapping. However, it is not possible to evaluate a non-pointwise expression without offloading that calculation to a distinct
IR
node.Relevant context: We don't currently support multi-partition expression unless they are "pointwise". We spent some time refactoring the
IR
class so that we can "lower" the evaluation of anIR
node into tasks that execute the (static)IR.do_evaluate
method. However, we cannot do this forExpr.do_evaluate
yet. My impression was that we are not planning to refactor theExpr
class. If so, we will probably need to decompose a singleIR
node containing a non-pointwise expression into one or moreIR
nodes that we know how to map onto a task graph.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all you work so far @rjzamora! My apologies, I don't have anything to add to the review. I'm adding this comment just to check my understanding.
So we've got hash-based shuffles which are pointwise. This makes it relatively straightforward to determine the partition mapping. Eg.
hash(df["A"]) % num_partitions
only depends on row "A"`.Sort-based shuffles are non-pointwise because you'd need to know the ranges that divide the dataframe into partitions. Eg.
[8, 4, 10, 2, 1] into 3 partitions -> {0: [1, 2], 1: [4], 2: [8,10]}
. How would we calculate the boundaries? (which I think is the quantile calculation)Would you use multiple IR nodes to do the calculation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed response here @Matt711 !
Exactly right. Just to state this a slightly-different way: Any shuffle operation is actually two distinct operations. First, we need to figure out where each row is going, then we perform the actual shuffle. Lets call that first step the "partition-mapping" calculation. For a hash-bashed shuffle, the partition-mapping step is indeed pointwise. For a sort, the partition-mapping step is not.
In Dask DataFrame, we essentially calculate a list of N quantiles on each partition independently (where N is >= the number of output partitions). Since the data may not be balanced, we then calculate an approximate "global" quantiles by merging these independent quantile calculations together (the code is generally in dask/dataframe/partitionquantiles.py).
In Dask DataFrame, we reduce these "global" quantiles on the client. However, for cudf-polars we may want to write it as more of an all-reduce pattern (TBD).
Yes, I think so. But this is just a design choice that allows us to keep "Shuffle" logic separate from "partition-mapping" logic. There is no fundamental requirement for us to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be
(schema, keys, options)
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that a
Shuffle
IR node is a "special" case where we don't actually want thedo_evaluate
method to be used at all. I actually just changedShuffle.do_evaluate
to return aNotImplementedError
, since a single-partition shuffle should never occur.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think it would be useful to be able to evaluate it, because then one can test the rewrites on a single partition independent of the partitioning and dask backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, seems reasonable to me. I changed
Shuffle.do_evaluate
to be a no-op for now.