Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for arrays in hashaggregate [databricks] #7465
Add support for arrays in hashaggregate [databricks] #7465
Changes from 5 commits
6c6df93
70e8e2e
d30d96c
5e04466
7bbf970
30c822b
9378dce
12e5164
5098efd
7273f30
6733b42
b7f3da2
c48837b
56724eb
49455d4
d81b551
59fb0e7
9a27733
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a follow on issue to fix this? Have we tested that this does not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sameerz is hashing function for Array[Struct] supported in cudf or is there an issue tracking that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cudf, lists of structs and structs of lists are not yet supported (tracking issue rapidsai/cudf#11222).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also spark-rapid tracking issue: #5109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ttnghia, the cudf issue you have tagged rapidsai/cudf#11222 seems to be for sorting a list of structs, is that the same for hashing? I tested passing a list of structs as groupBy key and cudf didn't like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand what you are saying correctly, I am aware that we don't have any control in the plugin to pick which aggregation to use, cudf will pick Hash or Sort.
But the line this comment is referencing is explicitly talking about
HashPartitioning
so I think this very explicitly only concerns that. To prove this, I uncommented this check and tried to do a groupBy on a List[Struct] and got an error from cudf saying that murmur hash is not implemented for List[Struct]. This is why I was confused why issue rapidsai/cudf#11222, which is completed, is tagged here when it deals with sorting a list and clearly doesn't fix the problem with groupBy on List[Struct]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we have to answer the question about what does
HashPartitioning
do for groupby?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If groupBy uses
HashAggregate
then it will useHashPartitioning
to create buckets with buffers pointing to values. So in this case, it will try to calculate the hash for List[Struct] which isn't supported atm by cudf.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed an issue for it: #8676
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Is there a corresponding cudf issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure to cleanup all files.