-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for arrays in hashaggregate [databricks] #7465
Add support for arrays in hashaggregate [databricks] #7465
Conversation
…VIDIA#6066)" (NVIDIA#6679)" This reverts commit c05ac2d. Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
Please hold off a bit. There is a bug in cudf that incorrectly sort arrays (rapidsai/cudf#12298). We should better wait for it to be fixed first. |
Agree, marking this draft until that is resolved. |
@razajafri we have fixed list sorting and it has been merged rapidsai/cudf#12538 |
…gg-array Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
@@ -559,7 +559,7 @@ case class GpuBasicMin(child: Expression) extends GpuMin(child) | |||
*/ | |||
case class GpuFloatMin(child: Expression) extends GpuMin(child) | |||
with GpuReplaceWindowFunction { | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure to cleanup all files.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/AggregateFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/AggregateFunctions.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one minor thing, but I am actually very concerned about putting this into 23.02. In order for sorting of arrays to work we need to ensure that there are no non-empty nulls. CUDF is working through fixing this for their code, and we have not really finished this either. I am fine with putting this in, but we have to have a way to fix up the input in the worst case. There are CUDF APIs to check if the input is bad and fix it up as needed, but we don't have JNI versions of those APIs yet to use in 23.02. Ideally if we were running tests we would throw an exception if we saw that the input was bad, but if we were not running tests we would just check to see if we might need to fix it up and if we do, then we would fix it. This is a data corruption issue.
We might be able to put the changes we need in spark-rapids-jni in the short term for 23.02. But I don't want to put this in without it.
}) | ||
) | ||
if (arrayWithStructsHashing) { | ||
willNotWorkOnGpu("hashing arrays with structs is not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a follow on issue to fix this? Have we tested that this does not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sameerz is hashing function for Array[Struct] supported in cudf or is there an issue tracking that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cudf, lists of structs and structs of lists are not yet supported (tracking issue rapidsai/cudf#11222).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also spark-rapid tracking issue: #5109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ttnghia, the cudf issue you have tagged rapidsai/cudf#11222 seems to be for sorting a list of structs, is that the same for hashing? I tested passing a list of structs as groupBy key and cudf didn't like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand what you are saying correctly, I am aware that we don't have any control in the plugin to pick which aggregation to use, cudf will pick Hash or Sort.
But the line this comment is referencing is explicitly talking about HashPartitioning
so I think this very explicitly only concerns that. To prove this, I uncommented this check and tried to do a groupBy on a List[Struct] and got an error from cudf saying that murmur hash is not implemented for List[Struct]. This is why I was confused why issue rapidsai/cudf#11222, which is completed, is tagged here when it deals with sorting a list and clearly doesn't fix the problem with groupBy on List[Struct]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we have to answer the question about what does HashPartitioning
do for groupby?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If groupBy uses HashAggregate
then it will use HashPartitioning
to create buckets with buffers pointing to values. So in this case, it will try to calculate the hash for List[Struct] which isn't supported atm by cudf.
scala> df.groupBy("_1","_2").count().explain
== Physical Plan ==
*(2) HashAggregate(keys=[_1#3, _2#4], functions=[count(1)])
+- Exchange hashpartitioning(_1#3, _2#4, 200), ENSURE_REQUIREMENTS, [id=#55]
+- *(1) HashAggregate(keys=[_1#3, _2#4], functions=[partial_count(1)])
+- *(1) LocalTableScan [_1#3, _2#4]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed an issue for it: #8676
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Is there a corresponding cudf issue?
@revans2 why do we need to ensure there are no non empty NULLs to sort arrays? We solved this in libcudf we the most recent PR I linked, didn't we? Are you seeing any more bugs since then? |
@divyegala we should sync up because I thought that the latest commits fixed null ordering and did some to mitigate non-empty nulls. But I thought that the sort still had issues if it saw rows with non-empty nulls in them. I know that we still have some issues with producing bad data. Also I thought that the plan was to move to not producing bad data instead of fixing it up after it was produced. If that is true then I still would like to have guard rails in place, at least when we are running tests to verify that we are not doing something wrong. |
@divyegala and I spoke and it was a small misunderstanding on non-empty nulls. This needs to wait until we can get asserts/fixup in place. |
What asserts/fixups are we talking about? Is there already a PR for them or an issue that we can use to depend this upon? |
@razajafri please retarget to 23.06. |
ebfa339
to
12e5164
Compare
build |
build |
CI failing because groupby on lists for most of the types won't fallback to the CPU now. Will post an update shortly |
build |
Regenerated supported_ops |
build |
build |
looks like you didn't restore test_hash_grpby_sum_count_action exactly the same as before
|
build |
premerge is stuck. Will kick it off again |
build |
This PR brings a previous change which was reverted due to a lack of support for sorting lists column. More info about the reverted PR can be found here
With rapidsai/cudf#5890 and rapidsai/cudf#11129 merged, we can now support this feature.
fixes #6680