Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for arrays in hashaggregate [databricks] #7465

Merged
merged 18 commits into from
Jul 13, 2023

Conversation

razajafri
Copy link
Collaborator

This PR brings a previous change which was reverted due to a lack of support for sorting lists column. More info about the reverted PR can be found here

With rapidsai/cudf#5890 and rapidsai/cudf#11129 merged, we can now support this feature.

fixes #6680

…VIDIA#6066)" (NVIDIA#6679)"

This reverts commit c05ac2d.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri razajafri self-assigned this Jan 6, 2023
@razajafri
Copy link
Collaborator Author

build

@sameerz sameerz added the feature request New feature or request label Jan 6, 2023
@sameerz sameerz linked an issue Jan 6, 2023 that may be closed by this pull request
@razajafri razajafri requested a review from gerashegalov January 9, 2023 18:41
@ttnghia
Copy link
Collaborator

ttnghia commented Jan 10, 2023

Please hold off a bit. There is a bug in cudf that incorrectly sort arrays (rapidsai/cudf#12298). We should better wait for it to be fixed first.

@jlowe jlowe marked this pull request as draft January 12, 2023 21:18
@jlowe
Copy link
Contributor

jlowe commented Jan 12, 2023

There is a bug in cudf that incorrectly sort arrays

Agree, marking this draft until that is resolved.

@jlowe jlowe added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Jan 12, 2023
@divyegala
Copy link

@razajafri we have fixed list sorting and it has been merged rapidsai/cudf#12538

…gg-array

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

@@ -559,7 +559,7 @@ case class GpuBasicMin(child: Expression) extends GpuMin(child)
*/
case class GpuFloatMin(child: Expression) extends GpuMin(child)
with GpuReplaceWindowFunction {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to cleanup all files.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one minor thing, but I am actually very concerned about putting this into 23.02. In order for sorting of arrays to work we need to ensure that there are no non-empty nulls. CUDF is working through fixing this for their code, and we have not really finished this either. I am fine with putting this in, but we have to have a way to fix up the input in the worst case. There are CUDF APIs to check if the input is bad and fix it up as needed, but we don't have JNI versions of those APIs yet to use in 23.02. Ideally if we were running tests we would throw an exception if we saw that the input was bad, but if we were not running tests we would just check to see if we might need to fix it up and if we do, then we would fix it. This is a data corruption issue.

We might be able to put the changes we need in spark-rapids-jni in the short term for 23.02. But I don't want to put this in without it.

})
)
if (arrayWithStructsHashing) {
willNotWorkOnGpu("hashing arrays with structs is not supported")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a follow on issue to fix this? Have we tested that this does not work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sameerz is hashing function for Array[Struct] supported in cudf or is there an issue tracking that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cudf, lists of structs and structs of lists are not yet supported (tracking issue rapidsai/cudf#11222).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also spark-rapid tracking issue: #5109

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttnghia, the cudf issue you have tagged rapidsai/cudf#11222 seems to be for sorting a list of structs, is that the same for hashing? I tested passing a list of structs as groupBy key and cudf didn't like it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand what you are saying correctly, I am aware that we don't have any control in the plugin to pick which aggregation to use, cudf will pick Hash or Sort.
But the line this comment is referencing is explicitly talking about HashPartitioning so I think this very explicitly only concerns that. To prove this, I uncommented this check and tried to do a groupBy on a List[Struct] and got an error from cudf saying that murmur hash is not implemented for List[Struct]. This is why I was confused why issue rapidsai/cudf#11222, which is completed, is tagged here when it deals with sorting a list and clearly doesn't fix the problem with groupBy on List[Struct]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we have to answer the question about what does HashPartitioning do for groupby?

Copy link
Collaborator Author

@razajafri razajafri Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If groupBy uses HashAggregate then it will use HashPartitioning to create buckets with buffers pointing to values. So in this case, it will try to calculate the hash for List[Struct] which isn't supported atm by cudf.

scala> df.groupBy("_1","_2").count().explain
== Physical Plan ==
*(2) HashAggregate(keys=[_1#3, _2#4], functions=[count(1)])
+- Exchange hashpartitioning(_1#3, _2#4, 200), ENSURE_REQUIREMENTS, [id=#55]
   +- *(1) HashAggregate(keys=[_1#3, _2#4], functions=[partial_count(1)])
      +- *(1) LocalTableScan [_1#3, _2#4]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed an issue for it: #8676

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Is there a corresponding cudf issue?

@divyegala
Copy link

@revans2 why do we need to ensure there are no non empty NULLs to sort arrays? We solved this in libcudf we the most recent PR I linked, didn't we? Are you seeing any more bugs since then?

@revans2
Copy link
Collaborator

revans2 commented Jan 30, 2023

@divyegala we should sync up because I thought that the latest commits fixed null ordering and did some to mitigate non-empty nulls. But I thought that the sort still had issues if it saw rows with non-empty nulls in them. I know that we still have some issues with producing bad data. Also I thought that the plan was to move to not producing bad data instead of fixing it up after it was produced. If that is true then I still would like to have guard rails in place, at least when we are running tests to verify that we are not doing something wrong.

@revans2
Copy link
Collaborator

revans2 commented Jan 30, 2023

@divyegala and I spoke and it was a small misunderstanding on non-empty nulls. This needs to wait until we can get asserts/fixup in place.

@razajafri
Copy link
Collaborator Author

@divyegala and I spoke and it was a small misunderstanding on non-empty nulls. This needs to wait until we can get asserts/fixup in place.

What asserts/fixups are we talking about? Is there already a PR for them or an issue that we can use to depend this upon?

@razajafri razajafri marked this pull request as draft February 2, 2023 18:29
@razajafri razajafri changed the base branch from branch-23.02 to branch-23.04 February 7, 2023 18:41
@sameerz
Copy link
Collaborator

sameerz commented Apr 14, 2023

@razajafri please retarget to 23.06.

@razajafri razajafri changed the base branch from branch-23.04 to branch-23.06 April 14, 2023 13:22
@razajafri razajafri force-pushed the SP-6680-hashagg-array branch from ebfa339 to 12e5164 Compare June 22, 2023 00:49
@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri marked this pull request as ready for review July 6, 2023 00:21
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

CI failing because groupby on lists for most of the types won't fallback to the CPU now. Will post an update shortly

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

Regenerated supported_ops

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

@ttnghia @revans2 PTAL

@razajafri
Copy link
Collaborator Author

build

@revans2
Copy link
Collaborator

revans2 commented Jul 12, 2023

2023-07-12T02:46:29.0192503Z [2023-07-12T02:45:08.716Z] �[36m�[1m=========================== short test summary info ============================�[0m
2023-07-12T02:46:29.0193090Z [2023-07-12T02:45:08.716Z] �[31mFAILED�[0m ../../src/main/python/hash_aggregate_test.py::�[1mtest_hash_grpby_sum_count_action[('a', Long)][INJECT_OOM]�[0m - AttributeError: 'tuple' object has no attribute 'nullable'
2023-07-12T02:46:29.0193632Z [2023-07-12T02:45:08.716Z] �[31mFAILED�[0m ../../src/main/python/hash_aggregate_test.py::�[1mtest_hash_grpby_sum_count_action[('b', Integer)][INJECT_OOM]�[0m - AttributeError: 'tuple' object has no attribute 'nullable'
2023-07-12T02:46:29.0194138Z [2023-07-12T02:45:08.716Z] �[31mFAILED�[0m ../../src/main/python/hash_aggregate_test.py::�[1mtest_hash_grpby_sum_count_action[('c', Long)]�[0m - AttributeError: 'tuple' object has no attribute 'nullable'

looks like you didn't restore test_hash_grpby_sum_count_action exactly the same as before

-  pytest.mark.parametrize('data_gen', [_longs_with_nulls], ids=idfn)
+ @pytest.mark.parametrize('data_gen', _longs_with_nulls, ids=idfn)

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

premerge is stuck. Will kick it off again

@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri merged commit e7817e4 into NVIDIA:branch-23.08 Jul 13, 2023
@razajafri razajafri deleted the SP-6680-hashagg-array branch July 13, 2023 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support hashaggregate for Array[Any] [FEA] Hash partitioning on ArrayType
6 participants