Add support for arrays in hashaggregate [databricks] #7465

razajafri · 2023-01-06T02:55:47Z

This PR brings a previous change which was reverted due to a lack of support for sorting lists column. More info about the reverted PR can be found here

With rapidsai/cudf#5890 and rapidsai/cudf#11129 merged, we can now support this feature.

fixes #6680

…VIDIA#6066)" (NVIDIA#6679)" This reverts commit c05ac2d. Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2023-01-06T02:55:58Z

build

ttnghia · 2023-01-10T00:27:46Z

Please hold off a bit. There is a bug in cudf that incorrectly sort arrays (rapidsai/cudf#12298). We should better wait for it to be fixed first.

jlowe · 2023-01-12T21:18:09Z

There is a bug in cudf that incorrectly sort arrays

Agree, marking this draft until that is resolved.

divyegala · 2023-01-20T21:59:15Z

@razajafri we have fixed list sorting and it has been merged rapidsai/cudf#12538

…gg-array Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2023-01-25T01:23:42Z

build

ttnghia · 2023-01-30T14:59:58Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/AggregateFunctions.scala

@@ -559,7 +559,7 @@ case class GpuBasicMin(child: Expression) extends GpuMin(child)
 */
 case class GpuFloatMin(child: Expression) extends GpuMin(child)
  with GpuReplaceWindowFunction {
-
+  


Please make sure to cleanup all files.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/AggregateFunctions.scala

revans2

I had one minor thing, but I am actually very concerned about putting this into 23.02. In order for sorting of arrays to work we need to ensure that there are no non-empty nulls. CUDF is working through fixing this for their code, and we have not really finished this either. I am fine with putting this in, but we have to have a way to fix up the input in the worst case. There are CUDF APIs to check if the input is bad and fix it up as needed, but we don't have JNI versions of those APIs yet to use in 23.02. Ideally if we were running tests we would throw an exception if we saw that the input was bad, but if we were not running tests we would just check to see if we might need to fix it up and if we do, then we would fix it. This is a data corruption issue.

We might be able to put the changes we need in spark-rapids-jni in the short term for 23.02. But I don't want to put this in without it.

revans2 · 2023-01-30T15:07:59Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

+              })
+          )
+          if (arrayWithStructsHashing) {
+            willNotWorkOnGpu("hashing arrays with structs is not supported")


Is there a follow on issue to fix this? Have we tested that this does not work?

@sameerz is hashing function for Array[Struct] supported in cudf or is there an issue tracking that?

In cudf, lists of structs and structs of lists are not yet supported (tracking issue rapidsai/cudf#11222).

Also spark-rapid tracking issue: #5109

@ttnghia, the cudf issue you have tagged rapidsai/cudf#11222 seems to be for sorting a list of structs, is that the same for hashing? I tested passing a list of structs as groupBy key and cudf didn't like it.

If I understand what you are saying correctly, I am aware that we don't have any control in the plugin to pick which aggregation to use, cudf will pick Hash or Sort.
But the line this comment is referencing is explicitly talking about HashPartitioning so I think this very explicitly only concerns that. To prove this, I uncommented this check and tried to do a groupBy on a List[Struct] and got an error from cudf saying that murmur hash is not implemented for List[Struct]. This is why I was confused why issue rapidsai/cudf#11222, which is completed, is tagged here when it deals with sorting a list and clearly doesn't fix the problem with groupBy on List[Struct]

Then we have to answer the question about what does HashPartitioning do for groupby?

If groupBy uses HashAggregate then it will use HashPartitioning to create buckets with buffers pointing to values. So in this case, it will try to calculate the hash for List[Struct] which isn't supported atm by cudf.

scala> df.groupBy("_1","_2").count().explain == Physical Plan == *(2) HashAggregate(keys=[_1#3, _2#4], functions=[count(1)]) +- Exchange hashpartitioning(_1#3, _2#4, 200), ENSURE_REQUIREMENTS, [id=#55] +- *(1) HashAggregate(keys=[_1#3, _2#4], functions=[partial_count(1)]) +- *(1) LocalTableScan [_1#3, _2#4]

I've filed an issue for it: #8676

Thanks. Is there a corresponding cudf issue?

divyegala · 2023-01-30T17:05:02Z

@revans2 why do we need to ensure there are no non empty NULLs to sort arrays? We solved this in libcudf we the most recent PR I linked, didn't we? Are you seeing any more bugs since then?

revans2 · 2023-01-30T18:59:22Z

@divyegala we should sync up because I thought that the latest commits fixed null ordering and did some to mitigate non-empty nulls. But I thought that the sort still had issues if it saw rows with non-empty nulls in them. I know that we still have some issues with producing bad data. Also I thought that the plan was to move to not producing bad data instead of fixing it up after it was produced. If that is true then I still would like to have guard rails in place, at least when we are running tests to verify that we are not doing something wrong.

revans2 · 2023-01-30T20:34:18Z

@divyegala and I spoke and it was a small misunderstanding on non-empty nulls. This needs to wait until we can get asserts/fixup in place.

razajafri · 2023-02-02T18:28:44Z

@divyegala and I spoke and it was a small misunderstanding on non-empty nulls. This needs to wait until we can get asserts/fixup in place.

What asserts/fixups are we talking about? Is there already a PR for them or an issue that we can use to depend this upon?

sameerz · 2023-04-14T04:14:41Z

@razajafri please retarget to 23.06.

razajafri · 2023-07-01T00:21:10Z

build

razajafri · 2023-07-06T00:47:13Z

build

razajafri · 2023-07-06T17:44:29Z

CI failing because groupby on lists for most of the types won't fallback to the CPU now. Will post an update shortly

razajafri · 2023-07-06T18:18:20Z

build

razajafri · 2023-07-06T20:12:15Z

Regenerated supported_ops

razajafri · 2023-07-06T20:12:19Z

build

razajafri · 2023-07-07T00:02:29Z

@ttnghia @revans2 PTAL

integration_tests/src/main/python/hash_aggregate_test.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

integration_tests/src/main/python/hash_aggregate_test.py

razajafri · 2023-07-12T01:12:58Z

build

revans2 · 2023-07-12T13:35:55Z

2023-07-12T02:46:29.0192503Z [2023-07-12T02:45:08.716Z] �[36m�[1m=========================== short test summary info ============================�[0m
2023-07-12T02:46:29.0193090Z [2023-07-12T02:45:08.716Z] �[31mFAILED�[0m ../../src/main/python/hash_aggregate_test.py::�[1mtest_hash_grpby_sum_count_action[('a', Long)][INJECT_OOM]�[0m - AttributeError: 'tuple' object has no attribute 'nullable'
2023-07-12T02:46:29.0193632Z [2023-07-12T02:45:08.716Z] �[31mFAILED�[0m ../../src/main/python/hash_aggregate_test.py::�[1mtest_hash_grpby_sum_count_action[('b', Integer)][INJECT_OOM]�[0m - AttributeError: 'tuple' object has no attribute 'nullable'
2023-07-12T02:46:29.0194138Z [2023-07-12T02:45:08.716Z] �[31mFAILED�[0m ../../src/main/python/hash_aggregate_test.py::�[1mtest_hash_grpby_sum_count_action[('c', Long)]�[0m - AttributeError: 'tuple' object has no attribute 'nullable'

looks like you didn't restore test_hash_grpby_sum_count_action exactly the same as before

-  pytest.mark.parametrize('data_gen', [_longs_with_nulls], ids=idfn)
+ @pytest.mark.parametrize('data_gen', _longs_with_nulls, ids=idfn)

razajafri · 2023-07-12T17:39:18Z

build

razajafri · 2023-07-13T00:46:08Z

premerge is stuck. Will kick it off again

razajafri · 2023-07-13T00:46:13Z

build

razajafri added 3 commits January 5, 2023 17:56

Revert "Revert "Add support for arrays in hashaggregate [databricks] (N…

6c6df93

…VIDIA#6066)" (NVIDIA#6679)" This reverts commit c05ac2d. Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Add test for aggregation on array

70e8e2e

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

renamed test

d30d96c

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri self-assigned this Jan 6, 2023

sameerz added the feature request New feature or request label Jan 6, 2023

sameerz linked an issue Jan 6, 2023 that may be closed by this pull request

[FEA] Hash partitioning on ArrayType #4887

Closed

razajafri requested a review from gerashegalov January 9, 2023 18:41

jlowe marked this pull request as draft January 12, 2023 21:18

jlowe added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Jan 12, 2023

razajafri added 2 commits January 24, 2023 17:07

Merge remote-tracking branch 'origin/branch-23.02' into SP-6680-hasha…

5e04466

…gg-array Signed-off-by: Raza Jafri <rjafri@nvidia.com>

resolve merge conflicts

7bbf970

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri marked this pull request as ready for review January 25, 2023 19:46

gerashegalov mentioned this pull request Jan 26, 2023

[BUG] test_mod_mixed decimal test fails on 330db (Databricks 11.3) and TBD 340 #7595

Closed

ttnghia reviewed Jan 30, 2023

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/AggregateFunctions.scala Outdated Show resolved Hide resolved

ttnghia reviewed Jan 30, 2023

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/AggregateFunctions.scala Outdated Show resolved Hide resolved

revans2 requested changes Jan 30, 2023

View reviewed changes

razajafri marked this pull request as draft February 2, 2023 18:29

razajafri changed the base branch from branch-23.02 to branch-23.04 February 7, 2023 18:41

razajafri changed the base branch from branch-23.04 to branch-23.06 April 14, 2023 13:22

razajafri force-pushed the SP-6680-hashagg-array branch from ebfa339 to 12e5164 Compare June 22, 2023 00:49

Merge remote-tracking branch 'origin/branch-23.08' into HEAD

5098efd

code clean up

7273f30

razajafri marked this pull request as ready for review July 6, 2023 00:21

added more functions to the agg test

6733b42

update fallback test

b7f3da2

update supported_ops

c48837b

revans2 reviewed Jul 11, 2023

View reviewed changes

addressed review comments

56724eb

revans2 reviewed Jul 11, 2023

View reviewed changes

integration_tests/src/main/python/hash_aggregate_test.py Outdated Show resolved Hide resolved

integration_tests/src/main/python/hash_aggregate_test.py Show resolved Hide resolved

integration_tests/src/main/python/hash_aggregate_test.py Outdated Show resolved Hide resolved

razajafri added 3 commits July 11, 2023 17:24

Fixed tests and added tests for hash partitioning

49455d4

added docs

d81b551

added tools docs

59fb0e7

reverted the unnecessary test method change

9a27733

revans2 approved these changes Jul 12, 2023

View reviewed changes

razajafri merged commit e7817e4 into NVIDIA:branch-23.08 Jul 13, 2023

razajafri deleted the SP-6680-hashagg-array branch July 13, 2023 05:31

ttnghia mentioned this pull request Jul 13, 2023

Support nested arrays for min/max aggregations in groupby and reduction #8689

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for arrays in hashaggregate [databricks] #7465

Add support for arrays in hashaggregate [databricks] #7465

razajafri commented Jan 6, 2023

razajafri commented Jan 6, 2023

ttnghia commented Jan 10, 2023

jlowe commented Jan 12, 2023

divyegala commented Jan 20, 2023

razajafri commented Jan 25, 2023

ttnghia Jan 30, 2023

revans2 left a comment

revans2 Jan 30, 2023

razajafri Feb 7, 2023

ttnghia Feb 7, 2023

ttnghia Feb 7, 2023

razajafri Jul 6, 2023

razajafri Jul 6, 2023

ttnghia Jul 6, 2023

razajafri Jul 6, 2023 •

edited

Loading

ttnghia Jul 7, 2023

razajafri Jul 7, 2023

divyegala commented Jan 30, 2023

revans2 commented Jan 30, 2023

revans2 commented Jan 30, 2023

razajafri commented Feb 2, 2023

sameerz commented Apr 14, 2023

razajafri commented Jul 1, 2023

razajafri commented Jul 6, 2023

razajafri commented Jul 6, 2023

razajafri commented Jul 6, 2023

razajafri commented Jul 6, 2023

razajafri commented Jul 6, 2023

razajafri commented Jul 7, 2023

razajafri commented Jul 12, 2023

revans2 commented Jul 12, 2023

razajafri commented Jul 12, 2023

razajafri commented Jul 13, 2023

razajafri commented Jul 13, 2023

Add support for arrays in hashaggregate [databricks] #7465

Add support for arrays in hashaggregate [databricks] #7465

Conversation

razajafri commented Jan 6, 2023

razajafri commented Jan 6, 2023

ttnghia commented Jan 10, 2023

jlowe commented Jan 12, 2023

divyegala commented Jan 20, 2023

razajafri commented Jan 25, 2023

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri Jul 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

divyegala commented Jan 30, 2023

revans2 commented Jan 30, 2023

revans2 commented Jan 30, 2023

razajafri commented Feb 2, 2023

sameerz commented Apr 14, 2023

razajafri commented Jul 1, 2023

razajafri commented Jul 6, 2023

razajafri commented Jul 6, 2023

razajafri commented Jul 6, 2023

razajafri commented Jul 6, 2023

razajafri commented Jul 6, 2023

razajafri commented Jul 7, 2023

razajafri commented Jul 12, 2023

revans2 commented Jul 12, 2023

razajafri commented Jul 12, 2023

razajafri commented Jul 13, 2023

razajafri commented Jul 13, 2023

razajafri Jul 6, 2023 •

edited

Loading