Integrate with kudo #11724

liurenjie1024 · 2024-11-15T09:51:25Z

This pr introduces integration with kudo serialization format in spark rapids, for epic issue please see #11590.

liurenjie1024 · 2024-11-15T09:51:59Z

Currently blocked by NVIDIA/spark-rapids-jni#2596, but it's ready for review.

Signed-off-by: liurenjie1024 <liurenjie2008@gmail.com>

liurenjie1024 · 2024-11-15T09:53:28Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

+    .internal()
+    .startupOnly()
+    .booleanConf
+    .createWithDefault(true)


We enable this by default so that it's easier to do integration tests for now, will revert it to false before merging.

We need a way to continue testing this, and the JCudfSerialization approach, while they continue to coexist. Just testing during premerge isn't enough.

Note that we don't have to run the entire test suite for both shuffle approaches. IMHO taking the tests in repart_test and adding a "with kudo" dimension would suffice, and/or we could do the same tactic we use for the RAPIDS caching shuffle where we run the mortgage test queries with that shuffle type enabled.

I'm planning to add tests for join_test, hash_aggregate_test and repart_test. They helped to identity many bugs in early tests, will do this befor merging.

sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java

jlowe · 2024-11-15T15:54:20Z

sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java

+  private static void visit(DataType[] dataTypes, Schema.Builder builder, int level) {
+    for (int idx = 0; idx < dataTypes.length; idx ++) {
+      DataType dt = dataTypes[idx];
+      String name = "_col_" + level + "_" + idx;


It's annoying that we need to spend any time building unique column names when they are never used. We may want to consider allowing Schema to build a "data type only" schema (or have a separate class for that) that doesn't have any column names anywhere, which matches the case we're using it for here.

There are two approaches here:

We allow schema to accept empty field name

We make a new class which mimics the behavior of spark's DataType system.

I lean towards option 2 since it would be cleaner. What do you think?

Personally I'd prefer 1, especially since cudf doesn't care about column names in most contexts. cc: @revans2 and @abellina who may have a strong opinion on this.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchSerializer.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala

jlowe · 2024-11-15T20:24:55Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

+    .internal()
+    .startupOnly()
+    .booleanConf
+    .createWithDefault(true)


We need a way to continue testing this, and the JCudfSerialization approach, while they continue to coexist. Just testing during premerge isn't enough.

Note that we don't have to run the entire test suite for both shuffle approaches. IMHO taking the tests in repart_test and adding a "with kudo" dimension would suffice, and/or we could do the same tactic we use for the RAPIDS caching shuffle where we run the mortgage test queries with that shuffle type enabled.

...c/main/spark330db/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastHashJoinExec.scala

tests/src/test/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExecSuite.scala

liurenjie1024 · 2024-11-18T09:04:55Z

Currently in local dev env: hash_aggregate_test, join_test, repart_test passed.

The build break waiting for NVIDIA/spark-rapids-jni#2601 to be merge.

jlowe · 2024-11-18T17:02:57Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala

@@ -194,13 +205,80 @@ class JCudfTableOperator extends SerializedTableOperator[SerializedTableColumn]
  }
 }

+case class KudoHostMergeResultWrapper(inner: KudoHostMergeResult,
+    dataSize: Long) extends CoalescedHostResult {


Should this be a constructor argument? I'm not sure there's a case where we want getDataSize to return a value different than inner.getDataLength, and therefore it's safer to not give the caller the chance to screw this up.

Integrate with kudo

dae6f4d

Signed-off-by: liurenjie1024 <liurenjie2008@gmail.com>

liurenjie1024 force-pushed the ray/kudo branch from 54b8dad to dae6f4d Compare November 15, 2024 09:52

liurenjie1024 commented Nov 15, 2024

View reviewed changes

liurenjie1024 requested review from jlowe, abellina, revans2 and firestarman November 15, 2024 09:53

jlowe reviewed Nov 15, 2024

View reviewed changes

sameerz added the performance A performance related task/issue label Nov 16, 2024

liurenjie1024 added 4 commits November 18, 2024 13:37

Fix commets

25b360f

Fix build break

688fb52

Fix build on databricks

738f34a

Fix build break and tests

dddd5c5

jlowe reviewed Nov 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate with kudo #11724

Integrate with kudo #11724

liurenjie1024 commented Nov 15, 2024

liurenjie1024 commented Nov 15, 2024

liurenjie1024 Nov 15, 2024

jlowe Nov 15, 2024

liurenjie1024 Nov 18, 2024

jlowe Nov 15, 2024

liurenjie1024 Nov 18, 2024

jlowe Nov 18, 2024

jlowe Nov 15, 2024

liurenjie1024 commented Nov 18, 2024 •

edited

Loading

jlowe Nov 18, 2024

Integrate with kudo #11724

Are you sure you want to change the base?

Integrate with kudo #11724

Conversation

liurenjie1024 commented Nov 15, 2024

liurenjie1024 commented Nov 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

liurenjie1024 commented Nov 18, 2024 •

edited

Loading