-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate with kudo #11724
base: branch-24.12
Are you sure you want to change the base?
Integrate with kudo #11724
Conversation
Currently blocked by NVIDIA/spark-rapids-jni#2596, but it's ready for review. |
Signed-off-by: liurenjie1024 <liurenjie2008@gmail.com>
54b8dad
to
dae6f4d
Compare
.internal() | ||
.startupOnly() | ||
.booleanConf | ||
.createWithDefault(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We enable this by default so that it's easier to do integration tests for now, will revert it to false before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way to continue testing this, and the JCudfSerialization approach, while they continue to coexist. Just testing during premerge isn't enough.
Note that we don't have to run the entire test suite for both shuffle approaches. IMHO taking the tests in repart_test and adding a "with kudo" dimension would suffice, and/or we could do the same tactic we use for the RAPIDS caching shuffle where we run the mortgage test queries with that shuffle type enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning to add tests for join_test
, hash_aggregate_test
and repart_test
. They helped to identity many bugs in early tests, will do this befor merging.
sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java
Outdated
Show resolved
Hide resolved
private static void visit(DataType[] dataTypes, Schema.Builder builder, int level) { | ||
for (int idx = 0; idx < dataTypes.length; idx ++) { | ||
DataType dt = dataTypes[idx]; | ||
String name = "_col_" + level + "_" + idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's annoying that we need to spend any time building unique column names when they are never used. We may want to consider allowing Schema to build a "data type only" schema (or have a separate class for that) that doesn't have any column names anywhere, which matches the case we're using it for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two approaches here:
- We allow schema to accept empty field name
- We make a new class which mimics the behavior of spark's
DataType
system.
I lean towards option 2 since it would be cleaner. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchSerializer.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchSerializer.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchSerializer.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala
Outdated
Show resolved
Hide resolved
.internal() | ||
.startupOnly() | ||
.booleanConf | ||
.createWithDefault(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way to continue testing this, and the JCudfSerialization approach, while they continue to coexist. Just testing during premerge isn't enough.
Note that we don't have to run the entire test suite for both shuffle approaches. IMHO taking the tests in repart_test and adding a "with kudo" dimension would suffice, and/or we could do the same tactic we use for the RAPIDS caching shuffle where we run the mortgage test queries with that shuffle type enabled.
...c/main/spark330db/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastHashJoinExec.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExecSuite.scala
Outdated
Show resolved
Hide resolved
Currently in local dev env: The build break waiting for NVIDIA/spark-rapids-jni#2601 to be merge. |
@@ -194,13 +205,80 @@ class JCudfTableOperator extends SerializedTableOperator[SerializedTableColumn] | |||
} | |||
} | |||
|
|||
case class KudoHostMergeResultWrapper(inner: KudoHostMergeResult, | |||
dataSize: Long) extends CoalescedHostResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a constructor argument? I'm not sure there's a case where we want getDataSize to return a value different than inner.getDataLength, and therefore it's safer to not give the caller the chance to screw this up.
This pr introduces integration with kudo serialization format in spark rapids, for epic issue please see #11590.