Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] AQE Crashing Spark RAPIDS when using filter() and union() #4216

Closed
laatopi opened this issue Nov 25, 2021 · 4 comments · Fixed by #4257
Closed

[BUG] AQE Crashing Spark RAPIDS when using filter() and union() #4216

laatopi opened this issue Nov 25, 2021 · 4 comments · Fixed by #4257
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@laatopi
Copy link

laatopi commented Nov 25, 2021

Hey,

the following code crashes Spark RAPIDS.
The output of the job looks, including the error messages looks like this.
I feel that this is a bug due to it working fine with RAPIDS disabled.

Example input file that reproduces of the bug can be found here.

The following are used:

- Standalone single node
- Spark 3.2.0
- RAPIDS v21.10.0
- NVIDIA Tesla V100 SXM2 32 GB

The configurations used to launch the job were as follows:

--conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}
--conf spark.driver.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}        
--conf spark.rapids.sql.concurrentGpuTasks=3        
--driver-memory 24G     
--conf spark.executor.memory=250G        
--conf spark.executor.cores=5        
--conf spark.task.cpus=1        
--conf spark.executor.resource.gpu.amount=1        
--conf spark.task.resource.gpu.amount=1        
--conf spark.rapids.memory.pinnedPool.size=4G        
--conf spark.locality.wait=0s        
--conf spark.sql.files.maxPartitionBytes=512m        
--conf spark.plugins=com.nvidia.spark.SQLPlugin        
--conf spark.sql.shuffle.partitions=32        

Important notes regarding crashes
The crash can be avoided, resulting in a succesful job, by making either one of the following changes to the code/configuration:

  • By disabling AQE by adding:
--conf spark.sql.adaptive.enabled=false

to the configuration

  • Removing the filter (Although this does not give the expected outcome regards to the output) operation from line 16 of the code:
val cDF = bDF.select(col("mid"))
  • By adding a cache operation to the line 16 of the code:
val cDF = bDF.filter(col("left") === 0).select(col("mid")).cache()

There may be more things that result into a succesful job, these are just some things that I personally found that impacted whether the job results into a success or a failure.
The code also works as expected with RAPIDS disabled, resulting in a succesful job without making any of the changes mentioned.

Feel free to ask me more details incase I missed something.
Thanks!

@laatopi laatopi added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 25, 2021
@tgravescs
Copy link
Collaborator

RAPIDS v21.08.0 does not support Spark 3.2.0.

However running the code sample with the latest version results in:

java.lang.UnsupportedOperationException
  at com.nvidia.spark.rapids.shims.v2.GpuShuffleExchangeExec.getShuffleRDD(GpuShuffleExchangeExec.scala:45)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.shuffleRDD$lzycompute(AQEShuffleReadExec.scala:247)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.shuffleRDD(AQEShuffleReadExec.scala:243)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.doExecuteColumnar(AQEShuffleReadExec.scala:258)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:211)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
  at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:207)
  at com.nvidia.spark.rapids.HostColumnarToGpu.doExecuteColumnar(HostColumnarToGpu.scala:447)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:211)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
  at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:207)
  at com.nvidia.spark.rapids.GpuHashAggregateExec.doExecuteColumnar(aggregate.scala:1443)

we will have to investigate further

@tgravescs tgravescs added the P0 Must have for release label Nov 29, 2021
@revans2
Copy link
Collaborator

revans2 commented Nov 29, 2021

It looks like the change went in as a part of #1206, but it was not mentioned in the PR or the review anywhere. Appears to be a missed requirement.

The good news is that it looks like it should not be hard to fix the bug, I think we have everything we need to do it.

The Spark code is just returning.

  override def getShuffleRDD(partitionSpecs: Array[ShufflePartitionSpec]): RDD[InternalRow] = {
    new ShuffledRowRDD(shuffleDependency, readMetrics, partitionSpecs)
  }

It looks like our own ShuffledBatchRDD can also take a partitionSpecs parameter so we should be able to make this work.

@laatopi
Copy link
Author

laatopi commented Nov 29, 2021

RAPIDS v21.08.0 does not support Spark 3.2.0.

However running the code sample with the latest version results in:

java.lang.UnsupportedOperationException
  at com.nvidia.spark.rapids.shims.v2.GpuShuffleExchangeExec.getShuffleRDD(GpuShuffleExchangeExec.scala:45)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.shuffleRDD$lzycompute(AQEShuffleReadExec.scala:247)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.shuffleRDD(AQEShuffleReadExec.scala:243)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.doExecuteColumnar(AQEShuffleReadExec.scala:258)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:211)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
  at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:207)
  at com.nvidia.spark.rapids.HostColumnarToGpu.doExecuteColumnar(HostColumnarToGpu.scala:447)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:211)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
  at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:207)
  at com.nvidia.spark.rapids.GpuHashAggregateExec.doExecuteColumnar(aggregate.scala:1443)

we will have to investigate further

Ah my bad, this was a typo on my part. This version I used was indeed v21.10.0. I edited the original post to the correct version.

@jlowe
Copy link
Member

jlowe commented Dec 2, 2021

Fixed by #4257

@jlowe jlowe closed this as completed Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
5 participants