[BUG] AQE Crashing Spark RAPIDS when using filter() and union() #4216

laatopi · 2021-11-25T13:28:44Z

Hey,

the following code crashes Spark RAPIDS.
The output of the job looks, including the error messages looks like this.
I feel that this is a bug due to it working fine with RAPIDS disabled.

Example input file that reproduces of the bug can be found here.

The following are used:

- Standalone single node
- Spark 3.2.0
- RAPIDS v21.10.0
- NVIDIA Tesla V100 SXM2 32 GB

The configurations used to launch the job were as follows:

--conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}
--conf spark.driver.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}        
--conf spark.rapids.sql.concurrentGpuTasks=3        
--driver-memory 24G     
--conf spark.executor.memory=250G        
--conf spark.executor.cores=5        
--conf spark.task.cpus=1        
--conf spark.executor.resource.gpu.amount=1        
--conf spark.task.resource.gpu.amount=1        
--conf spark.rapids.memory.pinnedPool.size=4G        
--conf spark.locality.wait=0s        
--conf spark.sql.files.maxPartitionBytes=512m        
--conf spark.plugins=com.nvidia.spark.SQLPlugin        
--conf spark.sql.shuffle.partitions=32

Important notes regarding crashes
The crash can be avoided, resulting in a succesful job, by making either one of the following changes to the code/configuration:

By disabling AQE by adding:

--conf spark.sql.adaptive.enabled=false

to the configuration

Removing the filter (Although this does not give the expected outcome regards to the output) operation from line 16 of the code:

val cDF = bDF.select(col("mid"))

By adding a cache operation to the line 16 of the code:

val cDF = bDF.filter(col("left") === 0).select(col("mid")).cache()

There may be more things that result into a succesful job, these are just some things that I personally found that impacted whether the job results into a success or a failure.
The code also works as expected with RAPIDS disabled, resulting in a succesful job without making any of the changes mentioned.

Feel free to ask me more details incase I missed something.
Thanks!

The text was updated successfully, but these errors were encountered:

tgravescs · 2021-11-29T14:50:53Z

RAPIDS v21.08.0 does not support Spark 3.2.0.

However running the code sample with the latest version results in:

java.lang.UnsupportedOperationException
  at com.nvidia.spark.rapids.shims.v2.GpuShuffleExchangeExec.getShuffleRDD(GpuShuffleExchangeExec.scala:45)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.shuffleRDD$lzycompute(AQEShuffleReadExec.scala:247)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.shuffleRDD(AQEShuffleReadExec.scala:243)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.doExecuteColumnar(AQEShuffleReadExec.scala:258)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:211)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
  at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:207)
  at com.nvidia.spark.rapids.HostColumnarToGpu.doExecuteColumnar(HostColumnarToGpu.scala:447)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:211)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
  at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:207)
  at com.nvidia.spark.rapids.GpuHashAggregateExec.doExecuteColumnar(aggregate.scala:1443)

we will have to investigate further

revans2 · 2021-11-29T15:09:07Z

It looks like the change went in as a part of #1206, but it was not mentioned in the PR or the review anywhere. Appears to be a missed requirement.

The good news is that it looks like it should not be hard to fix the bug, I think we have everything we need to do it.

The Spark code is just returning.

  override def getShuffleRDD(partitionSpecs: Array[ShufflePartitionSpec]): RDD[InternalRow] = {
    new ShuffledRowRDD(shuffleDependency, readMetrics, partitionSpecs)
  }

It looks like our own ShuffledBatchRDD can also take a partitionSpecs parameter so we should be able to make this work.

laatopi · 2021-11-29T15:56:07Z

RAPIDS v21.08.0 does not support Spark 3.2.0.

However running the code sample with the latest version results in:

java.lang.UnsupportedOperationException
  at com.nvidia.spark.rapids.shims.v2.GpuShuffleExchangeExec.getShuffleRDD(GpuShuffleExchangeExec.scala:45)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.shuffleRDD$lzycompute(AQEShuffleReadExec.scala:247)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.shuffleRDD(AQEShuffleReadExec.scala:243)
  at org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec.doExecuteColumnar(AQEShuffleReadExec.scala:258)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:211)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
  at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:207)
  at com.nvidia.spark.rapids.HostColumnarToGpu.doExecuteColumnar(HostColumnarToGpu.scala:447)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:211)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
  at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:207)
  at com.nvidia.spark.rapids.GpuHashAggregateExec.doExecuteColumnar(aggregate.scala:1443)

we will have to investigate further

Ah my bad, this was a typo on my part. This version I used was indeed v21.10.0. I edited the original post to the correct version.

jlowe · 2021-12-02T14:55:29Z

Fixed by #4257

laatopi added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 25, 2021

tgravescs added the P0 Must have for release label Nov 29, 2021

jlowe self-assigned this Nov 29, 2021

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Nov 30, 2021

Salonijain27 added this to the Nov 30 - Dec 10 milestone Nov 30, 2021

jlowe mentioned this issue Dec 1, 2021

Implement getShuffleRDD and fixup mismatched output types on shuffle reuse [databricks] #4257

Merged

jlowe linked a pull request Dec 1, 2021 that will close this issue

Implement getShuffleRDD and fixup mismatched output types on shuffle reuse [databricks] #4257

Merged

jlowe closed this as completed Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] AQE Crashing Spark RAPIDS when using filter() and union() #4216

[BUG] AQE Crashing Spark RAPIDS when using filter() and union() #4216

laatopi commented Nov 25, 2021 •

edited

Loading

tgravescs commented Nov 29, 2021

revans2 commented Nov 29, 2021

laatopi commented Nov 29, 2021 •

edited

Loading

jlowe commented Dec 2, 2021

[BUG] AQE Crashing Spark RAPIDS when using filter() and union() #4216

[BUG] AQE Crashing Spark RAPIDS when using filter() and union() #4216

Comments

laatopi commented Nov 25, 2021 • edited Loading

tgravescs commented Nov 29, 2021

revans2 commented Nov 29, 2021

laatopi commented Nov 29, 2021 • edited Loading

jlowe commented Dec 2, 2021

laatopi commented Nov 25, 2021 •

edited

Loading

laatopi commented Nov 29, 2021 •

edited

Loading