You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With apache/spark#43627 we eliminate the need to add the plugin jar via spark.executor.extraClassPath and paved the way to the simplified Boolean switch useRSM=true/false. Now would be a good time to do this work. At the minimum we need to fix the NullPointerException issue resulting from the initialization order change.
Thanks for filing this. I do not know why we got an NPE here, I didn't get one when I tested the apache issue, so I am worried now that there's a bug somewhere.
Our plugin init code currently assumes that the lazy shuffle manager instance SparkEnv.get.shuffleManager has already been created and set, to execute some validation and initialization. Now that the order of SM instantiation and Plugin initialization is reversed in 4.0.0 we need to do validation steps without assuming the an instance in the ExecutorDriver init and set some flag to force eager initialization at the SM instantiation time. I think we can write this code without shimming, but worst case with shimming.
24/12/20 06:59:04 ERROR RapidsExecutorPlugin: Exception in the executor plugin, shutting down!
java.lang.NullPointerException
at org.apache.spark.sql.rapids.GpuShuffleEnv$.initShuffleManager(GpuShuffleEnv.scala:112)
at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:544)
at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:127)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:115)
With apache/spark#43627 we eliminate the need to add the plugin jar via
spark.executor.extraClassPath
and paved the way to the simplified Boolean switch useRSM=true/false. Now would be a good time to do this work. At the minimum we need to fix theNullPointerException
issue resulting from the initialization order change.Steps/Code to reproduce bug
Start a local-cluster with RSM
JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 \ ~/dist/spark-4.0.0-preview1-bin-hadoop3/bin/spark-shell \ --jars scala2.13/dist/target/rapids-4-spark_2.13-24.08.0-SNAPSHOT-cuda11.jar --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.rapids.sql.explain=ALL \ --conf spark.rapids.memory.gpu.allocSize=1536m \ --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark400.RapidsShuffleManager \ --master local-cluster[2,2,1024]
Note:
--conf spark.executor.extraClassPath=$PWD/scala2.13/dist/target/rapids-4-spark_2.13-24.08.0-SNAPSHOT-cuda11.jar
Run
Check the executor log
Additional context
[SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order
razajafri#3
The text was updated successfully, but these errors were encountered: