[BUG] Executor falls back to cudaMalloc if the pool can't be initialized #5242

abellina · 2022-04-13T14:16:48Z

While debugging: #5183, I was confused because some of our runs with 22.06 would "work" before @jlowe added a fix to cuDF to specify that RMM use the statically linked cudart.

The reason why our runs work boiled down to us using a different allocator than what we thought we were using (ASYNC would fail to start) with:

ai.rapids.cudf.CudfException: RMM failure at: /home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-669-cuda11/cpp/build/_deps/rmm-src/include/rmm/mr/device/cuda_async_memory_resource.hpp:67: cudaMallocAsync not supported with this CUDA driver/runtime version
        at ai.rapids.cudf.Rmm.initializeInternal(Native Method)

This caused everything to start, and RMM to create a cuda memory resource by default, which it used for the duration of the test (and the test passed).

We should let this exception tear down the executor, instead of catching-and-log pattern we have now: https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala#L304.

The text was updated successfully, but these errors were encountered:

abellina · 2022-04-13T14:43:15Z

If I just remove the catch, I get the executor to bail, but I believe we should log something from our side saying we are exiting now.

e.g. this is what it is without the extra log, if people have strong opinions we can keep this behavior or do what I am thinking with the extra log message (like: "Failed to initialize executor GPU memory, exiting!") before throwing that exception back up:

22/04/13 14:39:05 ERROR RapidsExecutorPlugin: Exception in the executor plugin
ai.rapids.cudf.CudfException: RMM failure at: /home/jenkins/agent/workspace/jenkins-cudf-for-dev-32-cuda11/cpp/build/_deps/rmm-src/include/rmm/mr/device/cuda_async_memory_resource.hpp:67: cudaMallocAsync not supported with this CUDA driver/runtime version 
  at ai.rapids.cudf.Rmm.initializeInternal(Native Method)
  at ai.rapids.cudf.Rmm.initialize(Rmm.java:119)
  at com.nvidia.spark.rapids.GpuDeviceManager$.initializeRmm(GpuDeviceManager.scala:300)
  at com.nvidia.spark.rapids.GpuDeviceManager$.initializeMemory(GpuDeviceManager.scala:330)
  at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:137)
  at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:222)
  at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:125)
  at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
  at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
  at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:113)
  at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:211)
  at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:199)
  at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:253)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:222)
  at org.apache.spark.executor.Executor.<init>(Executor.scala:253)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:159)
  at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
  at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
  at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
  at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
  at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
22/04/13 14:39:05 ERROR Utils: Uncaught exception in thread shutdown-hook-0
java.lang.NullPointerException
  at org.apache.spark.executor.Executor.$anonfun$stop$3(Executor.scala:332)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:222)
  at org.apache.spark.executor.Executor.stop(Executor.scala:332)
  at org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:76)
  at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
  at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
  at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at scala.util.Try$.apply(Try.scala:213)
  at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
  at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
22/04/13 14:39:05 INFO DiskBlockManager: Shutdown hook called
22/04/13 14:39:05 INFO ShutdownHookManager: Shutdown hook called

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 13, 2022

abellina changed the title ~~[BUG] Executor fallsback to cudaMalloc if the pool can't be initialized~~ [BUG] Executor falls back to cudaMalloc if the pool can't be initialized Apr 13, 2022

abellina self-assigned this Apr 13, 2022

abellina mentioned this issue Apr 13, 2022

Throw again after logging that RMM could not intialize #5243

Merged

abellina closed this as completed in #5243 Apr 14, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Executor falls back to cudaMalloc if the pool can't be initialized #5242

[BUG] Executor falls back to cudaMalloc if the pool can't be initialized #5242

abellina commented Apr 13, 2022

abellina commented Apr 13, 2022 •

edited

Loading

[BUG] Executor falls back to cudaMalloc if the pool can't be initialized #5242

[BUG] Executor falls back to cudaMalloc if the pool can't be initialized #5242

Comments

abellina commented Apr 13, 2022

abellina commented Apr 13, 2022 • edited Loading

abellina commented Apr 13, 2022 •

edited

Loading