Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Executor falls back to cudaMalloc if the pool can't be initialized #5242

Closed
abellina opened this issue Apr 13, 2022 · 1 comment · Fixed by #5243
Closed

[BUG] Executor falls back to cudaMalloc if the pool can't be initialized #5242

abellina opened this issue Apr 13, 2022 · 1 comment · Fixed by #5243
Assignees
Labels
bug Something isn't working

Comments

@abellina
Copy link
Collaborator

While debugging: #5183, I was confused because some of our runs with 22.06 would "work" before @jlowe added a fix to cuDF to specify that RMM use the statically linked cudart.

The reason why our runs work boiled down to us using a different allocator than what we thought we were using (ASYNC would fail to start) with:

ai.rapids.cudf.CudfException: RMM failure at: /home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-669-cuda11/cpp/build/_deps/rmm-src/include/rmm/mr/device/cuda_async_memory_resource.hpp:67: cudaMallocAsync not supported with this CUDA driver/runtime version
        at ai.rapids.cudf.Rmm.initializeInternal(Native Method)

This caused everything to start, and RMM to create a cuda memory resource by default, which it used for the duration of the test (and the test passed).

We should let this exception tear down the executor, instead of catching-and-log pattern we have now: https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala#L304.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 13, 2022
@abellina abellina changed the title [BUG] Executor fallsback to cudaMalloc if the pool can't be initialized [BUG] Executor falls back to cudaMalloc if the pool can't be initialized Apr 13, 2022
@abellina abellina self-assigned this Apr 13, 2022
@abellina
Copy link
Collaborator Author

abellina commented Apr 13, 2022

If I just remove the catch, I get the executor to bail, but I believe we should log something from our side saying we are exiting now.

e.g. this is what it is without the extra log, if people have strong opinions we can keep this behavior or do what I am thinking with the extra log message (like: "Failed to initialize executor GPU memory, exiting!") before throwing that exception back up:

22/04/13 14:39:05 ERROR RapidsExecutorPlugin: Exception in the executor plugin
ai.rapids.cudf.CudfException: RMM failure at: /home/jenkins/agent/workspace/jenkins-cudf-for-dev-32-cuda11/cpp/build/_deps/rmm-src/include/rmm/mr/device/cuda_async_memory_resource.hpp:67: cudaMallocAsync not supported with this CUDA driver/runtime version 
  at ai.rapids.cudf.Rmm.initializeInternal(Native Method)
  at ai.rapids.cudf.Rmm.initialize(Rmm.java:119)
  at com.nvidia.spark.rapids.GpuDeviceManager$.initializeRmm(GpuDeviceManager.scala:300)
  at com.nvidia.spark.rapids.GpuDeviceManager$.initializeMemory(GpuDeviceManager.scala:330)
  at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:137)
  at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:222)
  at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:125)
  at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
  at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
  at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:113)
  at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:211)
  at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:199)
  at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:253)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:222)
  at org.apache.spark.executor.Executor.<init>(Executor.scala:253)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:159)
  at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
  at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
  at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
  at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
  at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
22/04/13 14:39:05 ERROR Utils: Uncaught exception in thread shutdown-hook-0
java.lang.NullPointerException
  at org.apache.spark.executor.Executor.$anonfun$stop$3(Executor.scala:332)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:222)
  at org.apache.spark.executor.Executor.stop(Executor.scala:332)
  at org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:76)
  at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
  at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
  at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at scala.util.Try$.apply(Try.scala:213)
  at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
  at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
22/04/13 14:39:05 INFO DiskBlockManager: Shutdown hook called
22/04/13 14:39:05 INFO ShutdownHookManager: Shutdown hook called

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants