-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Executor falls back to cudaMalloc if the pool can't be initialized #5242
Labels
bug
Something isn't working
Comments
abellina
added
bug
Something isn't working
? - Needs Triage
Need team to review and classify
labels
Apr 13, 2022
abellina
changed the title
[BUG] Executor fallsback to cudaMalloc if the pool can't be initialized
[BUG] Executor falls back to cudaMalloc if the pool can't be initialized
Apr 13, 2022
If I just remove the catch, I get the executor to bail, but I believe we should log something from our side saying we are exiting now. e.g. this is what it is without the extra log, if people have strong opinions we can keep this behavior or do what I am thinking with the extra log message (like: "Failed to initialize executor GPU memory, exiting!") before throwing that exception back up:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
While debugging: #5183, I was confused because some of our runs with 22.06 would "work" before @jlowe added a fix to cuDF to specify that RMM use the statically linked cudart.
The reason why our runs work boiled down to us using a different allocator than what we thought we were using (ASYNC would fail to start) with:
This caused everything to start, and RMM to create a cuda memory resource by default, which it used for the duration of the test (and the test passed).
We should let this exception tear down the executor, instead of catching-and-log pattern we have now: https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala#L304.
The text was updated successfully, but these errors were encountered: