-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kokkos error reporting failures with CUDA GPUs in exclusive mode #2471
Comments
@crtrott Do we have access to a testbed where this is enabled or where we can enable this? The only machine I have root on is my laptop, and CUDA doesn't work on Macs... |
I can test this on Apollo or Kokkos-Dev-2 |
actually I can set kokkos-dev-2 the second GPU to exclusive mode for testing purposes. You need to launch with CUDA_VISIBLE_DEVICES=1 then. |
This is set up and confirmed. Same executable run on device 0 (which is in default mode) passes, while on device 1 (which is now in exclusive mode) it fails. |
I believe this is because of us using gtest for a "death_test", not because of how Kokkos reports failures. I bet gtest death tests spawn off a child process, which the original process checks to have died. And the test fails, because it dies with the wrong error message (i.e. it couldn't get a GPU, instead of the expected assert). My guess is that in exclusive mode we just need to disable death tests ... |
I mark this tentatively as enhancement not bug, since I am pretty convinced that this only affects testing, and via fiat I declare running our unit tests on a GPU in exclusive mode is currently not supported. |
Alternatively we could just prefix the names of all death tests, and thus would allow on systems in exclusive mode a easy way of excluding those tests. |
We merged this with a suffix according to the recommendations from gtest. Thus in exclusive mode you can now simply exclude the tests. We may come back to this and try to disable death_tests internally when we discover GPUs are in exclusive mode. |
CUDA allows GPUs to be placed in an "exclusive process" mode that permits at most one process to use a GPU at a time. This appears to cause problems with how Kokkos reports errors, as seen in the Kokkos test suite:
yields this:
The entire test suite runs fine if I switch the GPU back to the default compute mode with:
The text was updated successfully, but these errors were encountered: