-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Troubleshooting help? #27
Comments
Hi Kevin, Could you please re-launch the test with the environment variable NCCL_DEBUG set to "WARN" ? NCCL should display a clear error message before returning the error. Thanks, |
Here's the output.
|
That's strange. It looks like it cannot find the cuIpcGetMemHandle from libcuda.so. But I think the test is linked against libcuda.so, so the libcuda should be there. |
Sorry, I read that wrong. It is not a symbol problem, just cudaIpcGetMemHandle returning an error. |
I didn't want to close this right away. Re-opening. @kmatzen can you check if that commit fixes your problem ? Thanks. |
NCCL_DEBUG=WARN ./build/test/single/all_reduce_test 10000000 |
The message you get ("invalid device function") means that one of the GPU cannot execute the compiled code. Indeed, the GTX760 has a compute capability of 3.0 (https://developer.nvidia.com/cuda-gpus), and the default NVCC_GENCODE in the Makefile only compiles for 3.5 and later. Can you try to recompile with :
(compute capability 5.2 is for the GTX960, 3.0 for the GTX760) |
Thank you!
|
That's another way to do it, indeed. Actually, building for many architectures also increases the build times (for everyone) so you may want to only compile for the architectures you care about. |
Closing. |
I followed the instructions from the readme, but I can't get the tests to run. Is there any additional advice someone can give me?
The text was updated successfully, but these errors were encountered: