Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_api_gpu fails every time with CUDA_EXCEPTION_15 #2

Closed
sdorminey opened this issue May 14, 2018 · 4 comments
Closed

test_api_gpu fails every time with CUDA_EXCEPTION_15 #2

sdorminey opened this issue May 14, 2018 · 4 comments

Comments

@sdorminey
Copy link

test_api_gpu dies for me, every time, with Invalid Managed Memory Access, while evaluating the Nand gate (before bootstrapping occurs.) It looks like this code is running on the Host Thread, but the underlying data (in the Unified Memory) is mapped to the GPU, causing an error.

Would love a workaround, since this project looks really neat! Let me know if you need more info.


System setup:

  • Ubuntu LTS 16.04
  • NVidia drivers, 390.48
  • CUDA Toolkit, v7.5
  • NVidia GeForce 940M (Compute Capability 5.0)

Output:
------ Key Generation ------
------ Test Encryption/Decryption ------
Number of tests: 96
PASS
------ Initilizating Data on GPU(s) ------
------ Test NAND Gate ------
Number of tests: 96
(crashes here)

Stack trace:
Thread [1] 14501 [core: 2] (Suspended : Signal : CUDA_EXCEPTION_15:Invalid Managed Memory Access)
cufhe::Nand() at cufhe_gates_gpu.cu:50 0x7ffff7b18223
main() at test_api_gpu.cu:116 0x4048c1

@WeiDaiWD
Copy link
Collaborator

WeiDaiWD commented May 14, 2018

Thank you very much for your report. I think that I have found the reason of this crash.

Since we are launching several NAND gates concurrently on a single device, while one NAND gate is running a kernel that accesses some unified memory, another NAND gate accesses some other unified memory from the host. This is not allowed on devices with compute capability < 6.x: Unified memory coherency and concurrency.

I am working on a work-around solution that allocates both host and device memory and transfers data when needed. Hopefully I will posted it tomorrow when the new fix passes on a Titan X (compute capability 5.2).

Only if that does not work on your device, then I need more info on your side. Thanks again.

@WeiDaiWD
Copy link
Collaborator

OK, this is not a perfect fix. Please try to compile/run the code in New Branch. This new fix does not use unified memory. I see no crash on a Titan X. Let me know if it still does not work for your system.

Ironically, I now see a new issue which is the reason why I didn't merge it to master. After the fix, less than 0.5% of gates gives wrong result. I do not have much a clue here. It could be the problem of using pinned memory. I will have to test with page-locked memory and see. If you have some idea about this, please shine some light here. I would very much appreciate that.

@WeiDaiWD
Copy link
Collaborator

I have temporarily created another branch for pre-Pascal GPUs. Performance is much slower since I have to disable concurrent launching of kernels for now. The results are correct and safe to play with. I am working on the perfect cure now.

@sdorminey
Copy link
Author

Awesome! test_api_gpu now succeeds, and I get ~22ms per gate, testing with both hot-fix and older_than_6.0_no_concurrency. Thank you for the speedy workaround!

I'm going to play with the python bindings next - I'll let you know if I run into any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants