-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does "an illegal memory access was encountered" on CUDA ever occur due to overloading the GPU memory? #2330
Comments
@bartlettroscoe : a little clarification, does |
When you run with I guess I could purposefully overload the GPUs by running with a high FYI: Kitware is working on trying to improve our running of multiple tests with multiple MPI ranks at the same time on GPU machines. You can see their current effort in PR trilinos/Trilinos#5598. |
@bartlettroscoe to answer the original question, I think the answer is no. All the calls to |
@ibaned, are the calls to I don't want to developers to continue to use the excuse that they think that a test is failing due to overloading the GPU (because we are using too high of a |
@bartlettroscoe we can start with a pure CUDA program to confirm that overloading causes a failure, after that I'm not sure how necessary a wrapper would be. A simple |
FYI, UVM allocations can overcommit, just like CPU allocations (given Linux overcommit). |
confirmed, trying to allocate |
Worse still, I just ran a program that allocates 100 |
I unfortunately still don't have an answer to @bartlettroscoe's original question since I can't trigger any kind of error with horribly oversubscribed UVM memory. Lets stop using UVM. |
@crtrott says oversubscription shouldn't be able to cause illegal address accesses |
Given that Trilinos doesn't seem to be moving any closer to UVM-free execution, I think the right short-term move (for many reasons besides this one) is to move to |
@ibaned, okay, to be clear, because of UVM usage, we could be running out of memory on the GPU and that could trigger strange errors like these "illegal memory access was encountered" errors? |
If only one process is using the GPU, it looks like CUDA will swap UVM pages in and out of the GPU as they are used, so there shouldn't be any strange errors. Also, if we're not running with the MPS server, then only one process can run on the GPU at a time. |
Even with
I have asked Kitware staff member Kyle Edwards to look into the MPS server and try it out. See: (if you don't have access just ask me.) |
The MPS server makes it so kernels from different processes can run concurrently, otherwise the GPU is essentially timesliced between processes. Illegal memory access should generally not be coming from oversubscribing memory. But who knows. Typically illegal memory accesses are stuff like accessing static variables, constexpr thingies in some cases, dereferencing something pointing to host stack variables, accessing out of bounds shared memory, or accessing out of bounds device stack variables. |
Is |
@jjellio asked:
Yes. See: I believe this was needed to allow some Tpetra tests to run at the same time as other tests. |
@crtrott said:
Some recent experiments by Kitware staff member @KyleFromKitware show some good speedup with using the MPS server on 'wateman'. For those with access from SNL, see: Not clear if the MPS sever will cause the kernels to spread out over the two GPUs on 'waterman', for example. Should we expect the MPS server to spread out work on to multiple GPUS automatically or does that need to be done manually, with the CTest/Kokkos allocation work being done in trilinos/Trilinos#5598? |
I am not quite sure actually, there is something funky around CUDA_VISIBLE_DEVICES |
@crtrott said:
We can research this more as part of the FY20 Kitware contract on this. |
Hi @ibaned and @bartlettroscoe -- Has this issue been resolved? If so, may I close it? If not, would you please detail what else needs to be done? |
I think the Trilinos team now runs one test at a time for CUDA builds which avoids this issue |
Correct, see trilinos/Trilinos#6840. Therefore, we would not be seeing this in the regular automated builds of Trilinos (and the ATDM Trilinos builds). But regular users may be seeing this when they are not setting up to run on a GPU system. |
Hello Kokkos developers,
It is ever the case that overloading the GPU memory can trigger errors that look like:
There are several Trilinos tests that randomly fail in in the ATDM Trilinos CUDA builds as shown in this query.
We have seen this "illegal memory access was encountered" error in several ATDM Trilinos issues including trilinos/Trilinos#5179, trilinos/Trilinos#5002, trilinos/Trilinos#4551, trilinos/Trilinos#4123, trilinos/Trilinos#3543, trilinos/Trilinos#3438, and trilinos/Trilinos#2827. In the case of trilinos/Trilinos#3542, we know this error is caused because by code that is not designed to run with CUDA on the GPU but in other cases, this was caused by bugs in code.
It seems like when we have seen out-of-bounds errors when overloading the GPU memory.
So can we expect to see "illegal memory access was encountered" errors when running out of GPU memory? Is there some better way to detect when the GPU memory has been exhausted and provide a better error message?
The reason this is important is that some developers have reported that they see errors that make them think that the GPU is being overloaded and then they simply discount and ignore the ATDM Trilinos GitHub issues that are being created. Therefore, we would like a reliable way to detect when GPU memory might be getting exhausted so that we can adjust the parallel test execution level with ctest. (Otherwise, we are going to need to switch to
ctest -j1
on all of the ATDM Trilinos CUDA builds and just suffer the wasted CPU/GPU wall-clock time that results from this.)The text was updated successfully, but these errors were encountered: