Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does "an illegal memory access was encountered" on CUDA ever occur due to overloading the GPU memory? #2330

Closed
bartlettroscoe opened this issue Sep 12, 2019 · 23 comments
Assignees
Labels
Question For Kokkos internal and external contributors and users

Comments

@bartlettroscoe
Copy link
Contributor

Hello Kokkos developers,

It is ever the case that overloading the GPU memory can trigger errors that look like:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered <file-name>:<line-number>

There are several Trilinos tests that randomly fail in in the ATDM Trilinos CUDA builds as shown in this query.

We have seen this "illegal memory access was encountered" error in several ATDM Trilinos issues including trilinos/Trilinos#5179, trilinos/Trilinos#5002, trilinos/Trilinos#4551, trilinos/Trilinos#4123, trilinos/Trilinos#3543, trilinos/Trilinos#3438, and trilinos/Trilinos#2827. In the case of trilinos/Trilinos#3542, we know this error is caused because by code that is not designed to run with CUDA on the GPU but in other cases, this was caused by bugs in code.

It seems like when we have seen out-of-bounds errors when overloading the GPU memory.

So can we expect to see "illegal memory access was encountered" errors when running out of GPU memory? Is there some better way to detect when the GPU memory has been exhausted and provide a better error message?

The reason this is important is that some developers have reported that they see errors that make them think that the GPU is being overloaded and then they simply discount and ignore the ATDM Trilinos GitHub issues that are being created. Therefore, we would like a reliable way to detect when GPU memory might be getting exhausted so that we can adjust the parallel test execution level with ctest. (Otherwise, we are going to need to switch to ctest -j1 on all of the ATDM Trilinos CUDA builds and just suffer the wasted CPU/GPU wall-clock time that results from this.)

@DavidPoliakoff
Copy link
Contributor

@bartlettroscoe : a little clarification, does ctest -j1 fix the problem because the problem arises from multiple processes using the GPU? It's fairly easy to fix it if one process uses too much GPU memory, if it's a contention between processes issue things get a little more interesting. I couldn't tell one way or another from skimming a couple of those bugs

@bartlettroscoe
Copy link
Contributor Author

@DavidPoliakoff,

When you run with ctest -j1, ctest will run all of the defined tests, one at a time. Many tests use multiple MPI processes so a single test can invoke multiple kernels from different processes on the same GPU. For example, many Trilinos tests use 4 MPI ranks per test so even with ctest -j1 you are still gettting 4 MPI ranks launching kernels on the same GPU at the same time. When running with say ctest -j8 (which we commonly run with on many GPU machines), it can run as many as 8 MPI ranks at a time (e.g. two tests with 4 MPI ranks per test running at the same time).

I guess I could purposefully overload the GPUs by running with a high ctest -j<N> and then look at the errors that were produced but I thought that Kokkos developers would already have an FAQ or something that mentions this. (And again, if Kokkos could provide a robust way to detect OOM conditions on the GPU and report them as such, that would be a great help.)

FYI: Kitware is working on trying to improve our running of multiple tests with multiple MPI ranks at the same time on GPU machines. You can see their current effort in PR trilinos/Trilinos#5598.

@ibaned
Copy link
Contributor

ibaned commented Sep 13, 2019

@bartlettroscoe to answer the original question, I think the answer is no. All the calls to cudaMalloc and friends in Kokkos should have their return codes checked, and running out of GPU memory should result in a non-success return code from one of these functions. The failure point should be there, instead of accessing a bad address somewhere down the line. However, I don't regularly test what the behavior is when overallocating, so I can't say this for sure.

@bartlettroscoe
Copy link
Contributor Author

bartlettroscoe commented Sep 13, 2019

All the calls to cudaMalloc and friends in Kokkos should have their return codes checked, and running out of GPU memory should result in a non-success return code from one of these functions.

@ibaned, are the calls to cudaMalloc() fairly isolated so we can test their calls when we overload memory? Could one put in a mock cudaMalloc() that you can trigger an out-of-memory error and then verify that Kokkos catches and responds to this correctly? A very thin (inline) kokkosCudaMalloc() function that all of Kokkos calls (and in unit-test mode, simulates out-of-memory conditions) should do.

I don't want to developers to continue to use the excuse that they think that a test is failing due to overloading the GPU (because we are using too high of a ctest -j<N>) and then ignoring Trilinos GitHub issues. I know most software is not unit tested for OOM conditions but this seems like a major problem with running test suites on GPUs.

@ibaned
Copy link
Contributor

ibaned commented Sep 13, 2019

@bartlettroscoe we can start with a pure CUDA program to confirm that overloading causes a failure, after that I'm not sure how necessary a wrapper would be. A simple grep search shows 6 places where cudaMalloc is called and the only one not checked is allocating a single integer once at the beginning of the program. If we know out of memory returns a code and we check all of these, I think that's proof enough that it's not the issue.

@mhoemmen
Copy link
Contributor

I know most software is not unit tested for OOM conditions but this seems like a major problem with running test suites on GPUs.

FYI, UVM allocations can overcommit, just like CPU allocations (given Linux overcommit).

@ibaned
Copy link
Contributor

ibaned commented Sep 13, 2019

confirmed, trying to allocate 100GB with cudaMalloc gives a nice return code, however cudaMallocManaged (which underpins all the UVM allocations that Trilinos requires) gives no such error code. Thus, as long as we are using UVM, it is actually possible that oversubscribing memory leads to way-down-the-line segfaults instead of nice error messages. This seems like another extremely compelling reason to put all Tpetra/MueLu resources into eliminating UVM completely.

@ibaned
Copy link
Contributor

ibaned commented Sep 13, 2019

Worse still, I just ran a program that allocates 100 1GB allocations and then fills them all with actual data, and that still didn't run into any error conditions! I assume the GPU is "conveniently" swapping pages out to CPU memory in order to make this happen. Maybe in @mhoemmen 's article the applications appreciate this behavior, but I think for HPC this much unchecked magic is not worth the trouble it introduces.

@ibaned
Copy link
Contributor

ibaned commented Sep 13, 2019

I unfortunately still don't have an answer to @bartlettroscoe's original question since I can't trigger any kind of error with horribly oversubscribed UVM memory. Lets stop using UVM.

@ibaned ibaned self-assigned this Sep 13, 2019
@ibaned
Copy link
Contributor

ibaned commented Sep 13, 2019

@crtrott says oversubscription shouldn't be able to cause illegal address accesses

@ibaned ibaned added the Question For Kokkos internal and external contributors and users label Sep 13, 2019
@ibaned
Copy link
Contributor

ibaned commented Sep 13, 2019

Given that Trilinos doesn't seem to be moving any closer to UVM-free execution, I think the right short-term move (for many reasons besides this one) is to move to ctest -j1 for testing.

@bartlettroscoe
Copy link
Contributor Author

Given that Trilinos doesn't seem to be moving any closer to UVM-free execution, I think the right short-term move (for many reasons besides this one) is to move to ctest -j1 for testing.

@ibaned, okay, to be clear, because of UVM usage, we could be running out of memory on the GPU and that could trigger strange errors like these "illegal memory access was encountered" errors?

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 13, 2019

... because of UVM usage, we could be running out of memory on the GPU and that could trigger strange errors like these "illegal memory access was encountered" errors?

If only one process is using the GPU, it looks like CUDA will swap UVM pages in and out of the GPU as they are used, so there shouldn't be any strange errors. Also, if we're not running with the MPS server, then only one process can run on the GPU at a time.

@bartlettroscoe
Copy link
Contributor Author

If only one process is using the GPU, it looks like CUDA will swap UVM pages in and out of the GPU as they are used, so there shouldn't be any strange errors.

Even with ctest -j1, there are many tests with 4 MPI rank and therefore each running a kernel on the GPU at the same time.

Also, if we're not running with the MPS server, then only one process can run on the GPU at a time.

I have asked Kitware staff member Kyle Edwards to look into the MPS server and try it out. See:

(if you don't have access just ask me.)

@crtrott
Copy link
Member

crtrott commented Sep 14, 2019

The MPS server makes it so kernels from different processes can run concurrently, otherwise the GPU is essentially timesliced between processes. Illegal memory access should generally not be coming from oversubscribing memory. But who knows. Typically illegal memory accesses are stuff like accessing static variables, constexpr thingies in some cases, dereferencing something pointing to host stack variables, accessing out of bounds shared memory, or accessing out of bounds device stack variables.

@jjellio
Copy link
Contributor

jjellio commented Sep 15, 2019

Is CUDA_LAUNCH_BLOCKING=1 being set?

@bartlettroscoe
Copy link
Contributor Author

@jjellio asked:

Is CUDA_LAUNCH_BLOCKING=1 being set?

Yes. See:

https://github.com/trilinos/Trilinos/blob/e4fd7f9289c3ba3a1e8fd6bada0a76546534ce5e/cmake/std/atdm/ride/environment.sh#L191

I believe this was needed to allow some Tpetra tests to run at the same time as other tests.

@bartlettroscoe
Copy link
Contributor Author

@crtrott said:

The MPS server makes it so kernels from different processes can run concurrently, otherwise the GPU is essentially timesliced between processes.

Some recent experiments by Kitware staff member @KyleFromKitware show some good speedup with using the MPS server on 'wateman'. For those with access from SNL, see:

Not clear if the MPS sever will cause the kernels to spread out over the two GPUs on 'waterman', for example. Should we expect the MPS server to spread out work on to multiple GPUS automatically or does that need to be done manually, with the CTest/Kokkos allocation work being done in trilinos/Trilinos#5598?

@crtrott
Copy link
Member

crtrott commented Sep 25, 2019

I am not quite sure actually, there is something funky around CUDA_VISIBLE_DEVICES

@bartlettroscoe
Copy link
Contributor Author

@crtrott said:

Should we expect the MPS server to spread out work on to multiple GPUS automatically or does that need to be done manually, with the CTest/Kokkos allocation work being done in trilinos/Trilinos#5598?

I am not quite sure actually, there is something funky around CUDA_VISIBLE_DEVICES

We can research this more as part of the FY20 Kitware contract on this.

@ajpowelsnl
Copy link
Contributor

Hi @ibaned and @bartlettroscoe -- Has this issue been resolved? If so, may I close it? If not, would you please detail what else needs to be done?

@ibaned
Copy link
Contributor

ibaned commented Nov 30, 2022

I think the Trilinos team now runs one test at a time for CUDA builds which avoids this issue

@bartlettroscoe
Copy link
Contributor Author

I think the Trilinos team now runs one test at a time for CUDA builds which avoids this issue

Correct, see trilinos/Trilinos#6840.

Therefore, we would not be seeing this in the regular automated builds of Trilinos (and the ATDM Trilinos builds). But regular users may be seeing this when they are not setting up to run on a GPU system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question For Kokkos internal and external contributors and users
Projects
None yet
Development

No branches or pull requests

7 participants