Does "an illegal memory access was encountered" on CUDA ever occur due to overloading the GPU memory? #2330

bartlettroscoe · 2019-09-12T23:56:03Z

Hello Kokkos developers,

It is ever the case that overloading the GPU memory can trigger errors that look like:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered <file-name>:<line-number>

There are several Trilinos tests that randomly fail in in the ATDM Trilinos CUDA builds as shown in this query.

We have seen this "illegal memory access was encountered" error in several ATDM Trilinos issues including trilinos/Trilinos#5179, trilinos/Trilinos#5002, trilinos/Trilinos#4551, trilinos/Trilinos#4123, trilinos/Trilinos#3543, trilinos/Trilinos#3438, and trilinos/Trilinos#2827. In the case of trilinos/Trilinos#3542, we know this error is caused because by code that is not designed to run with CUDA on the GPU but in other cases, this was caused by bugs in code.

It seems like when we have seen out-of-bounds errors when overloading the GPU memory.

So can we expect to see "illegal memory access was encountered" errors when running out of GPU memory? Is there some better way to detect when the GPU memory has been exhausted and provide a better error message?

The reason this is important is that some developers have reported that they see errors that make them think that the GPU is being overloaded and then they simply discount and ignore the ATDM Trilinos GitHub issues that are being created. Therefore, we would like a reliable way to detect when GPU memory might be getting exhausted so that we can adjust the parallel test execution level with ctest. (Otherwise, we are going to need to switch to ctest -j1 on all of the ATDM Trilinos CUDA builds and just suffer the wasted CPU/GPU wall-clock time that results from this.)

The text was updated successfully, but these errors were encountered:

DavidPoliakoff · 2019-09-13T00:23:29Z

@bartlettroscoe : a little clarification, does ctest -j1 fix the problem because the problem arises from multiple processes using the GPU? It's fairly easy to fix it if one process uses too much GPU memory, if it's a contention between processes issue things get a little more interesting. I couldn't tell one way or another from skimming a couple of those bugs

bartlettroscoe · 2019-09-13T13:07:14Z

@DavidPoliakoff,

When you run with ctest -j1, ctest will run all of the defined tests, one at a time. Many tests use multiple MPI processes so a single test can invoke multiple kernels from different processes on the same GPU. For example, many Trilinos tests use 4 MPI ranks per test so even with ctest -j1 you are still gettting 4 MPI ranks launching kernels on the same GPU at the same time. When running with say ctest -j8 (which we commonly run with on many GPU machines), it can run as many as 8 MPI ranks at a time (e.g. two tests with 4 MPI ranks per test running at the same time).

I guess I could purposefully overload the GPUs by running with a high ctest -j<N> and then look at the errors that were produced but I thought that Kokkos developers would already have an FAQ or something that mentions this. (And again, if Kokkos could provide a robust way to detect OOM conditions on the GPU and report them as such, that would be a great help.)

FYI: Kitware is working on trying to improve our running of multiple tests with multiple MPI ranks at the same time on GPU machines. You can see their current effort in PR trilinos/Trilinos#5598.

ibaned · 2019-09-13T19:31:05Z

@bartlettroscoe to answer the original question, I think the answer is no. All the calls to cudaMalloc and friends in Kokkos should have their return codes checked, and running out of GPU memory should result in a non-success return code from one of these functions. The failure point should be there, instead of accessing a bad address somewhere down the line. However, I don't regularly test what the behavior is when overallocating, so I can't say this for sure.

bartlettroscoe · 2019-09-13T20:25:49Z

All the calls to cudaMalloc and friends in Kokkos should have their return codes checked, and running out of GPU memory should result in a non-success return code from one of these functions.

@ibaned, are the calls to cudaMalloc() fairly isolated so we can test their calls when we overload memory? Could one put in a mock cudaMalloc() that you can trigger an out-of-memory error and then verify that Kokkos catches and responds to this correctly? A very thin (inline) kokkosCudaMalloc() function that all of Kokkos calls (and in unit-test mode, simulates out-of-memory conditions) should do.

I don't want to developers to continue to use the excuse that they think that a test is failing due to overloading the GPU (because we are using too high of a ctest -j<N>) and then ignoring Trilinos GitHub issues. I know most software is not unit tested for OOM conditions but this seems like a major problem with running test suites on GPUs.

ibaned · 2019-09-13T20:35:55Z

@bartlettroscoe we can start with a pure CUDA program to confirm that overloading causes a failure, after that I'm not sure how necessary a wrapper would be. A simple grep search shows 6 places where cudaMalloc is called and the only one not checked is allocating a single integer once at the beginning of the program. If we know out of memory returns a code and we check all of these, I think that's proof enough that it's not the issue.

mhoemmen · 2019-09-13T20:38:10Z

I know most software is not unit tested for OOM conditions but this seems like a major problem with running test suites on GPUs.

FYI, UVM allocations can overcommit, just like CPU allocations (given Linux overcommit).

ibaned · 2019-09-13T20:55:29Z

confirmed, trying to allocate 100GB with cudaMalloc gives a nice return code, however cudaMallocManaged (which underpins all the UVM allocations that Trilinos requires) gives no such error code. Thus, as long as we are using UVM, it is actually possible that oversubscribing memory leads to way-down-the-line segfaults instead of nice error messages. This seems like another extremely compelling reason to put all Tpetra/MueLu resources into eliminating UVM completely.

ibaned · 2019-09-13T20:58:22Z

Worse still, I just ran a program that allocates 100 1GB allocations and then fills them all with actual data, and that still didn't run into any error conditions! I assume the GPU is "conveniently" swapping pages out to CPU memory in order to make this happen. Maybe in @mhoemmen 's article the applications appreciate this behavior, but I think for HPC this much unchecked magic is not worth the trouble it introduces.

ibaned · 2019-09-13T21:03:56Z

I unfortunately still don't have an answer to @bartlettroscoe's original question since I can't trigger any kind of error with horribly oversubscribed UVM memory. Lets stop using UVM.

ibaned · 2019-09-13T21:06:07Z

@crtrott says oversubscription shouldn't be able to cause illegal address accesses

ibaned · 2019-09-13T21:19:55Z

Given that Trilinos doesn't seem to be moving any closer to UVM-free execution, I think the right short-term move (for many reasons besides this one) is to move to ctest -j1 for testing.

bartlettroscoe · 2019-09-13T21:38:07Z

Given that Trilinos doesn't seem to be moving any closer to UVM-free execution, I think the right short-term move (for many reasons besides this one) is to move to ctest -j1 for testing.

@ibaned, okay, to be clear, because of UVM usage, we could be running out of memory on the GPU and that could trigger strange errors like these "illegal memory access was encountered" errors?

mhoemmen · 2019-09-13T22:29:42Z

... because of UVM usage, we could be running out of memory on the GPU and that could trigger strange errors like these "illegal memory access was encountered" errors?

If only one process is using the GPU, it looks like CUDA will swap UVM pages in and out of the GPU as they are used, so there shouldn't be any strange errors. Also, if we're not running with the MPS server, then only one process can run on the GPU at a time.

bartlettroscoe · 2019-09-14T15:33:07Z

If only one process is using the GPU, it looks like CUDA will swap UVM pages in and out of the GPU as they are used, so there shouldn't be any strange errors.

Even with ctest -j1, there are many tests with 4 MPI rank and therefore each running a kernel on the GPU at the same time.

Also, if we're not running with the MPS server, then only one process can run on the GPU at a time.

I have asked Kitware staff member Kyle Edwards to look into the MPS server and try it out. See:

https://gitlab.kitware.com/snl/project-1/issues/100

(if you don't have access just ask me.)

crtrott · 2019-09-14T17:51:45Z

The MPS server makes it so kernels from different processes can run concurrently, otherwise the GPU is essentially timesliced between processes. Illegal memory access should generally not be coming from oversubscribing memory. But who knows. Typically illegal memory accesses are stuff like accessing static variables, constexpr thingies in some cases, dereferencing something pointing to host stack variables, accessing out of bounds shared memory, or accessing out of bounds device stack variables.

jjellio · 2019-09-15T18:29:03Z

Is CUDA_LAUNCH_BLOCKING=1 being set?

bartlettroscoe · 2019-09-16T16:03:07Z

@jjellio asked:

Is CUDA_LAUNCH_BLOCKING=1 being set?

Yes. See:

https://github.com/trilinos/Trilinos/blob/e4fd7f9289c3ba3a1e8fd6bada0a76546534ce5e/cmake/std/atdm/ride/environment.sh#L191

I believe this was needed to allow some Tpetra tests to run at the same time as other tests.

bartlettroscoe · 2019-09-25T00:53:18Z

@crtrott said:

The MPS server makes it so kernels from different processes can run concurrently, otherwise the GPU is essentially timesliced between processes.

Some recent experiments by Kitware staff member @KyleFromKitware show some good speedup with using the MPS server on 'wateman'. For those with access from SNL, see:

https://gitlab.kitware.com/snl/project-1/issues/100#note_317d916c53879cb8a755cc2053d4a3efa2cebffb

Not clear if the MPS sever will cause the kernels to spread out over the two GPUs on 'waterman', for example. Should we expect the MPS server to spread out work on to multiple GPUS automatically or does that need to be done manually, with the CTest/Kokkos allocation work being done in trilinos/Trilinos#5598?

crtrott · 2019-09-25T21:16:05Z

I am not quite sure actually, there is something funky around CUDA_VISIBLE_DEVICES

bartlettroscoe · 2019-09-26T13:00:44Z

@crtrott said:

Should we expect the MPS server to spread out work on to multiple GPUS automatically or does that need to be done manually, with the CTest/Kokkos allocation work being done in trilinos/Trilinos#5598?

I am not quite sure actually, there is something funky around CUDA_VISIBLE_DEVICES

We can research this more as part of the FY20 Kitware contract on this.

ajpowelsnl · 2022-11-30T17:27:48Z

Hi @ibaned and @bartlettroscoe -- Has this issue been resolved? If so, may I close it? If not, would you please detail what else needs to be done?

ibaned · 2022-11-30T17:33:37Z

I think the Trilinos team now runs one test at a time for CUDA builds which avoids this issue

bartlettroscoe · 2022-11-30T17:38:15Z

I think the Trilinos team now runs one test at a time for CUDA builds which avoids this issue

Correct, see trilinos/Trilinos#6840.

Therefore, we would not be seeing this in the regular automated builds of Trilinos (and the ATDM Trilinos builds). But regular users may be seeing this when they are not setting up to run on a GPU system.

ibaned self-assigned this Sep 13, 2019

ibaned added the Question For Kokkos internal and external contributors and users label Sep 13, 2019

bartlettroscoe mentioned this issue Oct 5, 2019

Reduce all CUDA builds to use ctest -j4 to avoid flaky UVM out of memory failures trilinos/Trilinos#6052

Closed

ajpowelsnl closed this as completed Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does "an illegal memory access was encountered" on CUDA ever occur due to overloading the GPU memory? #2330

Does "an illegal memory access was encountered" on CUDA ever occur due to overloading the GPU memory? #2330

bartlettroscoe commented Sep 12, 2019

DavidPoliakoff commented Sep 13, 2019

bartlettroscoe commented Sep 13, 2019

ibaned commented Sep 13, 2019

bartlettroscoe commented Sep 13, 2019 •

edited

Loading

ibaned commented Sep 13, 2019

mhoemmen commented Sep 13, 2019

ibaned commented Sep 13, 2019

ibaned commented Sep 13, 2019

ibaned commented Sep 13, 2019

ibaned commented Sep 13, 2019

ibaned commented Sep 13, 2019

bartlettroscoe commented Sep 13, 2019

mhoemmen commented Sep 13, 2019 •

edited

Loading

bartlettroscoe commented Sep 14, 2019

crtrott commented Sep 14, 2019

jjellio commented Sep 15, 2019

bartlettroscoe commented Sep 16, 2019

bartlettroscoe commented Sep 25, 2019

crtrott commented Sep 25, 2019 •

edited

Loading

bartlettroscoe commented Sep 26, 2019

ajpowelsnl commented Nov 30, 2022

ibaned commented Nov 30, 2022

bartlettroscoe commented Nov 30, 2022

Does "an illegal memory access was encountered" on CUDA ever occur due to overloading the GPU memory? #2330

Does "an illegal memory access was encountered" on CUDA ever occur due to overloading the GPU memory? #2330

Comments

bartlettroscoe commented Sep 12, 2019

DavidPoliakoff commented Sep 13, 2019

bartlettroscoe commented Sep 13, 2019

ibaned commented Sep 13, 2019

bartlettroscoe commented Sep 13, 2019 • edited Loading

ibaned commented Sep 13, 2019

mhoemmen commented Sep 13, 2019

ibaned commented Sep 13, 2019

ibaned commented Sep 13, 2019

ibaned commented Sep 13, 2019

ibaned commented Sep 13, 2019

ibaned commented Sep 13, 2019

bartlettroscoe commented Sep 13, 2019

mhoemmen commented Sep 13, 2019 • edited Loading

bartlettroscoe commented Sep 14, 2019

crtrott commented Sep 14, 2019

jjellio commented Sep 15, 2019

bartlettroscoe commented Sep 16, 2019

bartlettroscoe commented Sep 25, 2019

crtrott commented Sep 25, 2019 • edited Loading

bartlettroscoe commented Sep 26, 2019

ajpowelsnl commented Nov 30, 2022

ibaned commented Nov 30, 2022

bartlettroscoe commented Nov 30, 2022

bartlettroscoe commented Sep 13, 2019 •

edited

Loading

mhoemmen commented Sep 13, 2019 •

edited

Loading

crtrott commented Sep 25, 2019 •

edited

Loading