Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocate host memory through cudaMallocHost #2398

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

cypof
Copy link
Member

@cypof cypof commented Apr 30, 2015

It doesn't seem to make a performance difference but it is the recommended practice. It also seemed to help when we were debugging stability issues on the parallel branch. @thatguymike could comment.

@flx42
Copy link
Contributor

flx42 commented Apr 30, 2015

It seems unclear to me how you handle the problem described in the comment you're deleting at the beginning of the file.

@cypof
Copy link
Member Author

cypof commented Apr 30, 2015

It's only called in GPU mode, otherwise it falls back to malloc.

@cypof
Copy link
Member Author

cypof commented Apr 30, 2015

I can probably remove the ptr check on cudaSuccess.

@longjon
Copy link
Contributor

longjon commented Apr 30, 2015

The concern here (which I think is what @flx42 had in mind above) is: does this still work if you do not build with CPU_ONLY set, but run on a machine with no GPU physically present (in CPU mode)? That's an odd case but not one I think we should break. (I'm not sure this is the same, but maybe try with CUDA_VISIBLE_DEVICES set to the empty string).

@cypof
Copy link
Member Author

cypof commented Apr 30, 2015

Yes it's fine. That's what the build system is doing. Travis machines have no GPU, but as long as you don't call CUDA code it works.

@flx42
Copy link
Contributor

flx42 commented Apr 30, 2015

Yes that's what I had in mind, thanks for the clarification!

@cypof cypof force-pushed the use_cuda_malloc_host branch from 7636ffe to 3f4b52e Compare April 30, 2015 02:47
@longjon
Copy link
Contributor

longjon commented Apr 30, 2015

Ah, I see, I forgot the GPU mode check. That ought to work fine then, although it creates some subtleties: currently you're allowed to construct a net in CPU mode, then switch to GPU mode, and you'll be running on the GPU, but you won't have pinned memory.

Mode is, however, due for an update, so maybe that won't be an issue soon...

#ifndef CPU_ONLY
if (Caffe::mode() == Caffe::GPU) {
CUDA_CHECK(cudaMallocHost(ptr, size));
return;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cypof : can you explain the need to allocate page locked memory? Does this memory have to be directly accessible by the device?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With multiple GPUs in the system and larger models, you need the host side buffers to be pinned for transfer (DMA) or you will find your machine will "swap lock" trying to find enough contiguous memory to pin. We hit this originally in the parallel branch and it was a beast to track down why machines where apparently hanging in the kernel. Large contiguous buffers, combined with LMDB agressive memory use made things interesting.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thatguymike Thanks for the explanation.
@cypof :
Could you please include the explanation that the memory needs to be pinned for copy engine/DMA in the source for clarity?

  • Does the above change have a significant impact on single GPU mode? We are currently using an AWS g2.2 instance for training. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vimalthilak
This only changes how host memory is allocated, I think the only difference would be in the case where you allocate large amounts of host memory in your program, close to the RAM capacity. If you have 16GB of RAM, with malloc(3) you should still be able to allocate 16GB in a single program because of overcommit/swap, it won't be possible with cudaMallocHost.

But I think this situation is unlikely to occur in Caffe, especially since it only happens in GPU mode, in this case the CPU memory requirement should not be excessive.
I don't see how this could impact single GPU mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There a small performance benefit of having the memory prepinned for data transfers even in single GPU mode. It gets tough on the OS for large numbers of GPUs and large buffers to succesfully dynamically pin buffer.

In CPU mode, there is some unneeded overhead of allocating the data as pinned. Pinning the memory puts pressure on other parts of the system under high memory use contention.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thatguymike /Mike: Thanks

Understood about the benefits of pinning memory even for a single GPU system. My assumption here was/is that as long as there is sufficient sys RAM the overall system performance should be okay. I will do a test run when I get a chance unless someone else already has done so. Thanks once again

As noted earlier, Mike's comments should be noted down as a part of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mike. Yes I will add this to the code.

cypof added 8 commits May 18, 2015 17:24
- Interrupt the thread before waiting on join
- Provide a method for looping threads to exit on demand
- CHECK if start and stop succeed instead of returning an error
- Makes sure each solver accesses a different subset of the data
- Sequential reading of DB for performance
- Prefetches a configurable amount of data to host memory
- Distributes data to solvers in round-robin way for determinism
@cypof cypof force-pushed the use_cuda_malloc_host branch from 3f4b52e to 2de355a Compare May 19, 2015 03:07
@cypof cypof mentioned this pull request May 19, 2015
@flx42
Copy link
Contributor

flx42 commented May 21, 2015

This causes some tests to fail (make runtest), as reported in #2114 by @lukeyeager

For instance:

$ build/test/test_all.testbin --gtest_filter='MathFunctionsTest/1.TestCopyGPU'
[...]
F0520 18:12:53.099692 23912 syncedmem.cpp:28] Check failed: error == cudaSuccess (11 vs. 0)  invalid argument
[backtrace]

In this patch, Caffe::mode() is used to determine if memory should be allocated through malloc or cudaMallocHost and if memory should be freed through free or cudaFreeHost. The issue is, if the mode changes during the execution, we will have a mismatch, for instance a pointer allocated with malloc and we try to free it with cudaFreeHost, this is what causing the error above.

This failing test is changing the mode during execution:
https://github.com/BVLC/caffe/blob/master/src/caffe/test/test_math_functions.cpp#L223
That's actually violating the documentation of set_mode:
https://github.com/BVLC/caffe/blob/master/include/caffe/common.hpp#L140-L143
This comment documents exactly the problem we have here.

By setting a breakpoint in caffe::CaffeMallocHost, I can see that some data was initially allocated in the SetUp function of this test set:

#0  0x00007ffff23c3310 in caffe::CaffeMallocHost(void**, unsigned long)@plt () from /home/felix/git/caffe/.build_debug/test/../lib/libcaffe.so
#1  0x00007ffff25514c8 in caffe::SyncedMemory::to_cpu (this=0x53b66c0) at src/caffe/syncedmem.cpp:51
#2  0x00007ffff25512c2 in caffe::SyncedMemory::mutable_cpu_data (this=0x53b66c0) at src/caffe/syncedmem.cpp:128
#3  0x00007ffff25498c9 in caffe::Blob<double>::mutable_cpu_data (this=0xd1bc20) at src/caffe/blob.cpp:103
#4  0x000000000048f731 in caffe::GaussianFiller<double>::Fill (this=0x7fffffffd920, blob=0xd1bc20) at ./include/caffe/filler.hpp:71
#5  0x000000000059447e in caffe::MathFunctionsTest<double>::SetUp (this=0xd1b9e0) at src/caffe/test/test_math_functions.cpp:33

The corresponding line:
https://github.com/BVLC/caffe/blob/master/src/caffe/test/test_math_functions.cpp#L33

The obvious solution here is to add an additional flag to remember how the memory was alllocated, but CaffeMallocHost and CaffeFreeHost won't be free standing functions anymore since they will require this extra information.
But first, we need to fix this inconsistency in the code and finally decide if calling "set_mode" halfway through the execution should be valid or undefined behavior.

@flx42
Copy link
Contributor

flx42 commented May 22, 2015

I started working on a fix for the illegal Caffe::set_mode calls that were causing the tests to fail:
https://github.com/flx42/caffe/commits/fix_illegal_mode_changes
I would like your opinion on the changes, for instance if you approve how I splitted the CPU and GPU tests.

However I'm still seeing some random failures like this one:

[ RUN      ] MultinomialLogisticLossLayerTest/0.TestGradientCPU
*** Error in `.build_debug/test/test_all.testbin': free(): invalid pointer: 0x0000000203600000 ***

These failures are non-deterministic, I suspect the change of allocation function simply revealed a bug that was already there. I will investigate more.

@flx42
Copy link
Contributor

flx42 commented May 23, 2015

So, it was actually the reciprocal problem: memory allocated by cudaMallocHost but released using free. It was more difficult to narrow down because it was non-deterministic. This is because the mode is not reset between different tests. Since tests are shuffled, in some cases the previous test set the mode to GPU and then the constructor of MultinomialLogisticLossLayerTest allocated memory using cudaMallocHost, one of the test set the mode to CPU and thus at cleanup memory was released using free.
I was expecting this could not fail because CPU is the default mode, but that's not true since the mode is not reset between tests.

An even nastier side effect of this bug is that some tests are possibly very wrong. For instance AccuracyLayerTest/TestForwardCPUTopK does not call set_mode() and thus could very well run on the GPU if the previous test modified the mode.

My suggestion is to remove ALL the occurrences of set_mode() from the individual tests, this is too dangerous. It would be one step further than my branch above, I'm working on it right now, my goal is to do something like the existing MultiDeviceTest class and have tests derive from CPUDeviceTest if they only run on the CPU, GPUDeviceTest if they only run on the GPU, and MultiDeviceTest if we need to share code between CPU and GPU tests (like MathFunctionsTest).
It should be done soon, expect all the tests to be changed.

Once again, feedback is welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants