Allocate host memory through cudaMallocHost #2398

cypof · 2015-04-30T02:10:04Z

It doesn't seem to make a performance difference but it is the recommended practice. It also seemed to help when we were debugging stability issues on the parallel branch. @thatguymike could comment.

flx42 · 2015-04-30T02:18:18Z

It seems unclear to me how you handle the problem described in the comment you're deleting at the beginning of the file.

cypof · 2015-04-30T02:22:54Z

It's only called in GPU mode, otherwise it falls back to malloc.

cypof · 2015-04-30T02:23:38Z

I can probably remove the ptr check on cudaSuccess.

longjon · 2015-04-30T02:37:04Z

The concern here (which I think is what @flx42 had in mind above) is: does this still work if you do not build with CPU_ONLY set, but run on a machine with no GPU physically present (in CPU mode)? That's an odd case but not one I think we should break. (I'm not sure this is the same, but maybe try with CUDA_VISIBLE_DEVICES set to the empty string).

cypof · 2015-04-30T02:41:33Z

Yes it's fine. That's what the build system is doing. Travis machines have no GPU, but as long as you don't call CUDA code it works.

flx42 · 2015-04-30T02:46:27Z

Yes that's what I had in mind, thanks for the clarification!

longjon · 2015-04-30T03:00:00Z

Ah, I see, I forgot the GPU mode check. That ought to work fine then, although it creates some subtleties: currently you're allowed to construct a net in CPU mode, then switch to GPU mode, and you'll be running on the GPU, but you won't have pinned memory.

Mode is, however, due for an update, so maybe that won't be an issue soon...

vimalthilak · 2015-05-01T05:27:14Z

include/caffe/syncedmem.hpp

+#ifndef CPU_ONLY
+  if (Caffe::mode() == Caffe::GPU) {
+    CUDA_CHECK(cudaMallocHost(ptr, size));
+    return;


@cypof : can you explain the need to allocate page locked memory? Does this memory have to be directly accessible by the device?

With multiple GPUs in the system and larger models, you need the host side buffers to be pinned for transfer (DMA) or you will find your machine will "swap lock" trying to find enough contiguous memory to pin. We hit this originally in the parallel branch and it was a beast to track down why machines where apparently hanging in the kernel. Large contiguous buffers, combined with LMDB agressive memory use made things interesting.

@thatguymike Thanks for the explanation.
@cypof :
Could you please include the explanation that the memory needs to be pinned for copy engine/DMA in the source for clarity?

Does the above change have a significant impact on single GPU mode? We are currently using an AWS g2.2 instance for training. Thanks

@vimalthilak
This only changes how host memory is allocated, I think the only difference would be in the case where you allocate large amounts of host memory in your program, close to the RAM capacity. If you have 16GB of RAM, with malloc(3) you should still be able to allocate 16GB in a single program because of overcommit/swap, it won't be possible with cudaMallocHost.

But I think this situation is unlikely to occur in Caffe, especially since it only happens in GPU mode, in this case the CPU memory requirement should not be excessive.
I don't see how this could impact single GPU mode.

There a small performance benefit of having the memory prepinned for data transfers even in single GPU mode. It gets tough on the OS for large numbers of GPUs and large buffers to succesfully dynamically pin buffer.

In CPU mode, there is some unneeded overhead of allocating the data as pinned. Pinning the memory puts pressure on other parts of the system under high memory use contention.

@thatguymike /Mike: Thanks

Understood about the benefits of pinning memory even for a single GPU system. My assumption here was/is that as long as there is sufficient sys RAM the overall system performance should be okay. I will do a test run when I get a chance unless someone else already has done so. Thanks once again

As noted earlier, Mike's comments should be noted down as a part of this PR.

Thanks Mike. Yes I will add this to the code.

- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error

- Makes sure each solver accesses a different subset of the data - Sequential reading of DB for performance - Prefetches a configurable amount of data to host memory - Distributes data to solvers in round-robin way for determinism

flx42 · 2015-05-21T01:18:58Z

This causes some tests to fail (make runtest), as reported in #2114 by @lukeyeager

For instance:

$ build/test/test_all.testbin --gtest_filter='MathFunctionsTest/1.TestCopyGPU'
[...]
F0520 18:12:53.099692 23912 syncedmem.cpp:28] Check failed: error == cudaSuccess (11 vs. 0)  invalid argument
[backtrace]

In this patch, Caffe::mode() is used to determine if memory should be allocated through malloc or cudaMallocHost and if memory should be freed through free or cudaFreeHost. The issue is, if the mode changes during the execution, we will have a mismatch, for instance a pointer allocated with malloc and we try to free it with cudaFreeHost, this is what causing the error above.

This failing test is changing the mode during execution:
https://github.com/BVLC/caffe/blob/master/src/caffe/test/test_math_functions.cpp#L223
That's actually violating the documentation of set_mode:
https://github.com/BVLC/caffe/blob/master/include/caffe/common.hpp#L140-L143
This comment documents exactly the problem we have here.

By setting a breakpoint in caffe::CaffeMallocHost, I can see that some data was initially allocated in the SetUp function of this test set:

#0  0x00007ffff23c3310 in caffe::CaffeMallocHost(void**, unsigned long)@plt () from /home/felix/git/caffe/.build_debug/test/../lib/libcaffe.so
#1  0x00007ffff25514c8 in caffe::SyncedMemory::to_cpu (this=0x53b66c0) at src/caffe/syncedmem.cpp:51
#2  0x00007ffff25512c2 in caffe::SyncedMemory::mutable_cpu_data (this=0x53b66c0) at src/caffe/syncedmem.cpp:128
#3  0x00007ffff25498c9 in caffe::Blob<double>::mutable_cpu_data (this=0xd1bc20) at src/caffe/blob.cpp:103
#4  0x000000000048f731 in caffe::GaussianFiller<double>::Fill (this=0x7fffffffd920, blob=0xd1bc20) at ./include/caffe/filler.hpp:71
#5  0x000000000059447e in caffe::MathFunctionsTest<double>::SetUp (this=0xd1b9e0) at src/caffe/test/test_math_functions.cpp:33

The corresponding line:
https://github.com/BVLC/caffe/blob/master/src/caffe/test/test_math_functions.cpp#L33

The obvious solution here is to add an additional flag to remember how the memory was alllocated, but CaffeMallocHost and CaffeFreeHost won't be free standing functions anymore since they will require this extra information.
But first, we need to fix this inconsistency in the code and finally decide if calling "set_mode" halfway through the execution should be valid or undefined behavior.

flx42 · 2015-05-22T02:08:38Z

I started working on a fix for the illegal Caffe::set_mode calls that were causing the tests to fail:
https://github.com/flx42/caffe/commits/fix_illegal_mode_changes
I would like your opinion on the changes, for instance if you approve how I splitted the CPU and GPU tests.

However I'm still seeing some random failures like this one:

[ RUN      ] MultinomialLogisticLossLayerTest/0.TestGradientCPU
*** Error in `.build_debug/test/test_all.testbin': free(): invalid pointer: 0x0000000203600000 ***

These failures are non-deterministic, I suspect the change of allocation function simply revealed a bug that was already there. I will investigate more.

flx42 · 2015-05-23T00:27:03Z

So, it was actually the reciprocal problem: memory allocated by cudaMallocHost but released using free. It was more difficult to narrow down because it was non-deterministic. This is because the mode is not reset between different tests. Since tests are shuffled, in some cases the previous test set the mode to GPU and then the constructor of MultinomialLogisticLossLayerTest allocated memory using cudaMallocHost, one of the test set the mode to CPU and thus at cleanup memory was released using free.
I was expecting this could not fail because CPU is the default mode, but that's not true since the mode is not reset between tests.

An even nastier side effect of this bug is that some tests are possibly very wrong. For instance AccuracyLayerTest/TestForwardCPUTopK does not call set_mode() and thus could very well run on the GPU if the previous test modified the mode.

My suggestion is to remove ALL the occurrences of set_mode() from the individual tests, this is too dangerous. It would be one step further than my branch above, I'm working on it right now, my goal is to do something like the existing MultiDeviceTest class and have tests derive from CPUDeviceTest if they only run on the CPU, GPUDeviceTest if they only run on the GPU, and MultiDeviceTest if we need to share code between CPU and GPU tests (like MathFunctionsTest).
It should be done soon, expect all the tests to be changed.

Once again, feedback is welcome.

cypof force-pushed the use_cuda_malloc_host branch from 7636ffe to 3f4b52e Compare April 30, 2015 02:47

vimalthilak reviewed May 1, 2015
View reviewed changes

cypof added 8 commits May 18, 2015 17:24

Added BlockingQueue for inter-thread communication.

6108b25

Thread-local Caffe

b8b5a52

Changed the way threads are started and stopped

59bb3ad

- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error

Persistent prefetch thread

01cbda5

Refactor solvers regularization and logging code

911f661

Timing info for solver, improved timing for data loading

4b60eb0

Allocate host memory through cudaMallocHost

2de355a

cypof force-pushed the use_cuda_malloc_host branch from 3f4b52e to 2de355a Compare May 19, 2015 03:07

cypof mentioned this pull request May 19, 2015

Multi-GPU #2114

Closed

flx42 mentioned this pull request May 26, 2015

Fix invalid mode changes during tests #2511

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocate host memory through cudaMallocHost #2398

Allocate host memory through cudaMallocHost #2398

cypof commented Apr 30, 2015

flx42 commented Apr 30, 2015

cypof commented Apr 30, 2015

cypof commented Apr 30, 2015

longjon commented Apr 30, 2015

cypof commented Apr 30, 2015

flx42 commented Apr 30, 2015

longjon commented Apr 30, 2015

vimalthilak May 1, 2015

thatguymike May 1, 2015

vimalthilak May 1, 2015

flx42 May 1, 2015

thatguymike May 1, 2015

vimalthilak May 1, 2015

cypof May 1, 2015

flx42 commented May 21, 2015

flx42 commented May 22, 2015

flx42 commented May 23, 2015

Allocate host memory through cudaMallocHost #2398

Are you sure you want to change the base?

Allocate host memory through cudaMallocHost #2398

Conversation

cypof commented Apr 30, 2015

flx42 commented Apr 30, 2015

cypof commented Apr 30, 2015

cypof commented Apr 30, 2015

longjon commented Apr 30, 2015

cypof commented Apr 30, 2015

flx42 commented Apr 30, 2015

longjon commented Apr 30, 2015

vimalthilak May 1, 2015

Choose a reason for hiding this comment

thatguymike May 1, 2015

Choose a reason for hiding this comment

vimalthilak May 1, 2015

Choose a reason for hiding this comment

flx42 May 1, 2015

Choose a reason for hiding this comment

thatguymike May 1, 2015

Choose a reason for hiding this comment

vimalthilak May 1, 2015

Choose a reason for hiding this comment

cypof May 1, 2015

Choose a reason for hiding this comment

flx42 commented May 21, 2015

flx42 commented May 22, 2015

flx42 commented May 23, 2015