-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocate host memory through cudaMallocHost #2398
base: master
Are you sure you want to change the base?
Conversation
It seems unclear to me how you handle the problem described in the comment you're deleting at the beginning of the file. |
It's only called in GPU mode, otherwise it falls back to malloc. |
I can probably remove the ptr check on cudaSuccess. |
The concern here (which I think is what @flx42 had in mind above) is: does this still work if you do not build with |
Yes it's fine. That's what the build system is doing. Travis machines have no GPU, but as long as you don't call CUDA code it works. |
Yes that's what I had in mind, thanks for the clarification! |
7636ffe
to
3f4b52e
Compare
Ah, I see, I forgot the Mode is, however, due for an update, so maybe that won't be an issue soon... |
#ifndef CPU_ONLY | ||
if (Caffe::mode() == Caffe::GPU) { | ||
CUDA_CHECK(cudaMallocHost(ptr, size)); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cypof : can you explain the need to allocate page locked memory? Does this memory have to be directly accessible by the device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With multiple GPUs in the system and larger models, you need the host side buffers to be pinned for transfer (DMA) or you will find your machine will "swap lock" trying to find enough contiguous memory to pin. We hit this originally in the parallel branch and it was a beast to track down why machines where apparently hanging in the kernel. Large contiguous buffers, combined with LMDB agressive memory use made things interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thatguymike Thanks for the explanation.
@cypof :
Could you please include the explanation that the memory needs to be pinned for copy engine/DMA in the source for clarity?
- Does the above change have a significant impact on single GPU mode? We are currently using an AWS g2.2 instance for training. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vimalthilak
This only changes how host memory is allocated, I think the only difference would be in the case where you allocate large amounts of host memory in your program, close to the RAM capacity. If you have 16GB of RAM, with malloc(3) you should still be able to allocate 16GB in a single program because of overcommit/swap, it won't be possible with cudaMallocHost.
But I think this situation is unlikely to occur in Caffe, especially since it only happens in GPU mode, in this case the CPU memory requirement should not be excessive.
I don't see how this could impact single GPU mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There a small performance benefit of having the memory prepinned for data transfers even in single GPU mode. It gets tough on the OS for large numbers of GPUs and large buffers to succesfully dynamically pin buffer.
In CPU mode, there is some unneeded overhead of allocating the data as pinned. Pinning the memory puts pressure on other parts of the system under high memory use contention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thatguymike /Mike: Thanks
Understood about the benefits of pinning memory even for a single GPU system. My assumption here was/is that as long as there is sufficient sys RAM the overall system performance should be okay. I will do a test run when I get a chance unless someone else already has done so. Thanks once again
As noted earlier, Mike's comments should be noted down as a part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mike. Yes I will add this to the code.
- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error
- Makes sure each solver accesses a different subset of the data - Sequential reading of DB for performance - Prefetches a configurable amount of data to host memory - Distributes data to solvers in round-robin way for determinism
3f4b52e
to
2de355a
Compare
This causes some tests to fail ( For instance:
In this patch, This failing test is changing the mode during execution: By setting a breakpoint in caffe::CaffeMallocHost, I can see that some data was initially allocated in the SetUp function of this test set:
The corresponding line: The obvious solution here is to add an additional flag to remember how the memory was alllocated, but CaffeMallocHost and CaffeFreeHost won't be free standing functions anymore since they will require this extra information. |
I started working on a fix for the illegal However I'm still seeing some random failures like this one:
These failures are non-deterministic, I suspect the change of allocation function simply revealed a bug that was already there. I will investigate more. |
So, it was actually the reciprocal problem: memory allocated by An even nastier side effect of this bug is that some tests are possibly very wrong. For instance My suggestion is to remove ALL the occurrences of Once again, feedback is welcome. |
It doesn't seem to make a performance difference but it is the recommended practice. It also seemed to help when we were debugging stability issues on the parallel branch. @thatguymike could comment.