Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizer CPU offload for single GPU training #584
Optimizer CPU offload for single GPU training #584
Changes from 3 commits
3cd42d2
c044d88
d85b172
d468e6f
d7a07eb
fe653e9
8ae42c3
b2c00e5
68835e3
b5393cb
3069b23
7af8518
5ff2e5a
a8a7b5a
40aea0c
dff2f9c
bd8db68
c3883ce
0e9235c
b6e4c6a
c514dba
cfdfe5d
c4ea68b
6478be9
03cf0ad
a144b22
d344817
fc358b1
5a5253e
7aa31eb
231a6ef
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check that we on the same page, the
non_blocking=True
here means that the host (CPU) is not blocked on this D2H copy. However, there is nothing for these D2H copies to overlap with, so the main benefit you are getting here is that copying D2H withnon_blocking=True
will copy directly to pinned memory.Otherwise, the CPU side should look like issuing D2H copy for each gradient and then blocking via the
torch.cuda.synchronize()
for all D2H copies to finish.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these H2D copies, the
non_blocking=True
here only means that the CPU will not be blocked. Thep_cpu
is already in pinned memory, so there is no further pinned memory consideration.Calling
non_blocking=True
allows the CPU to proceed into the next logic whether that is logging, the next iteration data loading, etc. or whatever.However, subsequent CUDA kernels issued in the default stream will still serialize with the H2D copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will still mention that this non_blocking is still benefiicial as it allows the cpu to enqueue all the copies and much better saturate the bw even if there is no overlap with compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD I wanted to understand this point better.
If you call
non_blocking=False
, then there is acudaDeviceSynchronize
after each copy, blocking the CPU until the copy finishes. After that, the CPU will proceed to issue the next copy, so there may be some slight gaps between each H2D copy.The part that I am not clear on is, are you suggesting that these gaps are exactly what would hurt the overall copy bandwidth, or do you mean that if you issue back-to-back H2D Memcpys, then there is some kind of batching effect across copies that improves bandwidth? (The latter would be non-intuitive to me, so I wanted to check.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for
non_blocking=False
, the additionalcudaDeviceSynchronize
is coupled with having to copy to paged memory as well, so that also is slower than copying to pinned memory.