-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reusable zeroed memory #1984
Reusable zeroed memory #1984
Conversation
// For serial grid reduction we test that re-using semaphore buffers works | ||
// properly. This will cause a failure if serial grid reduction does not | ||
// properly clean up its semaphores. | ||
std::vector<bool> reuse_zeroed_memory_values{false}; | ||
if (serial) { | ||
reuse_zeroed_memory_values.push_back(true); | ||
} | ||
for (bool reuse_zeroed_memory : reuse_zeroed_memory_values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new loop nest is introduced, so use "Hide whitespace" to view the diff.
void releaseZeroedMemory() { | ||
for (Arena& a : arenas) { | ||
a.reset(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more complicated alternative would be to return a custom subclass of at::Tensor
whose destructor marks the corresponding region as free. But given our intended usage pattern where it's always safe to release all the memory after each kernel launch, I decided to keep things simple.
return arenas[device_num].getTensor(sizes, aten_dtype, device); | ||
} | ||
|
||
void releaseZeroedMemory() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it actually release the memory? It looks like it only sets the buffer size to zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's just me got confused with some of the names. "release" here does not really release the allocated memory but return the allocation to the Arena.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. It is confusing but I wasn't sure what exactly to name it. I am open to suggestions of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change the name but a quick comment would be helpful.
intermediate_buffer = at::zeros( | ||
unexpanded_sizes, | ||
at::TensorOptions().dtype(buf_info.type).device(options_.device)); | ||
if (isOptionEnabled(EnableOption::ReuseZeroedMemory)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any way to remove this option and automatically decide which to use? We know that the serial reduction can use this, so we should be able to tell the executor to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. We could do that by adding a flag to kir::Allocate I think and plumbing it through. Would it ever be preferable to launch a memset kernel rather than cleaning the semaphore? Even though a kernel might not itself be persistent, the fusion will likely be run multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it ever be preferable to launch a memset kernel rather than cleaning the semaphore?
I think it's highly unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This doesn't need to be in this PR:
https://github.com/NVIDIA/Fuser/pull/1984/files#r1536246159
This will allow us to incrementally enable reuse of zeroed memory (see #1984) on a case-by-case basis. To do so, when creating an `Allocate` node, if it is guaranteed to be "cleaned up" (i.e. reset to all zero values) by the end of the kernel, then you can pass `true` for the `resets_to_zero` argument in the `kir::Allocate` constructor. This will automatically enable reuse of zeroed memory for this buffer and avoid a memset call, regardless of whether `EnableOption::ReuseZeroedMemory` is enabled or not. Eventually this will let us get rid of that option entirely, once all of our global memory semaphores are guaranteed to be reset.
As suggested by @naoyam, this will allow us to incrementally enable reuse of zeroed memory (see #1984) on a case-by-case basis. To do so, when creating an `Allocate` node, if it is guaranteed to be "cleaned up" (i.e. reset to all zero values) by the end of the kernel, then you can pass `true` for the `resets_to_zero` argument in the `kir::Allocate` constructor. This will automatically enable reuse of zeroed memory for this buffer and avoid a memset call, regardless of whether `EnableOption::ReuseZeroedMemory` is enabled or not. Eventually this will let us get rid of that option entirely, once all of our global memory semaphores are guaranteed to be reset.
This introduces a thread-local global memory allocator for each device and uses it whenever there is an intermediate tensor needed which requires zero-initialization.
To enable use
NVFUSER_ENABLE=reuse_zeroed_memory
. You can monitor the allocator usingNVFUSER_DUMP=global_zeroed_memory
.Before we enable this feature by default, we need to have high confidence that every kernel using zero-initialized memory will always clean up their semaphores. This is currently only the case for serial grid reductions, as far as I know.
This enables the basic functionality of #1829. However, it does not modify existing algorithms to clean up their memory. See
NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling
, which succeeds when using serial grid reduction, but fails (in debug mode) when usinggridReduce
(note that this test is updated to behave differently in this PR):This test runs with serial grid reduction, then with
gridReduce
. Each time it runs two grid reductions. Both serial grid reductions succeed because the semaphore buffer is properly zeroed. ThegridReduce
succeeds the first time since the memory pool callsat::zeros
again to request a larger buffer size (gridReduce
requires more semaphores since there is one per thread segment vs one for each each block segment). However, the second call togridReduce
fails because it has not cleaned up its semaphores. Hacking that function to forcePERSISTENT=1
would clean up the semaphores resulting in success in this case. I'm leaving those kind of modifications for a follow-up.