Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reusable zeroed memory #1984

Merged
merged 6 commits into from
Mar 22, 2024
Merged

Reusable zeroed memory #1984

merged 6 commits into from
Mar 22, 2024

Conversation

jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Mar 22, 2024

This introduces a thread-local global memory allocator for each device and uses it whenever there is an intermediate tensor needed which requires zero-initialization.

To enable use NVFUSER_ENABLE=reuse_zeroed_memory. You can monitor the allocator using NVFUSER_DUMP=global_zeroed_memory.

Before we enable this feature by default, we need to have high confidence that every kernel using zero-initialized memory will always clean up their semaphores. This is currently only the case for serial grid reductions, as far as I know.

This enables the basic functionality of #1829. However, it does not modify existing algorithms to clean up their memory. See NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling, which succeeds when using serial grid reduction, but fails (in debug mode) when using gridReduce (note that this test is updated to behave differently in this PR):

# NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling                                                       
Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = SerialGridReductionTest.Scheduling
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from SerialGridReductionTest
[ RUN      ] SerialGridReductionTest.Scheduling
[global zeroed memory] Resizing arena to 512 bytes
[global zeroed memory] Allocating byte range: 0 to 512 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Allocating byte range: 0 to 512 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Resizing arena to 16384 bytes
[global zeroed memory] Allocating byte range: 0 to 16384 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Allocating byte range: 0 to 16384 bytes
unknown file: Failure
C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero
Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first):
frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests)
frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests)
frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests)
frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests)
frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests)
frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests)
frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests)
frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests)
frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests)
frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests)
frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests)
frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests)
frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests)
frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests)
frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests)
frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests)
frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests)
frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests)
frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests)
frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests)
frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests)
" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling'
[  FAILED  ] SerialGridReductionTest.Scheduling (5669 ms)
[----------] 1 test from SerialGridReductionTest (5669 ms total)

This test runs with serial grid reduction, then with gridReduce. Each time it runs two grid reductions. Both serial grid reductions succeed because the semaphore buffer is properly zeroed. The gridReduce succeeds the first time since the memory pool calls at::zeros again to request a larger buffer size (gridReduce requires more semaphores since there is one per thread segment vs one for each each block segment). However, the second call to gridReduce fails because it has not cleaned up its semaphores. Hacking that function to force PERSISTENT=1 would clean up the semaphores resulting in success in this case. I'm leaving those kind of modifications for a follow-up.

@jacobhinkle jacobhinkle marked this pull request as ready for review March 22, 2024 16:06
Comment on lines +40 to +47
// For serial grid reduction we test that re-using semaphore buffers works
// properly. This will cause a failure if serial grid reduction does not
// properly clean up its semaphores.
std::vector<bool> reuse_zeroed_memory_values{false};
if (serial) {
reuse_zeroed_memory_values.push_back(true);
}
for (bool reuse_zeroed_memory : reuse_zeroed_memory_values) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new loop nest is introduced, so use "Hide whitespace" to view the diff.

Comment on lines +121 to +125
void releaseZeroedMemory() {
for (Arena& a : arenas) {
a.reset();
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more complicated alternative would be to return a custom subclass of at::Tensor whose destructor marks the corresponding region as free. But given our intended usage pattern where it's always safe to release all the memory after each kernel launch, I decided to keep things simple.

return arenas[device_num].getTensor(sizes, aten_dtype, device);
}

void releaseZeroedMemory() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it actually release the memory? It looks like it only sets the buffer size to zero.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's just me got confused with some of the names. "release" here does not really release the allocated memory but return the allocation to the Arena.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. It is confusing but I wasn't sure what exactly to name it. I am open to suggestions of course.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change the name but a quick comment would be helpful.

intermediate_buffer = at::zeros(
unexpanded_sizes,
at::TensorOptions().dtype(buf_info.type).device(options_.device));
if (isOptionEnabled(EnableOption::ReuseZeroedMemory)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any way to remove this option and automatically decide which to use? We know that the serial reduction can use this, so we should be able to tell the executor to do so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. We could do that by adding a flag to kir::Allocate I think and plumbing it through. Would it ever be preferable to launch a memset kernel rather than cleaning the semaphore? Even though a kernel might not itself be persistent, the fusion will likely be run multiple times.

Copy link
Collaborator

@naoyam naoyam Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it ever be preferable to launch a memset kernel rather than cleaning the semaphore?

I think it's highly unlikely.

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobhinkle jacobhinkle merged commit 8db85de into main Mar 22, 2024
4 checks passed
@jacobhinkle jacobhinkle deleted the reuse_zeroed_memory branch March 22, 2024 23:15
jacobhinkle added a commit that referenced this pull request Mar 25, 2024
This will allow us to incrementally enable reuse of zeroed memory
(see #1984) on a case-by-case basis. To do so, when creating an
`Allocate` node, if it is guaranteed to be "cleaned up" (i.e. reset to
all zero values) by the end of the kernel, then you can pass `true` for
the `resets_to_zero` argument in the `kir::Allocate` constructor. This
will automatically enable reuse of zeroed memory for this buffer and
avoid a memset call, regardless of whether
`EnableOption::ReuseZeroedMemory` is enabled or not. Eventually this
will let us get rid of that option entirely, once all of our global
memory semaphores are guaranteed to be reset.
jacobhinkle added a commit that referenced this pull request Mar 26, 2024
As suggested by @naoyam, this will allow us to incrementally enable
reuse of zeroed memory (see #1984) on a case-by-case basis. To do so,
when creating an `Allocate` node, if it is guaranteed to be "cleaned up"
(i.e. reset to all zero values) by the end of the kernel, then you can
pass `true` for the `resets_to_zero` argument in the `kir::Allocate`
constructor. This will automatically enable reuse of zeroed memory for
this buffer and avoid a memset call, regardless of whether
`EnableOption::ReuseZeroedMemory` is enabled or not. Eventually this
will let us get rid of that option entirely, once all of our global
memory semaphores are guaranteed to be reset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants