Reusable zeroed memory #1984

jacobhinkle · 2024-03-22T15:25:04Z

This introduces a thread-local global memory allocator for each device and uses it whenever there is an intermediate tensor needed which requires zero-initialization.

To enable use NVFUSER_ENABLE=reuse_zeroed_memory. You can monitor the allocator using NVFUSER_DUMP=global_zeroed_memory.

Before we enable this feature by default, we need to have high confidence that every kernel using zero-initialized memory will always clean up their semaphores. This is currently only the case for serial grid reductions, as far as I know.

This enables the basic functionality of #1829. However, it does not modify existing algorithms to clean up their memory. See NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling, which succeeds when using serial grid reduction, but fails (in debug mode) when using gridReduce (note that this test is updated to behave differently in this PR):

# NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling                                                       
Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = SerialGridReductionTest.Scheduling
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from SerialGridReductionTest
[ RUN      ] SerialGridReductionTest.Scheduling
[global zeroed memory] Resizing arena to 512 bytes
[global zeroed memory] Allocating byte range: 0 to 512 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Allocating byte range: 0 to 512 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Resizing arena to 16384 bytes
[global zeroed memory] Allocating byte range: 0 to 16384 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Allocating byte range: 0 to 16384 bytes
unknown file: Failure
C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero
Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first):
frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests)
frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests)
frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests)
frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests)
frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests)
frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests)
frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests)
frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests)
frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests)
frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests)
frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests)
frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests)
frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests)
frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests)
frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests)
frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests)
frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests)
frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests)
frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests)
frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests)
frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests)
" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling'
[  FAILED  ] SerialGridReductionTest.Scheduling (5669 ms)
[----------] 1 test from SerialGridReductionTest (5669 ms total)

This test runs with serial grid reduction, then with gridReduce. Each time it runs two grid reductions. Both serial grid reductions succeed because the semaphore buffer is properly zeroed. The gridReduce succeeds the first time since the memory pool calls at::zeros again to request a larger buffer size (gridReduce requires more semaphores since there is one per thread segment vs one for each each block segment). However, the second call to gridReduce fails because it has not cleaned up its semaphores. Hacking that function to force PERSISTENT=1 would clean up the semaphores resulting in success in this case. I'm leaving those kind of modifications for a follow-up.

jacobhinkle · 2024-03-22T16:09:04Z

tests/cpp/test_serial_gridreduce.cpp

+    // For serial grid reduction we test that re-using semaphore buffers works
+    // properly. This will cause a failure if serial grid reduction does not
+    // properly clean up its semaphores.
+    std::vector<bool> reuse_zeroed_memory_values{false};
+    if (serial) {
+      reuse_zeroed_memory_values.push_back(true);
+    }
+    for (bool reuse_zeroed_memory : reuse_zeroed_memory_values) {


A new loop nest is introduced, so use "Hide whitespace" to view the diff.

jacobhinkle · 2024-03-22T16:12:23Z

csrc/global_allocator.cpp

+void releaseZeroedMemory() {
+  for (Arena& a : arenas) {
+    a.reset();
+  }
+}


A more complicated alternative would be to return a custom subclass of at::Tensor whose destructor marks the corresponding region as free. But given our intended usage pattern where it's always safe to release all the memory after each kernel launch, I decided to keep things simple.

naoyam · 2024-03-22T21:09:28Z

csrc/global_allocator.cpp

+  return arenas[device_num].getTensor(sizes, aten_dtype, device);
+}
+
+void releaseZeroedMemory() {


Does it actually release the memory? It looks like it only sets the buffer size to zero.

I guess it's just me got confused with some of the names. "release" here does not really release the allocated memory but return the allocation to the Arena.

Exactly. It is confusing but I wasn't sure what exactly to name it. I am open to suggestions of course.

No need to change the name but a quick comment would be helpful.

naoyam · 2024-03-22T21:25:12Z

csrc/executor.cpp

-        intermediate_buffer = at::zeros(
-            unexpanded_sizes,
-            at::TensorOptions().dtype(buf_info.type).device(options_.device));
+        if (isOptionEnabled(EnableOption::ReuseZeroedMemory)) {


Would there be any way to remove this option and automatically decide which to use? We know that the serial reduction can use this, so we should be able to tell the executor to do so.

Good idea. We could do that by adding a flag to kir::Allocate I think and plumbing it through. Would it ever be preferable to launch a memset kernel rather than cleaning the semaphore? Even though a kernel might not itself be persistent, the fusion will likely be run multiple times.

Would it ever be preferable to launch a memset kernel rather than cleaning the semaphore?

I think it's highly unlikely.

naoyam

LGTM.

This doesn't need to be in this PR:

https://github.com/NVIDIA/Fuser/pull/1984/files#r1536246159

This will allow us to incrementally enable reuse of zeroed memory (see #1984) on a case-by-case basis. To do so, when creating an `Allocate` node, if it is guaranteed to be "cleaned up" (i.e. reset to all zero values) by the end of the kernel, then you can pass `true` for the `resets_to_zero` argument in the `kir::Allocate` constructor. This will automatically enable reuse of zeroed memory for this buffer and avoid a memset call, regardless of whether `EnableOption::ReuseZeroedMemory` is enabled or not. Eventually this will let us get rid of that option entirely, once all of our global memory semaphores are guaranteed to be reset.

@naoyam

As suggested by @naoyam, this will allow us to incrementally enable reuse of zeroed memory (see #1984) on a case-by-case basis. To do so, when creating an `Allocate` node, if it is guaranteed to be "cleaned up" (i.e. reset to all zero values) by the end of the kernel, then you can pass `true` for the `resets_to_zero` argument in the `kir::Allocate` constructor. This will automatically enable reuse of zeroed memory for this buffer and avoid a memset call, regardless of whether `EnableOption::ReuseZeroedMemory` is enabled or not. Eventually this will let us get rid of that option entirely, once all of our global memory semaphores are guaranteed to be reset.

jacobhinkle added 3 commits March 22, 2024 14:25

Initial sketch

e2fb9dd

Add debug dump option and checkZeroed()

83c79f1

Update SerialGridReductionTest to exercise reuse

0c251a9

jacobhinkle marked this pull request as ready for review March 22, 2024 16:06

jacobhinkle requested review from naoyam, zasdfgbnm and liqiangxl March 22, 2024 16:06

jacobhinkle commented Mar 22, 2024

View reviewed changes

jacobhinkle added 2 commits March 22, 2024 16:17

View as proper size

5ec20c7

Satisfy clang-tidy

befd06d

naoyam reviewed Mar 22, 2024

View reviewed changes

naoyam approved these changes Mar 22, 2024

View reviewed changes

Add comments about releaseZeroedMemory

f99cec4

jacobhinkle merged commit 8db85de into main Mar 22, 2024
4 checks passed

jacobhinkle deleted the reuse_zeroed_memory branch March 22, 2024 23:15

jacobhinkle mentioned this pull request Mar 25, 2024

Add kir::Allocate::resetsToZero() #1995

Merged

jacobhinkle mentioned this pull request Apr 16, 2024

Reset semaphore in grid sync #2081

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reusable zeroed memory #1984

Reusable zeroed memory #1984

jacobhinkle commented Mar 22, 2024 •

edited

Loading

jacobhinkle Mar 22, 2024

jacobhinkle Mar 22, 2024

naoyam Mar 22, 2024

naoyam Mar 22, 2024

jacobhinkle Mar 22, 2024

naoyam Mar 22, 2024

naoyam Mar 22, 2024

jacobhinkle Mar 22, 2024

naoyam Mar 22, 2024 •

edited

Loading

naoyam left a comment

Reusable zeroed memory #1984

Reusable zeroed memory #1984

Conversation

jacobhinkle commented Mar 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam Mar 22, 2024 • edited Loading

Choose a reason for hiding this comment

naoyam left a comment

Choose a reason for hiding this comment

jacobhinkle commented Mar 22, 2024 •

edited

Loading

naoyam Mar 22, 2024 •

edited

Loading