Reusable zeroed memory #1829

jacobhinkle · 2024-02-24T19:12:07Z

We currently need zeroed global memory buffers for cross-cta communication. Our current executor calls at::zeros to initialize this before each launch of our nvfuser kernel, adding a handful of microseconds. Instead, each executor, kernel runtime, or maybe each process could hold one zeroed buffer (per device) and reuse it without the memset. This would require us to always clean up our semaphores after each use like we do for persistent kernels.

The text was updated successfully, but these errors were encountered:

jacobhinkle · 2024-03-22T11:56:04Z

For example, we see the following trace

$ nsys nvprof --print-gpu-trace build/nvfuser_bench --benchmark_filter=NvFuserScheduler_Matmul_Manual/nvfuser_splitk_TN/M:1024/N:2048/K:50304/warps:4/stages:3/splitk_factor:2/manual_time
 Start (ns)  Duration (ns)  CorrId  GrdX   GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MB/s)  SrcMemKd  DstMemKd           Device            Ctx  GreenCtx  Strm                                                  Name
 ----------  -------------  ------  -----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------------------  ---  --------  ----  ----------------------------------------------------------------------------------------------------
...
 5738218680           2048   31462       1     1     1   128     1     1       16         0.000         0.000                                                     NVIDIA A100 80GB PCIe (0)    1               7  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<long>, at::detail::A…
 5738245336        2461850   31475       8    16     2    32     2     2      240         0.000         0.049                                                     NVIDIA A100 80GB PCIe (0)    1               7  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)2, (int)2>, <unnamed>…
...

The first kernel is at::zeros(). Including the latency between these two launches, this introduces 5738245336 - 5738218680 = 26656 ns, which is 1.1% of the runtime of the actual kernel (2.46 ms). This is not a particularly small problem (we commonly have kernels with runtimes of 100-200 us) and since the latency penalty is fixed the impact can be larger for smaller problems. This can negate the main benefit of single-kernel split-K vs two-kernel: removing a separate kernel launch. It's worth noticing that cuBLAS reuses zeroed workspace memory, so their single-kernel split-K traces include a single kernel:

$ sys nvprof --print-gpu-trace build/nvfuser_bench --benchmark_filter=Baseline_Matmul/eagermode_.*_TN/M:1024/N:2048/K:50304/half_reduction:1/manual_time 
 Start (ns)  Duration (ns)  CorrId  GrdX   GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MB/s)  SrcMemKd  DstMemKd           Device            Ctx  GreenCtx  Strm                                                  Name
 ----------  -------------  ------  -----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------------------  ---  --------  ----  ----------------------------------------------------------------------------------------------------
...
 1704585906        1069789   59125     8     8     3   256     1     1      238         0.049         0.098                                                     NVIDIA A100 80GB PCIe (0)    1               7  ampere_fp16_s16816gemm_fp16_256x128_ldg8_f2f_stages_64x3_tn  
...

(disregard timing of the kernel itself since cublas is using half-precision reduction in this case, see #1719)

This introduces a thread-local global memory allocator for each device and uses it whenever there is an intermediate tensor needed which requires zero-initialization. To enable use `NVFUSER_ENABLE=reuse_zeroed_memory`. You can monitor the allocator using `NVFUSER_DUMP=global_zeroed_memory`. Before we enable this feature by default, we need to have high confidence that every kernel using zero-initialized memory will always clean up their semaphores. This is currently only the case for serial grid reductions, as far as I know. This enables the basic functionality of #1829. However, it does not modify existing algorithms to clean up their memory. See `NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling`, which succeeds when using serial grid reduction, but fails (in debug mode) when using `gridReduce` (note that this test is updated to behave differently in this PR): ``` # NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc Note: Google Test filter = SerialGridReductionTest.Scheduling [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from SerialGridReductionTest [ RUN ] SerialGridReductionTest.Scheduling [global zeroed memory] Resizing arena to 512 bytes [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Resizing arena to 16384 bytes [global zeroed memory] Allocating byte range: 0 to 16384 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 16384 bytes unknown file: Failure C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first): frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests) frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests) frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests) frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests) frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests) frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests) frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests) frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests) frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests) frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests) frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests) frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests) frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests) frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests) frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests) frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests) frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests) frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests) frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests) frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests) frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests) " thrown in the test body. To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling' [ FAILED ] SerialGridReductionTest.Scheduling (5669 ms) [----------] 1 test from SerialGridReductionTest (5669 ms total) ``` This test runs with serial grid reduction, then with `gridReduce`. Each time it runs two grid reductions. Both serial grid reductions succeed because the semaphore buffer is properly zeroed. The `gridReduce` succeeds the first time since the memory pool calls `at::zeros` again to request a larger buffer size (`gridReduce` requires more semaphores since there is one per thread segment vs one for each each block segment). However, the second call to `gridReduce` fails because it has not cleaned up its semaphores. Hacking that function to force `PERSISTENT=1` would clean up the semaphores resulting in success in this case. I'm leaving those kind of modifications for a follow-up.

jacobhinkle · 2024-11-01T18:02:36Z

This is complete for split-K matmul (serial grid reduction). However, we still launch a memset kernel before doing grid reductions. I have attempted to modify gridReduce to clean up the semaphore but have not yet found a working approach that doesn't deadlock. Until then I'm leaving this open.

jacobhinkle added the enhancement New feature or request label Feb 24, 2024

jacobhinkle mentioned this issue Mar 22, 2024

Reusable zeroed memory #1984

Merged

jacobhinkle linked a pull request Apr 16, 2024 that will close this issue

Reset semaphore in grid sync #2081

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reusable zeroed memory #1829

Reusable zeroed memory #1829

jacobhinkle commented Feb 24, 2024 •

edited

Loading

jacobhinkle commented Mar 22, 2024 •

edited

Loading

jacobhinkle commented Nov 1, 2024

Reusable zeroed memory #1829

Reusable zeroed memory #1829

Comments

jacobhinkle commented Feb 24, 2024 • edited Loading

jacobhinkle commented Mar 22, 2024 • edited Loading

jacobhinkle commented Nov 1, 2024

jacobhinkle commented Feb 24, 2024 •

edited

Loading

jacobhinkle commented Mar 22, 2024 •

edited

Loading