use regs for multi reduction per block to avoid smem race #3423

liqiangxl · 2024-11-16T04:41:37Z

when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding cp.async.wait_allafter the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we put all buffers in registers.

race detected with:

NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info  ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96'

liqiangxl · 2024-11-16T04:42:00Z

!test

use regs for multi reduction per block to avoid smem race

a9396e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use regs for multi reduction per block to avoid smem race #3423

use regs for multi reduction per block to avoid smem race #3423

liqiangxl commented Nov 16, 2024

liqiangxl commented Nov 16, 2024

use regs for multi reduction per block to avoid smem race #3423

Are you sure you want to change the base?

use regs for multi reduction per block to avoid smem race #3423

Conversation

liqiangxl commented Nov 16, 2024

liqiangxl commented Nov 16, 2024