Kernel args patch to show zero_init buffer (#1809)

Updated kernel args print to indicate zero_init buffers, which explains elementwise kernels happening before fusion. Changes from this ``` Reduction and semaphore buffers: Float [16] Long [1] ``` To ``` Reduction and semaphore buffers: Float [16] is_zero_initialized: 0 Long [1] is_zero_initialized: 1 ``` The is_zero_initialized: 1 on a given buffer means an extra init kernel would be needed.
csarofeen · Jul 7, 2022 · 3ed8330 · 3ed8330
1 parent 037a75a
commit 3ed8330
Show file tree

Hide file tree

Showing 2 changed files with 10 additions and 3 deletions.
diff --git a/torch/csrc/jit/codegen/cuda/README.md b/torch/csrc/jit/codegen/cuda/README.md
@@ -187,7 +187,7 @@ There're a few debug dump that could be turned on via environment variables. Loo
 1. `dump_eff_bandwidth`: print out effective bandwidth of each generated kernel. This naively measure the kernel time divided by I/O buffer size and is a good/simple metric of performance for bandwidth bound kernels
 2. `cuda_kernel`: print out generated cuda kernels
 3. `launch_param`: print out launch config of generated kernels
-4. `print_args`: print out input output tensors of executed codegen kernels
+4. `kernel_args`: print out input/output/buffer tensors of all executed codegen kernels, note that for buffers, we indicate whether they are zero-initialized, which hints on an extra kernel to fill the tensor before codegen kernels.
 
 ### FAQs
 

diff --git a/torch/csrc/jit/codegen/cuda/executor.cpp b/torch/csrc/jit/codegen/cuda/executor.cpp
@@ -790,13 +790,15 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
  at::TensorOptions()
  .dtype(executor_entry->buffer_types[i])
  .device(options_.device)));
+ global_buffers.zero_init.push_back(true);
  } else {
  global_buffers.buffers.push_back(at::native::empty_cuda(
  executor_entry->buffer_sizes[i],
  executor_entry->buffer_types[i],
  c10::nullopt,
  options_.device,
  c10::nullopt));
+ global_buffers.zero_init.push_back(false);
  }
  }
  }
@@ -984,9 +986,14 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
  << " (strides = " << output.strides() << ")" << std::endl;
  }
  std::cout << "Reduction and semaphore buffers:" << std::endl;
- for (const auto& buffer : global_buffers.buffers) {
+ TORCH_INTERNAL_ASSERT(
+ global_buffers.buffers.size() == global_buffers.zero_init.size(),
+ "global_buffer buffer & zero_init container should have identical sizes");
+ for (const auto i : c10::irange(global_buffers.buffers.size())) {
+ const auto& buffer = global_buffers.buffers[i];
+ const auto& zero_init = global_buffers.zero_init[i];
  std::cout << " " << buffer.scalar_type() << " " << buffer.sizes()
- << std::endl;
+ << " is_zero_initialized: " << zero_init << std::endl;
  }
  }