Skip to content

Commit

Permalink
Kernel args patch to show zero_init buffer (#1809)
Browse files Browse the repository at this point in the history
Updated kernel args print to indicate zero_init buffers, which explains elementwise kernels happening before fusion.

Changes from this

```
Reduction and semaphore buffers:
  Float [16]
  Long [1]
```

To

```
Reduction and semaphore buffers:
  Float [16] is_zero_initialized: 0
  Long [1] is_zero_initialized: 1
```

The is_zero_initialized: 1 on a given buffer means an extra init kernel would be needed.
  • Loading branch information
jjsjann123 committed Jul 7, 2022
1 parent 037a75a commit 3ed8330
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 3 deletions.
2 changes: 1 addition & 1 deletion torch/csrc/jit/codegen/cuda/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ There're a few debug dump that could be turned on via environment variables. Loo
1. `dump_eff_bandwidth`: print out effective bandwidth of each generated kernel. This naively measure the kernel time divided by I/O buffer size and is a good/simple metric of performance for bandwidth bound kernels
2. `cuda_kernel`: print out generated cuda kernels
3. `launch_param`: print out launch config of generated kernels
4. `print_args`: print out input output tensors of executed codegen kernels
4. `kernel_args`: print out input/output/buffer tensors of all executed codegen kernels, note that for buffers, we indicate whether they are zero-initialized, which hints on an extra kernel to fill the tensor before codegen kernels.

### FAQs

Expand Down
11 changes: 9 additions & 2 deletions torch/csrc/jit/codegen/cuda/executor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -790,13 +790,15 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
at::TensorOptions()
.dtype(executor_entry->buffer_types[i])
.device(options_.device)));
global_buffers.zero_init.push_back(true);
} else {
global_buffers.buffers.push_back(at::native::empty_cuda(
executor_entry->buffer_sizes[i],
executor_entry->buffer_types[i],
c10::nullopt,
options_.device,
c10::nullopt));
global_buffers.zero_init.push_back(false);
}
}
}
Expand Down Expand Up @@ -984,9 +986,14 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
<< " (strides = " << output.strides() << ")" << std::endl;
}
std::cout << "Reduction and semaphore buffers:" << std::endl;
for (const auto& buffer : global_buffers.buffers) {
TORCH_INTERNAL_ASSERT(
global_buffers.buffers.size() == global_buffers.zero_init.size(),
"global_buffer buffer & zero_init container should have identical sizes");
for (const auto i : c10::irange(global_buffers.buffers.size())) {
const auto& buffer = global_buffers.buffers[i];
const auto& zero_init = global_buffers.zero_init[i];
std::cout << " " << buffer.scalar_type() << " " << buffer.sizes()
<< std::endl;
<< " is_zero_initialized: " << zero_init << std::endl;
}
}

Expand Down

0 comments on commit 3ed8330

Please sign in to comment.