-
Notifications
You must be signed in to change notification settings - Fork 77
Description
Problem Description
Context: This issue describes behavior observed in the HipKittens framework (https://github.com/HazyResearch/HipKittens) when compiling kernels with ROCm/LLVM.
Summary
Based on our observations, the attention backwards kernel in HipKittens appears to require explicit use of Accumulator GPRs (AGPRs) to achieve optimal performance. Our understanding is that these registers will not be utilized unless manually specified in the inline assembly code using the clobber functions provided in HipKittens' include/common/macros.cuh. We would appreciate confirmation or correction of this behavior.
Background
What are AGPRs?
AGPRs (Accumulator General Purpose Registers) are a special class of registers on AMD CDNA architectures (MI300X, MI350X). They provide:
- 256 additional registers (a0-a255) beyond the standard VGPRs (v0-v255)
- Optimized data paths for MFMA (Matrix Fused Multiply-Add) operations
- Reduced register pressure when performing complex matrix operations
Register Encoding Convention in HipKittens
In the HipKittens codebase, we use a special encoding to distinguish between VGPRs and AGPRs:
- VGPRs: Register numbers 0-255 (e.g.,
v[0]throughv[255]) - AGPRs: Register numbers 256-511 (e.g., register 256 maps to
a[0], 384 maps toa[128], etc.)
The Problem
Observed Compiler Behavior
In our testing, we have observed that the ROCm LLVM compiler does not appear to automatically allocate or use AGPRs even when:
- The kernel has high register pressure
- MFMA instructions are present that could benefit from accumulator registers
- The
__attribute__((amdgpu_num_vgpr(N)))attribute is specified
We would like to understand if this is expected behavior or if there are compiler flags or attributes we may be missing that would enable automatic AGPR allocation.
Current Implementation in HipKittens Attention Backwards
In HipKittens' kernels/attn/gqa_backwards/attn_bkwd_non_causal.cpp, we explicitly use AGPRs for several critical register tiles:
// AGPR allocations (register numbers >= 256)
using Q_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<368, 383>>, 4>; // a[112:127]
using K_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<256, 303>, ...>, 4>; // a[0:47] + VGPRs
using V_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<304, 367>>, 4>; // a[48:111]
using dK_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<384, 511>>, 16>; // a[128:255]These ranges are then used with explicit clobber calls:
ducks::art::clobber<Q_ranges>();
ducks::art::clobber<K_ranges>();
ducks::art::clobber<V_ranges>();
ducks::art::clobber<dK_ranges>();to clobber the register "a" or "v" based on the index.
How Automatic AGPR Clobbering Works in HipKittens
The Clobber Mechanism
When register tiles are defined with GPR indices >= 256, the ducks::art::clobber<> template automatically invokes the appropriate AGPR clobber functions. This happens through compile-time template metaprogramming:
-
Register Range Definition: When you define a range like
ducks::art::range<384, 511>, this specifies registers a[128:255] (384-256=128, 511-256=255) -
Automatic Clobber Invocation: The
ducks::art::clobber<>template iterates through all registers in the range and callsclobber_gpr<N>()for each one -
Compile-Time Branching: Each
clobber_gpr<N>()call usesif constexpr (GPR >= 256)to determine whether to emit AGPR or VGPR clobber assembly
The Clobber Function Implementation in HipKittens
The clobber functions in HipKittens' include/common/macros.cuh use inline assembly to inform the compiler about register usage:
template<int GPR>
__device__ __forceinline__ void clobber_gpr() {
if constexpr (GPR >= 256) {
constexpr int reg = GPR - 256; // Convert to AGPR index (0-255)
switch (reg) {
CLOBBER_AREG_CASE(0) // asm volatile("" ::: "a0");
CLOBBER_AREG_CASE(1) // asm volatile("" ::: "a1");
// ... up to a255
}
} else {
constexpr int reg = GPR; // Use VGPR index directly (0-255)
switch (reg) {
CLOBBER_VREG_CASE(0) // asm volatile("" ::: "v0");
CLOBBER_VREG_CASE(1) // asm volatile("" ::: "v1");
// ... up to v255
}
}
}Example: Automatic AGPR Clobbering in Action
// Define a register tile using AGPRs (indices 384-399 map to a[128:143])
using dK_ranges = ducks::art::split_many_t<
ducks::art::type_list<ducks::art::range<384, 511>>, 16
>;
// This single call automatically clobbers a[128] through a[255]
ducks::art::clobber<dK_ranges>();
// Internally, this expands to calls like:
// clobber_gpr<384>() -> emits asm volatile("" ::: "a128");
// clobber_gpr<385>() -> emits asm volatile("" ::: "a129");
// ... and so on for all 128 registersThe key insight is that the developer only needs to specify GPR indices >= 256 in the range definition, and the clobber mechanism automatically handles the rest.
Our Understanding of Why Manual Specification Appears Necessary
Explicit Register Specification in Operations
All operations in HipKittens' macros.cuh that support AGPRs use compile-time conditionals to generate the correct assembly:
template<int GPR_START>
__device__ __forceinline__ void ds_read_b128(const uint32_t smem_ptr, const int offset) {
constexpr int GPR_END = GPR_START + 3;
if constexpr (GPR_START >= 256) {
// Use AGPR syntax: a[N:M]
asm volatile("ds_read_b128 a[%0:%1], %2 offset:%3"
: : "n"(GPR_START - 256), "n"(GPR_END - 256), "v"(smem_ptr), "i"(offset)
: "memory");
} else {
// Use VGPR syntax: v[N:M]
asm volatile("ds_read_b128 v[%0:%1], %2 offset:%3"
: : "n"(GPR_START), "n"(GPR_END), "v"(smem_ptr), "i"(offset)
: "memory");
}
}MFMA Instructions with Mixed Register Types
The MFMA operations support all combinations of VGPR and AGPR operands:
template<int GPR_START_A, int GPR_START_B, int GPR_START_C, int GPR_START_D>
__device__ __forceinline__ void mfma_f32_16x16x32_bf16() {
if constexpr (GPR_START_D >= 256 && GPR_START_A >= 256 && GPR_START_B >= 256 && GPR_START_C >= 256) {
// All AGPRs
asm volatile("v_mfma_f32_16x16x32_bf16 a[%0:%1], a[%2:%3], a[%4:%5], a[%6:%7]" ...);
} else if constexpr (...) {
// Various VGPR/AGPR combinations (16 total permutations)
}
}Impact on Performance
Without AGPRs
- Limited to 256 VGPRs total
- Severe register spilling to shared/global memory
- Reduced occupancy due to high VGPR usage
- Significantly degraded performance
With AGPRs
- Access to 512 total registers (256 VGPRs + 256 AGPRs)
- Eliminates register spilling for accumulation buffers
- Maintains high occupancy
- Optimal performance for complex kernels like attention backwards
Current Usage in HipKittens Attention Backwards
The HipKittens attention backwards kernel uses AGPRs for:
- Input tiles:
Q_i(a[112:127]),K_j(a[0:47]),V_j(a[48:111]) - Gradient accumulators:
dK_j_T(a[128:255]) - 128 registers for accumulating dK gradients - Intermediate results: Various AGPR ranges for temporary computations
These allocations are critical because:
- The kernel processes 4 dot slices per iteration
- Multiple gradient tensors (dQ, dK, dV) must be accumulated simultaneously
- Without AGPRs, the kernel would require ~400+ VGPRs, causing massive spilling
We would greatly appreciate any feedback on:
- Whether our understanding of the compiler's AGPR allocation behavior is correct
- If there are recommended approaches or compiler features we should be using instead
- Whether automatic AGPR allocation is a feature that could be considered for future compiler versions
Operating System
Ubuntu 22.04.5 LTS (Jammy Jellyfish)
CPU
AMD EPYC 9575F 64-Core Processor
GPU
AMD Instinct MI350
ROCm Version
ROCM 7.0.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response