[Issue]: AGPR Usage Issue in HipKittens Attention Backwards Pass

### Problem Description

**Context**: This issue describes behavior observed in the HipKittens framework (https://github.com/HazyResearch/HipKittens) when compiling kernels with ROCm/LLVM.

## Summary

Based on our observations, the attention backwards kernel in HipKittens appears to require explicit use of Accumulator GPRs (AGPRs) to achieve optimal performance. Our understanding is that these registers will **not be utilized unless manually specified** in the inline assembly code using the clobber functions provided in HipKittens' `include/common/macros.cuh`. We would appreciate confirmation or correction of this behavior.

## Background

### What are AGPRs?

AGPRs (Accumulator General Purpose Registers) are a special class of registers on AMD CDNA architectures (MI300X, MI350X). They provide:

- 256 additional registers (a0-a255) beyond the standard VGPRs (v0-v255)
- Optimized data paths for MFMA (Matrix Fused Multiply-Add) operations
- Reduced register pressure when performing complex matrix operations

### Register Encoding Convention in HipKittens

In the HipKittens codebase, we use a special encoding to distinguish between VGPRs and AGPRs:
- **VGPRs**: Register numbers 0-255 (e.g., `v[0]` through `v[255]`)
- **AGPRs**: Register numbers 256-511 (e.g., register 256 maps to `a[0]`, 384 maps to `a[128]`, etc.)

## The Problem

### Observed Compiler Behavior

In our testing, we have observed that the ROCm LLVM compiler **does not appear to automatically allocate or use AGPRs** even when:
1. The kernel has high register pressure
2. MFMA instructions are present that could benefit from accumulator registers
3. The `__attribute__((amdgpu_num_vgpr(N)))` attribute is specified

We would like to understand if this is expected behavior or if there are compiler flags or attributes we may be missing that would enable automatic AGPR allocation.

### Current Implementation in HipKittens Attention Backwards

In HipKittens' `kernels/attn/gqa_backwards/attn_bkwd_non_causal.cpp`, we explicitly use AGPRs for several critical register tiles:

```cpp
// AGPR allocations (register numbers >= 256)
using Q_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<368, 383>>, 4>; // a[112:127]
using K_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<256, 303>, ...>, 4>; // a[0:47] + VGPRs
using V_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<304, 367>>, 4>; // a[48:111]
using dK_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<384, 511>>, 16>; // a[128:255]
```

These ranges are then used with explicit clobber calls:

```cpp
ducks::art::clobber<Q_ranges>();
ducks::art::clobber<K_ranges>();
ducks::art::clobber<V_ranges>();
ducks::art::clobber<dK_ranges>();
```
to clobber the register "a" or "v" based on the index.
## How Automatic AGPR Clobbering Works in HipKittens

### The Clobber Mechanism

When register tiles are defined with GPR indices >= 256, the `ducks::art::clobber<>` template automatically invokes the appropriate AGPR clobber functions. This happens through compile-time template metaprogramming:

1. **Register Range Definition**: When you define a range like `ducks::art::range<384, 511>`, this specifies registers a[128:255] (384-256=128, 511-256=255)

2. **Automatic Clobber Invocation**: The `ducks::art::clobber<>` template iterates through all registers in the range and calls `clobber_gpr<N>()` for each one

3. **Compile-Time Branching**: Each `clobber_gpr<N>()` call uses `if constexpr (GPR >= 256)` to determine whether to emit AGPR or VGPR clobber assembly

### The Clobber Function Implementation in HipKittens

The clobber functions in HipKittens' `include/common/macros.cuh` use inline assembly to inform the compiler about register usage:

```cpp
template<int GPR>
__device__ __forceinline__ void clobber_gpr() {
  if constexpr (GPR >= 256) {
    constexpr int reg = GPR - 256;  // Convert to AGPR index (0-255)
    switch (reg) {
      CLOBBER_AREG_CASE(0)  // asm volatile("" ::: "a0");
      CLOBBER_AREG_CASE(1)  // asm volatile("" ::: "a1");
      // ... up to a255
    }
  } else {
    constexpr int reg = GPR;  // Use VGPR index directly (0-255)
    switch (reg) {
      CLOBBER_VREG_CASE(0)  // asm volatile("" ::: "v0");
      CLOBBER_VREG_CASE(1)  // asm volatile("" ::: "v1");
      // ... up to v255
    }
  }
}
```

### Example: Automatic AGPR Clobbering in Action

```cpp
// Define a register tile using AGPRs (indices 384-399 map to a[128:143])
using dK_ranges = ducks::art::split_many_t<
  ducks::art::type_list<ducks::art::range<384, 511>>, 16
>;

// This single call automatically clobbers a[128] through a[255]
ducks::art::clobber<dK_ranges>();

// Internally, this expands to calls like:
// clobber_gpr<384>() -> emits asm volatile("" ::: "a128");
// clobber_gpr<385>() -> emits asm volatile("" ::: "a129");
// ... and so on for all 128 registers
```

The key insight is that **the developer only needs to specify GPR indices >= 256 in the range definition**, and the clobber mechanism automatically handles the rest.

## Our Understanding of Why Manual Specification Appears Necessary

### Explicit Register Specification in Operations

All operations in HipKittens' `macros.cuh` that support AGPRs use compile-time conditionals to generate the correct assembly:

```cpp
template<int GPR_START>
__device__ __forceinline__ void ds_read_b128(const uint32_t smem_ptr, const int offset) {
  constexpr int GPR_END = GPR_START + 3;
  
  if constexpr (GPR_START >= 256) {
    // Use AGPR syntax: a[N:M]
    asm volatile("ds_read_b128 a[%0:%1], %2 offset:%3"
      : : "n"(GPR_START - 256), "n"(GPR_END - 256), "v"(smem_ptr), "i"(offset)
      : "memory");
  } else {
    // Use VGPR syntax: v[N:M]
    asm volatile("ds_read_b128 v[%0:%1], %2 offset:%3"
      : : "n"(GPR_START), "n"(GPR_END), "v"(smem_ptr), "i"(offset)
      : "memory");
  }
}
```

### MFMA Instructions with Mixed Register Types

The MFMA operations support all combinations of VGPR and AGPR operands:

```cpp
template<int GPR_START_A, int GPR_START_B, int GPR_START_C, int GPR_START_D>
__device__ __forceinline__ void mfma_f32_16x16x32_bf16() {
  if constexpr (GPR_START_D >= 256 && GPR_START_A >= 256 && GPR_START_B >= 256 && GPR_START_C >= 256) {
    // All AGPRs
    asm volatile("v_mfma_f32_16x16x32_bf16 a[%0:%1], a[%2:%3], a[%4:%5], a[%6:%7]" ...);
  } else if constexpr (...) {
    // Various VGPR/AGPR combinations (16 total permutations)
  }
}
```

## Impact on Performance

### Without AGPRs
- Limited to 256 VGPRs total
- Severe register spilling to shared/global memory
- Reduced occupancy due to high VGPR usage
- Significantly degraded performance

### With AGPRs
- Access to 512 total registers (256 VGPRs + 256 AGPRs)
- Eliminates register spilling for accumulation buffers
- Maintains high occupancy
- Optimal performance for complex kernels like attention backwards

## Current Usage in HipKittens Attention Backwards

The HipKittens attention backwards kernel uses AGPRs for:

1. **Input tiles**: `Q_i` (a[112:127]), `K_j` (a[0:47]), `V_j` (a[48:111])
2. **Gradient accumulators**: `dK_j_T` (a[128:255]) - 128 registers for accumulating dK gradients
3. **Intermediate results**: Various AGPR ranges for temporary computations

These allocations are critical because:
- The kernel processes 4 dot slices per iteration
- Multiple gradient tensors (dQ, dK, dV) must be accumulated simultaneously
- Without AGPRs, the kernel would require ~400+ VGPRs, causing massive spilling

**We would greatly appreciate any feedback on:**
- Whether our understanding of the compiler's AGPR allocation behavior is correct
- If there are recommended approaches or compiler features we should be using instead
- Whether automatic AGPR allocation is a feature that could be considered for future compiler versions


### Operating System

Ubuntu 22.04.5 LTS (Jammy Jellyfish)

### CPU

AMD EPYC 9575F 64-Core Processor

### GPU

AMD Instinct MI350

### ROCm Version

ROCM 7.0.0

### ROCm Component

_No response_

### Steps to Reproduce

_No response_

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: AGPR Usage Issue in HipKittens Attention Backwards Pass #1003

Problem Description

Summary

Background

What are AGPRs?

Register Encoding Convention in HipKittens

The Problem

Observed Compiler Behavior

Current Implementation in HipKittens Attention Backwards

How Automatic AGPR Clobbering Works in HipKittens

The Clobber Mechanism

The Clobber Function Implementation in HipKittens

Example: Automatic AGPR Clobbering in Action

Our Understanding of Why Manual Specification Appears Necessary

Explicit Register Specification in Operations

MFMA Instructions with Mixed Register Types

Impact on Performance

Without AGPRs

With AGPRs

Current Usage in HipKittens Attention Backwards

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: AGPR Usage Issue in HipKittens Attention Backwards Pass #1003

Description

Problem Description

Summary

Background

What are AGPRs?

Register Encoding Convention in HipKittens

The Problem

Observed Compiler Behavior

Current Implementation in HipKittens Attention Backwards

How Automatic AGPR Clobbering Works in HipKittens

The Clobber Mechanism

The Clobber Function Implementation in HipKittens

Example: Automatic AGPR Clobbering in Action

Our Understanding of Why Manual Specification Appears Necessary

Explicit Register Specification in Operations

MFMA Instructions with Mixed Register Types

Impact on Performance

Without AGPRs

With AGPRs

Current Usage in HipKittens Attention Backwards

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions