Skip to content

[Issue]: AGPR Usage Issue in HipKittens Attention Backwards Pass #1003

@ryanswann-amd

Description

@ryanswann-amd

Problem Description

Context: This issue describes behavior observed in the HipKittens framework (https://github.com/HazyResearch/HipKittens) when compiling kernels with ROCm/LLVM.

Summary

Based on our observations, the attention backwards kernel in HipKittens appears to require explicit use of Accumulator GPRs (AGPRs) to achieve optimal performance. Our understanding is that these registers will not be utilized unless manually specified in the inline assembly code using the clobber functions provided in HipKittens' include/common/macros.cuh. We would appreciate confirmation or correction of this behavior.

Background

What are AGPRs?

AGPRs (Accumulator General Purpose Registers) are a special class of registers on AMD CDNA architectures (MI300X, MI350X). They provide:

  • 256 additional registers (a0-a255) beyond the standard VGPRs (v0-v255)
  • Optimized data paths for MFMA (Matrix Fused Multiply-Add) operations
  • Reduced register pressure when performing complex matrix operations

Register Encoding Convention in HipKittens

In the HipKittens codebase, we use a special encoding to distinguish between VGPRs and AGPRs:

  • VGPRs: Register numbers 0-255 (e.g., v[0] through v[255])
  • AGPRs: Register numbers 256-511 (e.g., register 256 maps to a[0], 384 maps to a[128], etc.)

The Problem

Observed Compiler Behavior

In our testing, we have observed that the ROCm LLVM compiler does not appear to automatically allocate or use AGPRs even when:

  1. The kernel has high register pressure
  2. MFMA instructions are present that could benefit from accumulator registers
  3. The __attribute__((amdgpu_num_vgpr(N))) attribute is specified

We would like to understand if this is expected behavior or if there are compiler flags or attributes we may be missing that would enable automatic AGPR allocation.

Current Implementation in HipKittens Attention Backwards

In HipKittens' kernels/attn/gqa_backwards/attn_bkwd_non_causal.cpp, we explicitly use AGPRs for several critical register tiles:

// AGPR allocations (register numbers >= 256)
using Q_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<368, 383>>, 4>; // a[112:127]
using K_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<256, 303>, ...>, 4>; // a[0:47] + VGPRs
using V_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<304, 367>>, 4>; // a[48:111]
using dK_ranges = ducks::art::split_many_t<ducks::art::type_list<ducks::art::range<384, 511>>, 16>; // a[128:255]

These ranges are then used with explicit clobber calls:

ducks::art::clobber<Q_ranges>();
ducks::art::clobber<K_ranges>();
ducks::art::clobber<V_ranges>();
ducks::art::clobber<dK_ranges>();

to clobber the register "a" or "v" based on the index.

How Automatic AGPR Clobbering Works in HipKittens

The Clobber Mechanism

When register tiles are defined with GPR indices >= 256, the ducks::art::clobber<> template automatically invokes the appropriate AGPR clobber functions. This happens through compile-time template metaprogramming:

  1. Register Range Definition: When you define a range like ducks::art::range<384, 511>, this specifies registers a[128:255] (384-256=128, 511-256=255)

  2. Automatic Clobber Invocation: The ducks::art::clobber<> template iterates through all registers in the range and calls clobber_gpr<N>() for each one

  3. Compile-Time Branching: Each clobber_gpr<N>() call uses if constexpr (GPR >= 256) to determine whether to emit AGPR or VGPR clobber assembly

The Clobber Function Implementation in HipKittens

The clobber functions in HipKittens' include/common/macros.cuh use inline assembly to inform the compiler about register usage:

template<int GPR>
__device__ __forceinline__ void clobber_gpr() {
  if constexpr (GPR >= 256) {
    constexpr int reg = GPR - 256;  // Convert to AGPR index (0-255)
    switch (reg) {
      CLOBBER_AREG_CASE(0)  // asm volatile("" ::: "a0");
      CLOBBER_AREG_CASE(1)  // asm volatile("" ::: "a1");
      // ... up to a255
    }
  } else {
    constexpr int reg = GPR;  // Use VGPR index directly (0-255)
    switch (reg) {
      CLOBBER_VREG_CASE(0)  // asm volatile("" ::: "v0");
      CLOBBER_VREG_CASE(1)  // asm volatile("" ::: "v1");
      // ... up to v255
    }
  }
}

Example: Automatic AGPR Clobbering in Action

// Define a register tile using AGPRs (indices 384-399 map to a[128:143])
using dK_ranges = ducks::art::split_many_t<
  ducks::art::type_list<ducks::art::range<384, 511>>, 16
>;

// This single call automatically clobbers a[128] through a[255]
ducks::art::clobber<dK_ranges>();

// Internally, this expands to calls like:
// clobber_gpr<384>() -> emits asm volatile("" ::: "a128");
// clobber_gpr<385>() -> emits asm volatile("" ::: "a129");
// ... and so on for all 128 registers

The key insight is that the developer only needs to specify GPR indices >= 256 in the range definition, and the clobber mechanism automatically handles the rest.

Our Understanding of Why Manual Specification Appears Necessary

Explicit Register Specification in Operations

All operations in HipKittens' macros.cuh that support AGPRs use compile-time conditionals to generate the correct assembly:

template<int GPR_START>
__device__ __forceinline__ void ds_read_b128(const uint32_t smem_ptr, const int offset) {
  constexpr int GPR_END = GPR_START + 3;
  
  if constexpr (GPR_START >= 256) {
    // Use AGPR syntax: a[N:M]
    asm volatile("ds_read_b128 a[%0:%1], %2 offset:%3"
      : : "n"(GPR_START - 256), "n"(GPR_END - 256), "v"(smem_ptr), "i"(offset)
      : "memory");
  } else {
    // Use VGPR syntax: v[N:M]
    asm volatile("ds_read_b128 v[%0:%1], %2 offset:%3"
      : : "n"(GPR_START), "n"(GPR_END), "v"(smem_ptr), "i"(offset)
      : "memory");
  }
}

MFMA Instructions with Mixed Register Types

The MFMA operations support all combinations of VGPR and AGPR operands:

template<int GPR_START_A, int GPR_START_B, int GPR_START_C, int GPR_START_D>
__device__ __forceinline__ void mfma_f32_16x16x32_bf16() {
  if constexpr (GPR_START_D >= 256 && GPR_START_A >= 256 && GPR_START_B >= 256 && GPR_START_C >= 256) {
    // All AGPRs
    asm volatile("v_mfma_f32_16x16x32_bf16 a[%0:%1], a[%2:%3], a[%4:%5], a[%6:%7]" ...);
  } else if constexpr (...) {
    // Various VGPR/AGPR combinations (16 total permutations)
  }
}

Impact on Performance

Without AGPRs

  • Limited to 256 VGPRs total
  • Severe register spilling to shared/global memory
  • Reduced occupancy due to high VGPR usage
  • Significantly degraded performance

With AGPRs

  • Access to 512 total registers (256 VGPRs + 256 AGPRs)
  • Eliminates register spilling for accumulation buffers
  • Maintains high occupancy
  • Optimal performance for complex kernels like attention backwards

Current Usage in HipKittens Attention Backwards

The HipKittens attention backwards kernel uses AGPRs for:

  1. Input tiles: Q_i (a[112:127]), K_j (a[0:47]), V_j (a[48:111])
  2. Gradient accumulators: dK_j_T (a[128:255]) - 128 registers for accumulating dK gradients
  3. Intermediate results: Various AGPR ranges for temporary computations

These allocations are critical because:

  • The kernel processes 4 dot slices per iteration
  • Multiple gradient tensors (dQ, dK, dV) must be accumulated simultaneously
  • Without AGPRs, the kernel would require ~400+ VGPRs, causing massive spilling

We would greatly appreciate any feedback on:

  • Whether our understanding of the compiler's AGPR allocation behavior is correct
  • If there are recommended approaches or compiler features we should be using instead
  • Whether automatic AGPR allocation is a feature that could be considered for future compiler versions

Operating System

Ubuntu 22.04.5 LTS (Jammy Jellyfish)

CPU

AMD EPYC 9575F 64-Core Processor

GPU

AMD Instinct MI350

ROCm Version

ROCM 7.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions