Fix sliding window mtp by fsx950223 · Pull Request #1840 · ROCm/aiter

fsx950223 · 2026-01-14T10:01:13Z

Motivation

KV_BLOCK_SIZE=1024 is supported by the cache layout, but the PS (partitioned-softmax) decode path previously assumed smaller KV block sizes and could:

Produce incorrect results / NaNs for block_size=1024
Hit GPU memory access faults when sliding_window>0
Fail to compile the PS reduce kernel for large context_partition_num due to Triton tensor size limits

This PR makes PS decode robust for KV_BLOCK_SIZE=1024 and fixes PS reduction compilation/resource issues.

Technical Details

1) `paged_attention_decode_sliding_window`: add `KV_BLOCK_SIZE=1024` support

Allow KV_BLOCK_SIZE in [16, 64, 1024].
For KV_BLOCK_SIZE==1024, treat the KV page as 4 tiles of 256 tokens:
- KV_COMPUTE_BLOCK_SIZE = CONTEXT_PARTITION_SIZE (256)
- Compute a per-partition page_offset ∈ {0, 256, 512, 768} and apply it to:
  - key/value loads
  - per-token KV scale loads
Use runtime stride_key_block_elem when stepping through KV elements to match the actual key cache layout.

2) PS wrapper fixes

Correctly set one-shot mode for PS decode:
- pass ONE_SHOT=(num_splits <= 1) into paged_attention_decode_sliding_window
- fixes crashes/incorrect behavior when only one split is used.
Tune launch parameters for stability/perf:
- KV_BLOCK_SIZE==1024: waves_per_eu=1
- otherwise: waves_per_eu=4
- use num_stages=1

3) PS reduce kernel: avoid Triton `numel` limit and shared memory overflow

paged_attention_decode_ps_reduce_kernel now reduces partitions in chunks (two-pass reduction), instead of materializing tensors sized by next_power_of_2(context_partition_num).
Cap the chunk size to <= 8 partitions:
- avoids ValueError('numel (...) exceeds triton maximum tensor numel (1048576)')
- avoids shared-memory overflow for common configs (e.g. qg=64, head=128).

Test Plan

op_tests/triton_tests/test_pa_decode_gluon.py:
- block_size=1024, context_partition_size=256, kv_varlen=True, trans_v=False
- verify sliding_window=0 and sliding_window=128
- verify batch_size=1 and batch_size=128
Regression sanity:
- spot-check PS path with block_size=16 using same harness.

Test Result

All above tests passed locally.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This pull request fixes sliding window attention with Multi-Token Processing (MTP) in the paged attention decode implementation, adding support for KV_BLOCK_SIZE=1024 and improving the sliding window causal masking logic.

Changes:

Added support for KV_BLOCK_SIZE=1024 in sliding window kernels with appropriate page offset calculations and windowing masks
Fixed causal masking for sliding window to correctly handle per-query-position windows
Reorganized kernel code for better performance by moving initialization earlier and consolidating the PS path
Reduced MAX_CONTEXT_PARTITION_NUM from 16 to 8 to avoid exceeding shared memory limits
Expanded test coverage for sliding window scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
op_tests/triton_tests/test_pa_decode_gluon.py	Tightened diff tolerance from 8e-2 to 5e-2 and expanded test coverage with additional head dimensions, quantization modes, and configurations
aiter/ops/triton/gluon/pa_decode_gluon.py	Added KV_BLOCK_SIZE=1024 support with page offset handling, fixed sliding window causal masking, reorganized initialization code, reduced MAX_CONTEXT_PARTITION_NUM to 8, and moved PS kernel path to top of wrapper

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aiter/ops/triton/gluon/pa_decode_gluon.py

…kernel Co-authored-by: Cursor <cursoragent@cursor.com>

…o cache key in sliding window decode Co-authored-by: Cursor <cursoragent@cursor.com>

fsx950223 added 3 commits January 14, 2026 07:02

fix mtp accuracy

c4d6907

update unit test

6caa27d

support block size 1024

f15ec41

fsx950223 requested review from a team and Copilot January 14, 2026 10:01

Copilot started reviewing on behalf of fsx950223 January 14, 2026 10:02 View session

Copilot AI reviewed Jan 14, 2026

View reviewed changes

aiter/ops/triton/gluon/pa_decode_gluon.py Show resolved Hide resolved

fsx950223 and others added 23 commits January 15, 2026 04:05

remove duplicate code

2d9d1d2

remove spill

b95d77c

add prefetch

eaf74f0

simple code

6540fe3

Merge remote-tracking branch 'origin/main' into fix_sliding_window_mtp

b8653fe

fix bugs

52d145d

optimize layout

24f54a4

fix bugs

15cdb72

optimize perf

4a39988

optimize performance

8e3de18

fix a bug

0e9e9f6

optimize perf

f44673b

optimize query load

d3dd981

optimize probs lds

5fba8cc

optimize k load

8bc4b3c

fix crash

d45e9ff

key tensor prefetch and autotune

e2db770

fix accuracy issue

aa59bb0

optimize reduce

cf4d87b

fix accuracy

0474359

add block size 1024 test cases

98d1587

format code

8ce3296

fix tuning key

ba78a15

root and others added 2 commits February 24, 2026 08:57

fix(pa): correct sliding window partition and early return in decode …

01ff259

…kernel Co-authored-by: Cursor <cursoragent@cursor.com>

fix(pa): move qk_column_offsets before masking and add COMPUTE_TYPE t…

6f7525f

…o cache key in sliding window decode Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix sliding window mtp#1840

Fix sliding window mtp#1840
fsx950223 wants to merge 28 commits intomainfrom
fix_sliding_window_mtp

fsx950223 commented Jan 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

fsx950223 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

1) paged_attention_decode_sliding_window: add KV_BLOCK_SIZE=1024 support

2) PS wrapper fixes

3) PS reduce kernel: avoid Triton numel limit and shared memory overflow

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fsx950223 commented Jan 14, 2026 •

edited

Loading

1) `paged_attention_decode_sliding_window`: add `KV_BLOCK_SIZE=1024` support

3) PS reduce kernel: avoid Triton `numel` limit and shared memory overflow