[FEAT][ROCm]: Support AITER MLA #15893

vllmellm · 2025-04-01T15:51:00Z

Description

This PR integrates the AITER ops to improve the MLA functionality from AITER flash_attn_varlen_func and AITER mla_decode_fwd into vLLM, and will allow any up-coming optimizations in AITER kernel to be directly used and evaluated within the vLLM framework.

Implementation

ROCM_AITER_MLA is introduced as an additional attention backend type for ROCm platform.
To support this backend the modules below are implemented vllm/attention/backends/rocm_aiter_mla.py

AiterMLABackend inherits from MLACommonBackend.
AiterMLAMetadata inherits from MLACommonMetadata: note that from this class the advance_step function utilizes advance_step_flashinfer function from VLLM cutom ops.
AiterMLAMetadataBuilder inherits from MLACommonMetadataBuilder.
AiterMLAState inherits from MLACommonState.
AiterMLAImpl class inherits from CommonMLAImpl:
Important notes for this class:
- flash_attn_varlen_func (FA function) used in this class is AITER FA implementation (flash_attn_varlen_func from AITER package).
- _forward_decode function in this class uses mla_decode_fwd kernel from AITER package.

The MLACommon module has been refactored to reduce code duplication in its subclasses for advance_step function by invoking ops attention ops.advance_step_flashattn in a separate function _ops_advance_step that can be overridden by subclass.

To enable the backed the environment variable VLLM_ATTN_BACKEND can be set to ROCM_AITER_MLA.
In case that the backend is not specified the rocm.py in vllm/platforms verifies whether VLLM_ROCM_USE_AITER and VLLM_ROCM_USE_AITER_MLA are both enabled or not to utilize this backend. Otherwise the selected backend is TRITON_MLA.

Important Notes:

AITER MLA currently only supports block_size=1 and the variable max_model_len=32768 has to be set.
AITER MLA is suitable for DeepSeek models.

Testing

In order to ensure correct attention backend is selected.
MLA backend env backends has been added into the test cases in tests/kernels/test_attention_selector.py

Performance

Benchmark Serving Results Comparison

Metric	Triton MLA (ROCm Flash Attention)	Triton MLA (Triton Flash Attention)	ROCm AITER MLA (AITER Flash Attention)
Overall Performance
Successful requests	1000	1000	1000
Benchmark duration (s)	121.13	264.31	104.67
Total input tokens	1024000	1024000	1024000
Total generated tokens	39139	39899	40681
Request throughput (req/s)	8.26	3.78	9.55
Output token throughput (tok/s)	323.13	150.96	388.66
Total Token throughput (tok/s)	8777.14	4025.23	10171.83
Time to First Token (TTFT)
Mean TTFT (ms)	55437.36	116591.00	46060.49
Median TTFT (ms)	51164.28	101109.93	42263.81
P99 TTFT (ms)	114009.92	256545.57	96858.61
Time per Output Token (TPOT) (excl. 1st)
Mean TPOT (ms)	2053.40	5737.02	2360.18
Median TPOT (ms)	713.78	1736.03	768.247
P99 TPOT (ms)	8271.61	23386.95	15863.46
Inter-token Latency (ITL)
Mean ITL (ms)	534.54	1356.84	474.84
Median ITL (ms)	106.89	109.63	106.23
P99 ITL (ms)	7176.11	17639.02	5287.23

Lm Eval Results

Tasks	Version	Filter	n-shot	Metric	Value (Without AITER)	Stderr (Without AITER)	Value (With AITER)	Stderr (With AITER)
gsm8k	3	flexible-extract	5	exact_match ↑	0.95	±0.05	0.95	±0.05
gsm8k	3	strict-match	5	exact_match ↑	0.95	±0.05	0.95	±0.05

Envrionment Setting

Updates in Dockerfile.rocm_base
Added AITER Package:

AITER_BRANCH: 5a77249
- Additional Notes installing AITER
  When setting up AITER, it is crucial to use the command git clone --recursive. This is because the package depends on a third-party package (Composable Kernel).
  For building and installing the AITER Python package, you must use the PREBUILD_KERNELS=1 flag along with the command python3 setup.py develop. This ensures that all kernels in the AITER package are built successfully.
- The following branches were used as references for this integration:
  https://github.com/ROCm/vllm/tree/dsv3_dev
  https://github.com/ROCm/vllm/tree/aiter_integration_final
  https://github.com/ROCm/vllm/tree/deepseek_v3_dev

Co-authored-by: qli88 <qiang.li2@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

github-actions · 2025-04-01T15:51:09Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-04-01T15:58:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Co-authored-by: ArthurAMD yajhuang@amd.com Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

… if/else statements Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

LucasWilkinson · 2025-04-16T14:28:30Z

tests/kernels/test_attention_selector.py

+    "cpu": [],
+}
+
+DEVICE_NON_MLA_BACKENDS = {


nit: lets just call this DEVICE_REGULAR_ATTN_BACKENDS instead of MLA

@LucasWilkinson This has been addressed. Thanks.

LucasWilkinson · 2025-04-16T14:31:26Z

vllm/attention/backends/mla/common.py

-            self.block_tables.extend([] * cuda_graph_pad_size)
            num_decode_tokens = batch_size - self.num_prefill_tokens
+            self.slot_mapping.extend([PAD_SLOT_ID] * cuda_graph_pad_size)
+            self.block_tables.extend(self.__class__.BLOCK_TABLE_EXTENDER *


nit: why relocate these lines? Also can you please explain to me why we now need self.__class__.BLOCK_TABLE_EXTENDER

self.class.BLOCK_TABLE_EXTENDER this is a static class variable since common had this hardcoded as "[]" in the line below:
self.block_tables.extend([] * cuda_graph_pad_size)
cuz in AiterMLAMetadataBuilder for capturing graph we need "[[]]" instead of "[]", by eliminating the hardcoded extender into class variable allows the subclass to implement itsown extender value or just inherit from parent.

to review this file is better to open the entire file, as the github interface is not representative enough what has been changed.

overall as explained in the PR descript for the summary of the changes to accommodate AITER MLA implementation and reduce the code duplication in the subclass some refactoring has been made in certain function to allow more flexibility in subclasses.

Implementation

ROCM_AITER_MLA is introduced as an additional attention backend type for ROCm platform.
To support this backend the modules below are implemented vllm/attention/backends/rocm_aiter_mla.py

AiterMLABackend inherits from MLACommonBackend.

AiterMLAMetadata inherits from MLACommonMetadata: note that from this class the advance_step function utilizes advance_step_flashinfer function from VLLM cutom ops.

AiterMLAMetadataBuilder inherits from MLACommonMetadataBuilder.

AiterMLAState inherits from MLACommonState.

AiterMLAImpl class inherits from CommonMLAImpl:
Important notes for this class:

flash_attn_varlen_func (FA function) used in this class is AITER FA implementation (flash_attn_varlen_func from AITER package).

_forward_decode function in this class uses mla_decode_fwd kernel from AITER package.

The MLACommon module has been refactored to reduce code duplication in its subclasses. This was achieved by separating the attention output computation into two dedicated functions named as _get_fwd_prefill_attn_output and _get_prefill_ctx_attn_output that are used in _compute_prefill_context and _forward_prefill function respectively.
Another refactoring is placed in advance_step function by separating out the pre assertion checks before calling an advance_step method to allow advance_step function to be overridden without code duplication in its subclasses.

@LucasWilkinson after resolving merge conflict for this file. the only changes in common.py are as below:

invoking ops.advance_step_flashattn in a separate function _ops_advance_step that can be overridden by subclass that is used in advance_step function.

use of "static" class variable as BLOCK_TABLE_EXTENDER: list[list[int]] = [] that is used to update self.block_tables in graph mode which eliminates the hardcoded "[]" self.block_tables.extend([] * cuda_graph_pad_size) to allow flexibility for the subclasses to override this update based on the class variable.

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

LucasWilkinson

LGTM, thanks for the contribution

gshtras · 2025-04-21T18:05:48Z

vllm/platforms/rocm.py

+                else:
+                    raise ValueError(
+                        f" The selected backend, {selected_backend.name},"
+                        "does not support block size {block_size}.")


Suggested change

"does not support block size {block_size}.")

f"does not support block size {block_size}.")

thanks for pointing out this. Have added the the suggestion.

gshtras · 2025-04-21T18:06:05Z

vllm/platforms/rocm.py

+                else:
+                    raise ValueError(
+                        f" The selected backend, {selected_backend.name},"
+                        "does not support block size {block_size}."


Suggested change

"does not support block size {block_size}."

f"does not support block size {block_size}."

mergify · 2025-04-22T03:52:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

… handle wrong backend selection when MLA is requested. Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Signed-off-by: Frieda (Jingying) Huang <jingyingfhuang@gmail.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

vllmellm and others added 7 commits March 28, 2025 08:19

add AITER MLA implementation in attention backend

f782c66

Co-authored-by: qli88 <qiang.li2@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

remove unused arguments in aiter mla decode fwd kernel

42d5c62

Co-authored-by: qli88 <qiang.li2@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

add unittest for AITER MLA backend in attention selector

565a3fd

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

add unittest for MLA attention backend selector

645f400

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

code cleaning

22c8726

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

update AITER version

5dc1348

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-integration

12f8023

mergify bot added the ci/build label Apr 1, 2025

tjtanaa mentioned this pull request Apr 1, 2025

[Feature] [ROCm]: AITER Kernel Integration #14964

Open

61 tasks

mergify bot added the needs-rebase label Apr 1, 2025

vllmellm added 3 commits April 2, 2025 04:51

add ck flash attn in prefill mla computation

da8c69f

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

further code cleaning

1ea5718

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-integration

681d777

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

mergify bot removed the needs-rebase label Apr 2, 2025

vllmellm added 3 commits April 3, 2025 04:41

fix mypy typing errors

9ada055

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-integration

1ceb3b9

fix mypy error on Iterable typing error

20a3f07

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllmellm marked this pull request as ready for review April 3, 2025 05:44

vllmellm requested review from WoosukKwon and tlrmchlsmth as code owners April 3, 2025 05:44

gshtras mentioned this pull request Apr 3, 2025

Upstream merge 2025 04 02 ROCm/vllm#499

Merged

vllmellm added 5 commits April 15, 2025 11:17

remove padding for v tensor in AITER MLA which improves performance

194a42a

Co-authored-by: ArthurAMD yajhuang@amd.com Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

upgrade aiter package version

a9a02d5

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

only support AITER FA in AITER MLA backend to avoid latency caused by…

02a4fb3

… if/else statements Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-integration

95213e2

add missing data types of arguments in aiter_mla_decode_fwd

6e48433

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

LucasWilkinson reviewed Apr 16, 2025

View reviewed changes

tjtanaa force-pushed the aiter-mla-integration branch from 5e6ed9a to 6e48433 Compare April 17, 2025 13:17

vllmellm added 6 commits April 17, 2025 13:31

NIT

8c2ed72

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-integration

c95cb02

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

support block-size 1 for ROCM AITER MLA

25d88d5

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

fix mypy error

f38c4a9

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

preserve the lines

0027497

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

return back calling the tiron fa function to its original format

78007d0

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

LucasWilkinson approved these changes Apr 21, 2025

View reviewed changes

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 21, 2025

gshtras reviewed Apr 21, 2025

View reviewed changes

gshtras mentioned this pull request Apr 21, 2025

Rocm 6.4 docker ROCm/vllm#519

Merged

mergify bot added the needs-rebase label Apr 22, 2025

Merge remote-tracking branch 'origin/main' into aiter-mla-integration

cb4e861

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

mergify bot removed the needs-rebase label Apr 22, 2025

vllmellm added 2 commits April 22, 2025 05:29

fix fstring in error message

54817a1

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Update MLA attention backend selector for rocm attention selector and…

8fd039e

… handle wrong backend selection when MLA is requested. Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllm-bot merged commit 30bc3e0 into vllm-project:main Apr 22, 2025
43 of 46 checks passed

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[FEAT][ROCm]: Support AITER MLA (vllm-project#15893)

719bd65

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[FEAT][ROCm]: Support AITER MLA (vllm-project#15893)

0ed8e64

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com>

vllmellm mentioned this pull request May 1, 2025

[FEAT][ROCm]: Support AITER MLA on V1 Engine #17523

Merged

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

vllmellm mentioned this pull request Aug 26, 2025

[Feature] [ROCm]: AITER Kernel Integration vllmellm/vllm#51

Open

61 tasks

	"does not support block size {block_size}.")
	f"does not support block size {block_size}.")

Uh oh!

[FEAT][ROCm]: Support AITER MLA #15893

[FEAT][ROCm]: Support AITER MLA #15893

Uh oh!

Conversation

vllmellm commented Apr 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Implementation

Testing

Performance

Lm Eval Results

Envrionment Setting

Uh oh!

github-actions bot commented Apr 1, 2025

Uh oh!

mergify bot commented Apr 1, 2025

Uh oh!

LucasWilkinson Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

vllmellm Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

vllmellm Apr 16, 2025

Choose a reason for hiding this comment

Implementation

Uh oh!

vllmellm Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

gshtras Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

vllmellm Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

gshtras Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vllmellm commented Apr 1, 2025 •

edited by github-actions bot

Loading

vllmellm Apr 17, 2025 •

edited

Loading