[Refactor]: Make an common MLAAttention Layer and custom OP

One long term overarching goal is to refactor different attention types into their own dedicated layer implementations that are a subclass [AttentionLayerBase](https://github.com/vllm-project/vllm/blob/cc99baf14dacc2497d0c5ed84e076ef2c37f6a4d/vllm/model_executor/layers/attention_layer_base.py#L11), e.g. currently we have the subclasses [MambaBase](https://github.com/vllm-project/vllm/blob/cc99baf14dacc2497d0c5ed84e076ef2c37f6a4d/vllm/model_executor/layers/mamba/abstract.py#L15) and [Attention](https://github.com/vllm-project/vllm/blob/cc99baf14dacc2497d0c5ed84e076ef2c37f6a4d/vllm/attention/layer.py#L58C7-L58C16).

Currently `Attention` implements both MLA and the more standard MHA/GQA/MQA schemes; we should separate out MLA into its own `AttentionLayerBase` subclass using the existing [MultiHeadLatentAttention](https://github.com/vllm-project/vllm/blob/cc99baf14dacc2497d0c5ed84e076ef2c37f6a4d/vllm/model_executor/layers/mla.py#L30) 

This will have a few material benefits:

1) will allow us to drop the [use_mla](https://github.com/vllm-project/vllm/blob/cc99baf14dacc2497d0c5ed84e076ef2c37f6a4d/vllm/attention/layer.py#L81) flag from Attention
2) will allow us to have a separate backend selector for MLA backends separate from that or the MHA/GQA/MQA (like we have for Mamba)
3) will allow use to pull `concat_and_cache_mla` out of the backend into its own custom op opening up fusion opportunities with RoPE (we would like to do something similar for MHA/GQA/MQA but that uses `reshape_and_cache_flash` instead of `concat_and_cache_mla` so this is much cleaner if they are seperate layers)
4) open up the opportunity for MLA to its own custom op instead of `unified_attention` allowing for to potentially explore passing `q_nope` and `q_rope` independently instead of concatenated 
5) look at move decode and prefill splitting into the `torch.compile`d section
   - by using a new dynamic shape (returned from a new custom op) for `n_decode_tokens`, we could unwrap the decode/prefill batch split from the custom op
   - that would allow exposing the GEMMs that are currently separate for prefill and decode to exist in the `torch.compile`d region, which would enable new fusion opportunities and reduce python/pointwise op overhead
   - this could be tricky with piecewise cudagraph because suddenly there won't just be a single op we can mark as splitting/cudagraph unsafe


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Refactor]: Make an common MLAAttention Layer and custom OP #24620

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Refactor]: Make an common MLAAttention Layer and custom OP #24620

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions