Skip to content

[Refactor]: Make an common MLAAttention Layer and custom OP #24620

@LucasWilkinson

Description

@LucasWilkinson

One long term overarching goal is to refactor different attention types into their own dedicated layer implementations that are a subclass AttentionLayerBase, e.g. currently we have the subclasses MambaBase and Attention.

Currently Attention implements both MLA and the more standard MHA/GQA/MQA schemes; we should separate out MLA into its own AttentionLayerBase subclass using the existing MultiHeadLatentAttention

This will have a few material benefits:

  1. will allow us to drop the use_mla flag from Attention
  2. will allow us to have a separate backend selector for MLA backends separate from that or the MHA/GQA/MQA (like we have for Mamba)
  3. will allow use to pull concat_and_cache_mla out of the backend into its own custom op opening up fusion opportunities with RoPE (we would like to do something similar for MHA/GQA/MQA but that uses reshape_and_cache_flash instead of concat_and_cache_mla so this is much cleaner if they are seperate layers)
  4. open up the opportunity for MLA to its own custom op instead of unified_attention allowing for to potentially explore passing q_nope and q_rope independently instead of concatenated
  5. look at move decode and prefill splitting into the torch.compiled section
    • by using a new dynamic shape (returned from a new custom op) for n_decode_tokens, we could unwrap the decode/prefill batch split from the custom op
    • that would allow exposing the GEMMs that are currently separate for prefill and decode to exist in the torch.compiled region, which would enable new fusion opportunities and reduce python/pointwise op overhead
    • this could be tricky with piecewise cudagraph because suddenly there won't just be a single op we can mark as splitting/cudagraph unsafe

Sub-issues

Metadata

Metadata

Labels

actionableThere is clear action for a vLLM developer to takehelp wantedExtra attention is needed

Type

No type

Projects

Status

Done

Relationships

None yet

Development

No branches or pull requests

Issue actions