- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
Closed
0 / 20 of 2 issues completed
Copy link
Labels
actionableThere is clear action for a vLLM developer to takeThere is clear action for a vLLM developer to takehelp wantedExtra attention is neededExtra attention is needed
Description
One long term overarching goal is to refactor different attention types into their own dedicated layer implementations that are a subclass AttentionLayerBase, e.g. currently we have the subclasses MambaBase and Attention.
Currently Attention implements both MLA and the more standard MHA/GQA/MQA schemes; we should separate out MLA into its own AttentionLayerBase subclass using the existing MultiHeadLatentAttention
This will have a few material benefits:
- will allow us to drop the use_mla flag from Attention
- will allow us to have a separate backend selector for MLA backends separate from that or the MHA/GQA/MQA (like we have for Mamba)
- will allow use to pull concat_and_cache_mlaout of the backend into its own custom op opening up fusion opportunities with RoPE (we would like to do something similar for MHA/GQA/MQA but that usesreshape_and_cache_flashinstead ofconcat_and_cache_mlaso this is much cleaner if they are seperate layers)
- open up the opportunity for MLA to its own custom op instead of unified_attentionallowing for to potentially explore passingq_nopeandq_ropeindependently instead of concatenated
- look at move decode and prefill splitting into the torch.compiled section- by using a new dynamic shape (returned from a new custom op) for n_decode_tokens, we could unwrap the decode/prefill batch split from the custom op
- that would allow exposing the GEMMs that are currently separate for prefill and decode to exist in the torch.compiled region, which would enable new fusion opportunities and reduce python/pointwise op overhead
- this could be tricky with piecewise cudagraph because suddenly there won't just be a single op we can mark as splitting/cudagraph unsafe
 
- by using a new dynamic shape (returned from a new custom op) for 
ProExpertProg, whx-sjtu and yuan-luoProExpertProg and yuan-luo
Sub-issues
Metadata
Metadata
Assignees
Labels
actionableThere is clear action for a vLLM developer to takeThere is clear action for a vLLM developer to takehelp wantedExtra attention is neededExtra attention is needed
Type
Projects
Status
Done