Skip to content

Conversation

chang-wenbin
Copy link
Collaborator

  1. Support cache initialization of MLA backend to rationalize the allocation of kvcache video memory, blocknum from 1500->4500, concurrency from 45->145.
  2. Fixed a bug in v1-schedule that caused the number of activated tokens to exceed max-num-batched-tokens.

Copy link

paddle-bot bot commented Sep 29, 2025

Thanks for your contribution!

# To rationalize the allocation of kvcache.
from fastdeploy import envs

self.mla_cache = envs.FD_ATTENTION_BACKEND == "MLA_ATTN"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是用 MLA 的模型自动设置此环境变量,还是需要手动设置?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前是启动脚本手动设置 export FD_ATTENTION_BACKEND="MLA_ATTN",
后面会根据config.json中的model_type 自动设置backend,这项修改计划和mla默认开启tensor_core一起提交。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants