Support MLA_CACHE & Fix V1_Schedule Bug #4318

chang-wenbin · 2025-09-29T07:35:33Z

Support cache initialization of MLA backend to rationalize the allocation of kvcache video memory, blocknum from 1500->4500, concurrency from 45->145.
Fixed a bug in v1-schedule that caused the number of activated tokens to exceed max-num-batched-tokens.

paddle-bot · 2025-09-29T07:35:39Z

Thanks for your contribution!

qingqing01 · 2025-09-29T07:52:46Z

fastdeploy/worker/gpu_model_runner.py

+        # To rationalize the allocation of kvcache.
+        from fastdeploy import envs
+
+        self.mla_cache = envs.FD_ATTENTION_BACKEND == "MLA_ATTN"


这里是用 MLA 的模型自动设置此环境变量，还是需要手动设置？

目前是启动脚本手动设置 export FD_ATTENTION_BACKEND="MLA_ATTN"，
后面会根据config.json中的model_type 自动设置backend,这项修改计划和mla默认开启tensor_core一起提交。

Support MLA_CACHE & Fix V1_Schedule Bug

a145665

merge develop

51f924f

zhoutianzi666 approved these changes Sep 29, 2025

View reviewed changes

qingqing01 reviewed Sep 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support MLA_CACHE & Fix V1_Schedule Bug #4318

Support MLA_CACHE & Fix V1_Schedule Bug #4318

Uh oh!

chang-wenbin commented Sep 29, 2025

Uh oh!

paddle-bot bot commented Sep 29, 2025

Uh oh!

qingqing01 Sep 29, 2025

Uh oh!

chang-wenbin Sep 29, 2025

Uh oh!

Uh oh!

Support MLA_CACHE & Fix V1_Schedule Bug #4318

Are you sure you want to change the base?

Support MLA_CACHE & Fix V1_Schedule Bug #4318

Uh oh!

Conversation

chang-wenbin commented Sep 29, 2025

Uh oh!

paddle-bot bot commented Sep 29, 2025

Uh oh!

qingqing01 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

chang-wenbin Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!