- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
[Model] Add LongCat-Flash #23991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add LongCat-Flash #23991
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run  You ask your reviewers to trigger select CI tests on top of  Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add  If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the LongCat-Flash model, including its architecture and a multi-token prediction (MTP) variant for speculative decoding. The changes are extensive, touching model implementation, configuration, and fused MoE kernels. The core logic for the new model seems well-integrated, reusing components from existing models like DeepseekV2 where appropriate. My main feedback concerns a function in the Fused MoE implementation that modifies its inputs in-place, which could be a source of bugs. Overall, the PR is a significant contribution, adding a complex new model to vLLM.
| expert_indices[normal_expert_mask] = 0 | ||
| expert_scales[normal_expert_mask] = 0.0 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function zero_experts_compute_triton modifies its input tensors expert_indices and expert_scales in-place. This side effect can be unexpected and lead to bugs if the caller reuses these tensors assuming they are unchanged. While this might be an intentional optimization to avoid extra memory allocations, it makes the code harder to reason about and maintain. To improve clarity and safety, consider returning the modified tensors instead of modifying them in-place. This would make the data flow explicit.
9bd1622    to
    5af8af3      
    Compare
  
    | Create a PR to support tool call for LongCat-Flash-Chat model: #24083 | 
| @youkaichao @ywang96 @simon-mo Just a friendly ping on this PR when you have a moment. | 
| Can we proceed with merging this feature? | 
657153c    to
    b95456d      
    Compare
  
    | Please fix the test failure 2025-09-24T14:25:50Z] FAILED models/test_initialization.py::test_can_initialize_large_subset[LongCatFlashMTPModel] - AttributeError: property 'num_hidden_layers' of 'LongcatFlashConfig' object has no setter | 
Head branch was pushed to by a user without write access
30fac96    to
    870a2b8      
    Compare
  
    | 
 done | 
| Great work! | 
PR vllm-project#23991 use another attribute from triton.language, which cause import error in TPU setup. Enhance the placeholder for TPU environment. Signed-off-by: Weida Hong <wdhongtw@google.com>
vllm-project/vllm#23991 vllm-project/vllm#25613 --------- Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>
vllm-project/vllm#23991 vllm-project/vllm#25613 --------- Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: yangxurui <yangxurui@meituan.com> Co-authored-by: yangxurui <yangxurui@meituan.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yangxurui <yangxurui@meituan.com> Co-authored-by: yangxurui <yangxurui@meituan.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: yangxurui <yangxurui@meituan.com> Co-authored-by: yangxurui <yangxurui@meituan.com>
Signed-off-by: yangxurui <yangxurui@meituan.com> Co-authored-by: yangxurui <yangxurui@meituan.com>
Signed-off-by: yangxurui <yangxurui@meituan.com> Co-authored-by: yangxurui <yangxurui@meituan.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
This PR implements support for the newly released LongCat-Flash model by Meituan.
The core implementation includes: