- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 11k
 
Deepseek-v3 Batch Invariant on 8xH100 #26609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           This pull request has merge conflicts that must be resolved before it can be  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is an impressive and comprehensive pull request that systematically tackles the challenge of achieving bitwise batch invariance for Deepseek-v3. The changes are well-thought-out, spanning from low-level kernel modifications and environment variable settings to high-level configuration overrides. The approach of centralizing the batch invariance logic and providing deterministic implementations for key operations like matmul, softmax, and RMSNorm is excellent. The significantly improved and more rigorous test suite is also a major contribution that will help ensure correctness and prevent regressions.
I have found one critical issue in the moe_align_block_size implementation where a deterministic path seems to be unintentionally disabled. Please see the specific comment for details.
Overall, this is a high-quality contribution that will be very valuable for reproducible research and production use cases. Great work on this complex feature!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
 - Mark a draft as ready
 - Comment "@codex review".
 
If Codex has suggestions, it will comment; otherwise it will react with 👍.
3cc0e92    to
    7512fc9      
    Compare
  
    81c3ff6    to
    f6f8399      
    Compare
  
    755b05e    to
    355bda6      
    Compare
  
    chunked prefill setting Signed-off-by: Bram Wasti <bwasti@meta.com>
355bda6    to
    107d577      
    Compare
  
    Signed-off-by: Bram Wasti <bwasti@meta.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Bram Wasti <bwasti@meta.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Bram Wasti <bwasti@meta.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Bram Wasti <bwasti@meta.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Bram Wasti <bwasti@meta.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
| 
           Context: #24583 (comment)  | 
    
          
 Hi @SmartManoj ~  Yes, I noticed this PR. As I mentioned when consulting with you earlier, my vllm was updated to the latest version on the main branch yesterday and already includes this commit. However, after reviewing: Batch-invariant Inference (view), I found that this target does not appear to be in a   | 
    
          
 I made changes to the code and got it running. I am using Qwen3-30B-A3B, and TP=8, The results are as follows::  | 
    
| 
           Are there any tracebacks before that? Would you test invariance_test.py?  | 
    
Signed-off-by: Bram Wasti <bwasti@meta.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Bram Wasti <bwasti@meta.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Bram Wasti <bwasti@meta.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Bram Wasti <bwasti@meta.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Bram Wasti <bwasti@meta.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

This PR replaces #26136 with a far more rigorous set of tests and implementation choices. It is, somewhat unfortunately, quite big. I will be trying to make this smaller over the weekend but would appreciate some initial eyes! (also I'll do a writeup because this was quite a journey to get working and I think folks would benefit from that)
Adds support for FLASH_ATTN, rms_norm, batched matmul, linear, fused_moe (the actual triton impl, not the native one), FLASH_ATTN_MLA, TRITON_MLA, allreduce (on NCCL, not the custom all reduce).
It also attemps to configure all relevant flags across the stack (including env variables) so users don't need to specify things like "disable_custom_ar" and "enforce_eager"
Purpose
Fully support Deepseek-v3 Batch Invariance on 8xH100s. This has large impact on mainstream models (including things like full multi-gpu support for Qwen30b-3a).
Test Plan
The biggest test is this:
Which runs hundreds of queries individually and in batched form, achieving exact bitwise alignment across every generated token (including sampling with temp=0.6)
Test Result
Pass.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.