- 
                Notifications
    You must be signed in to change notification settings 
- Fork 6
Add FP16 configs and Support expert map #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FP16 configs and Support expert map #57
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run  Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add  🚀 | 
| @ElizaWszola @dsikka can you please a look at the expert_map support part of the PR please ! Thanks 🙌 | 
        
          
                tests/kernels/test_cutlass_moe.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly refactor existing tests and add EP tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ElizaWszola  I see this function being called below in this file in fused_moe::__init__() - I updated this code to not raise any errors as the expected behavior seems to be to fallback to triton impl. PTAL! Thanks !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good update!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsikka Can you please take a look at the compressed_tensors changes please. Thanks !
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>
f1b859b    to
    1168828      
    Compare
  
    |  | ||
| c1 = torch.empty((m * topk, n * 2), device=device, dtype=out_dtype) | ||
| c2 = torch.empty((m * topk, k), device=device, dtype=out_dtype) | ||
| c2 = torch.zeros((m * topk, k), device=device, dtype=out_dtype) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed for correctness? empty should be faster than zeros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see this now
      // c2 is initialized to zeros, therefore by setting the output_permutation
      // to num_tokens, we are guaranteed to fill the moe outputs to zero
      // for "invalid" topk_ids.
| a_map = torch.zeros((local_topk_ids.numel()), | ||
| dtype=torch.int32, | ||
| device=device) | ||
| c_map = torch.zeros((local_topk_ids.numel()), | ||
| dtype=torch.int32, | ||
| device=device) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about these two, why do they need to be zeros?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a_map has to be zeros because we don't fill the indices related to "invalid" topk ids. the c_map can be actually empty as we fill all the indices in the get_cutlass_moe_mm_data. Ill make the change.
|  | ||
| c1 = torch.empty((m * topk, n * 2), device=device, dtype=out_dtype) | ||
| c2 = torch.empty((m * topk, k), device=device, dtype=out_dtype) | ||
| c2 = torch.zeros((m * topk, k), device=device, dtype=out_dtype) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see this now
      // c2 is initialized to zeros, therefore by setting the output_permutation
      // to num_tokens, we are guaranteed to fill the moe outputs to zero
      // for "invalid" topk_ids.
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>
| uint32_t const n = out_tensors.size(1); | ||
| uint32_t const k = a_tensors.size(1); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these vars look unused now, it's ok to remove them
| Lgtm! I added a minor comment | 
| Merged through command line, I think it's safe to close now | 
| merged through command line. | 
Add new FP16 configs and Support expert_map for EP
E2E benchmark numbers : link
Micro benchmarks : link - there are some bald spots I am looking into.