- 
                Notifications
    
You must be signed in to change notification settings  - Fork 533
 
[Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series #1591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
          Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@             Coverage Diff             @@
##             main    #1591       +/-   ##
===========================================
+ Coverage   27.39%   52.35%   +24.95%     
===========================================
  Files          56       78       +22     
  Lines        6191     9633     +3442     
===========================================
+ Hits         1696     5043     +3347     
- Misses       4495     4590       +95     
 Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
  | 
    
          Before: After:  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 
           This pull request has conflicts, please resolve those before we can evaluate the pull request.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait for confirm the VocabParallelEmbedding
| for module in self.model.modules(): | ||
| if isinstance( | ||
| module, | ||
| (VocabParallelEmbedding, MergedColumnParallelLinear, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it is better to transfer NZ via process_weights_after_loading api. This function is called after weight-loading of models. For quantized methods, you can refer to implementations of process_weights_after_loading in vllm-ascend/vllm_ascend/quantization/w8a8.py. For unquantized methods, you can consider to patch process_weights_after_loading to UnquantizedLinearMethod.
Besides, word-embedding and lmhead both use VocabParallelEmbedding class. However, as far as I know, word-embedding calls embedding api to get embeddings according to input_ids. Since embedding api is not implemented with NZ format, it might incur an additional transdata operation to convert NZ to ND before embedding. For lmhead it is ok as lmhead is implemented with linear. You can check the profiling to see whether converting word-embedding to NZ will downgrade performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will skip VocabParallelEmbedding in this PR and try process_weights_after_loading in new PR.
b99e976    to
    3400df0      
    Compare
  
    | 
           This pull request has conflicts, please resolve those before we can evaluate the pull request.  | 
    
Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
3400df0    to
    32aa8d8      
    Compare
  
    … 300I series (vllm-project#1591) ### What this PR does / why we need it? Since running on Altlas 300I Duo was initial supported after vllm-project#1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance. See vllm-project#1563 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually: vllm-project#1591 (comment) Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
… 300I series (vllm-project#1591) ### What this PR does / why we need it? Since running on Altlas 300I Duo was initial supported after vllm-project#1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance. See vllm-project#1563 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually: vllm-project#1591 (comment) Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
What this PR does / why we need it?
Since running on Altlas 300I Duo was initial supported after #1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance.
See #1563
Does this PR introduce any user-facing change?
How was this patch tested?
See #1591 (comment)