-
Notifications
You must be signed in to change notification settings - Fork 543
optimize the funtion of computing topk and topp in sampler. #970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MengqingCao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please add pr description
- Run
bash format.shlocally to fix lint failures
|
|
||
| # Sort by local expert IDs | ||
| sort_indices = torch.argsort(filtered_experts) | ||
| sort_indices = torch.argsort(filtered_experts.view(torch.float32)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we change the dtype of filterd_experts to float32? And view will change the metadata of filterd_experts, instead of create a new tensor. Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort can use aicore under float32, with better performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort can use aicore under float32, with better performance
vllm_ascend/ops/utils.py
Outdated
| weight: torch.Tensor, | ||
| bias: Optional[torch.Tensor] = None): | ||
| import torch_npu | ||
| if torch_npu.get_npu_format(weight) != 29: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What format does 29 refer to? Let's add a comment on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
vllm_ascend/ops/utils.py
Outdated
| return rocm_unquantized_gemm | ||
| return npu_matmul_add | ||
|
|
||
| unquantized_gemm = dispatch_unquantized_gemm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you want to patch vllm.model_executor.layers.utils.dispatch_unquantized_gemm into a custom one?
If so, let's do this in vllm_ascend/patch and discribe the details in vllm_ascend/patch/__init __.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
vllm_ascend/sample/sampler.py
Outdated
|
|
||
| s1.apply_min_p = apply_min_p | ||
| if envs.VLLM_ENABLE_TOPK_OPTIMZE: | ||
| TopKTopPSampler.forward_native = topk_topp_forward_native No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. let's do this in vllm_ascend/patch and discribe the details in vllm_ascend/patch/__init __.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add percision ut to check if the _apply_top_k_top_p and apply_min_p calculate correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
vllm_ascend/sample/sampler.py
Outdated
| # Convert logits to probability distribution | ||
| probability_values = torch.nn.functional.softmax(logits, dim=-1) | ||
| # Calculate maximum probabilities per sequence | ||
| max_probabilities = torch.amax(probability_values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does torch.nn.functional.softmax and torch.amax bring any performance gain compared with torch.softmax and torch.tensor.max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function changes indexput to masked_fill to achieve performance optimization. The operation in the comment is not modified.
52aff53 to
c35b678
Compare
90ae7ec to
07c6282
Compare
77b79f7 to
7940e8e
Compare
wangxiyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a ds test for multicard as well. example: https://github.com/vllm-project/vllm-ascend/blob/main/tests/multicard/test_offline_inference_distributed.py#L49
da1af56 to
e2bc926
Compare
ba565b9 to
950db4a
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
430d325 to
6333191
Compare
Yikun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except env, do wen have any plan to remove the VLLM_ASCEND_ENABLE_TOPK_OPTIMZE and enable by default in future?
| max_tokens=64) | ||
|
|
||
|
|
||
| @patch.dict(os.environ, {"VLLM_ENABLE_TOPK_OPTIMZE": "1"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @patch.dict(os.environ, {"VLLM_ENABLE_TOPK_OPTIMZE": "1"}) | |
| @patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_TOPK_OPTIMZE": "1"}) |
vllm_ascend/envs.py
Outdated
| "VLLM_ENABLE_TOPK_OPTIMZE": | ||
| lambda: bool(int(os.getenv("VLLM_ENABLE_TOPK_OPTIMZE", '0'))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "VLLM_ENABLE_TOPK_OPTIMZE": | |
| lambda: bool(int(os.getenv("VLLM_ENABLE_TOPK_OPTIMZE", '0'))), | |
| "VLLM_ASCEND_ENABLE_TOPK_OPTIMZE": | |
| lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_TOPK_OPTIMZE", '0'))), |
ee63d93 to
3ce47a2
Compare
After testing, the tpu_apply_top_k_top_p function achieves optimal performance. Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: ZhengWG <zwg0606@gmail.com>
…ject#970) ### What this PR does / why we need it? Optimize the performance of calculation logic in sampler and deepseekv2. ### Does this PR introduce _any_ user-facing change? Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler ### How was this patch tested? pytest test_sampler.py Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: ZhengWG <zwg0606@gmail.com>
…ject#970) ### What this PR does / why we need it? Optimize the performance of calculation logic in sampler and deepseekv2. ### Does this PR introduce _any_ user-facing change? Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler ### How was this patch tested? pytest test_sampler.py Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: ZhengWG <zwg0606@gmail.com>
What this PR does / why we need it?
Optimize the performance of calculation logic in sampler and deepseekv2.
Does this PR introduce any user-facing change?
Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler
How was this patch tested?
pytest test_sampler.py