Skip to content

Conversation

@momo609
Copy link
Collaborator

@momo609 momo609 commented May 27, 2025

What this PR does / why we need it?

Optimize the performance of calculation logic in sampler and deepseekv2.

Does this PR introduce any user-facing change?

Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler

How was this patch tested?

pytest test_sampler.py

Copy link
Collaborator

@MengqingCao MengqingCao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please add pr description
  2. Run bash format.sh locally to fix lint failures


# Sort by local expert IDs
sort_indices = torch.argsort(filtered_experts)
sort_indices = torch.argsort(filtered_experts.view(torch.float32))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we change the dtype of filterd_experts to float32? And view will change the metadata of filterd_experts, instead of create a new tensor. Is this expected?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort can use aicore under float32, with better performance

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort can use aicore under float32, with better performance

weight: torch.Tensor,
bias: Optional[torch.Tensor] = None):
import torch_npu
if torch_npu.get_npu_format(weight) != 29:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What format does 29 refer to? Let's add a comment on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return rocm_unquantized_gemm
return npu_matmul_add

unquantized_gemm = dispatch_unquantized_gemm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you want to patch vllm.model_executor.layers.utils.dispatch_unquantized_gemm into a custom one?

If so, let's do this in vllm_ascend/patch and discribe the details in vllm_ascend/patch/__init __.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


s1.apply_min_p = apply_min_p
if envs.VLLM_ENABLE_TOPK_OPTIMZE:
TopKTopPSampler.forward_native = topk_topp_forward_native No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. let's do this in vllm_ascend/patch and discribe the details in vllm_ascend/patch/__init __.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add percision ut to check if the _apply_top_k_top_p and apply_min_p calculate correctly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

# Convert logits to probability distribution
probability_values = torch.nn.functional.softmax(logits, dim=-1)
# Calculate maximum probabilities per sequence
max_probabilities = torch.amax(probability_values,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does torch.nn.functional.softmax and torch.amax bring any performance gain compared with torch.softmax and torch.tensor.max?

Copy link
Collaborator Author

@momo609 momo609 Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function changes indexput to masked_fill to achieve performance optimization. The operation in the comment is not modified.

@momo609 momo609 force-pushed the main branch 3 times, most recently from 52aff53 to c35b678 Compare May 30, 2025 07:06
@momo609
Copy link
Collaborator Author

momo609 commented May 30, 2025

@wangxiyuan

@momo609 momo609 force-pushed the main branch 4 times, most recently from 90ae7ec to 07c6282 Compare May 30, 2025 08:47
@momo609 momo609 force-pushed the main branch 4 times, most recently from 77b79f7 to 7940e8e Compare June 3, 2025 03:47
Copy link
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@momo609 momo609 force-pushed the main branch 8 times, most recently from da1af56 to e2bc926 Compare June 3, 2025 09:35
@momo609 momo609 force-pushed the main branch 4 times, most recently from ba565b9 to 950db4a Compare June 5, 2025 01:24
@github-actions
Copy link

github-actions bot commented Jun 5, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link
Collaborator

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except env, do wen have any plan to remove the VLLM_ASCEND_ENABLE_TOPK_OPTIMZE and enable by default in future?

max_tokens=64)


@patch.dict(os.environ, {"VLLM_ENABLE_TOPK_OPTIMZE": "1"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@patch.dict(os.environ, {"VLLM_ENABLE_TOPK_OPTIMZE": "1"})
@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_TOPK_OPTIMZE": "1"})

Comment on lines 39 to 40
"VLLM_ENABLE_TOPK_OPTIMZE":
lambda: bool(int(os.getenv("VLLM_ENABLE_TOPK_OPTIMZE", '0'))),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"VLLM_ENABLE_TOPK_OPTIMZE":
lambda: bool(int(os.getenv("VLLM_ENABLE_TOPK_OPTIMZE", '0'))),
"VLLM_ASCEND_ENABLE_TOPK_OPTIMZE":
lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_TOPK_OPTIMZE", '0'))),

@momo609 momo609 force-pushed the main branch 3 times, most recently from ee63d93 to 3ce47a2 Compare June 5, 2025 06:24
After testing, the tpu_apply_top_k_top_p function achieves optimal
performance.

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: ZhengWG <zwg0606@gmail.com>
@wangxiyuan wangxiyuan merged commit 908a851 into vllm-project:main Jun 5, 2025
21 of 23 checks passed
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
…ject#970)

### What this PR does / why we need it?
Optimize the performance of calculation logic in sampler and deepseekv2.

### Does this PR introduce _any_ user-facing change?
Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler

### How was this patch tested?
pytest test_sampler.py

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: ZhengWG <zwg0606@gmail.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…ject#970)

### What this PR does / why we need it?
Optimize the performance of calculation logic in sampler and deepseekv2.

### Does this PR introduce _any_ user-facing change?
Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler

### How was this patch tested?
pytest test_sampler.py

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: ZhengWG <zwg0606@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants