[Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series #1591

farawayboat · 2025-07-02T10:50:36Z

What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after #1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance.

See #1563

Does this PR introduce any user-facing change?

How was this patch tested?

See #1591 (comment)

codecov · 2025-07-02T11:34:18Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.35%. Comparing base (c30ddb8) to head (32aa8d8).
⚠️ Report is 613 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1591       +/-   ##
===========================================
+ Coverage   27.39%   52.35%   +24.95%     
===========================================
  Files          56       78       +22     
  Lines        6191     9633     +3442     
===========================================
+ Hits         1696     5043     +3347     
- Misses       4495     4590       +95

Flag	Coverage Δ
unittests	`52.35% <ø> (+24.95%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Yikun · 2025-07-02T15:43:29Z

export VLLM_USE_V1=1
vllm serve Qwen/Qwen3-8B \
    --tensor-parallel-size 1 \
    --enforce-eager \
    --dtype float16 \
    --compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'

python3 -m venv .venv-evalscope
source .venv-evalscope/bin/activate
pip install evalscope[perf] -U
evalscope perf \
    --url "http://localhost:8000/v1/chat/completions" \
    --parallel 5 \
    --model Qwen/Qwen3-8B \
    --number 20 \
    --api openai \
    --dataset openqa \
    --stream

Before:

Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          | 3553.09   |
+-----------------------------------+-----------+
| Number of concurrency             |    5      |
+-----------------------------------+-----------+
| Total requests                    |   20      |
+-----------------------------------+-----------+
| Succeed requests                  |   20      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |    6.8889 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |    7.0536 |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.0056 |
+-----------------------------------+-----------+
| Average latency (s)               |  787.369  |
+-----------------------------------+-----------+
| Average time to first token (s)   |   76.5634 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.575  |
+-----------------------------------+-----------+
| Average input tokens per request  |   29.25   |
+-----------------------------------+-----------+
| Average output tokens per request | 1223.85   |
+-----------------------------------+-----------+
| Average package latency (s)       |    0.5808 |
+-----------------------------------+-----------+
| Average package per request       | 1223.85   |
+-----------------------------------+-----------+
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     | 18.9568  | 0.4273  |  0.5009  |  571.2431   |      21      |      845      |     1.4097     |    1.4695     |
|     25%     | 66.1473  | 0.4291  |  0.5566  |  638.7671   |      26      |     1043      |     1.4573     |    1.4951     |
|     50%     |  67.481  | 0.4303  |  0.5649  |  818.9423   |      28      |     1222      |     1.5394     |    1.5689     |
|     66%     | 68.3655  |  0.431  |  0.6157  |  902.7492   |      31      |     1414      |     1.6395     |    1.6923     |
|     75%     | 71.3802  | 0.4314  |  0.634   |  972.2325   |      34      |     1479      |     1.6805     |    1.7313     |
|     80%     | 155.0879 | 0.4316  |  0.6359  |  997.7846   |      37      |     1484      |     1.7178     |    1.7625     |
|     90%     | 155.0886 | 0.4322  |  0.6639  |  1165.3984  |      41      |     1716      |     1.7486     |    1.8818     |
|     95%     | 155.0886 | 0.4326  |  0.7136  |  1222.6076  |      45      |     2048      |     1.9405     |    1.9761     |
|     98%     | 155.0886 | 0.4333  |  0.7136  |  1222.6076  |      45      |     2048      |     1.9405     |    1.9761     |
|     99%     | 155.0886 | 0.4338  |  0.7136  |  1222.6076  |      45      |     2048      |     1.9405     |    1.9761     |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

After:

Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |  694.935  |
+-----------------------------------+-----------+
| Number of concurrency             |    5      |
+-----------------------------------+-----------+
| Total requests                    |   20      |
+-----------------------------------+-----------+
| Succeed requests                  |   20      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |   35.8523 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |   36.6941 |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.0288 |
+-----------------------------------+-----------+
| Average latency (s)               |  148.268  |
+-----------------------------------+-----------+
| Average time to first token (s)   |    0.6713 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.1185 |
+-----------------------------------+-----------+
| Average input tokens per request  |   29.25   |
+-----------------------------------+-----------+
| Average output tokens per request | 1245.75   |
+-----------------------------------+-----------+
| Average package latency (s)       |    0.1185 |
+-----------------------------------+-----------+
| Average package per request       | 1245.75   |
+-----------------------------------+-----------+

Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  0.243   | 0.1161  |  0.117   |  101.3687   |      21      |      845      |     8.3569     |    8.5102     |
|     25%     |  0.2439  | 0.1178  |  0.1179  |  123.5415   |      26      |     1035      |     8.3742     |    8.5287     |
|     50%     |  0.247   | 0.1188  |  0.1189  |  155.6198   |      28      |     1304      |     8.3987     |    8.6044     |
|     66%     |  0.2508  | 0.1194  |  0.1191  |  175.2364   |      31      |     1472      |     8.4032     |    8.6504     |
|     75%     |  1.8208  | 0.1196  |  0.1194  |  179.6964   |      34      |     1510      |     8.4097     |    8.6686     |
|     80%     |  1.9802  | 0.1198  |  0.1194  |  182.7204   |      37      |     1533      |     8.4098     |    8.7404     |
|     90%     |  1.9809  | 0.1204  |  0.1197  |  220.3238   |      41      |     1846      |     8.4632     |    8.9714     |
|     95%     |  1.9811  | 0.1208  |  0.1198  |  230.3884   |      45      |     1936      |     8.6505     |    9.0138     |
|     98%     |  1.9811  | 0.1211  |  0.1198  |  230.3884   |      45      |     1936      |     8.6505     |    9.0138     |
|     99%     |  1.9811  | 0.1215  |  0.1198  |  230.3884   |      45      |     1936      |     8.6505     |    9.0138     |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

Yikun

Please also take a look @Angazenn @leo-pony

github-actions · 2025-07-03T10:38:54Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Yikun

Wait for confirm the VocabParallelEmbedding

Angazenn · 2025-07-03T06:38:04Z

vllm_ascend/worker/model_runner_v1.py

+                for module in self.model.modules():
+                    if isinstance(
+                            module,
+                        (VocabParallelEmbedding, MergedColumnParallelLinear,


Perhaps it is better to transfer NZ via process_weights_after_loading api. This function is called after weight-loading of models. For quantized methods, you can refer to implementations of process_weights_after_loading in vllm-ascend/vllm_ascend/quantization/w8a8.py. For unquantized methods, you can consider to patch process_weights_after_loading to UnquantizedLinearMethod.

Besides, word-embedding and lmhead both use VocabParallelEmbedding class. However, as far as I know, word-embedding calls embedding api to get embeddings according to input_ids. Since embedding api is not implemented with NZ format, it might incur an additional transdata operation to convert NZ to ND before embedding. For lmhead it is ok as lmhead is implemented with linear. You can check the profiling to see whether converting word-embedding to NZ will downgrade performance.

OK, I will skip VocabParallelEmbedding in this PR and try process_weights_after_loading in new PR.

github-actions · 2025-07-03T14:19:05Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>

… 300I series (vllm-project#1591) ### What this PR does / why we need it? Since running on Altlas 300I Duo was initial supported after vllm-project#1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance. See vllm-project#1563 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually: vllm-project#1591 (comment) Signed-off-by: Vincent Yuan <farawayboat@gmail.com>

Yikun changed the title ~~[Performance] Improve performance for Altlas 300I series~~ [Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series Jul 2, 2025

Yikun approved these changes Jul 3, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Jul 3, 2025

Yikun force-pushed the feat-atlas-310p branch from 6a156ed to 68e0026 Compare July 3, 2025 13:27

github-actions bot removed the merge-conflicts label Jul 3, 2025

Yikun force-pushed the feat-atlas-310p branch from 68e0026 to b99e976 Compare July 3, 2025 13:42

Yikun added the ready read for review label Jul 3, 2025

Yikun reviewed Jul 3, 2025

View reviewed changes

Angazenn approved these changes Jul 3, 2025

View reviewed changes

farawayboat force-pushed the feat-atlas-310p branch from b99e976 to 3400df0 Compare July 3, 2025 14:13

github-actions bot added merge-conflicts and removed ready read for review labels Jul 3, 2025

feat: Improve performance for Altlas 300I series

32aa8d8

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>

farawayboat force-pushed the feat-atlas-310p branch from 3400df0 to 32aa8d8 Compare July 3, 2025 14:25

github-actions bot removed the merge-conflicts label Jul 3, 2025

Yikun merged commit eb39054 into vllm-project:main Jul 5, 2025
20 checks passed

Yikun mentioned this pull request Jul 8, 2025

vLLM Ascend Roadmap Q3 2025 #1168

Open

45 tasks

AlphaINF mentioned this pull request Jul 15, 2025

[Bug]: VLLM ascend v0.9.2.rc1-310p with lora run exteremely slow #1812

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series #1591

[Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series #1591

Uh oh!

farawayboat commented Jul 2, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 2, 2025 •

edited

Loading

Uh oh!

Yikun commented Jul 2, 2025 •

edited

Loading

Uh oh!

Yikun left a comment •

edited

Loading

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

Yikun left a comment

Uh oh!

Angazenn Jul 3, 2025

Uh oh!

farawayboat Jul 3, 2025

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series #1591

[Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series #1591

Uh oh!

Conversation

farawayboat commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Yikun commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yikun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

Yikun left a comment

Choose a reason for hiding this comment

Uh oh!

Angazenn Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

farawayboat Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

farawayboat commented Jul 2, 2025 •

edited

Loading

codecov bot commented Jul 2, 2025 •

edited

Loading

Yikun commented Jul 2, 2025 •

edited

Loading

Yikun left a comment •

edited

Loading