Skip to content

Conversation

@farawayboat
Copy link
Contributor

@farawayboat farawayboat commented Jul 2, 2025

What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after #1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance.

See #1563

Does this PR introduce any user-facing change?

How was this patch tested?

See #1591 (comment)

@codecov
Copy link

codecov bot commented Jul 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.35%. Comparing base (c30ddb8) to head (32aa8d8).
⚠️ Report is 613 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1591       +/-   ##
===========================================
+ Coverage   27.39%   52.35%   +24.95%     
===========================================
  Files          56       78       +22     
  Lines        6191     9633     +3442     
===========================================
+ Hits         1696     5043     +3347     
- Misses       4495     4590       +95     
Flag Coverage Δ
unittests 52.35% <ø> (+24.95%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Yikun
Copy link
Collaborator

Yikun commented Jul 2, 2025

export VLLM_USE_V1=1
vllm serve Qwen/Qwen3-8B \
    --tensor-parallel-size 1 \
    --enforce-eager \
    --dtype float16 \
    --compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'

python3 -m venv .venv-evalscope
source .venv-evalscope/bin/activate
pip install evalscope[perf] -U
evalscope perf \
    --url "http://localhost:8000/v1/chat/completions" \
    --parallel 5 \
    --model Qwen/Qwen3-8B \
    --number 20 \
    --api openai \
    --dataset openqa \
    --stream

Before:

Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          | 3553.09   |
+-----------------------------------+-----------+
| Number of concurrency             |    5      |
+-----------------------------------+-----------+
| Total requests                    |   20      |
+-----------------------------------+-----------+
| Succeed requests                  |   20      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |    6.8889 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |    7.0536 |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.0056 |
+-----------------------------------+-----------+
| Average latency (s)               |  787.369  |
+-----------------------------------+-----------+
| Average time to first token (s)   |   76.5634 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.575  |
+-----------------------------------+-----------+
| Average input tokens per request  |   29.25   |
+-----------------------------------+-----------+
| Average output tokens per request | 1223.85   |
+-----------------------------------+-----------+
| Average package latency (s)       |    0.5808 |
+-----------------------------------+-----------+
| Average package per request       | 1223.85   |
+-----------------------------------+-----------+
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     | 18.9568  | 0.4273  |  0.5009  |  571.2431   |      21      |      845      |     1.4097     |    1.4695     |
|     25%     | 66.1473  | 0.4291  |  0.5566  |  638.7671   |      26      |     1043      |     1.4573     |    1.4951     |
|     50%     |  67.481  | 0.4303  |  0.5649  |  818.9423   |      28      |     1222      |     1.5394     |    1.5689     |
|     66%     | 68.3655  |  0.431  |  0.6157  |  902.7492   |      31      |     1414      |     1.6395     |    1.6923     |
|     75%     | 71.3802  | 0.4314  |  0.634   |  972.2325   |      34      |     1479      |     1.6805     |    1.7313     |
|     80%     | 155.0879 | 0.4316  |  0.6359  |  997.7846   |      37      |     1484      |     1.7178     |    1.7625     |
|     90%     | 155.0886 | 0.4322  |  0.6639  |  1165.3984  |      41      |     1716      |     1.7486     |    1.8818     |
|     95%     | 155.0886 | 0.4326  |  0.7136  |  1222.6076  |      45      |     2048      |     1.9405     |    1.9761     |
|     98%     | 155.0886 | 0.4333  |  0.7136  |  1222.6076  |      45      |     2048      |     1.9405     |    1.9761     |
|     99%     | 155.0886 | 0.4338  |  0.7136  |  1222.6076  |      45      |     2048      |     1.9405     |    1.9761     |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

After:

Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |  694.935  |
+-----------------------------------+-----------+
| Number of concurrency             |    5      |
+-----------------------------------+-----------+
| Total requests                    |   20      |
+-----------------------------------+-----------+
| Succeed requests                  |   20      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |   35.8523 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |   36.6941 |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.0288 |
+-----------------------------------+-----------+
| Average latency (s)               |  148.268  |
+-----------------------------------+-----------+
| Average time to first token (s)   |    0.6713 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.1185 |
+-----------------------------------+-----------+
| Average input tokens per request  |   29.25   |
+-----------------------------------+-----------+
| Average output tokens per request | 1245.75   |
+-----------------------------------+-----------+
| Average package latency (s)       |    0.1185 |
+-----------------------------------+-----------+
| Average package per request       | 1245.75   |
+-----------------------------------+-----------+

Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  0.243   | 0.1161  |  0.117   |  101.3687   |      21      |      845      |     8.3569     |    8.5102     |
|     25%     |  0.2439  | 0.1178  |  0.1179  |  123.5415   |      26      |     1035      |     8.3742     |    8.5287     |
|     50%     |  0.247   | 0.1188  |  0.1189  |  155.6198   |      28      |     1304      |     8.3987     |    8.6044     |
|     66%     |  0.2508  | 0.1194  |  0.1191  |  175.2364   |      31      |     1472      |     8.4032     |    8.6504     |
|     75%     |  1.8208  | 0.1196  |  0.1194  |  179.6964   |      34      |     1510      |     8.4097     |    8.6686     |
|     80%     |  1.9802  | 0.1198  |  0.1194  |  182.7204   |      37      |     1533      |     8.4098     |    8.7404     |
|     90%     |  1.9809  | 0.1204  |  0.1197  |  220.3238   |      41      |     1846      |     8.4632     |    8.9714     |
|     95%     |  1.9811  | 0.1208  |  0.1198  |  230.3884   |      45      |     1936      |     8.6505     |    9.0138     |
|     98%     |  1.9811  | 0.1211  |  0.1198  |  230.3884   |      45      |     1936      |     8.6505     |    9.0138     |
|     99%     |  1.9811  | 0.1215  |  0.1198  |  230.3884   |      45      |     1936      |     8.6505     |    9.0138     |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

@Yikun Yikun changed the title [Performance] Improve performance for Altlas 300I series [Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series Jul 2, 2025
Copy link
Collaborator

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also take a look @Angazenn @leo-pony

@github-actions
Copy link

github-actions bot commented Jul 3, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@Yikun Yikun force-pushed the feat-atlas-310p branch from 6a156ed to 68e0026 Compare July 3, 2025 13:27
@Yikun Yikun force-pushed the feat-atlas-310p branch from 68e0026 to b99e976 Compare July 3, 2025 13:42
@Yikun Yikun added the ready read for review label Jul 3, 2025
Copy link
Collaborator

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for confirm the VocabParallelEmbedding

for module in self.model.modules():
if isinstance(
module,
(VocabParallelEmbedding, MergedColumnParallelLinear,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is better to transfer NZ via process_weights_after_loading api. This function is called after weight-loading of models. For quantized methods, you can refer to implementations of process_weights_after_loading in vllm-ascend/vllm_ascend/quantization/w8a8.py. For unquantized methods, you can consider to patch process_weights_after_loading to UnquantizedLinearMethod.

Besides, word-embedding and lmhead both use VocabParallelEmbedding class. However, as far as I know, word-embedding calls embedding api to get embeddings according to input_ids. Since embedding api is not implemented with NZ format, it might incur an additional transdata operation to convert NZ to ND before embedding. For lmhead it is ok as lmhead is implemented with linear. You can check the profiling to see whether converting word-embedding to NZ will downgrade performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will skip VocabParallelEmbedding in this PR and try process_weights_after_loading in new PR.

@github-actions
Copy link

github-actions bot commented Jul 3, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions github-actions bot added merge-conflicts and removed ready read for review labels Jul 3, 2025
Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
@Yikun Yikun merged commit eb39054 into vllm-project:main Jul 5, 2025
20 checks passed
@Yikun Yikun mentioned this pull request Jul 8, 2025
45 tasks
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
… 300I series (vllm-project#1591)

### What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after vllm-project#1333 ,
this PR will disable the JIT compiler for the 310P and changed the data
format to NZ for the weight in the vocabulary embedding and QKV
projection layers, which help improving performance.

See vllm-project#1563 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Test manually:
vllm-project#1591 (comment)

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
… 300I series (vllm-project#1591)

### What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after vllm-project#1333 ,
this PR will disable the JIT compiler for the 310P and changed the data
format to NZ for the weight in the vocabulary embedding and QKV
projection layers, which help improving performance.

See vllm-project#1563 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Test manually:
vllm-project#1591 (comment)

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants