[Bugfix][VIP] hotfix for gptq-marlin non-contiguous error #9

DefTruth · 2025-03-24T05:45:26Z

Hotfix for gptq-marlin non-contiguous error. Previous vllm-project#15319 only fixed the GPTQ code path and vllm-project#14946 only fixed the AWQ code path. We should lower the non-contiguous fix into the ops.gptq_marlin_gemm function, since gptq_marlin_gemm has been used in multiple code paths, both AWQ and GPTQ.

# GPTQ
MLA -> gptq_marlin.py:341 -> mixed_precision/marlin.py:123 (previous fixed here) ->  marlin_utils.py:334 apply_gptq_marlin_linear -> _custom_ops.py:741 ops.gptq_marlin_gemm -> torch.ops._C.gptq_marlin_gemm
# AWQ
MLA -> awq_marlin.py:303 -> marlin_utils.py:379 apply_awq_marlin_linear -> _custom_ops.py:741 ops.gptq_marlin_gemm -> torch.ops._C.gptq_marlin_gemm

Performance w/ or w/o prefix cache for R1-AWQ model: (TTFT from 13701.38ms -> 1183.58ms, this PR don't downgrade the performance of TTFT)

# MLA + AWQ + w/o prefix cache + NVIDIA L20 + pp2 + tp8
python3 benchmark_serving.py \
        --backend vllm \
        --model /workspace/DeepSeek-R1-awq \
        --port 8862 \
        --endpoint /v1/completions \
        --dataset-name random \
        --dataset-path ${SHAREGPT_DATASET_PATH} \
        --random-prefix-len 0 \
        --random-input-len 4096 \
        --random-output-len 1024 \
        --ignore-eos \
        --max-concurrency 32 \
        --num-prompts 32

Maximum request concurrency: 32
============ Serving Benchmark Result ============
Successful requests:                     32
Benchmark duration (s):                  130.67
Total input tokens:                      131072
Total generated tokens:                  32768
Request throughput (req/s):              0.24
Output token throughput (tok/s):         250.77
Total Token throughput (tok/s):          1253.85
---------------Time to First Token----------------
Mean TTFT (ms):                          13701.38
Median TTFT (ms):                        14286.95
P99 TTFT (ms):                           24786.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          113.26
Median TPOT (ms):                        112.76
P99 TPOT (ms):                           124.87
---------------Inter-token Latency----------------
Mean ITL (ms):                           113.26
Median ITL (ms):                         104.52
P99 ITL (ms):                            853.17
==================================================

# MLA + AWQ + w/ prefix cache + NVIDIA L20 + pp2 + tp8
python3 benchmark_serving.py \
        --backend vllm \
        --model /workspace/DeepSeek-R1-awq \
        --port 8862 \
        --endpoint /v1/completions \
        --dataset-name random \
        --dataset-path ${SHAREGPT_DATASET_PATH} \
        --random-prefix-len 3072 \
        --random-input-len 1024 \
        --random-output-len 1024 \
        --ignore-eos \
        --max-concurrency 32 \
        --num-prompts 32

## TTFT from 13701.38ms -> 1183.58ms
Maximum request concurrency: 32
============ Serving Benchmark Result ============
Successful requests:                     32
Benchmark duration (s):                  112.73
Total input tokens:                      131072
Total generated tokens:                  32768
Request throughput (req/s):              0.28
Output token throughput (tok/s):         290.69
Total Token throughput (tok/s):          1453.45
---------------Time to First Token----------------
Mean TTFT (ms):                          1183.58
Median TTFT (ms):                        1198.95
P99 TTFT (ms):                           1573.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          108.95
Median TPOT (ms):                        108.94
P99 TPOT (ms):                           109.46
---------------Inter-token Latency----------------
Mean ITL (ms):                           108.95
Median ITL (ms):                         109.81
P99 ITL (ms):                            120.19
==================================================

This PR only acts as a temporary workaround. For performance reasons, the best approach is to make gptq_marlin_gemm support non-contiguous input in the future or carefully manage context_chunk_workspace for MLA (don't fuse nope and pe kv cache workspace into a single Tensor).

Signed-off-by: DefTruth <qiustudent_r@163.com>

[Misc][VIP] hotfix for gptq-marlin non-contiguous error

3c2e158

Signed-off-by: DefTruth <qiustudent_r@163.com>

DefTruth changed the title ~~[Misc][VIP] hotfix for gptq-marlin non-contiguous error~~ [Bugfix][VIP] hotfix for gptq-marlin non-contiguous error Mar 24, 2025

DefTruth merged commit bb218a6 into main Mar 24, 2025

DefTruth deleted the vipshop-dev branch March 24, 2025 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][VIP] hotfix for gptq-marlin non-contiguous error #9

[Bugfix][VIP] hotfix for gptq-marlin non-contiguous error #9

Uh oh!

DefTruth commented Mar 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Bugfix][VIP] hotfix for gptq-marlin non-contiguous error #9

[Bugfix][VIP] hotfix for gptq-marlin non-contiguous error #9

Uh oh!

Conversation

DefTruth commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DefTruth commented Mar 24, 2025 •

edited

Loading