fix: temporary solution for DeepSeek V2 H100 layout conversion issue #1060

zhyncs · 2024-08-12T17:22:59Z

Motivation

eval with Llama 3.1 8B Instruct gsm8k and it works well

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --trust-remote-code --disable-radix-cache --disable-flashinfer

Macro average 0.8468536770280516
Meta Macro average 0.844579226686884
Micro average 0.8468536770280516
Meta Micro average 0.844579226686884

tested with DeepSeek V2 Lite

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2-Lite --port 30000 --trust-remote-code --disable-radix-cache --enable-mla

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  60.98
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    215378
Request throughput (req/s):              16.40
Input token throughput (tok/s):          3872.26
Output token throughput (tok/s):         3535.64
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   25694.61
Median E2E Latency (ms):                 22902.73
---------------Time to First Token----------------
Mean TTFT (ms):                          5935.37
Median TTFT (ms):                        5716.50
P99 TTFT (ms):                           10970.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          237.18
Median TPOT (ms):                        109.77
P99 TPOT (ms):                           1610.97
---------------Inter-token Latency----------------
Mean ITL (ms):                           94.47
Median ITL (ms):                         70.81
P99 ITL (ms):                            272.62
==================================================

Modification

Briefly describe the changes made in this PR.

Checklist

Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
Modify documentation as needed, such as docstrings or example tutorials.

Co-authored-by: ispobock <ISPObaoke@163.com>

zhyncs · 2024-08-12T20:12:49Z

DeepSeek V2 H100 TP8

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  165.44
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    215058
Request throughput (req/s):              6.04
Input token throughput (tok/s):          1427.38
Output token throughput (tok/s):         1303.30
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   72001.17
Median E2E Latency (ms):                 72469.93
---------------Time to First Token----------------
Mean TTFT (ms):                          40396.57
Median TTFT (ms):                        28940.14
P99 TTFT (ms):                           99405.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          178.97
Median TPOT (ms):                        160.95
P99 TPOT (ms):                           559.55
---------------Inter-token Latency----------------
Mean ITL (ms):                           149.20
Median ITL (ms):                         111.16
P99 ITL (ms):                            398.63
==================================================


============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  777.04
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1087011
Request throughput (req/s):              6.43
Input token throughput (tok/s):          1528.70
Output token throughput (tok/s):         1402.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   358316.01
Median E2E Latency (ms):                 365857.50
---------------Time to First Token----------------
Mean TTFT (ms):                          320752.33
Median TTFT (ms):                        323528.82
P99 TTFT (ms):                           670386.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          176.45
Median TPOT (ms):                        176.47
P99 TPOT (ms):                           272.25
---------------Inter-token Latency----------------
Mean ITL (ms):                           175.99
Median ITL (ms):                         128.65
P99 ITL (ms):                            517.99
==================================================

zhyncs · 2024-08-13T08:27:35Z

Potential issues, on the H100 other models if can be 128 128, here it was directly changed to 128 64, there might be performance degradation, need to verify and fix.

zhyncs · 2024-08-13T10:12:59Z

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache --disable-flashinfer

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 8192 --random-output 8

# latest main
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  146.91
Total input tokens:                      4064944
Total generated tokens:                  3930
Total generated tokens (retokenized):    4038
Request throughput (req/s):              6.81
Input token throughput (tok/s):          27670.26
Output token throughput (tok/s):         26.75
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   86202.47
Median E2E Latency (ms):                 84699.32
---------------Time to First Token----------------
Mean TTFT (ms):                          73186.68
Median TTFT (ms):                        72622.70
P99 TTFT (ms):                           144514.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4225.82
Median TPOT (ms):                        4279.84
P99 TPOT (ms):                           9710.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           4277.28
Median ITL (ms):                         4344.91
P99 ITL (ms):                            15263.60
==================================================

# v0.2.12
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  156.92
Total input tokens:                      4064944
Total generated tokens:                  3930
Total generated tokens (retokenized):    4038
Request throughput (req/s):              6.37
Input token throughput (tok/s):          25905.22
Output token throughput (tok/s):         25.05
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   93430.57
Median E2E Latency (ms):                 93655.56
---------------Time to First Token----------------
Mean TTFT (ms):                          79686.68
Median TTFT (ms):                        79479.00
P99 TTFT (ms):                           155250.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4289.53
Median TPOT (ms):                        4592.39
P99 TPOT (ms):                           8525.07
---------------Inter-token Latency----------------
Mean ITL (ms):                           4516.55
Median ITL (ms):                         4983.82
P99 ITL (ms):                            13701.18
==================================================

zhyncs · 2024-08-13T10:14:30Z

env

Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
GPU 0: NVIDIA H100 80GB HBM3
GPU 0 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.54.15
PyTorch: 2.4.0+cu121
flashinfer: 0.1.4+cu121torch2.4
triton: 3.0.0
transformers: 4.44.0
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.3
aiohttp: 3.10.3
fastapi: 0.112.0
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 23.2
PIL: 10.2.0
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.5
uvloop: 0.19.0
zmq: 24.0.1
vllm: 0.5.4
multipart: 0.0.9
openai: 1.40.6
anthropic: 0.33.1
NVIDIA Topology:
        GPU0    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PXB     PXB     52-103,156-207  1               N/A
NIC0    PXB      X      PIX
NIC1    PXB     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1


ulimit soft: 1048576

fix: tmp resolve deepseek v2 h100 layout issue

0d2d8b9

Co-authored-by: ispobock <ISPObaoke@163.com>

zhyncs requested review from Ying1123, merrymercy, ispobock and hnyls2002 August 12, 2024 17:22

zhyncs self-assigned this Aug 12, 2024

zhyncs changed the title ~~fix: tmp resolve deepseek v2 h100 layout issue~~ fix: temporary solution for DeepSeek V2 H100 layout conversion issue Aug 12, 2024

zhyncs changed the title ~~fix: temporary solution for DeepSeek V2 H100 layout conversion issue~~ fix: temporary solution for DeepSeek V2 Lite H100 layout conversion issue Aug 12, 2024

ispobock approved these changes Aug 13, 2024

View reviewed changes

zhyncs changed the title ~~fix: temporary solution for DeepSeek V2 Lite H100 layout conversion issue~~ fix: temporary solution for DeepSeek V2 H100 layout conversion issue Aug 13, 2024

Merge branch 'main' into fix

d0b070b

zhyncs merged commit 65915f9 into sgl-project:main Aug 13, 2024
3 of 4 checks passed

zhyncs deleted the fix branch August 13, 2024 05:48

zhyncs mentioned this pull request Aug 13, 2024

[Bug] Assertion `idx < size()' failed. triton-lang/triton#4502

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: temporary solution for DeepSeek V2 H100 layout conversion issue #1060

fix: temporary solution for DeepSeek V2 H100 layout conversion issue #1060

zhyncs commented Aug 12, 2024 •

edited

Loading

zhyncs commented Aug 12, 2024

zhyncs commented Aug 13, 2024

zhyncs commented Aug 13, 2024

zhyncs commented Aug 13, 2024

fix: temporary solution for DeepSeek V2 H100 layout conversion issue #1060

fix: temporary solution for DeepSeek V2 H100 layout conversion issue #1060

Conversation

zhyncs commented Aug 12, 2024 • edited Loading

Motivation

Modification

Checklist

zhyncs commented Aug 12, 2024

zhyncs commented Aug 13, 2024

zhyncs commented Aug 13, 2024

zhyncs commented Aug 13, 2024

env

zhyncs commented Aug 12, 2024 •

edited

Loading