Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: temporary solution for DeepSeek V2 H100 layout conversion issue #1060

Merged
merged 2 commits into from
Aug 13, 2024

Conversation

zhyncs
Copy link
Member

@zhyncs zhyncs commented Aug 12, 2024

Motivation

fix #913

eval with Llama 3.1 8B Instruct gsm8k and it works well

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --trust-remote-code --disable-radix-cache --disable-flashinfer
Macro average 0.8468536770280516
Meta Macro average 0.844579226686884
Micro average 0.8468536770280516
Meta Micro average 0.844579226686884

tested with DeepSeek V2 Lite

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2-Lite --port 30000 --trust-remote-code --disable-radix-cache --enable-mla
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  60.98
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    215378
Request throughput (req/s):              16.40
Input token throughput (tok/s):          3872.26
Output token throughput (tok/s):         3535.64
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   25694.61
Median E2E Latency (ms):                 22902.73
---------------Time to First Token----------------
Mean TTFT (ms):                          5935.37
Median TTFT (ms):                        5716.50
P99 TTFT (ms):                           10970.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          237.18
Median TPOT (ms):                        109.77
P99 TPOT (ms):                           1610.97
---------------Inter-token Latency----------------
Mean ITL (ms):                           94.47
Median ITL (ms):                         70.81
P99 ITL (ms):                            272.62
==================================================

Modification

Briefly describe the changes made in this PR.

Checklist

  1. Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
  2. Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
  3. Modify documentation as needed, such as docstrings or example tutorials.

Co-authored-by: ispobock <ISPObaoke@163.com>
@zhyncs zhyncs self-assigned this Aug 12, 2024
@zhyncs zhyncs changed the title fix: tmp resolve deepseek v2 h100 layout issue fix: temporary solution for DeepSeek V2 H100 layout conversion issue Aug 12, 2024
@zhyncs zhyncs changed the title fix: temporary solution for DeepSeek V2 H100 layout conversion issue fix: temporary solution for DeepSeek V2 Lite H100 layout conversion issue Aug 12, 2024
@zhyncs
Copy link
Member Author

zhyncs commented Aug 12, 2024

DeepSeek V2 H100 TP8

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  165.44
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    215058
Request throughput (req/s):              6.04
Input token throughput (tok/s):          1427.38
Output token throughput (tok/s):         1303.30
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   72001.17
Median E2E Latency (ms):                 72469.93
---------------Time to First Token----------------
Mean TTFT (ms):                          40396.57
Median TTFT (ms):                        28940.14
P99 TTFT (ms):                           99405.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          178.97
Median TPOT (ms):                        160.95
P99 TPOT (ms):                           559.55
---------------Inter-token Latency----------------
Mean ITL (ms):                           149.20
Median ITL (ms):                         111.16
P99 ITL (ms):                            398.63
==================================================


============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  777.04
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1087011
Request throughput (req/s):              6.43
Input token throughput (tok/s):          1528.70
Output token throughput (tok/s):         1402.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   358316.01
Median E2E Latency (ms):                 365857.50
---------------Time to First Token----------------
Mean TTFT (ms):                          320752.33
Median TTFT (ms):                        323528.82
P99 TTFT (ms):                           670386.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          176.45
Median TPOT (ms):                        176.47
P99 TPOT (ms):                           272.25
---------------Inter-token Latency----------------
Mean ITL (ms):                           175.99
Median ITL (ms):                         128.65
P99 ITL (ms):                            517.99
==================================================

@zhyncs zhyncs changed the title fix: temporary solution for DeepSeek V2 Lite H100 layout conversion issue fix: temporary solution for DeepSeek V2 H100 layout conversion issue Aug 13, 2024
@zhyncs zhyncs merged commit 65915f9 into sgl-project:main Aug 13, 2024
3 of 4 checks passed
@zhyncs zhyncs deleted the fix branch August 13, 2024 05:48
@zhyncs
Copy link
Member Author

zhyncs commented Aug 13, 2024

Potential issues, on the H100 other models if can be 128 128, here it was directly changed to 128 64, there might be performance degradation, need to verify and fix.

@zhyncs
Copy link
Member Author

zhyncs commented Aug 13, 2024

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache --disable-flashinfer
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 8192 --random-output 8
# latest main
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  146.91
Total input tokens:                      4064944
Total generated tokens:                  3930
Total generated tokens (retokenized):    4038
Request throughput (req/s):              6.81
Input token throughput (tok/s):          27670.26
Output token throughput (tok/s):         26.75
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   86202.47
Median E2E Latency (ms):                 84699.32
---------------Time to First Token----------------
Mean TTFT (ms):                          73186.68
Median TTFT (ms):                        72622.70
P99 TTFT (ms):                           144514.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4225.82
Median TPOT (ms):                        4279.84
P99 TPOT (ms):                           9710.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           4277.28
Median ITL (ms):                         4344.91
P99 ITL (ms):                            15263.60
==================================================

# v0.2.12
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  156.92
Total input tokens:                      4064944
Total generated tokens:                  3930
Total generated tokens (retokenized):    4038
Request throughput (req/s):              6.37
Input token throughput (tok/s):          25905.22
Output token throughput (tok/s):         25.05
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   93430.57
Median E2E Latency (ms):                 93655.56
---------------Time to First Token----------------
Mean TTFT (ms):                          79686.68
Median TTFT (ms):                        79479.00
P99 TTFT (ms):                           155250.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4289.53
Median TPOT (ms):                        4592.39
P99 TPOT (ms):                           8525.07
---------------Inter-token Latency----------------
Mean ITL (ms):                           4516.55
Median ITL (ms):                         4983.82
P99 ITL (ms):                            13701.18
==================================================

@zhyncs
Copy link
Member Author

zhyncs commented Aug 13, 2024

env

Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
GPU 0: NVIDIA H100 80GB HBM3
GPU 0 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.54.15
PyTorch: 2.4.0+cu121
flashinfer: 0.1.4+cu121torch2.4
triton: 3.0.0
transformers: 4.44.0
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.3
aiohttp: 3.10.3
fastapi: 0.112.0
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 23.2
PIL: 10.2.0
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.5
uvloop: 0.19.0
zmq: 24.0.1
vllm: 0.5.4
multipart: 0.0.9
openai: 1.40.6
anthropic: 0.33.1
NVIDIA Topology:
        GPU0    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PXB     PXB     52-103,156-207  1               N/A
NIC0    PXB      X      PIX
NIC1    PXB     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1


ulimit soft: 1048576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] DeepSeek V2 H100 x8 Triton failure
2 participants