Faster overlap mode scheduler #1738

merrymercy · 2024-10-21T04:23:22Z

This PR improves the order of kernel launch and result fetching. Now the overlap scheduler can bring 10% throughput improvement even when radix cache is turned off. When the radix cache is turned on, we can expect more speedup.

Benchmark results

Overlap mode: 51.03 req/s

python -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --disable-radix --enable-overlap
python -m sglang.bench_serving --model meta-llama/Llama-3.1-8B-Instruct --num-prompt 3000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     3000
Benchmark duration (s):                  58.79
Total input tokens:                      673672
Total generated tokens:                  581627
Total generated tokens (retokenized):    581405
Request throughput (req/s):              51.03
Input token throughput (tok/s):          11459.26
Output token throughput (tok/s):         9893.56
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   28986.97
Median E2E Latency (ms):                 29088.28
---------------Time to First Token----------------
Mean TTFT (ms):                          14495.13
Median TTFT (ms):                        11312.61
P99 TTFT (ms):                           36408.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          144.25
Median TPOT (ms):                        86.74
P99 TPOT (ms):                           1081.64
---------------Inter-token Latency----------------
Mean ITL (ms):                           78.78
Median ITL (ms):                         32.48
P99 ITL (ms):                            529.30
==================================================

Normal mode: 46.06 req/s

python -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --disable-radix 
python -m sglang.bench_serving --model meta-llama/Llama-3.1-8B-Instruct --num-prompt 3000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     3000
Benchmark duration (s):                  65.14
Total input tokens:                      673672
Total generated tokens:                  581627
Total generated tokens (retokenized):    581402
Request throughput (req/s):              46.06
Input token throughput (tok/s):          10342.28
Output token throughput (tok/s):         8929.19
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   31574.46
Median E2E Latency (ms):                 31581.12
---------------Time to First Token----------------
Mean TTFT (ms):                          15352.12
Median TTFT (ms):                        11615.68
P99 TTFT (ms):                           39444.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          157.51
Median TPOT (ms):                        96.38
P99 TPOT (ms):                           1131.20
---------------Inter-token Latency----------------
Mean ITL (ms):                           87.11
Median ITL (ms):                         37.10
P99 ITL (ms):                            554.28
==================================================

Notes

We still only use multi-threading under the limitation of GIL. We can expect a larger improvement if we move to multi-processing or we can turn off GIL.
The overlap scheduler is an experimental feature. I verified its accuracy on GSM-8k, and it matches that of the normal scheduler. It works for standard decoding, but it does not support sampling penalizers (e.g., frequency and repetition penalties) or constrained decoding (e.g., regex, JSON).

A copy thread

8685533

merrymercy force-pushed the multi-stream branch from 44108b7 to 8685533 Compare October 21, 2024 09:23

merrymercy added 2 commits October 21, 2024 03:34

update

b91b56a

Add a wait until the batch is launched

97cb23d

merrymercy changed the title ~~Launch a copy thread for overlapped scheduler~~ Faster overlap mode scheduler Oct 21, 2024

Fix

53a6acb

merrymercy merged commit 7ce3606 into main Oct 21, 2024
9 of 10 checks passed

merrymercy deleted the multi-stream branch October 21, 2024 11:30

merrymercy mentioned this pull request Oct 23, 2024

Development Roadmap (2024 Q4) #1487

Open

31 tasks

qeternity pushed a commit to qeternity/sglang that referenced this pull request Oct 27, 2024

Faster overlap mode scheduler (sgl-project#1738)

85a1941

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster overlap mode scheduler #1738

Faster overlap mode scheduler #1738

merrymercy commented Oct 21, 2024 •

edited

Loading

Faster overlap mode scheduler #1738

Faster overlap mode scheduler #1738

Conversation

merrymercy commented Oct 21, 2024 • edited Loading

Benchmark results

Overlap mode: 51.03 req/s

Normal mode: 46.06 req/s

Notes

merrymercy commented Oct 21, 2024 •

edited

Loading