Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster overlap mode scheduler #1738

Merged
merged 4 commits into from
Oct 21, 2024
Merged

Faster overlap mode scheduler #1738

merged 4 commits into from
Oct 21, 2024

Conversation

merrymercy
Copy link
Contributor

@merrymercy merrymercy commented Oct 21, 2024

This PR improves the order of kernel launch and result fetching. Now the overlap scheduler can bring 10% throughput improvement even when radix cache is turned off. When the radix cache is turned on, we can expect more speedup.

Benchmark results

Overlap mode: 51.03 req/s

python -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --disable-radix --enable-overlap
python -m sglang.bench_serving --model meta-llama/Llama-3.1-8B-Instruct --num-prompt 3000
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     3000
Benchmark duration (s):                  58.79
Total input tokens:                      673672
Total generated tokens:                  581627
Total generated tokens (retokenized):    581405
Request throughput (req/s):              51.03
Input token throughput (tok/s):          11459.26
Output token throughput (tok/s):         9893.56
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   28986.97
Median E2E Latency (ms):                 29088.28
---------------Time to First Token----------------
Mean TTFT (ms):                          14495.13
Median TTFT (ms):                        11312.61
P99 TTFT (ms):                           36408.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          144.25
Median TPOT (ms):                        86.74
P99 TPOT (ms):                           1081.64
---------------Inter-token Latency----------------
Mean ITL (ms):                           78.78
Median ITL (ms):                         32.48
P99 ITL (ms):                            529.30
==================================================

Normal mode: 46.06 req/s

python -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --disable-radix 
python -m sglang.bench_serving --model meta-llama/Llama-3.1-8B-Instruct --num-prompt 3000
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     3000
Benchmark duration (s):                  65.14
Total input tokens:                      673672
Total generated tokens:                  581627
Total generated tokens (retokenized):    581402
Request throughput (req/s):              46.06
Input token throughput (tok/s):          10342.28
Output token throughput (tok/s):         8929.19
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   31574.46
Median E2E Latency (ms):                 31581.12
---------------Time to First Token----------------
Mean TTFT (ms):                          15352.12
Median TTFT (ms):                        11615.68
P99 TTFT (ms):                           39444.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          157.51
Median TPOT (ms):                        96.38
P99 TPOT (ms):                           1131.20
---------------Inter-token Latency----------------
Mean ITL (ms):                           87.11
Median ITL (ms):                         37.10
P99 ITL (ms):                            554.28
==================================================

Notes

  1. We still only use multi-threading under the limitation of GIL. We can expect a larger improvement if we move to multi-processing or we can turn off GIL.
  2. The overlap scheduler is an experimental feature. I verified its accuracy on GSM-8k, and it matches that of the normal scheduler. It works for standard decoding, but it does not support sampling penalizers (e.g., frequency and repetition penalties) or constrained decoding (e.g., regex, JSON).

@merrymercy merrymercy changed the title Launch a copy thread for overlapped scheduler Faster overlap mode scheduler Oct 21, 2024
@merrymercy merrymercy merged commit 7ce3606 into main Oct 21, 2024
9 of 10 checks passed
@merrymercy merrymercy deleted the multi-stream branch October 21, 2024 11:30
@merrymercy merrymercy mentioned this pull request Oct 23, 2024
31 tasks
qeternity pushed a commit to qeternity/sglang that referenced this pull request Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant