-
Notifications
You must be signed in to change notification settings - Fork 725
feat: SGLang FP8 improvements and vLLM benchmark enhancements #4675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughConfiguration updates for SLang backend optimization and benchmarking script refinements. The first file disables JIT DeepGEMM compilation. The second file increases benchmark timeout, refactors warmup concurrency list handling, and standardizes result filename construction. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh (1)
43-63: Warmup concurrency list handling is sound, but consider preferred ShellCheck patterns.The logic to ensure all chosen concurrencies are included in the warmup list (lines 45–57) is correct and avoids duplicates. However, line 60 triggers a ShellCheck SC2207 warning about command substitution with array assignment. While the current approach works, using
mapfileis the preferred pattern:# Current approach: IFS=$'\n' warmup_concurrency_list=($(sort -n <<<"${warmup_concurrency_list[*]}")) unset IFS # Preferred approach using mapfile: mapfile -t warmup_concurrency_list < <(printf '%s\n' "${warmup_concurrency_list[@]}" | sort -n)Additionally, line 63 triggers a ShellCheck SC2145 warning about mixing string and array syntax. The current
echostatement appears syntactically correct; please verify this resolves after the mapfile refactor or that it is a false positive in your environment.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1p_4d.sh(2 hunks)examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 0
File: :0-0
Timestamp: 2025-09-19T07:32:44.210Z
Learning: The SGLang backend has debug print statements in _get_input_param() and _process_text_stream() methods that should be removed for production as they cause synchronous I/O in async contexts.
🧬 Code graph analysis (1)
examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh (1)
examples/backends/sglang/slurm_jobs/scripts/benchmark_utils.sh (1)
wait_for_model(5-44)
🪛 Shellcheck (0.11.0)
examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh
[warning] 60-60: Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(SC2207)
[error] 63-63: Argument mixes string and array. Use * or separate argument.
(SC2145)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: trtllm (amd64)
- GitHub Check: sglang (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: Build and Test - dynamo
- GitHub Check: trtllm (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: sglang (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (3)
examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1p_4d.sh (1)
86-86: Environment variable addition for FP8 stability is well-placed.Setting
SGLANG_ENABLE_JIT_DEEPGEMM=falseconsistently in both prefill and decode branches before other SGLANG environment variables follows a logical configuration order and directly addresses JIT compilation issues in FP8 inference.Also applies to: 144-144
examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh (2)
30-35: Timeout increase and parameter extraction improve readability and meet requirements.Doubling the timeout from 1500s to 3000s (50 minutes) aligns with the PR objective for extended warmup, and extracting check/report intervals as named variables improves maintainability. The
wait_for_modelparameter order is correct: model_host, model_port, n_prefill, n_decode, poll, timeout, report_every.
99-99: Result filename format now includes clearer metadata and total GPU count.The updated filename pattern with explicit
ctx_,gen_, andgpus_labels improves metadata clarity and traceability compared to the previous format. This aligns well with the PR objective and makes benchmark results easier to parse and organize.
Aphoh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/ok to test 4c6d013 |
Summary
Changes
SGLang FP8 Disaggregated Inference (
1p_4d.sh)SGLANG_ENABLE_JIT_DEEPGEMM=falsefor both prefill and decode modesvLLM Benchmark Script (
bench.sh)wait_for_model_timeoutfrom 1500s (25 min) to 3000s (50 min)ctx${prefill_gpus}_gen${decode_gpus}toctx_${prefill_gpus}_gen_${decode_gpus}_gpus_${total_gpus}set +efor cleaner script terminationTest Plan
Generated with Claude Code
Summary by CodeRabbit
Chores
✏️ Tip: You can customize this high-level summary in your review settings.