feat: SGLang FP8 improvements and vLLM benchmark enhancements #4675

ishandhanani · 2025-12-01T21:27:53Z

Summary

Disable JIT DeepGEMM for FP8 disaggregated inference to improve stability
Extend model warmup timeout from 25 to 50 minutes for better reliability
Enhance warmup process to dynamically include all target concurrencies
Update benchmark result filename format for better clarity and metadata

Changes

SGLang FP8 Disaggregated Inference (`1p_4d.sh`)

Set SGLANG_ENABLE_JIT_DEEPGEMM=false for both prefill and decode modes
Prevents JIT compilation issues with FP8 inference

vLLM Benchmark Script (`bench.sh`)

Warmup improvements:
- Increased wait_for_model_timeout from 1500s (25 min) to 3000s (50 min)
- Warmup now dynamically includes all chosen concurrency values, not just predefined list
- Warmup list is sorted numerically for consistent behavior
Filename format:
- Changed from ctx${prefill_gpus}_gen${decode_gpus} to ctx_${prefill_gpus}_gen_${decode_gpus}_gpus_${total_gpus}
- Improves parsing and adds total GPU count to metadata
Removed trailing set +e for cleaner script termination

Test Plan

Verify FP8 disaggregated inference runs without JIT DeepGEMM errors
Confirm warmup completes successfully with extended timeout
Validate warmup includes all target concurrencies
Check benchmark result files use new naming format
Test with various concurrency configurations

Generated with Claude Code

Summary by CodeRabbit

Chores

Updated benchmark scripts with extended timeout values (1500ms → 3000ms) and improved warmup parameter handling for enhanced stability
Modified result output filenames to include additional tracking context for better result organization
Applied hardware-specific configuration optimizations for target deployment scenarios

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-01T21:30:32Z

Walkthrough

Configuration updates for SLang backend optimization and benchmarking script refinements. The first file disables JIT DeepGEMM compilation. The second file increases benchmark timeout, refactors warmup concurrency list handling, and standardizes result filename construction.

Changes

Cohort / File(s)	Summary
SLang Configuration `examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1p_4d.sh`	Added environment variable `SGLANG_ENABLE_JIT_DEEPGEMM=false` in prefill and decode branches to disable JIT DeepGEMM compilation.
Benchmark Script Enhancements `examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh`	Increased `wait_for_model` timeout from 1500 to 3000 seconds. Introduced logic to ensure all `chosen_concurrencies` values exist in `warmup_concurrency_list`, sorting the list numerically. Updated warmup concurrency list handling and refactored result filename pattern to include `ctx`, `gen`, and `total_gpus` identifiers.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Focus areas:
- vllm/bench.sh: Verify warmup concurrency list logic correctly appends missing values and sorts numerically without duplicates
- vllm/bench.sh: Confirm new result filename pattern with ctx, gen, and total_gpus identifiers constructs correctly in both warmup and main benchmark loops
- Validate wait_for_model timeout increase (3000 seconds) is appropriate for expected model initialization time

Poem

🐰 A script hops faster, timeout sets wider,
Warmup lists sorted like lettuce inside her,
JIT's turned to rest, filenames refined,
DeepGEMM disabled—optimization's designed! ✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main changes: SGLang FP8 improvements and vLLM benchmark enhancements, directly matching the dual-file modifications in the changeset.
Description check	✅ Passed	The pull request description includes all required template sections (Overview/Summary, Details/Changes, Related Issues), provides clear explanations of modifications, identifies files changed, and includes a comprehensive test plan.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh (1)
43-63: Warmup concurrency list handling is sound, but consider preferred ShellCheck patterns.

The logic to ensure all chosen concurrencies are included in the warmup list (lines 45–57) is correct and avoids duplicates. However, line 60 triggers a ShellCheck SC2207 warning about command substitution with array assignment. While the current approach works, using mapfile is the preferred pattern:
# Current approach:
IFS=$'\n' warmup_concurrency_list=($(sort -n <<<"${warmup_concurrency_list[*]}"))
unset IFS

# Preferred approach using mapfile:
mapfile -t warmup_concurrency_list < <(printf '%s\n' "${warmup_concurrency_list[@]}" | sort -n)
Additionally, line 63 triggers a ShellCheck SC2145 warning about mixing string and array syntax. The current echo statement appears syntactically correct; please verify this resolves after the mapfile refactor or that it is a false positive in your environment.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5708b70 and 4c6d013.

📒 Files selected for processing (2)

examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1p_4d.sh (2 hunks)
examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 0
File: :0-0
Timestamp: 2025-09-19T07:32:44.210Z
Learning: The SGLang backend has debug print statements in _get_input_param() and _process_text_stream() methods that should be removed for production as they cause synchronous I/O in async contexts.

🧬 Code graph analysis (1)

examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh (1)

examples/backends/sglang/slurm_jobs/scripts/benchmark_utils.sh (1)

wait_for_model (5-44)

🪛 Shellcheck (0.11.0)

examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh

[warning] 60-60: Prefer mapfile or read -a to split command output (or quote to avoid splitting).

(SC2207)

[error] 63-63: Argument mixes string and array. Use * or separate argument.

(SC2145)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)

GitHub Check: trtllm (amd64)
GitHub Check: sglang (amd64)
GitHub Check: vllm (arm64)
GitHub Check: vllm (amd64)
GitHub Check: Build and Test - dynamo
GitHub Check: trtllm (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: operator (amd64)
GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: vllm (amd64)
GitHub Check: vllm (arm64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (3)

examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1p_4d.sh (1)

86-86: Environment variable addition for FP8 stability is well-placed.

Setting SGLANG_ENABLE_JIT_DEEPGEMM=false consistently in both prefill and decode branches before other SGLANG environment variables follows a logical configuration order and directly addresses JIT compilation issues in FP8 inference.

Also applies to: 144-144

examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh (2)

30-35: Timeout increase and parameter extraction improve readability and meet requirements.

Doubling the timeout from 1500s to 3000s (50 minutes) aligns with the PR objective for extended warmup, and extracting check/report intervals as named variables improves maintainability. The wait_for_model parameter order is correct: model_host, model_port, n_prefill, n_decode, poll, timeout, report_every.

99-99: Result filename format now includes clearer metadata and total GPU count.

The updated filename pattern with explicit ctx_, gen_, and gpus_ labels improves metadata clarity and traceability compared to the previous format. This aligns well with the PR objective and makes benchmark results easier to parse and organize.

Aphoh

LGTM

ishandhanani · 2025-12-01T21:47:44Z

/ok to test 4c6d013

ishandhanani and others added 5 commits November 20, 2025 22:46

chore: update low latency fp8 (#4499)

dcdacff

warmup fix

bc69dc3

bump

6fd0457

go

dc13e7f

update for parser

4c6d013

ishandhanani requested review from a team as code owners December 1, 2025 21:27

pull-request-size bot added the size/M label Dec 1, 2025

github-actions bot added the feat label Dec 1, 2025

ishandhanani enabled auto-merge (squash) December 1, 2025 21:30

PeaBrane approved these changes Dec 1, 2025

View reviewed changes

coderabbitai bot reviewed Dec 1, 2025

View reviewed changes

Aphoh approved these changes Dec 1, 2025

View reviewed changes

Merge branch 'main' into ishan/sa-1.1-sgl-dsr1-fp8-merge

fc9950b

copy-pr-bot bot had a problem deploying to GITLAB December 1, 2025 22:39 Failure

Merge branch 'main' into ishan/sa-1.1-sgl-dsr1-fp8-merge

fccfce8

copy-pr-bot bot temporarily deployed to GITLAB December 2, 2025 16:50 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 2, 2025 16:51 Inactive

ishandhanani merged commit 01a634d into main Dec 2, 2025
30 of 32 checks passed

ishandhanani deleted the ishan/sa-1.1-sgl-dsr1-fp8-merge branch December 2, 2025 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: SGLang FP8 improvements and vLLM benchmark enhancements #4675

feat: SGLang FP8 improvements and vLLM benchmark enhancements #4675

Uh oh!

ishandhanani commented Dec 1, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 1, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Aphoh left a comment

Uh oh!

ishandhanani commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: SGLang FP8 improvements and vLLM benchmark enhancements #4675

feat: SGLang FP8 improvements and vLLM benchmark enhancements #4675

Uh oh!

Conversation

ishandhanani commented Dec 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

SGLang FP8 Disaggregated Inference (1p_4d.sh)

vLLM Benchmark Script (bench.sh)

Test Plan

Summary by CodeRabbit

Chores

Uh oh!

coderabbitai bot commented Dec 1, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Aphoh left a comment

Choose a reason for hiding this comment

Uh oh!

ishandhanani commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ishandhanani commented Dec 1, 2025 •

edited by coderabbitai bot

Loading

SGLang FP8 Disaggregated Inference (`1p_4d.sh`)

vLLM Benchmark Script (`bench.sh`)