Skip to content

Conversation

@ishandhanani
Copy link
Contributor

@ishandhanani ishandhanani commented Dec 1, 2025

Summary

  • Disable JIT DeepGEMM for FP8 disaggregated inference to improve stability
  • Extend model warmup timeout from 25 to 50 minutes for better reliability
  • Enhance warmup process to dynamically include all target concurrencies
  • Update benchmark result filename format for better clarity and metadata

Changes

SGLang FP8 Disaggregated Inference (1p_4d.sh)

  • Set SGLANG_ENABLE_JIT_DEEPGEMM=false for both prefill and decode modes
  • Prevents JIT compilation issues with FP8 inference

vLLM Benchmark Script (bench.sh)

  • Warmup improvements:
    • Increased wait_for_model_timeout from 1500s (25 min) to 3000s (50 min)
    • Warmup now dynamically includes all chosen concurrency values, not just predefined list
    • Warmup list is sorted numerically for consistent behavior
  • Filename format:
    • Changed from ctx${prefill_gpus}_gen${decode_gpus} to ctx_${prefill_gpus}_gen_${decode_gpus}_gpus_${total_gpus}
    • Improves parsing and adds total GPU count to metadata
  • Removed trailing set +e for cleaner script termination

Test Plan

  • Verify FP8 disaggregated inference runs without JIT DeepGEMM errors
  • Confirm warmup completes successfully with extended timeout
  • Validate warmup includes all target concurrencies
  • Check benchmark result files use new naming format
  • Test with various concurrency configurations

Generated with Claude Code

Summary by CodeRabbit

Chores

  • Updated benchmark scripts with extended timeout values (1500ms → 3000ms) and improved warmup parameter handling for enhanced stability
  • Modified result output filenames to include additional tracking context for better result organization
  • Applied hardware-specific configuration optimizations for target deployment scenarios

✏️ Tip: You can customize this high-level summary in your review settings.

@ishandhanani ishandhanani requested review from a team as code owners December 1, 2025 21:27
@github-actions github-actions bot added the feat label Dec 1, 2025
@ishandhanani ishandhanani enabled auto-merge (squash) December 1, 2025 21:30
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 1, 2025

Walkthrough

Configuration updates for SLang backend optimization and benchmarking script refinements. The first file disables JIT DeepGEMM compilation. The second file increases benchmark timeout, refactors warmup concurrency list handling, and standardizes result filename construction.

Changes

Cohort / File(s) Summary
SLang Configuration
examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1p_4d.sh
Added environment variable SGLANG_ENABLE_JIT_DEEPGEMM=false in prefill and decode branches to disable JIT DeepGEMM compilation.
Benchmark Script Enhancements
examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh
Increased wait_for_model timeout from 1500 to 3000 seconds. Introduced logic to ensure all chosen_concurrencies values exist in warmup_concurrency_list, sorting the list numerically. Updated warmup concurrency list handling and refactored result filename pattern to include ctx, gen, and total_gpus identifiers.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

  • Focus areas:
    • vllm/bench.sh: Verify warmup concurrency list logic correctly appends missing values and sorts numerically without duplicates
    • vllm/bench.sh: Confirm new result filename pattern with ctx, gen, and total_gpus identifiers constructs correctly in both warmup and main benchmark loops
    • Validate wait_for_model timeout increase (3000 seconds) is appropriate for expected model initialization time

Poem

🐰 A script hops faster, timeout sets wider,
Warmup lists sorted like lettuce inside her,
JIT's turned to rest, filenames refined,
DeepGEMM disabled—optimization's designed! ✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: SGLang FP8 improvements and vLLM benchmark enhancements, directly matching the dual-file modifications in the changeset.
Description check ✅ Passed The pull request description includes all required template sections (Overview/Summary, Details/Changes, Related Issues), provides clear explanations of modifications, identifies files changed, and includes a comprehensive test plan.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh (1)

43-63: Warmup concurrency list handling is sound, but consider preferred ShellCheck patterns.

The logic to ensure all chosen concurrencies are included in the warmup list (lines 45–57) is correct and avoids duplicates. However, line 60 triggers a ShellCheck SC2207 warning about command substitution with array assignment. While the current approach works, using mapfile is the preferred pattern:

# Current approach:
IFS=$'\n' warmup_concurrency_list=($(sort -n <<<"${warmup_concurrency_list[*]}"))
unset IFS

# Preferred approach using mapfile:
mapfile -t warmup_concurrency_list < <(printf '%s\n' "${warmup_concurrency_list[@]}" | sort -n)

Additionally, line 63 triggers a ShellCheck SC2145 warning about mixing string and array syntax. The current echo statement appears syntactically correct; please verify this resolves after the mapfile refactor or that it is a false positive in your environment.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5708b70 and 4c6d013.

📒 Files selected for processing (2)
  • examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1p_4d.sh (2 hunks)
  • examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 0
File: :0-0
Timestamp: 2025-09-19T07:32:44.210Z
Learning: The SGLang backend has debug print statements in _get_input_param() and _process_text_stream() methods that should be removed for production as they cause synchronous I/O in async contexts.
🧬 Code graph analysis (1)
examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh (1)
examples/backends/sglang/slurm_jobs/scripts/benchmark_utils.sh (1)
  • wait_for_model (5-44)
🪛 Shellcheck (0.11.0)
examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh

[warning] 60-60: Prefer mapfile or read -a to split command output (or quote to avoid splitting).

(SC2207)


[error] 63-63: Argument mixes string and array. Use * or separate argument.

(SC2145)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: trtllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (3)
examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1p_4d.sh (1)

86-86: Environment variable addition for FP8 stability is well-placed.

Setting SGLANG_ENABLE_JIT_DEEPGEMM=false consistently in both prefill and decode branches before other SGLANG environment variables follows a logical configuration order and directly addresses JIT compilation issues in FP8 inference.

Also applies to: 144-144

examples/backends/sglang/slurm_jobs/scripts/vllm/bench.sh (2)

30-35: Timeout increase and parameter extraction improve readability and meet requirements.

Doubling the timeout from 1500s to 3000s (50 minutes) aligns with the PR objective for extended warmup, and extracting check/report intervals as named variables improves maintainability. The wait_for_model parameter order is correct: model_host, model_port, n_prefill, n_decode, poll, timeout, report_every.


99-99: Result filename format now includes clearer metadata and total GPU count.

The updated filename pattern with explicit ctx_, gen_, and gpus_ labels improves metadata clarity and traceability compared to the previous format. This aligns well with the PR objective and makes benchmark results easier to parse and organize.

Copy link
Contributor

@Aphoh Aphoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ishandhanani
Copy link
Contributor Author

/ok to test 4c6d013

@ishandhanani ishandhanani merged commit 01a634d into main Dec 2, 2025
30 of 32 checks passed
@ishandhanani ishandhanani deleted the ishan/sa-1.1-sgl-dsr1-fp8-merge branch December 2, 2025 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants