Skip to content

Conversation

@mohammedabdulwahhab
Copy link
Contributor

@mohammedabdulwahhab mohammedabdulwahhab commented Aug 26, 2025

Overview:

cherry pick of #2727

Summary by CodeRabbit

  • New Features

    • Kubernetes CRDs add sharedMemory config for /dev/shm (enable/size).
    • Helm charts become componentType-aware (frontend/worker env, ports, health checks); add terminationDelay.
    • New multimodal LLAVA aggregated deployment example.
  • Improvements

    • Default model switched to Qwen/Qwen3-0.6B across samples and launch scripts.
    • Readiness gating prevents requests before model registration; tokenizer init auto-skipped.
    • Container updates: TensorRT-LLM 1.0.0rc6, vLLM 0.10.1.1, base images/UCX pinned; Prometheus included in runtimes.
  • Bug Fixes

    • Clear error when output tokens are absent; example script imports fixed; GPU limits moved to pod level.
  • Documentation

    • New Quickstart (local), Installation, Architecture, Examples; links refreshed; metrics guide updated; support matrix revised.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mohammedabdulwahhab mohammedabdulwahhab changed the base branch from main to release/0.4.1 August 26, 2025 22:59
@nv-nmailhot nv-nmailhot merged commit ec66a42 into release/0.4.1 Aug 26, 2025
4 of 5 checks passed
@nv-nmailhot nv-nmailhot deleted the mabdulwahhab/cp-hello-world-fix branch August 26, 2025 23:10
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 26, 2025

Caution

Review failed

Failed to post review comments.

Walkthrough

Broad updates across docs, configs, and code: switch default demo model to Qwen/Qwen3-0.6B; add SGLang readiness gate and tokenizer-init enforcement; refine error handling; add CRD/operator “sharedMemory” support; rework Helm templates for componentType; bump TRT‑LLM/vLLM/UCX versions; remove local async-openai-macros crate; add examples and docs reorg.

Changes

Cohort / File(s) Summary
SGLang runtime flow
components/backends/sglang/src/dynamo/sglang/args.py, .../main.py, .../register.py, .../request_handlers/decode_handler.py
Auto-enable skip_tokenizer_init with warning; add readiness gate to queue requests until model registration completes; make register_llm_with_runtime_config return bool; add defensive handling when output_ids missing.
SGLang deploy/launch
components/backends/sglang/deploy/*.yaml, components/backends/sglang/launch/*.sh, components/backends/sglang/slurm_jobs/scripts/*, components/backends/sglang/README.md, .../deploy/README.md, .../docs/*
Switch model to Qwen/Qwen3-0.6B across examples; add flags like --skip-tokenizer-init, kv-events-config, disaggregation options; minor namespace/type/link fixes; update hicache flag to --hicache-ratio.
TRT‑LLM configs and docs
components/backends/trtllm/deploy/*.yaml, components/backends/trtllm/engine_configs/llama4/eagle/*, components/backends/trtllm/README.md, .../gpt-oss.md, .../gemma3_sliding_window_attention.md, .../launch/*.sh
Point deployments to Qwen/Qwen3-0.6B; adjust Eagle configs (delete some, tweak others incl. cuda_graph_config, token limits); consolidate multimodal docs to external guide; readiness/health docs added.
vLLM updates
components/backends/vllm/deploy/agg_router.yaml, container/Dockerfile.vllm, container/deps/vllm/install_vllm.sh
Move GPU limit to pod-level; bump vLLM ref to 0.10.1.1 and copy Prometheus/UCX into runtime.
Containers and build pins
container/Dockerfile*, container/build.sh, pyproject.toml, README.md
Update UCX to v1.19.0; bump TRT‑LLM base/runtime tags and deps to rc6; update Torch pins; copy Prometheus into runtime images; switch some install commands to uv.
Operator/CRDs
deploy/cloud/helm/crds/templates/nvidia.com_*, deploy/cloud/operator/api/v1alpha1/*, .../internal/consts/consts.go, .../internal/dynamo/graph.go, .../internal/controller/*_test.go, deploy/helm/chart/templates/*
Add sharedMemory spec (disabled/size) across CRDs, API types, deepcopy, and use defaults (/dev/shm, 8Gi). Add BackendFrameworkNoop and componentType-aware Helm charts (env/ports/probes/commands).
Docs reorganization
docs/conf.py, docs/index.rst, docs/_sections/*, docs/_includes/*, docs/hidden_toctree.rst, docs/support_matrix.md, multiple moved/removed links
Rebrand docs config, simplify MyST extensions; restructure index and sections; add install/quick start snippets; update support matrix to TRT‑LLM rc6 and add AL2023 footnote; replace/move various links.
Examples
examples/runtime/hello_world/*, examples/basics/multinode/README.md, examples/multimodal/deploy/agg_llava.yaml
Add retry loop to client; adjust probes/args and add backendFramework; fix missing import; add LLAVA aggregated deployment manifest.
Rust workspace/macros
Cargo.toml, lib/async-openai-macros/*, lib/async-openai/Cargo.toml
Remove local async-openai-macros crate from workspace; switch to published crate version 0.1.0.
Tests
tests/serve/test_sglang.py, tests/serve/test_vllm.py, tests/kvbm/test_determinism.py
Update model to Qwen; increase vLLM timeout; comment out certain pytest markers pending CI support.
Attributions
ATTRIBUTIONS-Go.md
Add two third-party license blocks (entries duplicated).
Misc
deploy/inference-gateway/README.md, docs/components/backends/*
Link/path fixes and small doc additions/removals.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Frontend
  participant SGLang as SGLang Runtime
  participant Registrar as Model Registrar

  Note over Frontend: Readiness gate
  Frontend->>Registrar: register_llm_with_runtime_config(...)
  par Start endpoint immediately
    Frontend-->>Client: /v1/... generate endpoint available
  and Register model concurrently
    Registrar-->>Frontend: success(bool=true) or failure
  end

  alt Registration succeeds
    Frontend->>Frontend: ready_event.set()
    Client->>Frontend: generate(request)
    Frontend->>Frontend: wait until ready_event
    Frontend->>SGLang: handler.generate(request)
    SGLang-->>Frontend: stream chunks
    Frontend-->>Client: stream chunks
  else Registration fails
    Registrar->>Frontend: error
    Frontend->>SGLang: shutdown()
    Frontend-->>Client: error response
  end
Loading
sequenceDiagram
  autonumber
  participant Operator as Operator Graph Builder
  participant CRD as CRD Spec (sharedMemory)
  participant K8s as Kubernetes

  Operator->>CRD: read spec.sharedMemory {disabled,size}
  alt disabled == true
    Operator->>K8s: do not mount /dev/shm tmpfs
  else not set or false
    Operator->>K8s: create EmptyDir medium=Memory sizeLimit=(size or 8Gi)
    Operator->>K8s: mount at /dev/shm (default path)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

Bun to the metal, I tweak and I tune,
Gates hold requests till models commune.
Qwen is the default, the routes are anew,
Shared mem grows comfy, with pods in a queue.
Docs shed their clutter, containers align—
Hippity-hop, ship it, all fine! 🐇✨

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.2.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants