Skip to content

Conversation

@alec-flowers
Copy link
Contributor

@alec-flowers alec-flowers commented Sep 23, 2025

Overview:

Bumping vLLM version to 0.10.2. There are a number of vLLM changes that we needed to fix.

  1. vLLM changed the KV Events Structure - commit. Everytime they change this breaks our indexing. I wrote a new deserializer in order to handle arbitrary field additions as well as handle both i64 and u64 for block hashes.

  2. vLLM changed their Multi-Model OpenAI Spec. We have removed the appropriate field.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added a configurable CUDA version for vLLM builds.
    • Enabled optional overrides for DeepGEMM and FlashInfer versions.
    • Introduced flexible vLLM installation paths (PyPI or source) with architecture-aware handling and clearer install output.
  • Chores

    • Updated vLLM to version 0.10.2.
    • Removed legacy pins for DeepGEMM and FlashInfer to support optional configuration.
    • Streamlined dependency installation flow and improved configuration summaries.

@alec-flowers alec-flowers requested review from a team as code owners September 23, 2025 13:13
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 23, 2025

Walkthrough

Introduces CUDA_VERSION, DEEPGEMM_REF, and FLASHINF_REF build args in Dockerfile; switches VLLM_REF to v0.10.2. Overhauls install_vllm.sh to support PyPI vs source installs based on VLLM_REF, adds CUDA-aware options, refactors dependency installs, and conditions LMCache/FlashInfer paths by arch. Updates pyproject vllm optional dependency to 0.10.2.

Changes

Cohort / File(s) Summary of Changes
VLLM Docker build args and flow
container/Dockerfile.vllm
Added top-level ARG CUDA_VERSION; added optional ARGs DEEPGEMM_REF, FLASHINF_REF (defaults empty); changed VLLM_REF from commit to tag v0.10.2; passed CUDA/version args into installer; removed prior pinned defaults and old commented pins; minor structural/formatting adjustments.
Installer control flow and dependency handling
container/deps/vllm/install_vllm.sh
Rewrote install logic: choose PyPI if VLLM_REF starts with v, else source build; added --torch-cuda-arch-list and --cuda-version; normalized arch detection; conditional LMCache install (amd64 only); updated FlashInfer path (PyPI/source, arch-aware); delegated DeepGEMM/EP Kernels installs; added verbose config and help updates.
Project dependency pin
pyproject.toml
Updated [project.optional-dependencies] entry: vllm[flashinfer]==0.10.2 (from ==0.10.1.1).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Docker as Dockerfile.vllm
  participant Script as install_vllm.sh
  participant PyPI as PyPI
  participant Git as Git (vLLM repo)
  participant Flash as FlashInfer
  participant Deep as DeepGEMM/EP Kernels
  participant LM as LMCache

  rect rgba(230,245,255,0.5)
  note over Docker: Build args: VLLM_REF, CUDA_VERSION, DEEPGEMM_REF, FLASHINF_REF
  Docker->>Script: Run install_vllm.sh --cuda-version --torch-cuda-arch-list [args]
  end

  alt VLLM_REF starts with "v"
    Script->>PyPI: pip install vllm==<VLLM_REF> (flashinfer if applicable)
    note right of Script: PyPI path selected
  else Source build
    Script->>Git: git clone & checkout <VLLM_REF>
    Script->>Git: build/install vLLM from source
    note right of Script: Source path selected
  end

  par FlashInfer
    Script->>Flash: Install (PyPI or source, arch-aware)
  and DeepGEMM/EP
    Script->>Deep: Invoke dedicated installers (pass CUDA_VERSION, arch list)
  end

  opt amd64 only
    Script->>LM: Install LMCache
  end

  Script-->>Docker: Exit 0 on success
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

A rabbit taps at Docker’s door,
New args hop in—CUDA galore.
VLLM tagged, not hashes long,
Scripts now choose the proper song.
Flash and Deep in tidy queue,
On amd64, LMCache too.
Ship it swift—hippity, hoo! 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The description includes the required template headings but leaves the Details and Where should the reviewer start sections empty and describes unrelated code changes instead of the Dockerfile and installation script updates, making it incomplete and inaccurate. Please update the Description to accurately describe the changes made—which include adding CUDA_VERSION, DEEPGEMM_REF, and FLASHINF_REF args in the Dockerfile, revising install_vllm.sh to support PyPI/source installs, and bumping the dependency in pyproject.toml—and fill in the Details and Reviewer Start sections with specific file references.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Title Check ✅ Passed The title succinctly and accurately summarizes the primary change of bumping the vLLM version to 0.10.2, aligning with the main focus of the changeset.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
container/Dockerfile.vllm (1)

74-89: Add git to build deps to prevent install_vllm.sh clone failure.

install_vllm.sh performs git clone; this stage doesn’t install git. Builds will fail on fresh images.

Apply:

 RUN apt-get update -y \
     && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
         # Python runtime - CRITICAL for virtual environment to work
         python${PYTHON_VERSION}-dev \
         build-essential \
+        git \
         # vLLM build dependencies
         cmake \
         ibverbs-providers \
🧹 Nitpick comments (3)
container/Dockerfile.vllm (1)

199-201: Derive CUDA apt package from CUDA_VERSION to avoid future drift.

Hardcoding 12-8 can desync from ARG CUDA_VERSION.

Apply:

-        cuda-command-line-tools-12-8 && \
+        cuda-command-line-tools-${CUDA_VERSION/./-} && \
container/deps/vllm/install_vllm.sh (2)

120-126: Minor: use printf/echo -e for newlines.

Raw “\n” is printed literally on many shells. Cosmetic only.

Example:

-echo "\n=== Configuration Summary ==="
+echo -e "\n=== Configuration Summary ==="

Apply similarly to other multiline echoes.


171-175: Precompiled wheel URL hardcodes 0.10.2; guard by release tags or parameterize.

Using a 0.10.2 wheel for non-release refs can mismatch and fail when VLLM_REF != v0.10.2.

Option A (gate by tag):

-if [ "$EDITABLE" = "true" ]; then
+if [[ $VLLM_REF =~ ^v ]]; then
+    export VLLM_PRECOMPILED_WHEEL_LOCATION="https://vllm-wheels.s3.us-west-2.amazonaws.com/${VLLM_REF}/vllm-${VLLM_REF#v}-cp38-abi3-manylinux1_x86_64.whl"
+fi
+if [ "$EDITABLE" = "true" ]; then

Option B (disable for non-tag refs): only set VLLM_PRECOMPILED_WHEEL_LOCATION when VLLM_REF starts with v.

Please confirm the exact S3 wheel filename pattern for 0.10.2.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c433447 and 1db67f1.

📒 Files selected for processing (3)
  • container/Dockerfile.vllm (4 hunks)
  • container/deps/vllm/install_vllm.sh (4 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
📚 Learning: 2025-07-22T10:22:28.972Z
Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.

Applied to files:

  • container/deps/vllm/install_vllm.sh
📚 Learning: 2025-08-18T16:52:15.659Z
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2489
File: container/deps/vllm/install_vllm.sh:151-152
Timestamp: 2025-08-18T16:52:15.659Z
Learning: The VLLM_PRECOMPILED_WHEEL_LOCATION environment variable, when exported, automatically triggers vLLM's build system to use the precompiled wheel instead of building from source, even when using standard `uv pip install .` commands in container/deps/vllm/install_vllm.sh.

Applied to files:

  • container/deps/vllm/install_vllm.sh
📚 Learning: 2025-07-21T00:10:56.947Z
Learnt from: zaristei
PR: ai-dynamo/dynamo#2020
File: container/deps/vllm/install_vllm.sh:115-118
Timestamp: 2025-07-21T00:10:56.947Z
Learning: Graceful fallback for PyTorch wheel installation is broken on ARM architecture, so immediate exit on pinned version failure is preferred over fallback mechanisms in container/deps/vllm/install_vllm.sh for ARM64.

Applied to files:

  • container/deps/vllm/install_vllm.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
container/Dockerfile.vllm (1)

143-145: Pass-through looks correct.

Conditional DEEPGEMM/FLASHINF args and CUDA_VERSION propagation are sound.

pyproject.toml (1)

57-58: Approve pin bump — flashinfer extra exists on vllm 0.10.2.
PyPI metadata for vllm 0.10.2 lists "flashinfer" in provides_extra.

Signed-off-by: Alec <aflowers@nvidia.com>
@alec-flowers
Copy link
Contributor Author

/ok to test 359b093

Copy link
Contributor

@krishung5 krishung5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, thanks

alec-flowers and others added 2 commits September 24, 2025 09:13
Co-authored-by: Kris Hung <krish@nvidia.com>
Signed-off-by: Alec <35311602+alec-flowers@users.noreply.github.com>
Signed-off-by: Alec <aflowers@nvidia.com>
@alec-flowers
Copy link
Contributor Author

/ok to test bd3389c

Signed-off-by: Alec <aflowers@nvidia.com>
@alec-flowers
Copy link
Contributor Author

/ok to test 24d3b15

Signed-off-by: krishung5 <krish@nvidia.com>
@krishung5
Copy link
Contributor

/ok to test 210c43d

Signed-off-by: krishung5 <krish@nvidia.com>
@krishung5
Copy link
Contributor

/ok to test 7f6ac97

krishung5 and others added 2 commits September 24, 2025 22:53
Signed-off-by: krishung5 <krish@nvidia.com>
Signed-off-by: Alec <aflowers@nvidia.com>
@alec-flowers
Copy link
Contributor Author

/ok to test 5cfe1cb

Signed-off-by: Alec <aflowers@nvidia.com>
@alec-flowers
Copy link
Contributor Author

/ok to test 85483a0

@rmccorm4 rmccorm4 requested a review from PeaBrane September 25, 2025 18:35
@PeaBrane
Copy link
Contributor

PeaBrane commented Sep 25, 2025

Seems like this could potentially resolve #3169. Would this change break for the KV events for the other engines (looks to be pretty general and backward compatible after reviewing the PR so likely not, great work ensuring this)

@PeaBrane
Copy link
Contributor

PeaBrane commented Sep 25, 2025

Not really part of your PR, but can you update the mocker engines to emit the KV events in the new vllm format, so new router + mocker e2e tests can also run with the new event publishing/subscribing formats, and we keep track locally how vllm changed.

Or at the very least mention briefly in the PR description what changes vllm made to their KV event publishing that necessitated these changes

Signed-off-by: Alec <aflowers@nvidia.com>
@alec-flowers
Copy link
Contributor Author

/ok to test fc88fcf

@alec-flowers
Copy link
Contributor Author

/ok to test 89164de

@alec-flowers alec-flowers enabled auto-merge (squash) September 26, 2025 03:13
@alec-flowers alec-flowers merged commit 5bb7490 into main Sep 26, 2025
23 of 26 checks passed
@alec-flowers alec-flowers deleted the aflowers/bump-vllm branch September 26, 2025 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants