chore: bump vllm version to 0.10.2 #3180

alec-flowers · 2025-09-23T13:13:07Z

Overview:

Bumping vLLM version to 0.10.2. There are a number of vLLM changes that we needed to fix.

vLLM changed the KV Events Structure - commit. Everytime they change this breaks our indexing. I wrote a new deserializer in order to handle arbitrary field additions as well as handle both i64 and u64 for block hashes.
vLLM changed their Multi-Model OpenAI Spec. We have removed the appropriate field.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Added a configurable CUDA version for vLLM builds.
- Enabled optional overrides for DeepGEMM and FlashInfer versions.
- Introduced flexible vLLM installation paths (PyPI or source) with architecture-aware handling and clearer install output.
Chores
- Updated vLLM to version 0.10.2.
- Removed legacy pins for DeepGEMM and FlashInfer to support optional configuration.
- Streamlined dependency installation flow and improved configuration summaries.

copy-pr-bot · 2025-09-23T13:13:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-09-23T13:19:38Z

Walkthrough

Introduces CUDA_VERSION, DEEPGEMM_REF, and FLASHINF_REF build args in Dockerfile; switches VLLM_REF to v0.10.2. Overhauls install_vllm.sh to support PyPI vs source installs based on VLLM_REF, adds CUDA-aware options, refactors dependency installs, and conditions LMCache/FlashInfer paths by arch. Updates pyproject vllm optional dependency to 0.10.2.

Changes

Cohort / File(s)	Summary of Changes
VLLM Docker build args and flow `container/Dockerfile.vllm`	Added top-level ARG `CUDA_VERSION`; added optional ARGs `DEEPGEMM_REF`, `FLASHINF_REF` (defaults empty); changed `VLLM_REF` from commit to tag `v0.10.2`; passed CUDA/version args into installer; removed prior pinned defaults and old commented pins; minor structural/formatting adjustments.
Installer control flow and dependency handling `container/deps/vllm/install_vllm.sh`	Rewrote install logic: choose PyPI if `VLLM_REF` starts with v, else source build; added `--torch-cuda-arch-list` and `--cuda-version`; normalized arch detection; conditional LMCache install (amd64 only); updated FlashInfer path (PyPI/source, arch-aware); delegated DeepGEMM/EP Kernels installs; added verbose config and help updates.
Project dependency pin `pyproject.toml`	Updated `[project.optional-dependencies]` entry: `vllm[flashinfer]==0.10.2` (from `==0.10.1.1`).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Docker as Dockerfile.vllm
  participant Script as install_vllm.sh
  participant PyPI as PyPI
  participant Git as Git (vLLM repo)
  participant Flash as FlashInfer
  participant Deep as DeepGEMM/EP Kernels
  participant LM as LMCache

  rect rgba(230,245,255,0.5)
  note over Docker: Build args: VLLM_REF, CUDA_VERSION, DEEPGEMM_REF, FLASHINF_REF
  Docker->>Script: Run install_vllm.sh --cuda-version --torch-cuda-arch-list [args]
  end

  alt VLLM_REF starts with "v"
    Script->>PyPI: pip install vllm==<VLLM_REF> (flashinfer if applicable)
    note right of Script: PyPI path selected
  else Source build
    Script->>Git: git clone & checkout <VLLM_REF>
    Script->>Git: build/install vLLM from source
    note right of Script: Source path selected
  end

  par FlashInfer
    Script->>Flash: Install (PyPI or source, arch-aware)
  and DeepGEMM/EP
    Script->>Deep: Invoke dedicated installers (pass CUDA_VERSION, arch list)
  end

  opt amd64 only
    Script->>LM: Install LMCache
  end

  Script-->>Docker: Exit 0 on success

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

chore: Install vLLM and WideEP kernels in vLLM runtime container #2010 — Similar Dockerfile and installer changes introducing CUDA/DeepGEMM/FlashInfer args and altering vLLM install behavior.
chore: Finish vllm upgrade to 0.10.1 + cleanup #2528 — Touches Dockerfile, install script, and pyproject to bump vLLM/FlashInfer and adjust refs; closely aligned with this PR’s scope.
fix: install torch distribution matching container cuda version #2027 — Modifies install_vLLM flow for CUDA-aware installs; overlaps on CUDA handling and installer flags.

Poem

A rabbit taps at Docker’s door,
New args hop in—CUDA galore.
VLLM tagged, not hashes long,
Scripts now choose the proper song.
Flash and Deep in tidy queue,
On amd64, LMCache too.
Ship it swift—hippity, hoo! 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The description includes the required template headings but leaves the Details and Where should the reviewer start sections empty and describes unrelated code changes instead of the Dockerfile and installation script updates, making it incomplete and inaccurate.	Please update the Description to accurately describe the changes made—which include adding CUDA_VERSION, DEEPGEMM_REF, and FLASHINF_REF args in the Dockerfile, revising install_vllm.sh to support PyPI/source installs, and bumping the dependency in pyproject.toml—and fill in the Details and Reviewer Start sections with specific file references.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.
Title Check	✅ Passed	The title succinctly and accurately summarizes the primary change of bumping the vLLM version to 0.10.2, aligning with the main focus of the changeset.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

container/Dockerfile.vllm (1)
74-89: Add git to build deps to prevent install_vllm.sh clone failure.

install_vllm.sh performs git clone; this stage doesn’t install git. Builds will fail on fresh images.

Apply:
 RUN apt-get update -y \
     && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
         # Python runtime - CRITICAL for virtual environment to work
         python${PYTHON_VERSION}-dev \
         build-essential \
+        git \
         # vLLM build dependencies
         cmake \
         ibverbs-providers \

🧹 Nitpick comments (3)

container/Dockerfile.vllm (1)
199-201: Derive CUDA apt package from CUDA_VERSION to avoid future drift.

Hardcoding 12-8 can desync from ARG CUDA_VERSION.

Apply:
-        cuda-command-line-tools-12-8 && \
+        cuda-command-line-tools-${CUDA_VERSION/./-} && \
container/deps/vllm/install_vllm.sh (2)
120-126: Minor: use printf/echo -e for newlines.

Raw “\n” is printed literally on many shells. Cosmetic only.

Example:
-echo "\n=== Configuration Summary ==="
+echo -e "\n=== Configuration Summary ==="
Apply similarly to other multiline echoes.

171-175: Precompiled wheel URL hardcodes 0.10.2; guard by release tags or parameterize.

Using a 0.10.2 wheel for non-release refs can mismatch and fail when VLLM_REF != v0.10.2.

Option A (gate by tag):
-if [ "$EDITABLE" = "true" ]; then
+if [[ $VLLM_REF =~ ^v ]]; then
+    export VLLM_PRECOMPILED_WHEEL_LOCATION="https://vllm-wheels.s3.us-west-2.amazonaws.com/${VLLM_REF}/vllm-${VLLM_REF#v}-cp38-abi3-manylinux1_x86_64.whl"
+fi
+if [ "$EDITABLE" = "true" ]; then
Option B (disable for non-tag refs): only set VLLM_PRECOMPILED_WHEEL_LOCATION when VLLM_REF starts with v.

Please confirm the exact S3 wheel filename pattern for 0.10.2.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c433447 and 1db67f1.

📒 Files selected for processing (3)

container/Dockerfile.vllm (4 hunks)
container/deps/vllm/install_vllm.sh (4 hunks)
pyproject.toml (1 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.

📚 Learning: 2025-07-22T10:22:28.972Z

Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.

Applied to files:

container/deps/vllm/install_vllm.sh

📚 Learning: 2025-08-18T16:52:15.659Z

Learnt from: nnshah1
PR: ai-dynamo/dynamo#2489
File: container/deps/vllm/install_vllm.sh:151-152
Timestamp: 2025-08-18T16:52:15.659Z
Learning: The VLLM_PRECOMPILED_WHEEL_LOCATION environment variable, when exported, automatically triggers vLLM's build system to use the precompiled wheel instead of building from source, even when using standard `uv pip install .` commands in container/deps/vllm/install_vllm.sh.

Applied to files:

container/deps/vllm/install_vllm.sh

📚 Learning: 2025-07-21T00:10:56.947Z

Learnt from: zaristei
PR: ai-dynamo/dynamo#2020
File: container/deps/vllm/install_vllm.sh:115-118
Timestamp: 2025-07-21T00:10:56.947Z
Learning: Graceful fallback for PyTorch wheel installation is broken on ARM architecture, so immediate exit on pinned version failure is preferred over fallback mechanisms in container/deps/vllm/install_vllm.sh for ARM64.

Applied to files:

container/deps/vllm/install_vllm.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (2)

container/Dockerfile.vllm (1)

143-145: Pass-through looks correct.

Conditional DEEPGEMM/FLASHINF args and CUDA_VERSION propagation are sound.

pyproject.toml (1)

57-58: Approve pin bump — flashinfer extra exists on vllm 0.10.2.
PyPI metadata for vllm 0.10.2 lists "flashinfer" in provides_extra.

container/deps/vllm/install_vllm.sh

Signed-off-by: Alec <aflowers@nvidia.com>

alec-flowers · 2025-09-23T16:04:17Z

/ok to test 359b093

krishung5

Lgtm, thanks

container/deps/vllm/install_vllm.sh

Co-authored-by: Kris Hung <krish@nvidia.com> Signed-off-by: Alec <35311602+alec-flowers@users.noreply.github.com>

Signed-off-by: Alec <aflowers@nvidia.com>

alec-flowers · 2025-09-24T16:24:55Z

/ok to test bd3389c

Signed-off-by: Alec <aflowers@nvidia.com>

alec-flowers · 2025-09-24T17:57:36Z

/ok to test 24d3b15

Signed-off-by: krishung5 <krish@nvidia.com>

krishung5 · 2025-09-25T01:49:53Z

/ok to test 210c43d

Signed-off-by: krishung5 <krish@nvidia.com>

krishung5 · 2025-09-25T02:07:34Z

/ok to test 7f6ac97

Signed-off-by: krishung5 <krish@nvidia.com>

Signed-off-by: Alec <aflowers@nvidia.com>

alec-flowers · 2025-09-25T07:51:23Z

/ok to test 5cfe1cb

Signed-off-by: Alec <aflowers@nvidia.com>

alec-flowers · 2025-09-25T12:30:43Z

/ok to test 85483a0

lib/llm/src/kv_router/publisher.rs

PeaBrane · 2025-09-25T19:43:43Z

Seems like this could potentially resolve #3169. Would this change break for the KV events for the other engines (looks to be pretty general and backward compatible after reviewing the PR so likely not, great work ensuring this)

lib/llm/src/kv_router/publisher.rs

PeaBrane · 2025-09-25T20:05:34Z

Not really part of your PR, but can you update the mocker engines to emit the KV events in the new vllm format, so new router + mocker e2e tests can also run with the new event publishing/subscribing formats, and we keep track locally how vllm changed.

Or at the very least mention briefly in the PR description what changes vllm made to their KV event publishing that necessitated these changes

Signed-off-by: Alec <aflowers@nvidia.com>

alec-flowers · 2025-09-26T02:43:50Z

/ok to test fc88fcf

alec-flowers · 2025-09-26T02:49:02Z

/ok to test 89164de

alec-flowers requested review from a team as code owners September 23, 2025 13:13

pull-request-size bot added the size/L label Sep 23, 2025

github-actions bot added the chore label Sep 23, 2025

coderabbitai bot reviewed Sep 23, 2025

View reviewed changes

container/deps/vllm/install_vllm.sh Show resolved Hide resolved

container/deps/vllm/install_vllm.sh Outdated Show resolved Hide resolved

container/deps/vllm/install_vllm.sh Show resolved Hide resolved

alec-flowers force-pushed the aflowers/bump-vllm branch from 1db67f1 to 285e794 Compare September 23, 2025 15:09

bubmp vllm

359b093

Signed-off-by: Alec <aflowers@nvidia.com>

alec-flowers force-pushed the aflowers/bump-vllm branch from 285e794 to 359b093 Compare September 23, 2025 16:00

krishung5 approved these changes Sep 23, 2025

View reviewed changes

container/deps/vllm/install_vllm.sh Outdated Show resolved Hide resolved

krishung5 mentioned this pull request Sep 23, 2025

feat: Add vLLM multimodal audio support #2760

Open

alec-flowers and others added 2 commits September 24, 2025 09:13

Update container/deps/vllm/install_vllm.sh

f678318

Co-authored-by: Kris Hung <krish@nvidia.com> Signed-off-by: Alec <35311602+alec-flowers@users.noreply.github.com>

fix aarch

bd3389c

Signed-off-by: Alec <aflowers@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB September 24, 2025 16:25 Inactive

fix Kv events

24d3b15

Signed-off-by: Alec <aflowers@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB September 24, 2025 17:57 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 24, 2025 17:58 Inactive

Fix clippy check

210c43d

Signed-off-by: krishung5 <krish@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB September 25, 2025 01:49 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 25, 2025 01:53 Inactive

Fix redundant closure for clippy checks

7f6ac97

Signed-off-by: krishung5 <krish@nvidia.com>

krishung5 and others added 2 commits September 24, 2025 22:53

fix: Fix multimodal vllm test (#3222)

409307a

Signed-off-by: krishung5 <krish@nvidia.com>

fix: serialization changes (#3225)

5cfe1cb

Signed-off-by: Alec <aflowers@nvidia.com>

pull-request-size bot added size/XL and removed size/L labels Sep 25, 2025

copy-pr-bot bot temporarily deployed to GITLAB September 25, 2025 07:51 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 25, 2025 07:52 Inactive

fix clippy

85483a0

Signed-off-by: Alec <aflowers@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB September 25, 2025 12:30 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 25, 2025 12:31 Inactive

rmccorm4 reviewed Sep 25, 2025

View reviewed changes

lib/llm/src/kv_router/publisher.rs Show resolved Hide resolved

rmccorm4 requested a review from PeaBrane September 25, 2025 18:35

PeaBrane reviewed Sep 25, 2025

View reviewed changes

lib/llm/src/kv_router/publisher.rs Show resolved Hide resolved

PeaBrane reviewed Sep 25, 2025

View reviewed changes

lib/llm/src/kv_router/publisher.rs Show resolved Hide resolved

fix few logs

fc88fcf

Signed-off-by: Alec <aflowers@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB September 26, 2025 02:43 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 26, 2025 02:44 Inactive

Merge branch 'main' into aflowers/bump-vllm

89164de

copy-pr-bot bot temporarily deployed to GITLAB September 26, 2025 02:49 Inactive

PeaBrane approved these changes Sep 26, 2025

View reviewed changes

alec-flowers enabled auto-merge (squash) September 26, 2025 03:13

alec-flowers merged commit 5bb7490 into main Sep 26, 2025
23 of 26 checks passed

alec-flowers deleted the aflowers/bump-vllm branch September 26, 2025 04:01

PeaBrane mentioned this pull request Sep 30, 2025

[BUG]: KV event publisher does not work with latest vllm #3169

Closed

chore: bump vllm version to 0.10.2 #3180

chore: bump vllm version to 0.10.2 #3180

Uh oh!

Conversation

alec-flowers commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 23, 2025

Uh oh!

coderabbitai bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alec-flowers commented Sep 23, 2025

Uh oh!

krishung5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alec-flowers commented Sep 24, 2025

Uh oh!

alec-flowers commented Sep 24, 2025

Uh oh!

krishung5 commented Sep 25, 2025

Uh oh!

krishung5 commented Sep 25, 2025

Uh oh!

alec-flowers commented Sep 25, 2025

Uh oh!

alec-flowers commented Sep 25, 2025

Uh oh!

Uh oh!

PeaBrane commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PeaBrane commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alec-flowers commented Sep 26, 2025

Uh oh!

alec-flowers commented Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alec-flowers commented Sep 23, 2025 •

edited

Loading

coderabbitai bot commented Sep 23, 2025 •

edited

Loading

PeaBrane commented Sep 25, 2025 •

edited

Loading

PeaBrane commented Sep 25, 2025 •

edited

Loading