-
Notifications
You must be signed in to change notification settings - Fork 677
chore: bump vllm version to 0.10.2 #3180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughIntroduces CUDA_VERSION, DEEPGEMM_REF, and FLASHINF_REF build args in Dockerfile; switches VLLM_REF to v0.10.2. Overhauls install_vllm.sh to support PyPI vs source installs based on VLLM_REF, adds CUDA-aware options, refactors dependency installs, and conditions LMCache/FlashInfer paths by arch. Updates pyproject vllm optional dependency to 0.10.2. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Docker as Dockerfile.vllm
participant Script as install_vllm.sh
participant PyPI as PyPI
participant Git as Git (vLLM repo)
participant Flash as FlashInfer
participant Deep as DeepGEMM/EP Kernels
participant LM as LMCache
rect rgba(230,245,255,0.5)
note over Docker: Build args: VLLM_REF, CUDA_VERSION, DEEPGEMM_REF, FLASHINF_REF
Docker->>Script: Run install_vllm.sh --cuda-version --torch-cuda-arch-list [args]
end
alt VLLM_REF starts with "v"
Script->>PyPI: pip install vllm==<VLLM_REF> (flashinfer if applicable)
note right of Script: PyPI path selected
else Source build
Script->>Git: git clone & checkout <VLLM_REF>
Script->>Git: build/install vLLM from source
note right of Script: Source path selected
end
par FlashInfer
Script->>Flash: Install (PyPI or source, arch-aware)
and DeepGEMM/EP
Script->>Deep: Invoke dedicated installers (pass CUDA_VERSION, arch list)
end
opt amd64 only
Script->>LM: Install LMCache
end
Script-->>Docker: Exit 0 on success
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
container/Dockerfile.vllm (1)
74-89: Add git to build deps to prevent install_vllm.sh clone failure.install_vllm.sh performs git clone; this stage doesn’t install git. Builds will fail on fresh images.
Apply:
RUN apt-get update -y \ && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ # Python runtime - CRITICAL for virtual environment to work python${PYTHON_VERSION}-dev \ build-essential \ + git \ # vLLM build dependencies cmake \ ibverbs-providers \
🧹 Nitpick comments (3)
container/Dockerfile.vllm (1)
199-201: Derive CUDA apt package from CUDA_VERSION to avoid future drift.Hardcoding 12-8 can desync from ARG CUDA_VERSION.
Apply:
- cuda-command-line-tools-12-8 && \ + cuda-command-line-tools-${CUDA_VERSION/./-} && \container/deps/vllm/install_vllm.sh (2)
120-126: Minor: use printf/echo -e for newlines.Raw “\n” is printed literally on many shells. Cosmetic only.
Example:
-echo "\n=== Configuration Summary ===" +echo -e "\n=== Configuration Summary ==="Apply similarly to other multiline echoes.
171-175: Precompiled wheel URL hardcodes 0.10.2; guard by release tags or parameterize.Using a 0.10.2 wheel for non-release refs can mismatch and fail when VLLM_REF != v0.10.2.
Option A (gate by tag):
-if [ "$EDITABLE" = "true" ]; then +if [[ $VLLM_REF =~ ^v ]]; then + export VLLM_PRECOMPILED_WHEEL_LOCATION="https://vllm-wheels.s3.us-west-2.amazonaws.com/${VLLM_REF}/vllm-${VLLM_REF#v}-cp38-abi3-manylinux1_x86_64.whl" +fi +if [ "$EDITABLE" = "true" ]; thenOption B (disable for non-tag refs): only set VLLM_PRECOMPILED_WHEEL_LOCATION when VLLM_REF starts with v.
Please confirm the exact S3 wheel filename pattern for 0.10.2.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
container/Dockerfile.vllm(4 hunks)container/deps/vllm/install_vllm.sh(4 hunks)pyproject.toml(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
📚 Learning: 2025-07-22T10:22:28.972Z
Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
Applied to files:
container/deps/vllm/install_vllm.sh
📚 Learning: 2025-08-18T16:52:15.659Z
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2489
File: container/deps/vllm/install_vllm.sh:151-152
Timestamp: 2025-08-18T16:52:15.659Z
Learning: The VLLM_PRECOMPILED_WHEEL_LOCATION environment variable, when exported, automatically triggers vLLM's build system to use the precompiled wheel instead of building from source, even when using standard `uv pip install .` commands in container/deps/vllm/install_vllm.sh.
Applied to files:
container/deps/vllm/install_vllm.sh
📚 Learning: 2025-07-21T00:10:56.947Z
Learnt from: zaristei
PR: ai-dynamo/dynamo#2020
File: container/deps/vllm/install_vllm.sh:115-118
Timestamp: 2025-07-21T00:10:56.947Z
Learning: Graceful fallback for PyTorch wheel installation is broken on ARM architecture, so immediate exit on pinned version failure is preferred over fallback mechanisms in container/deps/vllm/install_vllm.sh for ARM64.
Applied to files:
container/deps/vllm/install_vllm.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
container/Dockerfile.vllm (1)
143-145: Pass-through looks correct.Conditional DEEPGEMM/FLASHINF args and CUDA_VERSION propagation are sound.
pyproject.toml (1)
57-58: Approve pin bump —flashinferextra exists on vllm 0.10.2.
PyPI metadata for vllm 0.10.2 lists "flashinfer" in provides_extra.
1db67f1 to
285e794
Compare
Signed-off-by: Alec <aflowers@nvidia.com>
285e794 to
359b093
Compare
|
/ok to test 359b093 |
krishung5
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, thanks
Co-authored-by: Kris Hung <krish@nvidia.com> Signed-off-by: Alec <35311602+alec-flowers@users.noreply.github.com>
|
/ok to test bd3389c |
Signed-off-by: Alec <aflowers@nvidia.com>
|
/ok to test 24d3b15 |
Signed-off-by: krishung5 <krish@nvidia.com>
|
/ok to test 210c43d |
Signed-off-by: krishung5 <krish@nvidia.com>
|
/ok to test 7f6ac97 |
Signed-off-by: krishung5 <krish@nvidia.com>
Signed-off-by: Alec <aflowers@nvidia.com>
|
/ok to test 5cfe1cb |
Signed-off-by: Alec <aflowers@nvidia.com>
|
/ok to test 85483a0 |
|
Seems like this could potentially resolve #3169. Would this change break for the KV events for the other engines (looks to be pretty general and backward compatible after reviewing the PR so likely not, great work ensuring this) |
|
Not really part of your PR, but can you update the mocker engines to emit the KV events in the new vllm format, so new router + mocker e2e tests can also run with the new event publishing/subscribing formats, and we keep track locally how vllm changed. Or at the very least mention briefly in the PR description what changes vllm made to their KV event publishing that necessitated these changes |
Signed-off-by: Alec <aflowers@nvidia.com>
|
/ok to test fc88fcf |
|
/ok to test 89164de |
Overview:
Bumping vLLM version to 0.10.2. There are a number of vLLM changes that we needed to fix.
vLLM changed the KV Events Structure - commit. Everytime they change this breaks our indexing. I wrote a new deserializer in order to handle arbitrary field additions as well as handle both i64 and u64 for block hashes.
vLLM changed their Multi-Model OpenAI Spec. We have removed the appropriate field.
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Chores