Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
ed27f48
WIP
krishung5 Jun 17, 2025
5bbb2a5
Agg done. Disagg in progress
krishung5 Jun 25, 2025
1556c8a
chore: llama 4
GuanLuo Jul 10, 2025
fd61cc7
chore: update config
GuanLuo Jul 10, 2025
6fb99b6
chore: update config
GuanLuo Jul 11, 2025
737738f
debug: add log
GuanLuo Jul 14, 2025
8beb407
chore: add instruction and remove dead code
GuanLuo Jul 15, 2025
d611ede
chore: update components for llama 4
GuanLuo Jul 15, 2025
6fdb410
feat: consume image url directly
GuanLuo Jul 16, 2025
e8acf5b
fix: fix up
GuanLuo Jul 16, 2025
ba45908
chore: fix bug
GuanLuo Jul 16, 2025
390d7c5
doc: update readme
GuanLuo Jul 17, 2025
654180f
chore: revert debug changes
GuanLuo Jul 17, 2025
06ee79b
fix: fix up
GuanLuo Jul 17, 2025
805bbf2
style: add copyright. format
GuanLuo Jul 17, 2025
83fe60d
fix: update vLLM commit used for multi-modal
GuanLuo Jul 17, 2025
a3ca92e
style: format and typo
GuanLuo Jul 18, 2025
c25609d
chore: address comment
GuanLuo Jul 18, 2025
1071f74
chore: address comment
GuanLuo Jul 18, 2025
46b3fd9
feat: processor UX migration
GuanLuo Jul 22, 2025
9da48c5
fix: fix up
GuanLuo Jul 22, 2025
7f26477
wip: encoder
GuanLuo Jul 22, 2025
39e5aae
fix: fix up
GuanLuo Jul 23, 2025
201662f
wip: vllm workers
GuanLuo Jul 23, 2025
fcef6dc
feat: update vLLM worker
GuanLuo Jul 23, 2025
4adf600
chore: update launch script
GuanLuo Jul 23, 2025
b8974da
feat: add llama script
GuanLuo Jul 24, 2025
d6be750
fix: fix up
GuanLuo Jul 24, 2025
56dccde
docs: update scripts and README
GuanLuo Jul 24, 2025
30ad164
cleanup: remove unused files. fix up
GuanLuo Jul 24, 2025
3044043
fix: fix up
GuanLuo Jul 24, 2025
96ba9dc
chore: address comment
GuanLuo Jul 24, 2025
91f52f6
chore: use rebased vLLM commit
GuanLuo Jul 25, 2025
3f2a388
fix: address vLLM API changes
GuanLuo Jul 25, 2025
a2ea7b1
chore: address comment
GuanLuo Jul 25, 2025
0798dec
chore: style
GuanLuo Jul 25, 2025
1c5a626
Merge branch 'main' into gluo/multi-modal-ux
GuanLuo Aug 11, 2025
a96ed30
chore: update vLLM commit
GuanLuo Aug 11, 2025
3cdff4a
fix: fix main merge artifact
GuanLuo Aug 11, 2025
ec2373f
fix: remove dynamo SDK reference
GuanLuo Aug 11, 2025
7a62a6c
fix: new vLLM DeepEP installation requires arch list to be specified
GuanLuo Aug 12, 2025
52f362c
Merge branch 'main' into gluo/multi-modal-ux
GuanLuo Aug 12, 2025
8d31421
chore: update ingress launch command
GuanLuo Aug 12, 2025
0824ea5
chore: update vLLM commit to pick up more fix
GuanLuo Aug 12, 2025
7cf21e1
Merge branch 'main' into gluo/multi-modal-ux
GuanLuo Aug 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion container/Dockerfile.vllm
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ ARG RUNTIME_IMAGE="nvcr.io/nvidia/cuda"
ARG RUNTIME_IMAGE_TAG="12.8.1-runtime-ubuntu24.04"

# Make sure to update the dependency version in pyproject.toml when updating this
ARG VLLM_REF="f4135232b9a8c4845f8961fb1cd17581c56ae2ce"
ARG VLLM_REF="ba81acbdc1eec643ba815a76628ae3e4b2263b76"
ARG TORCH_BACKEND="cu128"

# Match 0.10.0 vLLM release
Expand Down Expand Up @@ -186,6 +186,7 @@ RUN if [ "$ARCH" = "arm64" ]; then \
# Install vllm - keep this early in Dockerfile to avoid
# rebuilds from unrelated source code changes
ARG VLLM_REF
ARG VLLM_GIT_URL
ARG DEEPGEMM_REF
ARG FLASHINF_REF

Expand Down
11 changes: 8 additions & 3 deletions container/deps/vllm/install_vllm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ set -euo pipefail

# Parse arguments
EDITABLE=true
VLLM_REF="f4135232b9a8c4845f8961fb1cd17581c56ae2ce"
VLLM_REF="ba81acbdc1eec643ba815a76628ae3e4b2263b76"
VLLM_GIT_URL="https://github.com/vllm-project/vllm.git"
MAX_JOBS=16
INSTALLATION_DIR=/tmp
ARCH=$(uname -m)
Expand Down Expand Up @@ -49,6 +50,10 @@ while [[ $# -gt 0 ]]; do
VLLM_REF="$2"
shift 2
;;
--vllm-git-url)
VLLM_GIT_URL="$2"
shift 2
;;
--max-jobs)
MAX_JOBS="$2"
shift 2
Expand Down Expand Up @@ -113,7 +118,7 @@ uv pip install lmcache
# Create vllm directory and clone
mkdir -p $INSTALLATION_DIR
cd $INSTALLATION_DIR
git clone https://github.com/vllm-project/vllm.git
git clone $VLLM_GIT_URL vllm
cd vllm
git checkout $VLLM_REF

Expand Down Expand Up @@ -148,7 +153,7 @@ fi
# Install ep_kernels and DeepGEMM
echo "Installing ep_kernels and DeepGEMM"
cd tools/ep_kernels
bash install_python_libraries.sh # These libraries aren't pinned.
TORCH_CUDA_ARCH_LIST="9.0;10.0" bash install_python_libraries.sh # These libraries aren't pinned.
cd ep_kernels_workspace
git clone https://github.com/deepseek-ai/DeepGEMM.git
cd DeepGEMM
Expand Down
5 changes: 4 additions & 1 deletion examples/deployments/router_standalone/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,10 @@ def __init__(self, port: int) -> None:
logger.info(f"ZMQ publisher initialized on port {port}")

def record(
self, scheduler_stats: SchedulerStats, iteration_stats: Optional[IterationStats]
self,
scheduler_stats: SchedulerStats,
iteration_stats: Optional[IterationStats],
engine_idx: int = 0,
):
# Send metrics over ZMQ
metrics_data = {
Expand Down
328 changes: 328 additions & 0 deletions examples/multimodal_v1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Multimodal Deployment Examples

This directory provides example workflows and reference implementations for deploying a multimodal model using Dynamo and vLLM v1.

## Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)

You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:

```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```

## Multimodal Aggregated Serving

### Components

- workers: For aggregated serving, we have two workers, [VllmEncodeWorker](components/encode_worker.py) for encoding and [VllmPDWorker](components/worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the VllmEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.

### Graph

In this graph, we have two workers, [VllmEncodeWorker](components/encode_worker.py) and [VllmPDWorker](components/worker.py).
The VllmEncodeWorker is responsible for encoding the image and passing the embeddings to the VllmPDWorker via a combination of NATS and RDMA.
The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](/components/backends/vllm/README.md) example.
By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
VllmEncodeWorker independently from the prefill and decode workers if needed.

This figure shows the flow of the graph:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings--> pd_worker
pd_worker --> encode_worker
```

```bash
cd $DYNAMO_HOME/examples/multimodal_v1
# Serve a LLaVA 1.5 7B model:
bash launch/agg.sh --model llava-hf/llava-1.5-7b-hf
# Serve a Qwen2.5-VL model:
# bash launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct
# Serve a Phi3V model:
# bash launch/agg.sh --model microsoft/Phi-3.5-vision-instruct
```

### Client

In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```

If serving the example Qwen model, replace `"llava-hf/llava-1.5-7b-hf"` in the `"model"` field with `"Qwen/Qwen2.5-VL-7B-Instruct"`. If serving the example Phi3V model, replace `"llava-hf/llava-1.5-7b-hf"` in the `"model"` field with `"microsoft/Phi-3.5-vision-instruct"`.

You should see a response similar to this:
```json
{"id": "c37b946e-9e58-4d54-88c8-2dbd92c47b0c", "object": "chat.completion", "created": 1747725277, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " In the image, there is a city bus parked on a street, with a street sign nearby on the right side. The bus appears to be stopped out of service. The setting is in a foggy city, giving it a slightly moody atmosphere."}, "finish_reason": "stop"}]}
```

## Multimodal Disaggregated Serving

### Components

- workers: For disaggregated serving, we have three workers, [VllmEncodeWorker](components/encode_worker.py) for encoding, [VllmDecodeWorker](components/worker.py) for decoding, and [VllmPDWorker](components/worker.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the VllmEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.

### Graph

In this graph, we have three workers, [VllmEncodeWorker](components/encode_worker.py), [VllmDecodeWorker](components/worker.py), and [VllmPDWorker](components/worker.py).
For the Llava model, embeddings are only required during the prefill stage. As such, the VllmEncodeWorker is connected directly to the prefill worker.
The VllmEncodeWorker is responsible for encoding the image and passing the embeddings to the prefill worker via a combination of NATS and RDMA.
Its work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/components/backends/vllm/README.md) example.

This figure shows the flow of the graph:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings--> prefill_worker
prefill_worker --> encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```

```bash
cd $DYNAMO_HOME/examples/multimodal_v1
bash launch/disagg.sh --model llava-hf/llava-1.5-7b-hf
```

### Client

In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```

You should see a response similar to this:
```json
{"id": "c1774d61-3299-4aa3-bea1-a0af6c055ba8", "object": "chat.completion", "created": 1747725645, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " This image shows a passenger bus traveling down the road near power lines and trees. The bus displays a sign that says \"OUT OF SERVICE\" on its front."}, "finish_reason": "stop"}]}
```

***Note***: disaggregation is currently only confirmed to work with LLaVA. Qwen VL and PhiV are not confirmed to be supported.

## Llama 4 family Serving

The family of Llama 4 models is natively multimodal, however, different
from Llava, they do not directly consume image embedding as input
(see the [support metrics](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)
from vLLM for the types of multi-modal inputs supported by the model).
Therefore, encoder worker will not be used in the following example and the
encoding will be done along side with prefill.

`meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` will be used as an example
for the content below. And the system will be H100x8 which can hold one instance
of the model per node.

### Multimodal Aggregated Serving

#### Components

- workers: For aggregated serving, we have one worker, [VllmPDWorker](components/worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the VllmPDWorker.
- frontend: HTTP endpoint to handle incoming requests.

#### Graph

In this graph, we have [VllmPDWorker](components/worker.py) which will encode the image, prefill and decode the prompt, just like the [LLM aggregated serving](/components/backends/vllm/README.md) example.

This figure shows the flow of the graph:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> pd_worker
pd_worker --> processor
```

```bash
cd $DYNAMO_HOME/examples/multimodal_v1
bash launch/agg_llama.sh
```

#### Client

In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```

You should see a response similar to this:
```json
{"id": "b8f060fa95584e34b9204eaba7b105cc", "object": "chat.completion", "created": 1752706281, "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", "choices": [{"index": 0, "message": {"role": "assistant", "content": "The image depicts a street scene with a trolley bus as the central focus. The trolley bus is positioned on the left side of the road, facing the camera, and features a white and yellow color scheme. A prominent sign on the front of the bus reads \"OUT OF SERVICE\" in orange letters.\n\n**Key Elements:**\n\n* **Trolley Bus:** The bus is the main subject of the image, showcasing its distinctive design and color.\n* **Sign:** The \"OUT OF SERVICE\" sign is clearly visible on the front of the bus, indicating its current status.\n* **Street Scene:** The surrounding environment includes trees, buildings, and power lines, creating a sense of context and atmosphere.\n* **Lighting:** The image is characterized by a misty or foggy quality, with soft lighting that adds to the overall ambiance.\n\n**Overall Impression:**\n\nThe image presents a serene and somewhat melancholic scene, with the out-of-service trolley bus serving as a focal point. The misty atmosphere and soft lighting contribute to a dreamy or nostalgic feel, inviting the viewer to reflect on the scene."}, "finish_reason": "stop"}]}
```

### Multimodal Disaggregated Serving

#### Components

- workers: For disaggregated serving, we have two workers, [VllmDecodeWorker](components/worker.py) for decoding, and [VllmPDWorker](components/worker.py) for encoding and prefilling.
- processor: Tokenizes the prompt and passes it to the VllmPDWorker.
- frontend: HTTP endpoint to handle incoming requests.

#### Graph

In this graph, we have two workers, [VllmDecodeWorker](components/worker.py), and [VllmPDWorker](components/worker.py).
The prefill worker performs the encoding and prefilling steps and forwards the KV cache to the decode worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/components/backends/vllm/README.md) example.

This figure shows the flow of the graph:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> prefill_worker
prefill_worker --> processor
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```

```bash
cd $DYNAMO_HOME/examples/multimodal_v1
bash launch/disagg_llama.sh --head-node

# On a separate node that has finished standard dynamo setup, i.e.
# the worker node needs NATS_SERVER and ETCD_ENDPOINTS environment variables
# pointing to the head node's external IP address for distributed coordination
cd $DYNAMO_HOME/examples/multimodal_v1
bash launch/disagg_llama.sh
```

#### Client

In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```

You should see a response similar to this:
```json
{"id": "6cc99123ad6948d685b8695428238d4b", "object": "chat.completion", "created": 1752708043, "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", "choices": [{"index": 0, "message": {"role": "assistant", "content": "The image depicts a street scene with a trolley bus as the central focus. The trolley bus is positioned on the left side of the road, facing the camera, and features a white and yellow color scheme. A prominent sign on the front of the bus reads \"OUT OF SERVICE\" in orange letters.\n\n**Key Elements:**\n\n* **Trolley Bus:** The bus is the main subject of the image, showcasing its distinctive design and color.\n* **Sign:** The \"OUT OF SERVICE\" sign is clearly visible on the front of the bus, indicating its current status.\n* **Street Scene:** The surrounding environment includes trees, buildings, and power lines, creating a sense of context and atmosphere.\n* **Lighting:** The image is characterized by a misty or foggy quality, with soft lighting that adds to the overall mood.\n\n**Overall Impression:**\n\nThe image presents a serene and somewhat melancholic scene, with the out-of-service trolley bus serving as a focal point. The misty atmosphere and soft lighting contribute to a contemplative ambiance, inviting the viewer to reflect on the situation."}, "finish_reason": "stop"}]}
```
Loading
Loading