Skip to content

Commit 353146e

Browse files
GuanLuokrishung5
andauthored
feat: add vLLM v1 multi-modal example. Add llama4 Maverick example (#1990)
Signed-off-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com> Co-authored-by: krishung5 <krish@nvidia.com>
1 parent 1f07dab commit 353146e

File tree

23 files changed

+4569
-3
lines changed

23 files changed

+4569
-3
lines changed

container/Dockerfile.vllm

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,11 @@ RUN uv pip install /workspace/wheels/nixl/*.whl
167167

168168
# Install vllm - keep this early in Dockerfile to avoid
169169
# rebuilds from unrelated source code changes
170-
ARG VLLM_REF="059d4cd"
170+
# [gluo NOTE] currently using a fork of vllm until the fix
171+
# for multi-modal disaggregated serving is merged upstream.
172+
# see https://github.com/vllm-project/vllm/pull/21074
173+
ARG VLLM_REPO=https://github.com/GuanLuo/vllm.git
174+
ARG VLLM_REF="eaadf838ebe93e29a38a6fc1bab5a9801abe7d2c"
171175
ARG MAX_JOBS=16
172176
ENV MAX_JOBS=$MAX_JOBS
173177
ENV CUDA_HOME=/usr/local/cuda
@@ -177,7 +181,7 @@ RUN --mount=type=bind,source=./container/deps/,target=/tmp/deps \
177181
uv pip install pip cuda-python && \
178182
mkdir /opt/vllm && \
179183
cd /opt/vllm && \
180-
git clone https://github.com/vllm-project/vllm.git && \
184+
git clone $VLLM_REPO && \
181185
cd vllm && \
182186
git checkout $VLLM_REF && \
183187
uv pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128 && \
@@ -198,7 +202,7 @@ RUN --mount=type=bind,source=./container/deps/,target=/tmp/deps \
198202
uv pip install pip cuda-python && \
199203
mkdir /opt/vllm && \
200204
cd /opt/vllm && \
201-
git clone https://github.com/vllm-project/vllm.git && \
205+
git clone $VLLM_REPO && \
202206
cd vllm && \
203207
git checkout $VLLM_REF && \
204208
VLLM_USE_PRECOMPILED=1 uv pip install -e . && \

examples/multimodal_v1/README.md

Lines changed: 337 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,337 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# Multimodal Deployment Examples
19+
20+
This directory provides example workflows and reference implementations for deploying a multimodal model using Dynamo and vLLM v1.
21+
22+
## Use the Latest Release
23+
24+
We recommend using the latest stable release of dynamo to avoid breaking changes:
25+
26+
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
27+
28+
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
29+
30+
```bash
31+
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
32+
```
33+
34+
## Multimodal Aggregated Serving
35+
36+
### Components
37+
38+
- workers: For aggregated serving, we have two workers, [VllmEncodeWorker](components/encode_worker.py) for encoding and [VllmPDWorker](components/worker.py) for prefilling and decoding.
39+
- processor: Tokenizes the prompt and passes it to the VllmEncodeWorker.
40+
- frontend: HTTP endpoint to handle incoming requests.
41+
42+
### Graph
43+
44+
In this graph, we have two workers, [VllmEncodeWorker](components/encode_worker.py) and [VllmPDWorker](components/worker.py).
45+
The VllmEncodeWorker is responsible for encoding the image and passing the embeddings to the VllmPDWorker via a combination of NATS and RDMA.
46+
The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
47+
Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](../llm/README.md) example.
48+
By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
49+
VllmEncodeWorker independently from the prefill and decode workers if needed.
50+
51+
This figure shows the flow of the graph:
52+
```mermaid
53+
flowchart LR
54+
HTTP --> processor
55+
processor --> HTTP
56+
processor --image_url--> encode_worker
57+
encode_worker --> processor
58+
encode_worker --embeddings--> pd_worker
59+
pd_worker --> encode_worker
60+
```
61+
62+
```bash
63+
cd $DYNAMO_HOME/examples/multimodal_v1
64+
# Serve a LLaVA 1.5 7B model:
65+
dynamo serve graphs.agg:Frontend -f ./configs/agg-llava.yaml
66+
# Serve a Qwen2.5-VL model:
67+
# dynamo serve graphs.agg:Frontend -f ./configs/agg-qwen.yaml
68+
# Serve a Phi3V model:
69+
# dynamo serve graphs.agg:Frontend -f ./configs/agg-phi3v.yaml
70+
```
71+
72+
### Client
73+
74+
In another terminal:
75+
```bash
76+
curl http://localhost:8000/v1/chat/completions \
77+
-H "Content-Type: application/json" \
78+
-d '{
79+
"model": "llava-hf/llava-1.5-7b-hf",
80+
"messages": [
81+
{
82+
"role": "user",
83+
"content": [
84+
{
85+
"type": "text",
86+
"text": "What is in this image?"
87+
},
88+
{
89+
"type": "image_url",
90+
"image_url": {
91+
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
92+
}
93+
}
94+
]
95+
}
96+
],
97+
"max_tokens": 300,
98+
"temperature": 0.0,
99+
"stream": false
100+
}'
101+
```
102+
103+
If serving the example Qwen model, replace `"llava-hf/llava-1.5-7b-hf"` in the `"model"` field with `"Qwen/Qwen2.5-VL-7B-Instruct"`. If serving the example Phi3V model, replace `"llava-hf/llava-1.5-7b-hf"` in the `"model"` field with `"microsoft/Phi-3.5-vision-instruct"`.
104+
105+
You should see a response similar to this:
106+
```json
107+
{"id": "c37b946e-9e58-4d54-88c8-2dbd92c47b0c", "object": "chat.completion", "created": 1747725277, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " In the image, there is a city bus parked on a street, with a street sign nearby on the right side. The bus appears to be stopped out of service. The setting is in a foggy city, giving it a slightly moody atmosphere."}, "finish_reason": "stop"}]}
108+
```
109+
110+
## Multimodal Disaggregated Serving
111+
112+
### Components
113+
114+
- workers: For disaggregated serving, we have three workers, [VllmEncodeWorker](components/encode_worker.py) for encoding, [VllmDecodeWorker](components/worker.py) for decoding, and [VllmPDWorker](components/worker.py) for prefilling.
115+
- processor: Tokenizes the prompt and passes it to the VllmEncodeWorker.
116+
- frontend: HTTP endpoint to handle incoming requests.
117+
118+
### Graph
119+
120+
In this graph, we have three workers, [VllmEncodeWorker](components/encode_worker.py), [VllmDecodeWorker](components/worker.py), and [VllmPDWorker](components/worker.py).
121+
For the Llava model, embeddings are only required during the prefill stage. As such, the VllmEncodeWorker is connected directly to the prefill worker.
122+
The VllmEncodeWorker is responsible for encoding the image and passing the embeddings to the prefill worker via a combination of NATS and RDMA.
123+
Its work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
124+
The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
125+
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../llm/README.md) example.
126+
127+
This figure shows the flow of the graph:
128+
```mermaid
129+
flowchart LR
130+
HTTP --> processor
131+
processor --> HTTP
132+
processor --image_url--> encode_worker
133+
encode_worker --> processor
134+
encode_worker --embeddings--> prefill_worker
135+
prefill_worker --> encode_worker
136+
prefill_worker --> decode_worker
137+
decode_worker --> prefill_worker
138+
```
139+
140+
```bash
141+
cd $DYNAMO_HOME/examples/multimodal_v1
142+
dynamo serve graphs.disagg:Frontend -f configs/disagg.yaml
143+
```
144+
145+
### Client
146+
147+
In another terminal:
148+
```bash
149+
curl http://localhost:8000/v1/chat/completions \
150+
-H "Content-Type: application/json" \
151+
-d '{
152+
"model": "llava-hf/llava-1.5-7b-hf",
153+
"messages": [
154+
{
155+
"role": "user",
156+
"content": [
157+
{
158+
"type": "text",
159+
"text": "What is in this image?"
160+
},
161+
{
162+
"type": "image_url",
163+
"image_url": {
164+
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
165+
}
166+
}
167+
]
168+
}
169+
],
170+
"max_tokens": 300,
171+
"temperature": 0.0,
172+
"stream": false
173+
}'
174+
```
175+
176+
You should see a response similar to this:
177+
```json
178+
{"id": "c1774d61-3299-4aa3-bea1-a0af6c055ba8", "object": "chat.completion", "created": 1747725645, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " This image shows a passenger bus traveling down the road near power lines and trees. The bus displays a sign that says \"OUT OF SERVICE\" on its front."}, "finish_reason": "stop"}]}
179+
```
180+
181+
***Note***: disaggregation is currently only confirmed to work with LLaVA. Qwen VL and PhiV are not confirmed to be supported.
182+
183+
## Llama 4 family Serving
184+
185+
The family of Llama 4 models is natively multimodal, however, different
186+
from Llava, they do not directly consume image embedding as input
187+
(see the [support metrics](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)
188+
from vLLM for the types of multi-modal inputs supported by the model).
189+
Therefore, encoder worker will not be used in the following example and the
190+
encoding will be done along side with prefill.
191+
192+
`meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` will be used as an example
193+
for the content below. And the system will be H100x8 which can hold one instance
194+
of the model per node.
195+
196+
### Multimodal Aggregated Serving
197+
198+
#### Components
199+
200+
- workers: For aggregated serving, we have one worker, [VllmPDWorker](components/worker.py) for prefilling and decoding.
201+
- processor: Tokenizes the prompt and passes it to the VllmPDWorker.
202+
- frontend: HTTP endpoint to handle incoming requests.
203+
204+
#### Graph
205+
206+
In this graph, we have [VllmPDWorker](components/worker.py) which will encode the image, prefill and decode the prompt, just like the [LLM aggregated serving](../llm/README.md) example.
207+
208+
This figure shows the flow of the graph:
209+
```mermaid
210+
flowchart LR
211+
HTTP --> processor
212+
processor --> HTTP
213+
processor --image_url--> pd_worker
214+
pd_worker --> processor
215+
```
216+
217+
```bash
218+
cd $DYNAMO_HOME/examples/multimodal_v1
219+
export CONFIG_FILE=configs/llama.yaml
220+
# start components individually as the model is too large that addition
221+
# node will be needed to scale up number of workers. And graph deployment
222+
# doesn't work well in multi-node case.
223+
dynamo serve components.web:Frontend --service-name Frontend -f $CONFIG_FILE &
224+
dynamo serve components.direct_processor:Processor --service-name Processor -f $CONFIG_FILE &
225+
dynamo serve components.worker:VllmPDWorker --service-name VllmPDWorker -f $CONFIG_FILE &
226+
```
227+
228+
#### Client
229+
230+
In another terminal:
231+
```bash
232+
curl http://localhost:8000/v1/chat/completions \
233+
-H "Content-Type: application/json" \
234+
-d '{
235+
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
236+
"messages": [
237+
{
238+
"role": "user",
239+
"content": [
240+
{
241+
"type": "text",
242+
"text": "What is in this image?"
243+
},
244+
{
245+
"type": "image_url",
246+
"image_url": {
247+
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
248+
}
249+
}
250+
]
251+
}
252+
],
253+
"max_tokens": 300,
254+
"temperature": 0.0,
255+
"stream": false
256+
}'
257+
```
258+
259+
You should see a response similar to this:
260+
```json
261+
{"id": "b8f060fa95584e34b9204eaba7b105cc", "object": "chat.completion", "created": 1752706281, "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", "choices": [{"index": 0, "message": {"role": "assistant", "content": "The image depicts a street scene with a trolley bus as the central focus. The trolley bus is positioned on the left side of the road, facing the camera, and features a white and yellow color scheme. A prominent sign on the front of the bus reads \"OUT OF SERVICE\" in orange letters.\n\n**Key Elements:**\n\n* **Trolley Bus:** The bus is the main subject of the image, showcasing its distinctive design and color.\n* **Sign:** The \"OUT OF SERVICE\" sign is clearly visible on the front of the bus, indicating its current status.\n* **Street Scene:** The surrounding environment includes trees, buildings, and power lines, creating a sense of context and atmosphere.\n* **Lighting:** The image is characterized by a misty or foggy quality, with soft lighting that adds to the overall ambiance.\n\n**Overall Impression:**\n\nThe image presents a serene and somewhat melancholic scene, with the out-of-service trolley bus serving as a focal point. The misty atmosphere and soft lighting contribute to a dreamy or nostalgic feel, inviting the viewer to reflect on the scene."}, "finish_reason": "stop"}]}
262+
```
263+
264+
### Multimodal Disaggregated Serving
265+
266+
#### Components
267+
268+
- workers: For disaggregated serving, we have two workers, [VllmDecodeWorker](components/worker.py) for decoding, and [VllmPDWorker](components/worker.py) for encoding and prefilling.
269+
- processor: Tokenizes the prompt and passes it to the VllmPDWorker.
270+
- frontend: HTTP endpoint to handle incoming requests.
271+
272+
#### Graph
273+
274+
In this graph, we have two workers, [VllmDecodeWorker](components/worker.py), and [VllmPDWorker](components/worker.py).
275+
The prefill worker performs the encoding and prefilling steps and forwards the KV cache to the decode worker for decoding.
276+
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../llm/README.md) example.
277+
278+
This figure shows the flow of the graph:
279+
```mermaid
280+
flowchart LR
281+
HTTP --> processor
282+
processor --> HTTP
283+
processor --image_url--> prefill_worker
284+
prefill_worker --> processor
285+
prefill_worker --> decode_worker
286+
decode_worker --> prefill_worker
287+
```
288+
289+
```bash
290+
cd $DYNAMO_HOME/examples/multimodal_v1
291+
export CONFIG_FILE=configs/llama.yaml
292+
# start components individually as the model is too large that addition
293+
# node will be needed to scale up number of workers. And graph deployment
294+
# doesn't work well in multi-node case.
295+
dynamo serve components.web:Frontend --service-name Frontend -f $CONFIG_FILE &
296+
dynamo serve components.direct_processor:Processor --service-name Processor -f $CONFIG_FILE &
297+
dynamo serve components.worker:VllmPDWorker --service-name VllmPDWorker --VllmPDWorker.enable_disagg true -f $CONFIG_FILE &
298+
# On a separate node with standard dynamo setup
299+
# (i.e. nats and etcd environment variables are set)
300+
dynamo serve components.worker:VllmDecodeWorker --service-name VllmDecodeWorker -f $CONFIG_FILE &
301+
```
302+
303+
#### Client
304+
305+
In another terminal:
306+
```bash
307+
curl http://localhost:8000/v1/chat/completions \
308+
-H "Content-Type: application/json" \
309+
-d '{
310+
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
311+
"messages": [
312+
{
313+
"role": "user",
314+
"content": [
315+
{
316+
"type": "text",
317+
"text": "What is in this image?"
318+
},
319+
{
320+
"type": "image_url",
321+
"image_url": {
322+
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
323+
}
324+
}
325+
]
326+
}
327+
],
328+
"max_tokens": 300,
329+
"temperature": 0.0,
330+
"stream": false
331+
}'
332+
```
333+
334+
You should see a response similar to this:
335+
```json
336+
{"id": "6cc99123ad6948d685b8695428238d4b", "object": "chat.completion", "created": 1752708043, "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", "choices": [{"index": 0, "message": {"role": "assistant", "content": "The image depicts a street scene with a trolley bus as the central focus. The trolley bus is positioned on the left side of the road, facing the camera, and features a white and yellow color scheme. A prominent sign on the front of the bus reads \"OUT OF SERVICE\" in orange letters.\n\n**Key Elements:**\n\n* **Trolley Bus:** The bus is the main subject of the image, showcasing its distinctive design and color.\n* **Sign:** The \"OUT OF SERVICE\" sign is clearly visible on the front of the bus, indicating its current status.\n* **Street Scene:** The surrounding environment includes trees, buildings, and power lines, creating a sense of context and atmosphere.\n* **Lighting:** The image is characterized by a misty or foggy quality, with soft lighting that adds to the overall mood.\n\n**Overall Impression:**\n\nThe image presents a serene and somewhat melancholic scene, with the out-of-service trolley bus serving as a focal point. The misty atmosphere and soft lighting contribute to a contemplative ambiance, inviting the viewer to reflect on the situation."}, "finish_reason": "stop"}]}
337+
```

0 commit comments

Comments
 (0)