-
Notifications
You must be signed in to change notification settings - Fork 691
feat: Multimodal support for dynamo with trtllm backend #2195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes introduce multimodal support to the TensorRT-LLM backend, including new configuration files for multimodal engines, code updates to process multimodal requests, and documentation enhancements. Key updates add a multimodal processor, extend handler logic for multimodal input/output, provide a new command-line modality argument, and document usage, benchmarking, and experimental features. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Main
participant Handler
participant MultimodalProcessor
participant Engine
Client->>Main: Send multimodal request (with messages)
Main->>Handler: Pass request, multimodal_processor, tokenizer
Handler->>MultimodalProcessor: process_openai_request(request)
MultimodalProcessor->>Handler: Return processed_inputs
Handler->>Engine: generate(processed_inputs, sampling_params)
Engine-->>Handler: Stream tokens
Handler->>Client: Stream OpenAI-compatible response chunks
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Nitpick comments (6)
components/backends/trtllm/src/dynamo/trtllm/utils/trtllm_utils.py (1)
39-39: Add modality to Config string representation.The new
modalityattribute should be included in the__str__method for better debugging and logging visibility.f"disaggregation_mode={self.disaggregation_mode}, " f"disaggregation_strategy={self.disaggregation_strategy}, " - f"next_endpoint={self.next_endpoint})" + f"next_endpoint={self.next_endpoint}, " + f"modality={self.modality})"components/backends/trtllm/engine_configs/multimodal/agg.yaml (1)
33-33: Add newline at end of fileFollowing POSIX conventions and general best practices, text files should end with a newline character.
-use_cuda_graph: false +use_cuda_graph: false +components/backends/trtllm/src/dynamo/trtllm/utils/multimodal_processor.py (1)
84-84: Make device configurableThe device is hardcoded to "cuda". Consider making it configurable or detecting available devices.
+ def __init__(self, model_type: str, model_dir: str, device: str = "cuda"): self.model_type = model_type self.model_dir = model_dir self.modality = "" + self.device = deviceAnd update line 84:
- device="cuda", + device=self.device,components/backends/trtllm/README.md (2)
218-220: Add language specification to code blockThe response JSON code block should specify the language for proper syntax highlighting.
-``` +```json {"id":"unknown-id","choices":[{"index":0,"message":{"content":"The image depicts a serene landscape featuring a large rock formation, likely El Capitan in Yosemite National Park, California. The scene is characterized by a winding road that curves from the bottom-right corner towards the center-left of the image, with a few rocks and trees lining its edge.\n\n**Key Features:**\n\n* **Rock Formation:** A prominent, tall, and flat-topped rock formation dominates the center of the image.\n* **Road:** A paved road winds its way through the landscape, curving from the bottom-right corner towards the center-left.\n* **Trees and Rocks:** Trees are visible on both sides of the road, with rocks scattered along the left side.\n* **Sky:** The sky above is blue, dotted with white clouds.\n* **Atmosphere:** The overall atmosphere of the","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1753322607,"model":"meta-llama/Llama-4-Maverick-17B-128E-Instruct","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null}--- `256-257`: **Maintain consistent list style** The markdown uses both asterisks and dashes for unordered lists. For consistency, use dashes throughout. ```diff - * Patch 1: [`302b73b`](https://github.com/chang-l/TensorRT-LLM/commit/302b73be5108f58a6795075e5231a31872e42ddd) - * Patch 2: [`5b7613b`](https://github.com/chang-l/TensorRT-LLM/commit/5b7613bbc78d830efb7c320a3090c3ef862aa0ab) + - Patch 1: [`302b73b`](https://github.com/chang-l/TensorRT-LLM/commit/302b73be5108f58a6795075e5231a31872e42ddd) + - Patch 2: [`5b7613b`](https://github.com/chang-l/TensorRT-LLM/commit/5b7613bbc78d830efb7c320a3090c3ef862aa0ab)Also applies to: 313-314
components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py (1)
107-123: Extract request normalization logicThe OpenAI format normalization logic could be extracted to a separate method for better maintainability and reusability.
+def _normalize_openai_request(self, request: dict) -> dict: + """Normalize OpenAI format parameters to internal format.""" + if "stop_conditions" not in request: + request["stop_conditions"] = {} + if "max_tokens" in request and "max_tokens" not in request["stop_conditions"]: + request["stop_conditions"]["max_tokens"] = request.pop("max_tokens") + + if "sampling_options" not in request: + request["sampling_options"] = {} + if "temperature" in request and "temperature" not in request["sampling_options"]: + request["sampling_options"]["temperature"] = request.pop("temperature") + + return request # Check for multimodal request and process it if self.multimodal_processor: - # Normalize the request to handle OpenAI format - if "stop_conditions" not in request: - request["stop_conditions"] = {} - if ( - "max_tokens" in request - and "max_tokens" not in request["stop_conditions"] - ): - request["stop_conditions"]["max_tokens"] = request.pop("max_tokens") - - if "sampling_options" not in request: - request["sampling_options"] = {} - if ( - "temperature" in request - and "temperature" not in request["sampling_options"] - ): - request["sampling_options"]["temperature"] = request.pop("temperature") + request = self._normalize_openai_request(request)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
components/backends/trtllm/README.md(2 hunks)components/backends/trtllm/engine_configs/multimodal/agg.yaml(1 hunks)components/backends/trtllm/engine_configs/multimodal/decode.yaml(1 hunks)components/backends/trtllm/engine_configs/multimodal/prefill.yaml(1 hunks)components/backends/trtllm/launch/disagg.sh(1 hunks)components/backends/trtllm/src/dynamo/trtllm/main.py(5 hunks)components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py(10 hunks)components/backends/trtllm/src/dynamo/trtllm/request_handlers/handlers.py(2 hunks)components/backends/trtllm/src/dynamo/trtllm/utils/multimodal_processor.py(1 hunks)components/backends/trtllm/src/dynamo/trtllm/utils/trtllm_utils.py(3 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
components/backends/trtllm/engine_configs/multimodal/agg.yaml (2)
Learnt from: ptarasiewiczNV
PR: #2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The --torch-backend=auto flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
Learnt from: tanmayv25
PR: #1391
File: examples/tensorrt_llm/common/base_engine.py:171-176
Timestamp: 2025-06-05T01:10:51.865Z
Learning: In examples/tensorrt_llm/common/base_engine.py, the _init_engine method is called only once during initialization, so direct mutation of the _default_sampling_params object during setup is safe and appropriate.
components/backends/trtllm/src/dynamo/trtllm/main.py (2)
Learnt from: ptarasiewiczNV
PR: #2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The --torch-backend=auto flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
Learnt from: tanmayv25
PR: #1391
File: examples/tensorrt_llm/common/base_engine.py:171-176
Timestamp: 2025-06-05T01:10:51.865Z
Learning: In examples/tensorrt_llm/common/base_engine.py, the _init_engine method is called only once during initialization, so direct mutation of the _default_sampling_params object during setup is safe and appropriate.
components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py (2)
Learnt from: tanmayv25
PR: #1391
File: examples/tensorrt_llm/common/base_engine.py:171-176
Timestamp: 2025-06-05T01:10:51.865Z
Learning: In examples/tensorrt_llm/common/base_engine.py, the _init_engine method is called only once during initialization, so direct mutation of the _default_sampling_params object during setup is safe and appropriate.
Learnt from: ishandhanani
PR: #1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the sampling_options call in the preprocess_request method is placed in the common section after the match statement on request.prompt_input_type(), meaning it applies to both PromptInput::Tokens and PromptInput::Text request types.
🪛 markdownlint-cli2 (0.17.2)
components/backends/trtllm/README.md
218-218: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
256-256: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
257-257: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
304-304: Multiple headings with the same content
(MD024, no-duplicate-heading)
313-313: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
314-314: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
357-357: Multiple headings with the same content
(MD024, no-duplicate-heading)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (6)
components/backends/trtllm/launch/disagg.sh (1)
13-14: LGTM! Clear documentation for enabling multimodal mode.The commented lines provide helpful guidance for users to enable multimodal functionality without affecting the default behavior.
components/backends/trtllm/src/dynamo/trtllm/request_handlers/handlers.py (1)
127-141: LGTM! Improved prefill response handling.The change to explicitly capture and extract prefill response state instead of yielding directly provides better control over the processing flow. This approach is beneficial for multimodal support where additional response processing may be needed.
components/backends/trtllm/engine_configs/multimodal/decode.yaml (1)
15-28: Verify multimodal-specific optimizations in decode.yamlI ran searches across our YAML configs and didn’t find any multimodal-specific tuning or documentation. Please confirm whether parameters like
max_batch_size,max_num_tokens, andkv_cache_configare optimal for your target multimodal workloads, or update the config with any required image/vision-specific settings.• File to review:
components/backends/trtllm/engine_configs/multimodal/decode.yamlcomponents/backends/trtllm/engine_configs/multimodal/prefill.yaml (1)
15-30: LGTM! Appropriate prefill-specific optimizations.The configuration correctly disables CUDA graphs and overlap scheduler for prefill workloads, with clear documentation explaining the rationale. The chunked prefill feature will be beneficial for handling large multimodal contexts.
components/backends/trtllm/src/dynamo/trtllm/utils/trtllm_utils.py (2)
151-157: LGTM! Clean implementation of modality configuration.The command-line argument implementation follows established patterns with appropriate defaults, validation, and clear help text.
210-210: LGTM! Proper config assignment.The modality value is correctly assigned to the config object, maintaining consistency with other command-line argument handling.
components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py
Outdated
Show resolved
Hide resolved
components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py
Outdated
Show resolved
Hide resolved
components/backends/trtllm/src/dynamo/trtllm/utils/multimodal_processor.py
Outdated
Show resolved
Hide resolved
components/backends/trtllm/src/dynamo/trtllm/utils/multimodal_processor.py
Outdated
Show resolved
Hide resolved
components/backends/trtllm/src/dynamo/trtllm/utils/multimodal_processor.py
Outdated
Show resolved
Hide resolved
f3d3c52 to
dd4fb75
Compare
components/backends/trtllm/src/dynamo/trtllm/utils/multimodal_processor.py
Outdated
Show resolved
Hide resolved
components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py
Outdated
Show resolved
Hide resolved
components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py
Outdated
Show resolved
Hide resolved
components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py
Outdated
Show resolved
Hide resolved
components/backends/trtllm/src/dynamo/trtllm/utils/trtllm_utils.py
Outdated
Show resolved
Hide resolved
components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py
Show resolved
Hide resolved
components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py
Show resolved
Hide resolved
components/backends/trtllm/multinode/multinode-multimodal-example.md
Outdated
Show resolved
Hide resolved
|
@coderabbitai resolve |
✅ Actions performedComments resolved. |
Overview:
Introduces multimodal support for Dynamo using the TensorRT-LLM backend. This allows models to process requests that include both text and images. Additionally, it adds experimental support for providing pre-computed embeddings directly in requests, which can improve performance by bypassing the model's own embedding generation.
Details:
Multimodal Request Handling: A new MultimodalRequestProcessor has been added to the TRT-LLM worker. This processor handles OpenAI-formatted requests that contain image URLs or paths to local embedding files. It processes this content and prepares it for the model.
Configuration:
A
--modalitycommand-line argument has been added, which can be set to multimodal to enable this functionality.New engine configurations (agg.yaml, decode.yaml, prefill.yaml) have been added for multimodal scenarios.
Pre-computed Embeddings (Experimental): Users can now provide pre-computed embeddings in .pt, .pth, or .bin formats. The system detects these files, loads the tensors, and passes them directly to the model.
Documentation: The README.md has been updated with detailed instructions on how to use the new multimodal and pre-computed embedding features, including example curl commands.
Where should the reviewer start?
components/backends/trtllm/src/dynamo/trtllm/utils/multimodal_processor.py: This new file contains the core logic for processing multimodal requests.components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py: Review the changes for handling and streaming multimodal responses with newmodelType = ModelType.Chatcomponents/backends/trtllm/src/dynamo/trtllm/main.py: Note how the MultimodalRequestProcessor is initialized.Summary by CodeRabbit
New Features
Documentation
Chores