-
Notifications
You must be signed in to change notification settings - Fork 690
feat: Initial Granite support #1271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add Granite to our tokenizer - Fix pre-processor to load context length correctly - Add strftime_now Jinja function for prompt templates - Update llama.cpp - Handle trtllm errors when not using trtllm Support depends on the engine: - `mistral.rs`, our default engine, doesn't support Granite yet. - `llama.cpp` does and works very well: ``` dynamo-run out=llamacpp ~/llms/granite-3.3-2b-instruct-Q4_K_M.gguf --context-length 16384 ``` - `vllm` also works very well: ``` dynamo-run in=http out=vllm ~/llms/granite-3.3-2b-instruct --context-length 16384 ``` - `sglang` mostly works, but it doesn't catch the stop token, so we do in the HTTP ingress, and log an error. The Text ingress doesn't catch it because I disabled it to make the raw echo engine work. A bit of work to do here.
WalkthroughThis update introduces IBM Granite model support by adding a new enum variant, adjusts model context length extraction logic to prioritize Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ModelLoader
participant ConfigReader
User->>ModelLoader: Load model (e.g., IBM Granite)
ModelLoader->>ConfigReader: Read config.json
alt max_position_embeddings found
ConfigReader-->>ModelLoader: Return context length
else not found
ConfigReader->>ConfigReader: Read tokenizer_config.json
alt model_max_length found
ConfigReader-->>ModelLoader: Return context length
else not found
ConfigReader-->>ModelLoader: Return 0
end
end
ModelLoader-->>User: Model loaded with context length
Assessment against linked issues
Suggested labels
Suggested reviewers
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/llm/src/preprocessor/prompt/template/tokcfg.rs (1)
230-235: Consider improving the function design and documentation.The
strftime_nowfunction implementation is functionally correct, but there are a few considerations:
Return type mismatch: The function signature suggests it can fail with
Result<Value, Error>, but the implementation never returns an error. Consider whether this should returnValuedirectly or if error handling for invalid format strings is needed.Local timezone assumption: Using
Local::now()assumes the local timezone is appropriate for all use cases. Consider if UTC time might be more predictable in server environments.Missing documentation: The function lacks documentation explaining the expected format string syntax and behavior.
Consider this improvement:
+/// Formats the current local date and time according to the given format string. +/// Uses chrono's format syntax (e.g., "%Y-%m-%d %H:%M:%S"). +/// Returns the formatted string as a safe template value. pub fn strftime_now(format_str: &str) -> Result<Value, Error> { let local: DateTime<Local> = Local::now(); - Ok(Value::from_safe_string( - local.format(format_str).to_string(), - )) + match local.format(format_str).to_string() { + formatted => Ok(Value::from_safe_string(formatted)), + } }Or if error handling isn't needed:
+/// Formats the current local date and time according to the given format string. -pub fn strftime_now(format_str: &str) -> Result<Value, Error> { +pub fn strftime_now(format_str: &str) -> Value { let local: DateTime<Local> = Local::now(); - Ok(Value::from_safe_string( - local.format(format_str).to_string(), - )) + Value::from_safe_string(local.format(format_str).to_string()) }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
lib/bindings/python/src/dynamo/llm/__init__.py(2 hunks)lib/engines/llamacpp/Cargo.toml(1 hunks)lib/llm/src/gguf.rs(1 hunks)lib/llm/src/model_card/create.rs(1 hunks)lib/llm/src/preprocessor/prompt/template/formatters.rs(2 hunks)lib/llm/src/preprocessor/prompt/template/tokcfg.rs(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/llm/src/preprocessor/prompt/template/formatters.rs (1)
lib/llm/src/preprocessor/prompt/template/tokcfg.rs (3)
raise_exception(39-41)strftime_now(230-235)tojson(196-228)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (8)
lib/engines/llamacpp/Cargo.toml (1)
41-41:✅ Verification successful
Verify the dependency version upgrade is safe and beneficial.
The
llama-cpp-2dependency has been updated from0.1.103to0.1.107. While this aligns with the PR's goal of improving llama.cpp support for Granite models, please ensure this version upgrade is stable and compatible.
🏁 Script executed:
#!/bin/bash # Description: Check for security advisories and verify the version exists # Expected: Confirm version 0.1.107 exists and check for any security issues # Check if the version exists on crates.io curl -s "https://crates.io/api/v1/crates/llama-cpp-2" | jq '.versions[] | select(.num == "0.1.107")' # Check for security advisories gh api graphql -f query=' { securityVulnerabilities(first: 10, ecosystem: RUST, package: "llama-cpp-2") { nodes { advisory { summary severity publishedAt } vulnerableVersionRange firstPatchedVersion { identifier } } } }'Length of output: 2553
Confirmed llama-cpp-2 v0.1.107 is Available and Secure
- Crate v0.1.107 exists on crates.io (not yanked)
- No Rust security advisories reported for this version
- License remains “MIT OR Apache-2.0”
This bump aligns with the PR’s goals and can be merged as-is.
lib/llm/src/preprocessor/prompt/template/tokcfg.rs (1)
20-20: Import addition looks good.The
chrono::{DateTime, Local}import is correctly added to support the newstrftime_nowfunction.lib/llm/src/preprocessor/prompt/template/formatters.rs (2)
18-18: Import update is correct.The import statement has been properly updated to include the new
strftime_nowfunction alongside the existing imports.
55-56: Function registration is properly implemented.The
strftime_nowfunction is correctly registered in the Jinja environment, making it available for use in prompt templates. The reordering of function registrations (movingraise_exceptionafter thetojsonfilter) doesn't affect functionality but maintains a consistent organization.lib/bindings/python/src/dynamo/llm/__init__.py (2)
16-16: Logging import addition is appropriate.Adding the
loggingimport enables proper error reporting for TensorRT-LLM import failures.
50-53: Excellent improvement to error handling.The enhanced error handling is a significant improvement over silently ignoring all import failures. This approach:
- Catches broader exceptions: Beyond just
ImportError, this handles any exception that might occur during TensorRT-LLM imports- Provides debugging information: Logs the full exception with stack trace using
logger.exception()- Maintains isolation: Prevents TensorRT-LLM issues from affecting other engine imports
- Uses appropriate logger: Creates a module-specific logger using
__name__This aligns perfectly with the PR's objective of enhancing error handling for
trtllmwhen it's not in use.lib/llm/src/model_card/create.rs (1)
118-131: LGTM! Improved context length extraction with proper fallback logic.The cascading fallback approach is well-implemented:
- First attempts the standard HuggingFace
max_position_embeddingsfromconfig.json- Falls back to
model_max_lengthfromtokenizer_config.json- Defaults to 0 to let the engine handle it
This should improve compatibility across different model configurations and follows best practices for configuration precedence.
lib/llm/src/gguf.rs (1)
61-61: LGTM! Clean addition of Granite architecture support.The new
Granitevariant is correctly added to theGGUFArchitectureenum, following the existing naming convention and placement pattern.
| env.add_filter("tojson", tojson); | ||
|
|
||
| env.add_function("raise_exception", raise_exception); | ||
| env.add_function("strftime_now", strftime_now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this hooking into? Do some chat/jinja templates have some "<strftime_now>"-like placeholders in them, and this is saying to populate them with our custom strftime_now function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly what you said.
The Python HF transformers library supports that, it's documented here:
https://huggingface.co/docs/transformers/main/en/chat_templating_writing#callable-functions
Granite's chat template has it: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct/blob/main/tokenizer_config.json#L224 (you have to scroll a long long way to the right).
I'm surprised we got this far without it.
- Add Granite to our tokenizer - Fix pre-processor to load context length correctly - Add strftime_now Jinja function for prompt templates - Update llama.cpp - Handle trtllm errors when not using trtllm Support depends on the engine: - `mistral.rs`, our default engine, doesn't support Granite yet. - `llama.cpp` does and works very well: ``` dynamo-run out=llamacpp ~/llms/granite-3.3-2b-instruct-Q4_K_M.gguf --context-length 16384 ``` - `vllm` also works very well: ``` dynamo-run in=http out=vllm ~/llms/granite-3.3-2b-instruct --context-length 16384 ``` - `sglang` mostly works, but it doesn't catch the stop token, so we do in the HTTP ingress, and log an error. The Text ingress doesn't catch it because I disabled it to make the raw echo engine work. A bit of work to do here. Closes: #1245
Support depends on the engine:
mistral.rs, our default engine, doesn't support Granite yet.llama.cppdoes and works very well:vllmalso works very well:sglangmostly works, but it doesn't catch the stop token, so we do in the HTTP ingress, and log an error. The Text ingress doesn't catch it because I disabled it to make the raw echo engine work. A bit of work to do here.Closes: #1245
Summary by CodeRabbit
New Features
Improvements
Dependency Updates
llama-cpp-2dependency to version 0.1.107 for improved compatibility and features.