-
Notifications
You must be signed in to change notification settings - Fork 688
refactor: Remove unwrap() usage and add explicit error handling in ModelManager
#2678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| loop { | ||
| if let Some(model_name) = self.manager.list_chat_completions_models().first() { | ||
| if let Some(model_name) = self.manager.list_chat_completions_models().unwrap().first() { | ||
| return model_name.to_owned(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduces unwrap() here and a few places below but will fix in subsequent PR to ensure this PR stays small.
WalkthroughRefactors ModelManager to a Result-based API with a new LockPoisoned error, adds engine accessors and KV router creation logic with etcd, and updates callers. Watcher now propagates errors. HTTP health and OpenAI list endpoints now unwrap Result values from manager methods. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller as Watcher/HTTP
participant MM as ModelManager
participant Locks as Mutex/RwLock
participant KV as kv_choosers
participant Etcd as etcd client
participant Router as KvRouter
Caller->>MM: kv_chooser_for(model, component, block_size, config)
MM->>Locks: Acquire lock (kv_choosers)
alt chooser exists
MM->>KV: Get existing chooser
alt size compatible
MM-->>Caller: Ok(Arc<KvRouter>)
else size mismatch
note right of MM: Return error with context
MM-->>Caller: Err(anyhow)
end
else not found
MM->>Etcd: Init/router config (async)
Etcd-->>MM: Ok(client/config)
MM->>Router: Create KvRouter
MM->>KV: Insert chooser
MM-->>Caller: Ok(Arc<KvRouter>)
end
note over Locks: Lock errors mapped to ModelManagerError::LockPoisoned
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/llm/src/discovery/watcher.rs (1)
128-132: Avoid unwrap() in has_model_any(); treat errors as “not present” and logThis prevents watcher crashes due to transient manager errors.
- if self.manager.has_model_any(&model_entry.name).unwrap() { + if match self.manager.has_model_any(&model_entry.name) { + Ok(found) => found, + Err(e) => { + tracing::warn!(error = %e, model = %model_entry.name, "failed checking model presence; assuming missing"); + false + } + } { tracing::trace!(name = model_entry.name, "New endpoint for existing model"); self.notify_on_model.notify_waiters(); continue; }
🧹 Nitpick comments (4)
lib/llm/src/discovery/model_manager.rs (3)
232-257: Preserve original error types instead of stringifying into anyhowYou’re converting ModelManagerError into anyhow::Error via
anyhow!(e.to_string()), which loses the concrete type and hampers downcasting/diagnostics.Apply this diff to let
?convert the error while preserving the source:- if let Some(kv_chooser) = self - .get_kv_chooser(model_name) - .map_err(|e| anyhow::anyhow!(e.to_string()))? - { + if let Some(kv_chooser) = self.get_kv_chooser(model_name)? { // Check if the existing router has a different block size if kv_chooser.block_size() != kv_cache_block_size { tracing::warn!( /* ... unchanged ... */ ); } return Ok(kv_chooser); }
284-306: Potential etcd key leak on router creation failuresYou create the etcd router key before constructing the router. If
KvRouter::new(...).awaitfails, the key remains. Consider deleting the key on failure or creating the key after successful router construction.Sketch:
- etcd_client.kv_create(&router_key, serde_json::to_vec_pretty(&kv_router_config.unwrap_or_default())?, None).await?; - let selector = ... - let chooser = KvRouter::new(...).await?; + let selector = ... + let chooser = KvRouter::new(...).await + .map_err(|e| { + // best-effort cleanup + let _ = etcd_client.blocking_kv_delete(&router_key, None); + e + })?; + etcd_client.kv_create(&router_key, serde_json::to_vec_pretty(&kv_router_config.unwrap_or_default())?, None).await?;If async cleanup is preferred, use a scope guard.
309-319: Silent lock poisoning in get_model_tool_call_parserReturning
Noneon a poisoned lock hides an operational error. Either log at least a warning or expose a fallible variant.Minimal improvement:
- match self.entries.lock() { + match self.entries.lock() { Ok(entries) => { /* unchanged */ } - Err(_) => None, + Err(_) => { + tracing::warn!("Lock poisoned: entries while getting tool_call_parser"); + None + }, }lib/llm/src/discovery/watcher.rs (1)
120-121: Don’t discard save_model_entry() errors; log themSwallowing errors makes debugging harder; at least log.
- let _ = self.manager.save_model_entry(key, model_entry.clone()); + if let Err(e) = self.manager.save_model_entry(key, model_entry.clone()) { + tracing::warn!(error = %e, key, "Failed to save model entry"); + }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
lib/llm/src/discovery/model_manager.rs(11 hunks)lib/llm/src/discovery/watcher.rs(5 hunks)lib/llm/src/http/service/health.rs(1 hunks)lib/llm/src/http/service/openai.rs(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
lib/llm/src/http/service/health.rs (1)
lib/llm/src/http/service/service_v2.rs (1)
state(186-188)
lib/llm/src/http/service/openai.rs (1)
lib/llm/src/http/service/service_v2.rs (1)
state(186-188)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (5)
lib/llm/src/discovery/model_manager.rs (2)
29-31: New LockPoisoned variant is a good additionClear identification of which lock failed will help ops significantly.
292-299: Resolved — no double-move: KvRouterConfig is CopyShort summary: KvRouterConfig is #[derive(..., Copy)] so passing it (or Option) into DefaultWorkerSelector::new(...) and then into KvRouter::new(...) is safe — no double-move compile error.
Locations:
- lib/llm/src/kv_router.rs — KvRouterConfig is defined with Copy. (see KvRouterConfig derive).
- lib/llm/src/kv_router/scheduler.rs — DefaultWorkerSelector::new(kv_router_config: Option) takes the config by value.
- lib/llm/src/discovery/model_manager.rs — call-site using DefaultWorkerSelector::new(kv_router_config) and KvRouter::new(..., kv_router_config).
- lib/llm/src/kv_router.rs — KvRouter::new(...) (constructor) is defined (pub async fn new...).
Recommendation: no change required. (Optional: if you prefer explicitness, call kv_router_config.clone() or pass a & reference — but not necessary because KvRouterConfig is Copy.)
lib/llm/src/discovery/watcher.rs (3)
231-243: OK to ignore remove_ errors given follow-up checks*You correctly gate notifications on post-remove list() emptiness, so ignoring remove errors here is acceptable.
293-349: End-to-end KV router creation shape looks goodThe flow to build routed pipelines and register engines is coherent and matches RouterMode selection.
420-444: all_entries() parsing and error reporting look solidGood defensive logging for malformed etcd data and fall-through behavior.
| pub fn has_model_any(&self, model: &str) -> Result<bool, ModelManagerError> { | ||
| let chat = self | ||
| .chat_completion_engines | ||
| .read() | ||
| .map_err(|_| ModelManagerError::LockPoisoned("chat_completion_engines"))?; | ||
| if chat.contains(model) { | ||
| return Ok(true); | ||
| } | ||
| let comp = self | ||
| .completion_engines | ||
| .read() | ||
| .map_err(|_| ModelManagerError::LockPoisoned("completion_engines"))?; | ||
| Ok(comp.contains(model)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: has_model_any() ignores Embeddings engines
The method’s name implies checking “any” model type, but it only considers chat and completions. This will return false negatives for embeddings-only models.
Apply this diff to include embeddings:
- pub fn has_model_any(&self, model: &str) -> Result<bool, ModelManagerError> {
- let chat = self
- .chat_completion_engines
- .read()
- .map_err(|_| ModelManagerError::LockPoisoned("chat_completion_engines"))?;
- if chat.contains(model) {
- return Ok(true);
- }
- let comp = self
- .completion_engines
- .read()
- .map_err(|_| ModelManagerError::LockPoisoned("completion_engines"))?;
- Ok(comp.contains(model))
- }
+ pub fn has_model_any(&self, model: &str) -> Result<bool, ModelManagerError> {
+ let chat = self
+ .chat_completion_engines
+ .read()
+ .map_err(|_| ModelManagerError::LockPoisoned("chat_completion_engines"))?;
+ if chat.contains(model) {
+ return Ok(true);
+ }
+ let comp = self
+ .completion_engines
+ .read()
+ .map_err(|_| ModelManagerError::LockPoisoned("completion_engines"))?;
+ if comp.contains(model) {
+ return Ok(true);
+ }
+ let embed = self
+ .embeddings_engines
+ .read()
+ .map_err(|_| ModelManagerError::LockPoisoned("embeddings_engines"))?;
+ Ok(embed.contains(model))
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub fn has_model_any(&self, model: &str) -> Result<bool, ModelManagerError> { | |
| let chat = self | |
| .chat_completion_engines | |
| .read() | |
| .map_err(|_| ModelManagerError::LockPoisoned("chat_completion_engines"))?; | |
| if chat.contains(model) { | |
| return Ok(true); | |
| } | |
| let comp = self | |
| .completion_engines | |
| .read() | |
| .map_err(|_| ModelManagerError::LockPoisoned("completion_engines"))?; | |
| Ok(comp.contains(model)) | |
| } | |
| pub fn has_model_any(&self, model: &str) -> Result<bool, ModelManagerError> { | |
| let chat = self | |
| .chat_completion_engines | |
| .read() | |
| .map_err(|_| ModelManagerError::LockPoisoned("chat_completion_engines"))?; | |
| if chat.contains(model) { | |
| return Ok(true); | |
| } | |
| let comp = self | |
| .completion_engines | |
| .read() | |
| .map_err(|_| ModelManagerError::LockPoisoned("completion_engines"))?; | |
| if comp.contains(model) { | |
| return Ok(true); | |
| } | |
| let embed = self | |
| .embeddings_engines | |
| .read() | |
| .map_err(|_| ModelManagerError::LockPoisoned("embeddings_engines"))?; | |
| Ok(embed.contains(model)) | |
| } |
🤖 Prompt for AI Agents
In lib/llm/src/discovery/model_manager.rs around lines 70 to 83, has_model_any
currently only checks chat_completion_engines and completion_engines causing
embeddings-only models to be missed; update the function to also acquire a read
lock on embeddings_engines (map_err to
ModelManagerError::LockPoisoned("embeddings_engines")), check contains(model)
and return Ok(true) if found, otherwise continue to return Ok(false) after all
checks. Ensure the new embeddings read lock is checked in the same
style/ordering as the others and that error messages use the exact key
"embeddings_engines".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore. Will handle in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulhendricks Understood! That's a reasonable approach to keep this PR focused on the error handling refactoring. The embeddings check can indeed be addressed in a separate PR.
| if let Some(model_name) = self.manager.list_chat_completions_models().unwrap().first() { | ||
| return model_name.to_owned(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid unwrap() in wait_for_chat_model loop; retry instead of panicking
If listing models fails (e.g., lock poisoning during startup), the watcher will crash. Log and retry.
Apply this diff:
- if let Some(model_name) = self.manager.list_chat_completions_models().unwrap().first() {
- return model_name.to_owned();
- }
+ match self.manager.list_chat_completions_models() {
+ Ok(models) => {
+ if let Some(model_name) = models.first() {
+ return model_name.to_owned();
+ }
+ }
+ Err(e) => {
+ tracing::warn!(error = %e, "failed to list chat models; retrying");
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if let Some(model_name) = self.manager.list_chat_completions_models().unwrap().first() { | |
| return model_name.to_owned(); | |
| } | |
| match self.manager.list_chat_completions_models() { | |
| Ok(models) => { | |
| if let Some(model_name) = models.first() { | |
| return model_name.to_owned(); | |
| } | |
| } | |
| Err(e) => { | |
| tracing::warn!(error = %e, "failed to list chat models; retrying"); | |
| } | |
| } |
🤖 Prompt for AI Agents
In lib/llm/src/discovery/watcher.rs around lines 86 to 88, the code currently
calls unwrap() on the result of list_chat_completions_models() which will panic
on errors; change this to handle the Err case by logging the error and
continuing the wait loop instead of panicking — call
list_chat_completions_models(), match on the Result, on Ok inspect first() as
before, on Err log the error (with context) and simply continue (or sleep/retry)
so the watcher keeps retrying until a model appears rather than crashing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore. Will handle in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulhendricks Understood! That makes sense to handle the watcher error handling improvements in a separate PR to keep this one focused on the ModelManager changes.
🐰
╭─────╮
│ ◕ ◕ │ nods approvingly
│ ‿ │
╰─────╯
| axum::extract::State(state): axum::extract::State<Arc<service_v2::State>>, | ||
| ) -> impl IntoResponse { | ||
| let model_entries = state.manager().get_model_entries(); | ||
| let model_entries = state.manager().get_model_entries().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t unwrap() get_model_entries(); return 500 on failure
A poisoned lock or other manager error should not panic the health endpoint. Return a stable 500 and a useful message.
Apply this diff:
- let model_entries = state.manager().get_model_entries().unwrap();
+ let model_entries = match state.manager().get_model_entries() {
+ Ok(entries) => entries,
+ Err(e) => return ErrorMessage::internal_server_error(&format!("Failed to get model entries: {e}")),
+ };📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let model_entries = state.manager().get_model_entries().unwrap(); | |
| let model_entries = match state.manager().get_model_entries() { | |
| Ok(entries) => entries, | |
| Err(e) => return ErrorMessage::internal_server_error(&format!("Failed to get model entries: {e}")), | |
| }; |
🤖 Prompt for AI Agents
In lib/llm/src/http/service/health.rs around line 55, replace the unwrap() call
on state.manager().get_model_entries() with proper error handling so the health
endpoint does not panic; instead match or use Result combinators to handle Err
and return an Internal Server Error (HTTP 500) response with a clear error
message (or JSON) indicating failure to retrieve model entries, logging the
underlying error as appropriate for diagnostics. Ensure the success path
continues to use the retrieved model_entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore. Will handle in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulhendricks Understood! I'll skip this suggestion for the current PR. The temporary unwrap() here makes sense given the current function signature constraints.
Would you like me to create a follow-up issue to track the error handling improvement for the health endpoint, or will you handle that when you're ready for the next PR?
| let models: HashSet<String> = state.manager().model_display_names().unwrap(); | ||
| for model_name in models { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t unwrap() model_display_names(); propagate as HTTP 500
Using unwrap() reintroduces panic paths from lock poisoning or other manager failures — this contradicts the PR’s goal.
Apply this diff:
- let models: HashSet<String> = state.manager().model_display_names().unwrap();
+ let models: HashSet<String> = state
+ .manager()
+ .model_display_names()
+ .map_err(|e| ErrorMessage::internal_server_error(&format!("Failed to list models: {e}")))?;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let models: HashSet<String> = state.manager().model_display_names().unwrap(); | |
| for model_name in models { | |
| let models: HashSet<String> = state | |
| .manager() | |
| .model_display_names() | |
| .map_err(|e| ErrorMessage::internal_server_error(&format!("Failed to list models: {e}")))?; | |
| for model_name in models { |
🤖 Prompt for AI Agents
In lib/llm/src/http/service/openai.rs around lines 930-931, remove the unwrap()
on state.manager().model_display_names() and instead handle the error path by
returning an HTTP 500 on failure: call model_display_names(), match or use
?/map_err to convert the error into an internal-server-error HTTP response (and
optionally log the error), and only iterate over the returned HashSet on Ok; do
not allow a panic from unwrap().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore. Will handle in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulhendricks Understood! Since you're planning to address this unwrap() in another PR, would you like me to create a follow-up issue to track removing the remaining unwrap() call in the list_models_openai function?
(\ /)
( ._. )
o_(")(")
Overview:
Remove
unwrap()usage and add explicit error handling in ModelManagerDetails:
This PR eliminates all
unwrap()calls fromModelManagerand replaces them with explicit error handling.ModelManagerError::LockPoisonedto represent poisonedMutex/RwLockstates.Resultwhere lock acquisition can fail.anyhow::Context.These changes improve safety by ensuring lock poisoning and other runtime errors are surfaced instead of causing panics.
Additionally, introduces
unwrap()farther upstream where function signature isn't fallible. Will update that to propagate errors in a subsequent PR.Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Stability Improvements
Refactor