Skip to content

Conversation

@paulhendricks
Copy link
Member

@paulhendricks paulhendricks commented Aug 23, 2025

Overview:

Remove unwrap() usage and add explicit error handling in ModelManager

Details:

This PR eliminates all unwrap() calls from ModelManager and replaces them with explicit error handling.

  • Introduced ModelManagerError::LockPoisoned to represent poisoned Mutex/RwLock states.
  • Updated methods to return Result where lock acquisition can fail.
  • Converted async methods to propagate lock errors using anyhow::Context.
  • Organized imports for clarity and consistency.
    These changes improve safety by ensuring lock poisoning and other runtime errors are surfaced instead of causing panics.

Additionally, introduces unwrap() farther upstream where function signature isn't fallible. Will update that to propagate errors in a subsequent PR.

Where should the reviewer start?

  • lib/llm/src/discovery/model_manager.rs

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • N/A

Summary by CodeRabbit

  • New Features

    • Enhanced model management with clearer feedback when adding, listing, or removing models.
    • Expanded routing support for key-value caching with component-aware selection and compatibility handling.
    • Improved access to model engines for chat, completions, and embeddings.
  • Stability Improvements

    • More consistent error handling across model operations, reducing unexpected failures and aiding troubleshooting.
  • Refactor

    • Internal APIs migrated to explicit result-based flows and safer synchronization, improving reliability without changing core behavior.
    • Broader error context added to lock and routing paths for easier diagnostics.

@paulhendricks paulhendricks requested a review from a team as a code owner August 23, 2025 02:29
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment on lines 85 to 87
loop {
if let Some(model_name) = self.manager.list_chat_completions_models().first() {
if let Some(model_name) = self.manager.list_chat_completions_models().unwrap().first() {
return model_name.to_owned();
Copy link
Member Author

@paulhendricks paulhendricks Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduces unwrap() here and a few places below but will fix in subsequent PR to ensure this PR stays small.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 23, 2025

Walkthrough

Refactors ModelManager to a Result-based API with a new LockPoisoned error, adds engine accessors and KV router creation logic with etcd, and updates callers. Watcher now propagates errors. HTTP health and OpenAI list endpoints now unwrap Result values from manager methods.

Changes

Cohort / File(s) Summary
ModelManager API & synchronization overhaul
lib/llm/src/discovery/model_manager.rs
Converted many methods to return Result; added ModelManagerError::LockPoisoned; introduced add_* and get_* engine APIs; reworked KV router access with get_kv_chooser/create_kv_chooser, component param, and etcd-backed creation; replaced unwrap-based locks with error mapping and context.
Watcher migration to Result-based API
lib/llm/src/discovery/watcher.rs
Updated call sites to use ? with list/remove/add/save methods; adjusted handling for delete and put paths; now relies on Result-returning APIs for model presence checks and modifications.
HTTP services: direct unwrap of manager results
lib/llm/src/http/service/health.rs, lib/llm/src/http/service/openai.rs
Switched to unwrap() on get_model_entries and model_display_names respectively; logic otherwise unchanged, but introduces potential panic on error.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller as Watcher/HTTP
  participant MM as ModelManager
  participant Locks as Mutex/RwLock
  participant KV as kv_choosers
  participant Etcd as etcd client
  participant Router as KvRouter

  Caller->>MM: kv_chooser_for(model, component, block_size, config)
  MM->>Locks: Acquire lock (kv_choosers)
  alt chooser exists
    MM->>KV: Get existing chooser
    alt size compatible
      MM-->>Caller: Ok(Arc<KvRouter>)
    else size mismatch
      note right of MM: Return error with context
      MM-->>Caller: Err(anyhow)
    end
  else not found
    MM->>Etcd: Init/router config (async)
    Etcd-->>MM: Ok(client/config)
    MM->>Router: Create KvRouter
    MM->>KV: Insert chooser
    MM-->>Caller: Ok(Arc<KvRouter>)
  end
  note over Locks: Lock errors mapped to ModelManagerError::LockPoisoned
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump my paws at mutex doors,
No panics now—just tidy scores.
Results hop back, with errors tamed,
A KV trail, etcd named.
Models line up, engines queue—
I nibble logs and leap right through.
Hippity-sync, we’re brand-new!

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/discovery/watcher.rs (1)

128-132: Avoid unwrap() in has_model_any(); treat errors as “not present” and log

This prevents watcher crashes due to transient manager errors.

-                    if self.manager.has_model_any(&model_entry.name).unwrap() {
+                    if match self.manager.has_model_any(&model_entry.name) {
+                        Ok(found) => found,
+                        Err(e) => {
+                            tracing::warn!(error = %e, model = %model_entry.name, "failed checking model presence; assuming missing");
+                            false
+                        }
+                    } {
                         tracing::trace!(name = model_entry.name, "New endpoint for existing model");
                         self.notify_on_model.notify_waiters();
                         continue;
                     }
🧹 Nitpick comments (4)
lib/llm/src/discovery/model_manager.rs (3)

232-257: Preserve original error types instead of stringifying into anyhow

You’re converting ModelManagerError into anyhow::Error via anyhow!(e.to_string()), which loses the concrete type and hampers downcasting/diagnostics.

Apply this diff to let ? convert the error while preserving the source:

-        if let Some(kv_chooser) = self
-            .get_kv_chooser(model_name)
-            .map_err(|e| anyhow::anyhow!(e.to_string()))?
-        {
+        if let Some(kv_chooser) = self.get_kv_chooser(model_name)? {
             // Check if the existing router has a different block size
             if kv_chooser.block_size() != kv_cache_block_size {
                 tracing::warn!( /* ... unchanged ... */ );
             }
             return Ok(kv_chooser);
         }

284-306: Potential etcd key leak on router creation failures

You create the etcd router key before constructing the router. If KvRouter::new(...).await fails, the key remains. Consider deleting the key on failure or creating the key after successful router construction.

Sketch:

-        etcd_client.kv_create(&router_key, serde_json::to_vec_pretty(&kv_router_config.unwrap_or_default())?, None).await?;
-        let selector = ...
-        let chooser = KvRouter::new(...).await?;
+        let selector = ...
+        let chooser = KvRouter::new(...).await
+            .map_err(|e| {
+                // best-effort cleanup
+                let _ = etcd_client.blocking_kv_delete(&router_key, None);
+                e
+            })?;
+        etcd_client.kv_create(&router_key, serde_json::to_vec_pretty(&kv_router_config.unwrap_or_default())?, None).await?;

If async cleanup is preferred, use a scope guard.


309-319: Silent lock poisoning in get_model_tool_call_parser

Returning None on a poisoned lock hides an operational error. Either log at least a warning or expose a fallible variant.

Minimal improvement:

-        match self.entries.lock() {
+        match self.entries.lock() {
             Ok(entries) => { /* unchanged */ }
-            Err(_) => None,
+            Err(_) => {
+                tracing::warn!("Lock poisoned: entries while getting tool_call_parser");
+                None
+            },
         }
lib/llm/src/discovery/watcher.rs (1)

120-121: Don’t discard save_model_entry() errors; log them

Swallowing errors makes debugging harder; at least log.

-                    let _ = self.manager.save_model_entry(key, model_entry.clone());
+                    if let Err(e) = self.manager.save_model_entry(key, model_entry.clone()) {
+                        tracing::warn!(error = %e, key, "Failed to save model entry");
+                    }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 05913af and a319de6.

📒 Files selected for processing (4)
  • lib/llm/src/discovery/model_manager.rs (11 hunks)
  • lib/llm/src/discovery/watcher.rs (5 hunks)
  • lib/llm/src/http/service/health.rs (1 hunks)
  • lib/llm/src/http/service/openai.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
lib/llm/src/http/service/health.rs (1)
lib/llm/src/http/service/service_v2.rs (1)
  • state (186-188)
lib/llm/src/http/service/openai.rs (1)
lib/llm/src/http/service/service_v2.rs (1)
  • state (186-188)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (5)
lib/llm/src/discovery/model_manager.rs (2)

29-31: New LockPoisoned variant is a good addition

Clear identification of which lock failed will help ops significantly.


292-299: Resolved — no double-move: KvRouterConfig is Copy

Short summary: KvRouterConfig is #[derive(..., Copy)] so passing it (or Option) into DefaultWorkerSelector::new(...) and then into KvRouter::new(...) is safe — no double-move compile error.

Locations:

  • lib/llm/src/kv_router.rs — KvRouterConfig is defined with Copy. (see KvRouterConfig derive).
  • lib/llm/src/kv_router/scheduler.rs — DefaultWorkerSelector::new(kv_router_config: Option) takes the config by value.
  • lib/llm/src/discovery/model_manager.rs — call-site using DefaultWorkerSelector::new(kv_router_config) and KvRouter::new(..., kv_router_config).
  • lib/llm/src/kv_router.rs — KvRouter::new(...) (constructor) is defined (pub async fn new...).

Recommendation: no change required. (Optional: if you prefer explicitness, call kv_router_config.clone() or pass a & reference — but not necessary because KvRouterConfig is Copy.)

lib/llm/src/discovery/watcher.rs (3)

231-243: OK to ignore remove_ errors given follow-up checks*

You correctly gate notifications on post-remove list() emptiness, so ignoring remove errors here is acceptable.


293-349: End-to-end KV router creation shape looks good

The flow to build routed pipelines and register engines is coherent and matches RouterMode selection.


420-444: all_entries() parsing and error reporting look solid

Good defensive logging for malformed etcd data and fall-through behavior.

Comment on lines +70 to 83
pub fn has_model_any(&self, model: &str) -> Result<bool, ModelManagerError> {
let chat = self
.chat_completion_engines
.read()
.map_err(|_| ModelManagerError::LockPoisoned("chat_completion_engines"))?;
if chat.contains(model) {
return Ok(true);
}
let comp = self
.completion_engines
.read()
.map_err(|_| ModelManagerError::LockPoisoned("completion_engines"))?;
Ok(comp.contains(model))
}
Copy link
Contributor

@coderabbitai coderabbitai bot Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Bug: has_model_any() ignores Embeddings engines

The method’s name implies checking “any” model type, but it only considers chat and completions. This will return false negatives for embeddings-only models.

Apply this diff to include embeddings:

-    pub fn has_model_any(&self, model: &str) -> Result<bool, ModelManagerError> {
-        let chat = self
-            .chat_completion_engines
-            .read()
-            .map_err(|_| ModelManagerError::LockPoisoned("chat_completion_engines"))?;
-        if chat.contains(model) {
-            return Ok(true);
-        }
-        let comp = self
-            .completion_engines
-            .read()
-            .map_err(|_| ModelManagerError::LockPoisoned("completion_engines"))?;
-        Ok(comp.contains(model))
-    }
+    pub fn has_model_any(&self, model: &str) -> Result<bool, ModelManagerError> {
+        let chat = self
+            .chat_completion_engines
+            .read()
+            .map_err(|_| ModelManagerError::LockPoisoned("chat_completion_engines"))?;
+        if chat.contains(model) {
+            return Ok(true);
+        }
+        let comp = self
+            .completion_engines
+            .read()
+            .map_err(|_| ModelManagerError::LockPoisoned("completion_engines"))?;
+        if comp.contains(model) {
+            return Ok(true);
+        }
+        let embed = self
+            .embeddings_engines
+            .read()
+            .map_err(|_| ModelManagerError::LockPoisoned("embeddings_engines"))?;
+        Ok(embed.contains(model))
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn has_model_any(&self, model: &str) -> Result<bool, ModelManagerError> {
let chat = self
.chat_completion_engines
.read()
.map_err(|_| ModelManagerError::LockPoisoned("chat_completion_engines"))?;
if chat.contains(model) {
return Ok(true);
}
let comp = self
.completion_engines
.read()
.map_err(|_| ModelManagerError::LockPoisoned("completion_engines"))?;
Ok(comp.contains(model))
}
pub fn has_model_any(&self, model: &str) -> Result<bool, ModelManagerError> {
let chat = self
.chat_completion_engines
.read()
.map_err(|_| ModelManagerError::LockPoisoned("chat_completion_engines"))?;
if chat.contains(model) {
return Ok(true);
}
let comp = self
.completion_engines
.read()
.map_err(|_| ModelManagerError::LockPoisoned("completion_engines"))?;
if comp.contains(model) {
return Ok(true);
}
let embed = self
.embeddings_engines
.read()
.map_err(|_| ModelManagerError::LockPoisoned("embeddings_engines"))?;
Ok(embed.contains(model))
}
🤖 Prompt for AI Agents
In lib/llm/src/discovery/model_manager.rs around lines 70 to 83, has_model_any
currently only checks chat_completion_engines and completion_engines causing
embeddings-only models to be missed; update the function to also acquire a read
lock on embeddings_engines (map_err to
ModelManagerError::LockPoisoned("embeddings_engines")), check contains(model)
and return Ok(true) if found, otherwise continue to return Ok(false) after all
checks. Ensure the new embeddings read lock is checked in the same
style/ordering as the others and that error messages use the exact key
"embeddings_engines".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore. Will handle in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulhendricks Understood! That's a reasonable approach to keep this PR focused on the error handling refactoring. The embeddings check can indeed be addressed in a separate PR.

Comment on lines +86 to 88
if let Some(model_name) = self.manager.list_chat_completions_models().unwrap().first() {
return model_name.to_owned();
}
Copy link
Contributor

@coderabbitai coderabbitai bot Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid unwrap() in wait_for_chat_model loop; retry instead of panicking

If listing models fails (e.g., lock poisoning during startup), the watcher will crash. Log and retry.

Apply this diff:

-            if let Some(model_name) = self.manager.list_chat_completions_models().unwrap().first() {
-                return model_name.to_owned();
-            }
+            match self.manager.list_chat_completions_models() {
+                Ok(models) => {
+                    if let Some(model_name) = models.first() {
+                        return model_name.to_owned();
+                    }
+                }
+                Err(e) => {
+                    tracing::warn!(error = %e, "failed to list chat models; retrying");
+                }
+            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if let Some(model_name) = self.manager.list_chat_completions_models().unwrap().first() {
return model_name.to_owned();
}
match self.manager.list_chat_completions_models() {
Ok(models) => {
if let Some(model_name) = models.first() {
return model_name.to_owned();
}
}
Err(e) => {
tracing::warn!(error = %e, "failed to list chat models; retrying");
}
}
🤖 Prompt for AI Agents
In lib/llm/src/discovery/watcher.rs around lines 86 to 88, the code currently
calls unwrap() on the result of list_chat_completions_models() which will panic
on errors; change this to handle the Err case by logging the error and
continuing the wait loop instead of panicking — call
list_chat_completions_models(), match on the Result, on Ok inspect first() as
before, on Err log the error (with context) and simply continue (or sleep/retry)
so the watcher keeps retrying until a model appears rather than crashing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore. Will handle in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulhendricks Understood! That makes sense to handle the watcher error handling improvements in a separate PR to keep this one focused on the ModelManager changes.

🐰
╭─────╮
│ ◕ ◕ │ nods approvingly
│ ‿ │
╰─────╯

axum::extract::State(state): axum::extract::State<Arc<service_v2::State>>,
) -> impl IntoResponse {
let model_entries = state.manager().get_model_entries();
let model_entries = state.manager().get_model_entries().unwrap();
Copy link
Contributor

@coderabbitai coderabbitai bot Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t unwrap() get_model_entries(); return 500 on failure

A poisoned lock or other manager error should not panic the health endpoint. Return a stable 500 and a useful message.

Apply this diff:

-    let model_entries = state.manager().get_model_entries().unwrap();
+    let model_entries = match state.manager().get_model_entries() {
+        Ok(entries) => entries,
+        Err(e) => return ErrorMessage::internal_server_error(&format!("Failed to get model entries: {e}")),
+    };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let model_entries = state.manager().get_model_entries().unwrap();
let model_entries = match state.manager().get_model_entries() {
Ok(entries) => entries,
Err(e) => return ErrorMessage::internal_server_error(&format!("Failed to get model entries: {e}")),
};
🤖 Prompt for AI Agents
In lib/llm/src/http/service/health.rs around line 55, replace the unwrap() call
on state.manager().get_model_entries() with proper error handling so the health
endpoint does not panic; instead match or use Result combinators to handle Err
and return an Internal Server Error (HTTP 500) response with a clear error
message (or JSON) indicating failure to retrieve model entries, logging the
underlying error as appropriate for diagnostics. Ensure the success path
continues to use the retrieved model_entries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore. Will handle in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulhendricks Understood! I'll skip this suggestion for the current PR. The temporary unwrap() here makes sense given the current function signature constraints.

Would you like me to create a follow-up issue to track the error handling improvement for the health endpoint, or will you handle that when you're ready for the next PR?

Comment on lines +930 to 931
let models: HashSet<String> = state.manager().model_display_names().unwrap();
for model_name in models {
Copy link
Contributor

@coderabbitai coderabbitai bot Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t unwrap() model_display_names(); propagate as HTTP 500

Using unwrap() reintroduces panic paths from lock poisoning or other manager failures — this contradicts the PR’s goal.

Apply this diff:

-    let models: HashSet<String> = state.manager().model_display_names().unwrap();
+    let models: HashSet<String> = state
+        .manager()
+        .model_display_names()
+        .map_err(|e| ErrorMessage::internal_server_error(&format!("Failed to list models: {e}")))?;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let models: HashSet<String> = state.manager().model_display_names().unwrap();
for model_name in models {
let models: HashSet<String> = state
.manager()
.model_display_names()
.map_err(|e| ErrorMessage::internal_server_error(&format!("Failed to list models: {e}")))?;
for model_name in models {
🤖 Prompt for AI Agents
In lib/llm/src/http/service/openai.rs around lines 930-931, remove the unwrap()
on state.manager().model_display_names() and instead handle the error path by
returning an HTTP 500 on failure: call model_display_names(), match or use
?/map_err to convert the error into an internal-server-error HTTP response (and
optionally log the error), and only iterate over the returned HashSet on Ok; do
not allow a panic from unwrap().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore. Will handle in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulhendricks Understood! Since you're planning to address this unwrap() in another PR, would you like me to create a follow-up issue to track removing the remaining unwrap() call in the list_models_openai function?

  (\   /)
 ( ._. )
o_(")(")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants