Skip to content

Conversation

@grahamking
Copy link
Contributor

@grahamking grahamking commented Aug 25, 2025

Rebased #2496 by @qimcis after the Rust upgrade.

-- Origin description

Overview:

Aligns OpenAI response IDs with distributed trace IDs

Details:

Replaces random UUID generation with consistent trace IDs from request context so that OpenAI API responses (chatcmpl-, cmpl-) match distributed tracing identifiers.

Where should the reviewer start?

  • lib/llm/src/protocols/openai/chat_completions/delta.rs: New request_id parameter in response_generator()
  • lib/llm/src/http/service/openai.rs: Removed UUID generation, using request.id()
  • lib/llm/src/engines.rs: Updated response generator calls with context IDs

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • New Features

    • Deterministic, per-request IDs for streaming responses (chatcmpl- and cmpl-) for easier tracing across streams, logs, and metrics.
    • Embeddings now supports propagating request IDs via HTTP headers, aligning traceability with other endpoints.
  • Refactor

    • Unified request ID handling across chat, completions, embeddings, and warmup flows to ensure consistent logging and monitoring.
    • Stream generation now uses the request’s context ID end-to-end, improving observability without changing response formats or error codes.

Signed-off-by: Graham King <grahamk@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 25, 2025

Walkthrough

Introduces per-request identifiers across HTTP handlers, engines, and protocol delta generators. Updates response_generator APIs to accept a context/request ID, propagates IDs into streaming flows, and switches delta IDs from UUIDs to deterministic chatcmpl-/cmpl-. Adjusts embeddings handler signature to include headers and centralizes request ID derivation. Updates tests accordingly.

Changes

Cohort / File(s) Summary
MistralRS request ID plumbing
lib/engines/mistralrs/src/lib.rs
Adds per-request mistralrs_request_id, routes logs to it, uses ctx.id().to_string() for streaming request_id, and passes ctx ID into non-chat response generation.
Engine callers updated to pass context ID
lib/llm/src/engines.rs
Both chat and completion streaming paths now call response_generator(ctx.id().to_string()).
HTTP service ID derivation changes
lib/llm/src/http/service/openai.rs
Completions uses request.id() as request_id. Embeddings handler signature adds HeaderMap, derives ID via get_or_create_request_id, attaches with Context::with_id, and reuses the derived ID.
Preprocessor forwards context ID
lib/llm/src/preprocessor.rs
response_generator calls updated to include context.id().to_string() for chat and completion requests.
OpenAI chat delta generator API
lib/llm/src/protocols/openai/chat_completions/delta.rs
NvCreateChatCompletionRequest::response_generator(request_id: String) and DeltaGenerator::new(..., request_id: String). Delta IDs now chatcmpl-{request_id}.
OpenAI completion delta generator API and options
lib/llm/src/protocols/openai/completions/delta.rs
Adds DeltaGeneratorOptions. response_generator(request_id: String). DeltaGenerator::new(..., request_id: String). Delta IDs now cmpl-{request_id} instead of UUID.
Tests updated for generator argument
lib/llm/tests/http-service.rs
Test invokes response_generator with Some(ctx.id().to_string()) instead of no args; removes license header.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant HTTP as HTTP Service (openai.rs)
  participant Ctx as Context
  participant Engine as LLM Engine
  participant Proto as Protocol Delta Gen
  participant M as MistralRS

  Client->>HTTP: POST /v1/chat/completions or /v1/completions
  HTTP->>HTTP: Derive request_id (request.id or headers via get_or_create_request_id)
  HTTP->>Ctx: Context::with_id(request_id)
  HTTP->>Engine: invoke(stream, ctx)

  Engine->>Proto: response_generator(ctx.id())
  Proto->>Proto: Build DeltaGenerator with id chatcmpl-{id} / cmpl-{id}

  alt Mistral-backed
    Engine->>M: start stream with mistralrs_request_id
    M-->>Engine: token deltas
  else Other backend
    Proto-->>Engine: token deltas
  end

  Engine-->>HTTP: streamed deltas (id includes {request_id})
  HTTP-->>Client: SSE/stream response
Loading
sequenceDiagram
  autonumber
  actor Client
  participant HTTP as HTTP Service (Embeddings)
  participant Ctx as Context
  participant Engine as LLM Engine

  Client->>HTTP: POST /v1/embeddings (with headers)
  HTTP->>HTTP: request_id = get_or_create_request_id(primary, headers)
  HTTP->>Ctx: Context::with_id(request_id)
  HTTP->>Engine: run embeddings with ctx
  Engine-->>HTTP: result
  HTTP-->>Client: JSON response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Poem

A whisk of logs, an ID in tow,
chatcmpl dances where UUIDs go.
cmpl echoes, tidy and bright,
headers whisper the name of the night.
I thump my foot—request in stream—
carrots aligned, IDs supreme. 🥕✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
lib/llm/src/protocols/openai/chat_completions/delta.rs (1)

218-247: Role field emitted on every chunk due to never-incremented msg_counter

role is intended to appear only on the first delta; since msg_counter never changes, every chunk will include role: Assistant. Increment the counter after emitting the first delta.

Apply this diff:

     pub fn create_choice(
         &mut self,
         index: u32,
         text: Option<String>,
         reasoning_content: Option<String>,
         finish_reason: Option<dynamo_async_openai::types::FinishReason>,
         logprobs: Option<dynamo_async_openai::types::ChatChoiceLogprobs>,
     ) -> NvCreateChatCompletionStreamResponse {
         let delta = dynamo_async_openai::types::ChatCompletionStreamResponseDelta {
             content: text,
             function_call: None,
             tool_calls: None,
             role: if self.msg_counter == 0 {
                 Some(dynamo_async_openai::types::Role::Assistant)
             } else {
                 None
             },
             refusal: None,
             reasoning_content,
         };
+        // Ensure subsequent chunks omit the role, matching OpenAI streaming semantics
+        self.msg_counter = self.msg_counter.saturating_add(1);
lib/engines/mistralrs/src/lib.rs (2)

543-559: Respect client logprobs request for completions

return_logprobs is hard-coded to false. This ignores request.inner.logprobs, so clients asking for logprobs won’t get them.

Apply this diff:

-            return_logprobs: false,
+            // return logprobs if the client requested a positive count
+            return_logprobs: request.inner.logprobs.map(|v| v > 0).unwrap_or(false),

579-589: Pass finish_reason into the completion choice

You compute finish_reason but don't pass it to create_choice, resulting in None in the response.

Apply this diff:

-                        let inner = response_generator.create_choice(0, Some(from_assistant), None, None);
+                        let inner = response_generator.create_choice(0, Some(from_assistant), finish_reason, None);

Also applies to: 593-593

🧹 Nitpick comments (7)
lib/llm/src/protocols/openai/chat_completions/delta.rs (1)

103-106: Deterministic ID: chatcmpl-{request_id}

The ID format matches OpenAI conventions. Consider guarding against whitespace or overly long IDs if upstream contexts are user-controlled, but not blocking.

Optional defensive tweak (upstream input sanitize):

-        let chatcmpl_id = format!("chatcmpl-{request_id}");
+        let sanitized = request_id.trim().replace(char::is_whitespace, "");
+        let chatcmpl_id = format!("chatcmpl-{}", sanitized);
lib/llm/src/protocols/openai/completions/delta.rs (1)

75-97: Minor perf/readability nit: avoid extra allocations in create_logprobs

If this becomes hot, consider accepting token_ids: &[TokenIdType] (and borrowing tokens when possible) to reduce copies. Not blocking.

Example shape:

-    pub fn create_logprobs(
-        &self,
-        tokens: Vec<common::llm_backend::TokenType>,
-        token_ids: Vec<TokenIdType>,
+    pub fn create_logprobs(
+        &self,
+        tokens: Vec<common::llm_backend::TokenType>,
+        token_ids: &[TokenIdType],
         logprobs: Option<common::llm_backend::LogProbs>,
         top_logprobs: Option<common::llm_backend::TopLogprobs>,
     ) -> Option<dynamo_async_openai::types::Logprobs> {

And at call site:

-        let logprobs = self.create_logprobs(
-            delta.tokens,
-            delta.token_ids,
+        let logprobs = self.create_logprobs(
+            delta.tokens,
+            &delta.token_ids,
             delta.log_probs,
             delta.top_logprobs,
         );

Also applies to: 192-205

lib/engines/mistralrs/src/lib.rs (3)

399-421: Set object to the OpenAI constant "chat.completion.chunk"

We currently pass through c.object.clone(). To ensure spec consistency across backends, prefer the canonical OpenAI value.

Apply this diff:

-                            object: c.object.clone(),
+                            object: "chat.completion.chunk".to_string(),

391-393: Typo: "Unknow stop reason" → "Unknown stop reason"

Minor log message polish.

Apply this diff:

-                                tracing::warn!(mistralrs_request_id, stop_reason = s, "Unknow stop reason");
+                                tracing::warn!(mistralrs_request_id, stop_reason = s, "Unknown stop reason");

571-572: Typo: "Unknow stop reason" → "Unknown stop reason"

Same log polish for the completions path.

Apply this diff:

-                                tracing::warn!(mistralrs_request_id, stop_reason = s, "Unknow stop reason");
+                                tracing::warn!(mistralrs_request_id, stop_reason = s, "Unknown stop reason");
lib/llm/src/engines.rs (1)

187-187: Nit: drop mut unless create_choice requires &mut self.

If DeltaGenerator::create_choice for chat completions no longer needs &mut self, remove mut to avoid an unused_mut lint in stricter builds.

-        let mut deltas = request.response_generator(ctx.id().to_string());
+        let deltas = request.response_generator(ctx.id().to_string());

If it still needs &mut self, keep as-is.

lib/llm/src/http/service/openai.rs (1)

248-266: Optional: add structured tracing to completions() with request_id field.

For parity with responses() (which already uses #[tracing::instrument(..., fields(request_id = %request.id()))]), consider instrumenting completions() similarly. This centralizes request_id on all logs from this function.

-async fn completions(
+#[tracing::instrument(level = "debug", skip_all, fields(request_id = %request.id()))]
+async fn completions(
     state: Arc<service_v2::State>,
     request: Context<NvCreateCompletionRequest>,
     stream_handle: ConnectionHandle,
 ) -> Result<Response, ErrorResponse> {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a24221d and dd84aca.

📒 Files selected for processing (7)
  • lib/engines/mistralrs/src/lib.rs (13 hunks)
  • lib/llm/src/engines.rs (2 hunks)
  • lib/llm/src/http/service/openai.rs (2 hunks)
  • lib/llm/src/preprocessor.rs (2 hunks)
  • lib/llm/src/protocols/openai/chat_completions/delta.rs (3 hunks)
  • lib/llm/src/protocols/openai/completions/delta.rs (3 hunks)
  • lib/llm/tests/http-service.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-22T19:55:41.592Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.592Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

  • lib/llm/tests/http-service.rs
  • lib/llm/src/protocols/openai/chat_completions/delta.rs
  • lib/llm/src/preprocessor.rs
  • lib/engines/mistralrs/src/lib.rs
  • lib/llm/src/engines.rs
  • lib/llm/src/protocols/openai/completions/delta.rs
📚 Learning: 2025-08-22T19:55:41.592Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.592Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.

Applied to files:

  • lib/llm/src/engines.rs
🧬 Code graph analysis (5)
lib/llm/src/protocols/openai/chat_completions/delta.rs (1)
lib/llm/src/protocols/openai/completions/delta.rs (2)
  • response_generator (10-17)
  • new (38-69)
lib/llm/src/preprocessor.rs (2)
lib/llm/src/protocols/openai/chat_completions/delta.rs (1)
  • response_generator (20-28)
lib/llm/src/protocols/openai/completions/delta.rs (1)
  • response_generator (10-17)
lib/llm/src/http/service/openai.rs (3)
lib/llm/src/http/service/service_v2.rs (1)
  • state (186-188)
lib/llm/src/http/client.rs (1)
  • with_id (81-88)
lib/runtime/src/pipeline/context.rs (1)
  • with_id (69-76)
lib/engines/mistralrs/src/lib.rs (3)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
  • new (76-116)
  • response_generator (20-28)
lib/llm/src/protocols/openai/completions/delta.rs (2)
  • new (38-69)
  • response_generator (10-17)
lib/bindings/python/rust/lib.rs (3)
  • new (268-293)
  • new (907-911)
  • id (929-931)
lib/llm/src/protocols/openai/completions/delta.rs (1)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
  • response_generator (20-28)
  • new (76-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (13)
lib/llm/src/preprocessor.rs (2)

497-498: Good: propagating deterministic request_id into the generator

Passing context.id().to_string() into request.response_generator(...) aligns with the PR objective of making streamed IDs deterministic and trace-aligned. No issues spotted.


551-552: Good: completions path also threads request_id

Same as chat completions, this ensures cmpl-<request_id> IDs are stable across the pipeline.

lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

15-21: API change is clear and documented

Nice docstring update explaining the new request_id parameter; signature change is minimal and aligns with the PR goal.


27-28: Constructor and call chain now accept request_id

DeltaGenerator::new(..., request_id) and its usage ensure chatcmpl-<request_id> determinism. Looks good.

Also applies to: 72-77

lib/llm/src/protocols/openai/completions/delta.rs (2)

10-17: API change: request_id threaded into completions as well

response_generator(&self, request_id: String) is consistent with the chat path and meets the objectives.


38-39: Deterministic ID: cmpl-{request_id}

cmpl-<request_id> is correct and consistent. Constructor changes are coherent.

Also applies to: 58-62

lib/engines/mistralrs/src/lib.rs (3)

216-218: Warmup request: logging context updated

Using mistralrs_request_id in logs makes tracing clearer. LGTM.

Also applies to: 249-253


275-276: Good: request_id sourced from context

Threading ctx.id().to_string() will align streamed ids with distributed traces.


489-490: Good: completions path uses response_generator(ctx.id())

This ensures the cmpl-<request_id> ID is trace-aligned.

lib/llm/src/engines.rs (2)

235-235: Completions: request_id propagation into delta generator looks correct.

Using the context ID for cmpl-<id> is consistent with the chat path and supports downstream trace alignment.


187-187: No remaining zero-arg response_generator calls — LGTM

  • Ran rg -nP --type=rust '\bresponse_generator\s*\(\s*\)' and confirmed there are no zero-argument invocations of response_generator.
  • Verified that both delta modules now use the incoming request_id for ID formatting:
    • lib/llm/src/protocols/openai/completions/delta.rs formats as cmpl-{request_id}
    • lib/llm/src/protocols/openai/chat_completions/delta.rs formats as chatcmpl-{request_id}

All looks good—approving the change.

lib/llm/src/http/service/openai.rs (2)

256-256: Expose request_id early in completions() for logging/annotations — LGTM.

Grabbing let request_id = request.id().to_string(); up-front is clean and is correctly used later for error logs and optional annotations.


356-365: Embeddings handler integration verified

  • The embeddings route in lib/llm/src/http/service/openai.rs:1061 correctly points to the embeddings handler via .route(&path, post(embeddings)).
  • All three OpenAI handlers consistently accept a HeaderMap parameter, ensuring uniform request_id derivation and propagation:
    • handler_completions at line 213
    • embeddings at line 354
    • handler_chat_completions at line 408

LGTM—approving these changes.

@grahamking grahamking marked this pull request as draft August 25, 2025 20:13
@grahamking
Copy link
Contributor Author

The main PR landed.

@grahamking grahamking closed this Aug 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants