Skip to content

Conversation

@ryanolson
Copy link
Contributor

@ryanolson ryanolson commented Sep 13, 2025

  • Only include usage when stream_options.include_usage is true
  • Send usage in a final chunk with empty choices array
  • All content chunks now have usage: null per OpenAI spec
  • Add comprehensive tests for streaming usage behavior

Per OpenAI spec, usage statistics should only be sent in a dedicated final chunk before data: [DONE], maintaining compatibility with clients expecting this behavior.

Summary by CodeRabbit

  • New Features

    • Streaming now aligns with OpenAI’s include_usage option: intermediate chunks omit usage; when requested, a final usage-only chunk is sent after completion (chat and completions). Provides accurate token counts.
  • Tests

    • Added tests covering no-usage, include_usage=true (final usage chunk), and include_usage=false scenarios for streaming.
  • Chores

    • Updated ignore rules to exclude generated configuration and documentation files.

- Only include usage when stream_options.include_usage is true
- Send usage in a final chunk with empty choices array
- All content chunks now have usage: null per OpenAI spec
- Add comprehensive tests for streaming usage behavior

Per OpenAI spec, usage statistics should only be sent in a dedicated
final chunk before data: [DONE], maintaining compatibility with clients
expecting this behavior.
@ryanolson ryanolson marked this pull request as ready for review September 15, 2025 07:12
@ryanolson ryanolson requested a review from a team as a code owner September 15, 2025 07:12
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 15, 2025

Walkthrough

Implements OpenAI-compliant streaming usage reporting: adds trait methods to generate/query usage, adjusts delta generators to omit usage in intermediate chunks and create a final usage-only chunk, updates the postprocessor to detect stream completion and emit the final usage chunk when enabled, adds targeted tests, and updates .gitignore.

Changes

Cohort / File(s) Summary of changes
Repo hygiene
/.gitignore
Added ignore patterns for Ruler-generated files: /.cursor/instructions.md, /.cursor/instructions.md.bak, /CLAUDE.md, /CLAUDE.md.bak.
OpenAI protocol usage API
lib/llm/src/protocols/openai.rs
Extended DeltaGeneratorExt with fn create_usage_chunk(...) and fn is_usage_enabled(...) (public API).
OpenAI delta generators
lib/llm/src/protocols/openai/chat_completions/delta.rs, lib/llm/src/protocols/openai/completions/delta.rs
Usage now respects stream_options.include_usage; intermediate chunks set usage: None; added create_usage_chunk() to build a final usage-only chunk; added is_usage_enabled(); implemented trait methods to expose both.
Postprocessor stream handling
lib/llm/src/preprocessor.rs
Tracked finish_reason_sent and usage_chunk_sent; on end-of-stream, when usage is enabled and a finish reason was observed, emitted a final llm_metrics-annotated usage chunk via response_generator.create_usage_chunk().
Tests
lib/llm/tests/test_streaming_usage.rs
Added tests covering no-usage, usage=true (final usage chunk emitted), and usage=false (no usage) for streaming chat completions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant P as OpenAIPreprocessor
  participant B as Backend Stream
  participant G as DeltaGenerator

  rect rgba(230,245,255,0.6)
  Note over C,P: Streamed chat/completion request (stream=true)
  C->>P: Request (include_usage: [true|false])
  P->>B: Subscribe to backend stream
  end

  loop For each backend chunk
    B-->>P: BackendOutput
    P->>G: Build delta (content/logprobs)
    alt Intermediate chunk
      G-->>P: Chunk with choices, usage=None
      P-->>C: Stream chunk (usage=None)
      P->>P: Track finish_reason if present
    end
  end

  alt include_usage=true AND finish_reason observed
    rect rgba(240,255,230,0.6)
    Note over P,G: End-of-stream finalization
    P->>G: create_usage_chunk()
    G-->>P: Final usage-only chunk (choices=[])
    P-->>C: Emit final usage chunk (annotated llm_metrics)
    end
  else include_usage=false OR no finish_reason
    Note over P: Do not emit usage chunk
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

A nibble of tokens, a stream on the breeze,
I twitch my ears, count words with ease.
Three little hops—then usage I send,
A final crumb at the rabbit-hole’s end.
Choices now quiet, metrics alight,
Compliance achieved—good night! 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The description clearly states the core intent and behavioral changes (only include usage when stream_options.include_usage, emit a final usage-only chunk, set content chunks' usage to null, and add tests) and aligns with the PR objectives, but it does not follow the repository's required template: it lacks the "Overview" and "Details" headings, does not provide "Where should the reviewer start?" guidance, and omits a Related Issues entry. Please update the PR description to match the repository template by adding an Overview and Details section, a "Where should the reviewer start?" note that lists the key files to inspect (for example: lib/llm/src/preprocessor.rs, lib/llm/src/protocols/openai/chat_completions/delta.rs, lib/llm/src/protocols/openai/completions/delta.rs, lib/llm/tests/test_streaming_usage.rs, and .gitignore), and a Related Issues line linking the issue number or stating "none", plus any brief testing or compatibility notes to aid reviewers.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly and accurately summarizes the primary change—implementing OpenAI-compliant streaming usage statistics—and matches the code and test changes in the PR; it is a single clear sentence focused on the main behavioral fix rather than low-level details.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
lib/llm/tests/test_streaming_usage.rs (1)

194-262: Add a counterpart test for text completions streaming.

We only cover chat completions. Please mirror this suite for NvCreateCompletionRequest to prevent regressions in the text-completions path.

lib/llm/src/preprocessor.rs (1)

520-545: Final usage chunk emission looks correct; optional metrics annotation parity.

Logic for creating and emitting the usage-only chunk is solid. To keep annotations consistent with earlier chunks, consider attaching an LLMMetricAnnotation comment payload (not just the event) to this final usage chunk.

lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

277-296: Final usage-only chunk aligns with OpenAI spec.

Empty choices and populated usage with total_tokens = prompt + completion is correct. Minor safety nit: consider saturating add to avoid theoretical overflow.

Apply if desired:

-        usage.total_tokens = usage.prompt_tokens + usage.completion_tokens;
+        usage.total_tokens = usage.prompt_tokens.saturating_add(usage.completion_tokens);

388-394: Avoid potential method-resolution ambiguity by calling inherent methods explicitly.

Current calls are fine in practice, but using a fully qualified path makes intent obvious and prevents accidental recursion if signatures change.

Apply:

-    fn create_usage_chunk(&self) -> NvCreateChatCompletionStreamResponse {
-        self.create_usage_chunk()
-    }
+    fn create_usage_chunk(&self) -> NvCreateChatCompletionStreamResponse {
+        DeltaGenerator::create_usage_chunk(self)
+    }
@@
-    fn is_usage_enabled(&self) -> bool {
-        self.is_usage_enabled()
-    }
+    fn is_usage_enabled(&self) -> bool {
+        DeltaGenerator::is_usage_enabled(self)
+    }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8ecc40 and 7624f93.

📒 Files selected for processing (6)
  • .gitignore (1 hunks)
  • lib/llm/src/preprocessor.rs (5 hunks)
  • lib/llm/src/protocols/openai.rs (1 hunks)
  • lib/llm/src/protocols/openai/chat_completions/delta.rs (3 hunks)
  • lib/llm/src/protocols/openai/completions/delta.rs (4 hunks)
  • lib/llm/tests/test_streaming_usage.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

  • lib/llm/src/protocols/openai.rs
  • lib/llm/src/protocols/openai/completions/delta.rs
  • lib/llm/src/protocols/openai/chat_completions/delta.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.

Applied to files:

  • lib/llm/src/protocols/openai/completions/delta.rs
  • lib/llm/src/protocols/openai/chat_completions/delta.rs
🧬 Code graph analysis (4)
lib/llm/tests/test_streaming_usage.rs (2)
lib/llm/src/preprocessor.rs (2)
  • new (101-107)
  • transform_postprocessor_stream (411-551)
lib/llm/src/protocols/openai/chat_completions/delta.rs (3)
  • new (85-130)
  • tokens (151-155)
  • response_generator (21-35)
lib/llm/src/protocols/openai.rs (2)
lib/llm/src/protocols/openai/chat_completions/delta.rs (4)
  • create_usage_chunk (282-296)
  • create_usage_chunk (388-390)
  • is_usage_enabled (299-301)
  • is_usage_enabled (392-394)
lib/llm/src/protocols/openai/completions/delta.rs (4)
  • create_usage_chunk (177-192)
  • create_usage_chunk (237-239)
  • is_usage_enabled (195-197)
  • is_usage_enabled (241-243)
lib/llm/src/protocols/openai/completions/delta.rs (2)
lib/llm/src/protocols/openai/chat_completions/delta.rs (4)
  • create_usage_chunk (282-296)
  • create_usage_chunk (388-390)
  • is_usage_enabled (299-301)
  • is_usage_enabled (392-394)
lib/llm/src/protocols/openai.rs (2)
  • create_usage_chunk (211-211)
  • is_usage_enabled (214-214)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
lib/llm/src/protocols/openai/completions/delta.rs (4)
  • create_usage_chunk (177-192)
  • create_usage_chunk (237-239)
  • is_usage_enabled (195-197)
  • is_usage_enabled (241-243)
lib/llm/src/protocols/openai.rs (2)
  • create_usage_chunk (211-211)
  • is_usage_enabled (214-214)
🪛 GitHub Actions: Rust pre-merge checks
lib/llm/tests/test_streaming_usage.rs

[error] 121-121: Clippy: redundant closure. Replace '.map(|output| Annotated::from_data(output))' with '.map(Annotated::from_data)'. Step: cargo clippy --no-deps --all-targets -- -D warnings.


[error] 220-220: Clippy: needless-range-loop. Use enumerate() over the iterator instead of 'for i in 0..3'. Step: cargo clippy --no-deps --all-targets -- -D warnings.

🪛 GitHub Actions: NVIDIA Dynamo Github Validation
lib/llm/tests/test_streaming_usage.rs

[error] 121-121: Clippy: redundant closure at lib/llm/tests/test_streaming_usage.rs:121. Replace with function: Annotated::from_data. Command: cargo clippy --features block-manager --no-deps --all-targets -- -D warnings.


[error] 220-220: Clippy: needless-range-loop at lib/llm/tests/test_streaming_usage.rs:220. Use enumerate() instead of 'for i in 0..3'. Command: cargo clippy --features block-manager --no-deps --all-targets -- -D warnings.

🔇 Additional comments (8)
.gitignore (1)

96-100: LGTM: ignore AI-generated instruction files.

Keeping these out of VCS is good hygiene and aligns with the repo’s policy for generated artifacts.

lib/llm/src/preprocessor.rs (1)

456-462: Finish-reason gating: confirm intended behavior on truncated streams.

You gate the final usage emission on finish_reason_sent. If the backend ends the stream without a finish_reason (timeout/cancel), no usage will be sent. If that’s intentional per spec, fine; otherwise consider emitting usage on graceful stream end even without a finish_reason.

Also applies to: 507-510

lib/llm/src/protocols/openai/completions/delta.rs (3)

12-18: Enable usage via stream_options.include_usage: LGTM.

Deriving enable_usage from stream_options.include_usage aligns with the spec.


151-167: Intermediate chunks with usage: None: LGTM.

Setting usage: None on content chunks matches OpenAI streaming semantics.


172-192: Final usage-only chunk builder: LGTM.

Empty choices with populated usage is the correct shape; computing total_tokens here is appropriate.

lib/llm/src/protocols/openai/chat_completions/delta.rs (3)

23-28: Enable usage flag derivation looks correct.

Deriving enable_usage from stream_options.include_usage with a safe default (false) matches the PR goal and OpenAI streaming semantics.


298-301: Accessor is concise and clear.


262-275: Confirmed: usage: None will serialize as "usage": null

CreateChatCompletionStreamResponse in lib/async-openai/src/types/chat.rs derives Serialize and its pub usage: Option<CompletionUsage> field has no #[serde(skip_serializing_if = "Option::is_none")], so None is serialized as JSON null (intermediate chunks will include "usage": null).

- Remove redundant closure in map() call
- Replace needless range loop with iterator enumerate()
Replace self-recursive trait method calls with fully-qualified calls
to the inherent methods on DeltaGenerator structs. This prevents
stack overflow from infinite recursion when calling create_usage_chunk()
or is_usage_enabled() through the trait.
Replace direct addition with saturating_add to prevent potential overflow
when calculating total_tokens from prompt_tokens + completion_tokens.
Copy link
Member

@paulhendricks paulhendricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

Resolved conflicts in preprocessor.rs by combining:
- Main's 'finished' flag to prevent duplicate stream processing
- Our 'finish_reason_sent' and 'usage_chunk_sent' flags for OpenAI-compliant usage
- Both features now coexist properly with tool calling jail functionality

Signed-off-by: Ryan Olson <rolson@nvidia.com>
@ryanolson ryanolson merged commit aa80ac4 into main Sep 16, 2025
15 of 16 checks passed
@ryanolson ryanolson deleted the ryan/oai-compliance-usage branch September 16, 2025 07:18
@coderabbitai coderabbitai bot mentioned this pull request Sep 16, 2025
kmkelle-nv pushed a commit that referenced this pull request Sep 17, 2025
…3022)

Signed-off-by: Ryan Olson <rolson@nvidia.com>
Signed-off-by: Kristen Kelleher <kkelleher@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants