fix: implement OpenAI-compliant usage stats for streaming responses #3022

ryanolson · 2025-09-13T19:13:02Z

Only include usage when stream_options.include_usage is true
Send usage in a final chunk with empty choices array
All content chunks now have usage: null per OpenAI spec
Add comprehensive tests for streaming usage behavior

Per OpenAI spec, usage statistics should only be sent in a dedicated final chunk before data: [DONE], maintaining compatibility with clients expecting this behavior.

Summary by CodeRabbit

New Features
- Streaming now aligns with OpenAI’s include_usage option: intermediate chunks omit usage; when requested, a final usage-only chunk is sent after completion (chat and completions). Provides accurate token counts.
Tests
- Added tests covering no-usage, include_usage=true (final usage chunk), and include_usage=false scenarios for streaming.
Chores
- Updated ignore rules to exclude generated configuration and documentation files.

- Only include usage when stream_options.include_usage is true - Send usage in a final chunk with empty choices array - All content chunks now have usage: null per OpenAI spec - Add comprehensive tests for streaming usage behavior Per OpenAI spec, usage statistics should only be sent in a dedicated final chunk before data: [DONE], maintaining compatibility with clients expecting this behavior.

coderabbitai · 2025-09-15T07:29:41Z

Walkthrough

Implements OpenAI-compliant streaming usage reporting: adds trait methods to generate/query usage, adjusts delta generators to omit usage in intermediate chunks and create a final usage-only chunk, updates the postprocessor to detect stream completion and emit the final usage chunk when enabled, adds targeted tests, and updates .gitignore.

Changes

Cohort / File(s)	Summary of changes
Repo hygiene `/.gitignore`	Added ignore patterns for Ruler-generated files: `/.cursor/instructions.md`, `/.cursor/instructions.md.bak`, `/CLAUDE.md`, `/CLAUDE.md.bak`.
OpenAI protocol usage API `lib/llm/src/protocols/openai.rs`	Extended `DeltaGeneratorExt` with `fn create_usage_chunk(...)` and `fn is_usage_enabled(...)` (public API).
OpenAI delta generators `lib/llm/src/protocols/openai/chat_completions/delta.rs`, `lib/llm/src/protocols/openai/completions/delta.rs`	Usage now respects `stream_options.include_usage`; intermediate chunks set `usage: None`; added `create_usage_chunk()` to build a final usage-only chunk; added `is_usage_enabled()`; implemented trait methods to expose both.
Postprocessor stream handling `lib/llm/src/preprocessor.rs`	Tracked `finish_reason_sent` and `usage_chunk_sent`; on end-of-stream, when usage is enabled and a finish reason was observed, emitted a final `llm_metrics`-annotated usage chunk via `response_generator.create_usage_chunk()`.
Tests `lib/llm/tests/test_streaming_usage.rs`	Added tests covering no-usage, usage=true (final usage chunk emitted), and usage=false (no usage) for streaming chat completions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant P as OpenAIPreprocessor
  participant B as Backend Stream
  participant G as DeltaGenerator

  rect rgba(230,245,255,0.6)
  Note over C,P: Streamed chat/completion request (stream=true)
  C->>P: Request (include_usage: [true|false])
  P->>B: Subscribe to backend stream
  end

  loop For each backend chunk
    B-->>P: BackendOutput
    P->>G: Build delta (content/logprobs)
    alt Intermediate chunk
      G-->>P: Chunk with choices, usage=None
      P-->>C: Stream chunk (usage=None)
      P->>P: Track finish_reason if present
    end
  end

  alt include_usage=true AND finish_reason observed
    rect rgba(240,255,230,0.6)
    Note over P,G: End-of-stream finalization
    P->>G: create_usage_chunk()
    G-->>P: Final usage-only chunk (choices=[])
    P-->>C: Emit final usage chunk (annotated llm_metrics)
    end
  else include_usage=false OR no finish_reason
    Note over P: Do not emit usage chunk
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat: align OpenAI response IDs with distributed trace IDs #2496 — Touches the same OpenAI DeltaGenerator/DeltaGeneratorExt codepaths; integrates request_id into generator construction alongside these usage APIs.
feat: add more metrics to rust frontend #1315 — Modifies the streaming postprocessor end-of-stream handling in preprocessor.rs, adjacent to the new final-usage emission logic.

Poem

A nibble of tokens, a stream on the breeze,
I twitch my ears, count words with ease.
Three little hops—then usage I send,
A final crumb at the rabbit-hole’s end.
Choices now quiet, metrics alight,
Compliance achieved—good night! 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The description clearly states the core intent and behavioral changes (only include usage when stream_options.include_usage, emit a final usage-only chunk, set content chunks' usage to null, and add tests) and aligns with the PR objectives, but it does not follow the repository's required template: it lacks the "Overview" and "Details" headings, does not provide "Where should the reviewer start?" guidance, and omits a Related Issues entry.	Please update the PR description to match the repository template by adding an Overview and Details section, a "Where should the reviewer start?" note that lists the key files to inspect (for example: lib/llm/src/preprocessor.rs, lib/llm/src/protocols/openai/chat_completions/delta.rs, lib/llm/src/protocols/openai/completions/delta.rs, lib/llm/tests/test_streaming_usage.rs, and .gitignore), and a Related Issues line linking the issue number or stating "none", plus any brief testing or compatibility notes to aid reviewers.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly and accurately summarizes the primary change—implementing OpenAI-compliant streaming usage statistics—and matches the code and test changes in the PR; it is a single clear sentence focused on the main behavioral fix rather than low-level details.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

lib/llm/tests/test_streaming_usage.rs (1)

194-262: Add a counterpart test for text completions streaming.

We only cover chat completions. Please mirror this suite for NvCreateCompletionRequest to prevent regressions in the text-completions path.

lib/llm/src/preprocessor.rs (1)

520-545: Final usage chunk emission looks correct; optional metrics annotation parity.

Logic for creating and emitting the usage-only chunk is solid. To keep annotations consistent with earlier chunks, consider attaching an LLMMetricAnnotation comment payload (not just the event) to this final usage chunk.
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
277-296: Final usage-only chunk aligns with OpenAI spec.

Empty choices and populated usage with total_tokens = prompt + completion is correct. Minor safety nit: consider saturating add to avoid theoretical overflow.

Apply if desired:
-        usage.total_tokens = usage.prompt_tokens + usage.completion_tokens;
+        usage.total_tokens = usage.prompt_tokens.saturating_add(usage.completion_tokens);
388-394: Avoid potential method-resolution ambiguity by calling inherent methods explicitly.

Current calls are fine in practice, but using a fully qualified path makes intent obvious and prevents accidental recursion if signatures change.

Apply:
-    fn create_usage_chunk(&self) -> NvCreateChatCompletionStreamResponse {
-        self.create_usage_chunk()
-    }
+    fn create_usage_chunk(&self) -> NvCreateChatCompletionStreamResponse {
+        DeltaGenerator::create_usage_chunk(self)
+    }
@@
-    fn is_usage_enabled(&self) -> bool {
-        self.is_usage_enabled()
-    }
+    fn is_usage_enabled(&self) -> bool {
+        DeltaGenerator::is_usage_enabled(self)
+    }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8ecc40 and 7624f93.

📒 Files selected for processing (6)

.gitignore (1 hunks)
lib/llm/src/preprocessor.rs (5 hunks)
lib/llm/src/protocols/openai.rs (1 hunks)
lib/llm/src/protocols/openai/chat_completions/delta.rs (3 hunks)
lib/llm/src/protocols/openai/completions/delta.rs (4 hunks)
lib/llm/tests/test_streaming_usage.rs (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-08-22T19:55:41.608Z

Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

lib/llm/src/protocols/openai.rs
lib/llm/src/protocols/openai/completions/delta.rs
lib/llm/src/protocols/openai/chat_completions/delta.rs

📚 Learning: 2025-08-22T19:55:41.608Z

Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.

Applied to files:

lib/llm/src/protocols/openai/completions/delta.rs
lib/llm/src/protocols/openai/chat_completions/delta.rs

🧬 Code graph analysis (4)

lib/llm/tests/test_streaming_usage.rs (2)

lib/llm/src/preprocessor.rs (2)

new (101-107)

transform_postprocessor_stream (411-551)

lib/llm/src/protocols/openai/chat_completions/delta.rs (3)

new (85-130)

tokens (151-155)

response_generator (21-35)

lib/llm/src/protocols/openai.rs (2)

lib/llm/src/protocols/openai/chat_completions/delta.rs (4)

create_usage_chunk (282-296)

create_usage_chunk (388-390)

is_usage_enabled (299-301)

is_usage_enabled (392-394)

lib/llm/src/protocols/openai/completions/delta.rs (4)

create_usage_chunk (177-192)

create_usage_chunk (237-239)

is_usage_enabled (195-197)

is_usage_enabled (241-243)

lib/llm/src/protocols/openai/completions/delta.rs (2)

lib/llm/src/protocols/openai/chat_completions/delta.rs (4)

create_usage_chunk (282-296)

create_usage_chunk (388-390)

is_usage_enabled (299-301)

is_usage_enabled (392-394)

lib/llm/src/protocols/openai.rs (2)

create_usage_chunk (211-211)

is_usage_enabled (214-214)

lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

lib/llm/src/protocols/openai/completions/delta.rs (4)

create_usage_chunk (177-192)

create_usage_chunk (237-239)

is_usage_enabled (195-197)

is_usage_enabled (241-243)

lib/llm/src/protocols/openai.rs (2)

create_usage_chunk (211-211)

is_usage_enabled (214-214)

🪛 GitHub Actions: Rust pre-merge checks

lib/llm/tests/test_streaming_usage.rs

[error] 121-121: Clippy: redundant closure. Replace '.map(|output| Annotated::from_data(output))' with '.map(Annotated::from_data)'. Step: cargo clippy --no-deps --all-targets -- -D warnings.

[error] 220-220: Clippy: needless-range-loop. Use enumerate() over the iterator instead of 'for i in 0..3'. Step: cargo clippy --no-deps --all-targets -- -D warnings.

🪛 GitHub Actions: NVIDIA Dynamo Github Validation

lib/llm/tests/test_streaming_usage.rs

[error] 121-121: Clippy: redundant closure at lib/llm/tests/test_streaming_usage.rs:121. Replace with function: Annotated::from_data. Command: cargo clippy --features block-manager --no-deps --all-targets -- -D warnings.

[error] 220-220: Clippy: needless-range-loop at lib/llm/tests/test_streaming_usage.rs:220. Use enumerate() instead of 'for i in 0..3'. Command: cargo clippy --features block-manager --no-deps --all-targets -- -D warnings.

🔇 Additional comments (8)

.gitignore (1)

96-100: LGTM: ignore AI-generated instruction files.

Keeping these out of VCS is good hygiene and aligns with the repo’s policy for generated artifacts.

lib/llm/src/preprocessor.rs (1)

456-462: Finish-reason gating: confirm intended behavior on truncated streams.

You gate the final usage emission on finish_reason_sent. If the backend ends the stream without a finish_reason (timeout/cancel), no usage will be sent. If that’s intentional per spec, fine; otherwise consider emitting usage on graceful stream end even without a finish_reason.

Also applies to: 507-510

lib/llm/src/protocols/openai/completions/delta.rs (3)

12-18: Enable usage via stream_options.include_usage: LGTM.

Deriving enable_usage from stream_options.include_usage aligns with the spec.

151-167: Intermediate chunks with usage: None: LGTM.

Setting usage: None on content chunks matches OpenAI streaming semantics.

172-192: Final usage-only chunk builder: LGTM.

Empty choices with populated usage is the correct shape; computing total_tokens here is appropriate.

lib/llm/src/protocols/openai/chat_completions/delta.rs (3)

23-28: Enable usage flag derivation looks correct.

Deriving enable_usage from stream_options.include_usage with a safe default (false) matches the PR goal and OpenAI streaming semantics.

298-301: Accessor is concise and clear.

262-275: Confirmed: usage: None will serialize as "usage": null

CreateChatCompletionStreamResponse in lib/async-openai/src/types/chat.rs derives Serialize and its pub usage: Option<CompletionUsage> field has no #[serde(skip_serializing_if = "Option::is_none")], so None is serialized as JSON null (intermediate chunks will include "usage": null).

lib/llm/src/protocols/openai.rs

lib/llm/src/protocols/openai/completions/delta.rs

lib/llm/tests/test_streaming_usage.rs

- Remove redundant closure in map() call - Replace needless range loop with iterator enumerate()

Replace self-recursive trait method calls with fully-qualified calls to the inherent methods on DeltaGenerator structs. This prevents stack overflow from infinite recursion when calling create_usage_chunk() or is_usage_enabled() through the trait.

Replace direct addition with saturating_add to prevent potential overflow when calculating total_tokens from prompt_tokens + completion_tokens.

paulhendricks

Looks good 👍

Resolved conflicts in preprocessor.rs by combining: - Main's 'finished' flag to prevent duplicate stream processing - Our 'finish_reason_sent' and 'usage_chunk_sent' flags for OpenAI-compliant usage - Both features now coexist properly with tool calling jail functionality Signed-off-by: Ryan Olson <rolson@nvidia.com>

…3022) Signed-off-by: Ryan Olson <rolson@nvidia.com> Signed-off-by: Kristen Kelleher <kkelleher@nvidia.com>

pull-request-size bot added the size/L label Sep 13, 2025

github-actions bot added the fix label Sep 13, 2025

fix: formatting

7624f93

copy-pr-bot bot temporarily deployed to GITLAB September 14, 2025 05:11 Inactive

ryanolson marked this pull request as ready for review September 15, 2025 07:12

ryanolson requested a review from a team as a code owner September 15, 2025 07:12

ryanolson requested review from grahamking and tedzhouhk September 15, 2025 07:13

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

lib/llm/src/protocols/openai.rs Show resolved Hide resolved

lib/llm/src/protocols/openai/completions/delta.rs Show resolved Hide resolved

lib/llm/tests/test_streaming_usage.rs Outdated Show resolved Hide resolved

lib/llm/tests/test_streaming_usage.rs Outdated Show resolved Hide resolved

fix: resolve clippy warnings in streaming usage tests

0639f7b

- Remove redundant closure in map() call - Replace needless range loop with iterator enumerate()

copy-pr-bot bot temporarily deployed to GITLAB September 15, 2025 16:48 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 15, 2025 16:51 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 15, 2025 17:59 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 15, 2025 18:00 Inactive

ryanolson requested a review from paulhendricks September 15, 2025 18:28

fix: use saturating_add for token count calculation

2071976

Replace direct addition with saturating_add to prevent potential overflow when calculating total_tokens from prompt_tokens + completion_tokens.

paulhendricks approved these changes Sep 15, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB September 15, 2025 19:21 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 15, 2025 19:24 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 15, 2025 21:34 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 15, 2025 21:38 Inactive

ryanolson merged commit aa80ac4 into main Sep 16, 2025
15 of 16 checks passed

ryanolson deleted the ryan/oai-compliance-usage branch September 16, 2025 07:18

coderabbitai bot mentioned this pull request Sep 16, 2025

feat: JailedStream #3034

Merged

kmkelle-nv pushed a commit that referenced this pull request Sep 17, 2025

fix: implement OpenAI-compliant usage stats for streaming responses (#…

446fe5f

…3022) Signed-off-by: Ryan Olson <rolson@nvidia.com> Signed-off-by: Kristen Kelleher <kkelleher@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: implement OpenAI-compliant usage stats for streaming responses #3022

fix: implement OpenAI-compliant usage stats for streaming responses #3022

Uh oh!

ryanolson commented Sep 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 15, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paulhendricks left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: implement OpenAI-compliant usage stats for streaming responses #3022

fix: implement OpenAI-compliant usage stats for streaming responses #3022

Uh oh!

Conversation

ryanolson commented Sep 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 15, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paulhendricks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ryanolson commented Sep 13, 2025 •

edited by coderabbitai bot

Loading