-
Notifications
You must be signed in to change notification settings - Fork 689
fix: implement OpenAI-compliant usage stats for streaming responses #3022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Only include usage when stream_options.include_usage is true - Send usage in a final chunk with empty choices array - All content chunks now have usage: null per OpenAI spec - Add comprehensive tests for streaming usage behavior Per OpenAI spec, usage statistics should only be sent in a dedicated final chunk before data: [DONE], maintaining compatibility with clients expecting this behavior.
WalkthroughImplements OpenAI-compliant streaming usage reporting: adds trait methods to generate/query usage, adjusts delta generators to omit usage in intermediate chunks and create a final usage-only chunk, updates the postprocessor to detect stream completion and emit the final usage chunk when enabled, adds targeted tests, and updates .gitignore. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Client
participant P as OpenAIPreprocessor
participant B as Backend Stream
participant G as DeltaGenerator
rect rgba(230,245,255,0.6)
Note over C,P: Streamed chat/completion request (stream=true)
C->>P: Request (include_usage: [true|false])
P->>B: Subscribe to backend stream
end
loop For each backend chunk
B-->>P: BackendOutput
P->>G: Build delta (content/logprobs)
alt Intermediate chunk
G-->>P: Chunk with choices, usage=None
P-->>C: Stream chunk (usage=None)
P->>P: Track finish_reason if present
end
end
alt include_usage=true AND finish_reason observed
rect rgba(240,255,230,0.6)
Note over P,G: End-of-stream finalization
P->>G: create_usage_chunk()
G-->>P: Final usage-only chunk (choices=[])
P-->>C: Emit final usage chunk (annotated llm_metrics)
end
else include_usage=false OR no finish_reason
Note over P: Do not emit usage chunk
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
lib/llm/tests/test_streaming_usage.rs (1)
194-262: Add a counterpart test for text completions streaming.We only cover chat completions. Please mirror this suite for
NvCreateCompletionRequestto prevent regressions in the text-completions path.lib/llm/src/preprocessor.rs (1)
520-545: Final usage chunk emission looks correct; optional metrics annotation parity.Logic for creating and emitting the usage-only chunk is solid. To keep annotations consistent with earlier chunks, consider attaching an LLMMetricAnnotation comment payload (not just the event) to this final usage chunk.
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
277-296: Final usage-only chunk aligns with OpenAI spec.Empty
choicesand populatedusagewithtotal_tokens = prompt + completionis correct. Minor safety nit: consider saturating add to avoid theoretical overflow.Apply if desired:
- usage.total_tokens = usage.prompt_tokens + usage.completion_tokens; + usage.total_tokens = usage.prompt_tokens.saturating_add(usage.completion_tokens);
388-394: Avoid potential method-resolution ambiguity by calling inherent methods explicitly.Current calls are fine in practice, but using a fully qualified path makes intent obvious and prevents accidental recursion if signatures change.
Apply:
- fn create_usage_chunk(&self) -> NvCreateChatCompletionStreamResponse { - self.create_usage_chunk() - } + fn create_usage_chunk(&self) -> NvCreateChatCompletionStreamResponse { + DeltaGenerator::create_usage_chunk(self) + } @@ - fn is_usage_enabled(&self) -> bool { - self.is_usage_enabled() - } + fn is_usage_enabled(&self) -> bool { + DeltaGenerator::is_usage_enabled(self) + }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignore(1 hunks)lib/llm/src/preprocessor.rs(5 hunks)lib/llm/src/protocols/openai.rs(1 hunks)lib/llm/src/protocols/openai/chat_completions/delta.rs(3 hunks)lib/llm/src/protocols/openai/completions/delta.rs(4 hunks)lib/llm/tests/test_streaming_usage.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.
Applied to files:
lib/llm/src/protocols/openai.rslib/llm/src/protocols/openai/completions/delta.rslib/llm/src/protocols/openai/chat_completions/delta.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.
Applied to files:
lib/llm/src/protocols/openai/completions/delta.rslib/llm/src/protocols/openai/chat_completions/delta.rs
🧬 Code graph analysis (4)
lib/llm/tests/test_streaming_usage.rs (2)
lib/llm/src/preprocessor.rs (2)
new(101-107)transform_postprocessor_stream(411-551)lib/llm/src/protocols/openai/chat_completions/delta.rs (3)
new(85-130)tokens(151-155)response_generator(21-35)
lib/llm/src/protocols/openai.rs (2)
lib/llm/src/protocols/openai/chat_completions/delta.rs (4)
create_usage_chunk(282-296)create_usage_chunk(388-390)is_usage_enabled(299-301)is_usage_enabled(392-394)lib/llm/src/protocols/openai/completions/delta.rs (4)
create_usage_chunk(177-192)create_usage_chunk(237-239)is_usage_enabled(195-197)is_usage_enabled(241-243)
lib/llm/src/protocols/openai/completions/delta.rs (2)
lib/llm/src/protocols/openai/chat_completions/delta.rs (4)
create_usage_chunk(282-296)create_usage_chunk(388-390)is_usage_enabled(299-301)is_usage_enabled(392-394)lib/llm/src/protocols/openai.rs (2)
create_usage_chunk(211-211)is_usage_enabled(214-214)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
lib/llm/src/protocols/openai/completions/delta.rs (4)
create_usage_chunk(177-192)create_usage_chunk(237-239)is_usage_enabled(195-197)is_usage_enabled(241-243)lib/llm/src/protocols/openai.rs (2)
create_usage_chunk(211-211)is_usage_enabled(214-214)
🪛 GitHub Actions: Rust pre-merge checks
lib/llm/tests/test_streaming_usage.rs
[error] 121-121: Clippy: redundant closure. Replace '.map(|output| Annotated::from_data(output))' with '.map(Annotated::from_data)'. Step: cargo clippy --no-deps --all-targets -- -D warnings.
[error] 220-220: Clippy: needless-range-loop. Use enumerate() over the iterator instead of 'for i in 0..3'. Step: cargo clippy --no-deps --all-targets -- -D warnings.
🪛 GitHub Actions: NVIDIA Dynamo Github Validation
lib/llm/tests/test_streaming_usage.rs
[error] 121-121: Clippy: redundant closure at lib/llm/tests/test_streaming_usage.rs:121. Replace with function: Annotated::from_data. Command: cargo clippy --features block-manager --no-deps --all-targets -- -D warnings.
[error] 220-220: Clippy: needless-range-loop at lib/llm/tests/test_streaming_usage.rs:220. Use enumerate() instead of 'for i in 0..3'. Command: cargo clippy --features block-manager --no-deps --all-targets -- -D warnings.
🔇 Additional comments (8)
.gitignore (1)
96-100: LGTM: ignore AI-generated instruction files.Keeping these out of VCS is good hygiene and aligns with the repo’s policy for generated artifacts.
lib/llm/src/preprocessor.rs (1)
456-462: Finish-reason gating: confirm intended behavior on truncated streams.You gate the final usage emission on
finish_reason_sent. If the backend ends the stream without a finish_reason (timeout/cancel), no usage will be sent. If that’s intentional per spec, fine; otherwise consider emitting usage on graceful stream end even without a finish_reason.Also applies to: 507-510
lib/llm/src/protocols/openai/completions/delta.rs (3)
12-18: Enable usage via stream_options.include_usage: LGTM.Deriving
enable_usagefromstream_options.include_usagealigns with the spec.
151-167: Intermediate chunks with usage: None: LGTM.Setting
usage: Noneon content chunks matches OpenAI streaming semantics.
172-192: Final usage-only chunk builder: LGTM.Empty choices with populated usage is the correct shape; computing total_tokens here is appropriate.
lib/llm/src/protocols/openai/chat_completions/delta.rs (3)
23-28: Enable usage flag derivation looks correct.Deriving
enable_usagefromstream_options.include_usagewith a safe default (false) matches the PR goal and OpenAI streaming semantics.
298-301: Accessor is concise and clear.
262-275: Confirmed:usage: Nonewill serialize as"usage": nullCreateChatCompletionStreamResponse in lib/async-openai/src/types/chat.rs derives Serialize and its
pub usage: Option<CompletionUsage>field has no #[serde(skip_serializing_if = "Option::is_none")], so None is serialized as JSON null (intermediate chunks will include"usage": null).
- Remove redundant closure in map() call - Replace needless range loop with iterator enumerate()
Replace self-recursive trait method calls with fully-qualified calls to the inherent methods on DeltaGenerator structs. This prevents stack overflow from infinite recursion when calling create_usage_chunk() or is_usage_enabled() through the trait.
Replace direct addition with saturating_add to prevent potential overflow when calculating total_tokens from prompt_tokens + completion_tokens.
paulhendricks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
Resolved conflicts in preprocessor.rs by combining: - Main's 'finished' flag to prevent duplicate stream processing - Our 'finish_reason_sent' and 'usage_chunk_sent' flags for OpenAI-compliant usage - Both features now coexist properly with tool calling jail functionality Signed-off-by: Ryan Olson <rolson@nvidia.com>
…3022) Signed-off-by: Ryan Olson <rolson@nvidia.com> Signed-off-by: Kristen Kelleher <kkelleher@nvidia.com>
Per OpenAI spec, usage statistics should only be sent in a dedicated final chunk before data: [DONE], maintaining compatibility with clients expecting this behavior.
Summary by CodeRabbit
New Features
Tests
Chores