-
Notifications
You must be signed in to change notification settings - Fork 690
chore: jail stream optimizations (v1) #3195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: ayushag <ayushag@nvidia.com>
9abacc2 to
ee4ef1e
Compare
Signed-off-by: ayushag <ayushag@nvidia.com>
Signed-off-by: ayushag <ayushag@nvidia.com>
Signed-off-by: ayushag <ayushag@nvidia.com>
Signed-off-by: ayushag <ayushag@nvidia.com>
Signed-off-by: ayushag <ayushag@nvidia.com>
Signed-off-by: ayushag <ayushag@nvidia.com>
Signed-off-by: ayushag <ayushag@nvidia.com>
7336d41 to
29e30f5
Compare
WalkthroughRefactors chat jail streaming to use a new helper for constructing ChatChoiceStream and delegates tool-call end detection to a shared parser utility. Adds per-parser end-position functions (JSON, Harmony, Pythonic), a dispatcher, and re-exports the end-position API. Removes an internal jail method and updates imports accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant ChatJail as Chat Jail Stream
participant Parsers as tool_calling::find_tool_call_end_position
participant JSON as json::find_tool_call_end_position_json
participant Harmony as harmony::find_tool_call_end_position_harmony
participant Pythonic as pythonic::find_tool_call_end_position_pythonic
Client->>ChatJail: Stream chunk
ChatJail->>Parsers: find_tool_call_end_position(chunk, parser_str)
alt parser == "json/*"
Parsers->>JSON: compute end position (tokens/brackets/config)
JSON-->>Parsers: end_pos
else parser == "harmony"
Parsers->>Harmony: end by length
Harmony-->>Parsers: end_pos
else parser == "pythonic"
Parsers->>Pythonic: end by length
Pythonic-->>Parsers: end_pos
else
Parsers-->>Parsers: default to chunk.len()
end
Parsers-->>ChatJail: end_pos
ChatJail->>ChatJail: create_choice_stream(index, role, ...)
ChatJail-->>Client: Emit ChatChoiceStream delta
note over ChatJail: Centralized construction via create_choice_stream
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/llm/src/protocols/openai/chat_completions/jail.rs (1)
485-538: Do not reorder emissions by type; preserve original streaming orderGrouping emissions into tool/content, trailing, and pass-through and then emitting by category reorders tokens within the same incoming chunk and breaks strict streaming semantics.
- Emit all_emissions in their original order (choice-order within the incoming response).
- For each emission pick the metadata tuple to attach (switch metadata per-emission: preserved_metadata for tool/content/trailing when required; current_metadata for PassThrough) while preserving order.
- For Packed mode, aggregate consecutive emissions that share identical metadata into a single packed response — do not group by type.
Location: lib/llm/src/protocols/openai/chat_completions/jail.rs (around lines 485–538)
🧹 Nitpick comments (6)
lib/llm/src/protocols/openai/chat_completions/jail.rs (3)
651-668: Avoid double parsing in should_end_jail()You parse accumulated_content twice (early_exit already parsed, then parse again before calling find_tool_call_end_position). This adds latency and risks inconsistent results under non-deterministic parsers. Use the dispatcher directly once early_exit is true.
Apply:
- } else if early_exit { - // For early exit, find where the complete tool call ends - if let Some(parser) = &self.tool_call_parser { - if let Ok((_, _)) = - try_tool_call_parse_aggregate(accumulated_content, Some(parser)).await - { - let split_pos = find_tool_call_end_position(accumulated_content, Some(parser)); - (true, split_pos) - } else { - (false, accumulated_content.len()) - } - } else { - (false, accumulated_content.len()) - } + } else if early_exit { + // For early exit, find where the complete tool call ends + if let Some(parser) = &self.tool_call_parser { + let split_pos = find_tool_call_end_position(accumulated_content, Some(parser)); + (true, split_pos) + } else { + (false, accumulated_content.len()) + }
505-528: Comment mismatch: trailing emissions “always as individual chunks” is not guaranteedemit_choice_emissions still respects emission_mode. In Packed mode, trailing emissions will be bundled, contradicting the comment.
- Either update the comment to reflect behavior.
- Or force SingleChoicePerChunk for trailing emissions.
Apply one:
- // Emit trailing content separately (always as individual chunks) + // Emit trailing content separately (mode-dependent)Or:
- let responses = self.emit_choice_emissions(trailing_emissions, chat_response, preserved_metadata); + let saved_mode = self.emission_mode; + let mut responses = Vec::new(); + // Force single-choice emission for trailing + let tmp = JailedStream { emission_mode: EmissionMode::SingleChoicePerChunk, ..self.clone() }; + responses = tmp.emit_choice_emissions(trailing_emissions, chat_response, preserved_metadata); + // saved_mode unused; shown for intent if you choose to refactor emit API
456-472: Multi-choice tool-call policy may diverge from prior “emit first tool-call choice only” designRetrieved learning indicates prior jail logic intentionally emits only the first choice that contains tool calls when unjailing, dropping other accumulated choices. Current logic collects and emits tool calls for all choices in a chunk.
- If that policy should apply here, filter tool_content_emissions to only the lowest index (or first encountered) containing tool calls, and drop the rest for that unjail event.
I can wire a minimal filter preserving pass-through and trailing emissions untouched.
lib/parsers/src/tool_calling/harmony/mod.rs (1)
11-13: Naive end-position (chunk.len()) is acceptable as a placeholderGiven Harmony uses JSON-like patterns, returning chunk.len() is fine for now. Consider documenting why end detection is not needed or add TODO if future heuristics become necessary.
lib/parsers/src/tool_calling/pythonic/mod.rs (1)
9-11: End-position default to chunk.len()Same note as Harmony: acceptable baseline; consider a comment/TODO for future Pythonic-specific end detection if needed.
lib/parsers/src/tool_calling/json/mod.rs (1)
45-71: Cover missing parser keys and edge cases in JSON end-position finder
- Include "llama3_json" and "deepseek_v3_1" in JSON cases. Both are present in get_tool_parser_map and likely terminate with a closing ']' for tool call arrays.
- Consider supporting multiple end tokens (not just .first()) for hermes/nemotron_deci.
- rfind(']') can match inside strings; acceptable heuristic, but note limitation.
Apply:
- match parser { - "hermes" | "nemotron_deci" => { + match parser { + "hermes" | "nemotron_deci" => { if let Some(end_token) = config.tool_call_end_tokens.first() { if let Some(pos) = chunk.find(end_token) { pos + end_token.len() } else { chunk.len() } } else { chunk.len() } } - "mistral" | "phi4" => { + "mistral" | "phi4" | "llama3_json" | "deepseek_v3_1" => { if let Some(pos) = chunk.rfind(']') { pos + 1 } else { chunk.len() } } _ => chunk.len(), }Optionally, support multiple end tokens:
- "hermes" | "nemotron_deci" => { - if let Some(end_token) = config.tool_call_end_tokens.first() { - if let Some(pos) = chunk.find(end_token) { - pos + end_token.len() - } else { - chunk.len() - } - } else { - chunk.len() - } - } + "hermes" | "nemotron_deci" => { + for end_token in &config.tool_call_end_tokens { + if !end_token.is_empty() { + if let Some(pos) = chunk.find(end_token) { + return pos + end_token.len(); + } + } + } + chunk.len() + }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
lib/llm/src/protocols/openai/chat_completions/jail.rs(10 hunks)lib/parsers/src/tool_calling/harmony/mod.rs(1 hunks)lib/parsers/src/tool_calling/json/mod.rs(1 hunks)lib/parsers/src/tool_calling/mod.rs(1 hunks)lib/parsers/src/tool_calling/parsers.rs(2 hunks)lib/parsers/src/tool_calling/pythonic/mod.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-09-10T22:32:12.978Z
Learnt from: zhongdaor-nv
PR: ai-dynamo/dynamo#2999
File: lib/parsers/src/tool_calling/harmony/harmony_parser.rs:250-256
Timestamp: 2025-09-10T22:32:12.978Z
Learning: In lib/parsers/src/tool_calling/harmony/harmony_parser.rs, the team prefers to maintain identical code patterns between parse_tool_calls_harmony and parse_tool_calls_harmony_complete functions, including message.content[0] indexing, to ensure consistency between streaming and complete parser implementations.
Applied to files:
lib/parsers/src/tool_calling/harmony/mod.rslib/llm/src/protocols/openai/chat_completions/jail.rslib/parsers/src/tool_calling/mod.rslib/parsers/src/tool_calling/parsers.rs
📚 Learning: 2025-09-10T15:27:42.511Z
Learnt from: ayushag-nv
PR: ai-dynamo/dynamo#2932
File: lib/llm/src/preprocessor.rs:768-844
Timestamp: 2025-09-10T15:27:42.511Z
Learning: In the tool calling jail implementation in lib/llm/src/preprocessor.rs, the design intentionally emits only the first accumulated choice that contains tool calls during unjailing, dropping other accumulated choices. This is a deliberate design decision, not a bug.
Applied to files:
lib/llm/src/protocols/openai/chat_completions/jail.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.
Applied to files:
lib/llm/src/protocols/openai/chat_completions/jail.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.
Applied to files:
lib/llm/src/protocols/openai/chat_completions/jail.rs
📚 Learning: 2025-09-10T05:04:58.417Z
Learnt from: ayushag-nv
PR: ai-dynamo/dynamo#2932
File: lib/llm/src/protocols/openai/chat_completions/aggregator.rs:66-86
Timestamp: 2025-09-10T05:04:58.417Z
Learning: In the dynamo codebase, tool call chunks from streaming responses always contain complete tool calls (one chunk = one tool call), unlike standard OpenAI streaming where tool calls can be fragmented across multiple chunks. The convert_tool_chunk_to_message_tool_call function correctly assumes complete tool call data in each chunk.
Applied to files:
lib/llm/src/protocols/openai/chat_completions/jail.rs
🧬 Code graph analysis (3)
lib/llm/src/protocols/openai/chat_completions/jail.rs (2)
lib/parsers/src/tool_calling/parsers.rs (3)
get_tool_parser_map(23-37)detect_tool_call_start(98-125)find_tool_call_end_position(127-155)lib/parsers/src/tool_calling/tools.rs (1)
try_tool_call_parse_aggregate(10-42)
lib/parsers/src/tool_calling/mod.rs (1)
lib/parsers/src/tool_calling/parsers.rs (4)
detect_and_parse_tool_call(72-96)detect_tool_call_start(98-125)find_tool_call_end_position(127-155)try_tool_call_parse(43-69)
lib/parsers/src/tool_calling/parsers.rs (4)
lib/parsers/src/tool_calling/config.rs (2)
harmony(145-154)pythonic(138-143)lib/parsers/src/tool_calling/harmony/mod.rs (1)
find_tool_call_end_position_harmony(11-13)lib/parsers/src/tool_calling/json/mod.rs (1)
find_tool_call_end_position_json(45-71)lib/parsers/src/tool_calling/pythonic/mod.rs (1)
find_tool_call_end_position_pythonic(9-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (5)
lib/llm/src/protocols/openai/chat_completions/jail.rs (3)
77-99: Good abstraction: centralized ChatChoiceStream constructionHelper cleanly standardizes choice creation and removes duplication. No issues.
149-156: Consistent use of create_choice_stream across emission sitesUsing the helper for prefix/trailing/pass-through/tool-call/finalization is consistent and improves maintainability.
Also applies to: 189-196, 221-228, 264-271, 309-316, 339-346, 703-710, 714-721
703-710: Role hard-coded to Assistant for tool-call emissionsSetting role to Some(Role::Assistant) is likely correct for OpenAI-compatible tool-call deltas; however, some providers stream tool calls with role None except the very first delta. Confirm downstream consumers expect role present here.
lib/parsers/src/tool_calling/mod.rs (1)
16-19: Publicly re-exporting find_tool_call_end_position is good API hygieneCentralizes end-position helpers behind a single dispatcher. Looks good.
lib/parsers/src/tool_calling/parsers.rs (1)
5-15: LGTM: imports aligned with new end-position helpersThe import additions look correct and match the new dispatcher usage below.
Signed-off-by: ayushag <ayushag@nvidia.com>
GuanLuo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the find_tool_call_end_position wasn't handling harmony and thanks for fixing it along the way
|
/ok to test abfec30 |
Signed-off-by: ayushag <ayushag@nvidia.com>
|
/ok to test 0343a26 |
Signed-off-by: ayushag <ayushag@nvidia.com>
Signed-off-by: ayushag <ayushag@nvidia.com> Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
Signed-off-by: ayushag <ayushag@nvidia.com> Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
Signed-off-by: ayushag <ayushag@nvidia.com> Signed-off-by: Kyle H <kylhuang@nvidia.com>
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)