chore: JailStream Optimizations #3172

ayushag-nv · 2025-09-22T22:38:47Z

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Unified end-position detection for tool calls across parsers, improving streaming accuracy.
- Automatic detection of jail/start/end markers from configuration for more reliable streaming.
Bug Fixes
- More robust start/end handling of tool-call sections, including single-chunk completions.
- Consistent emission of pass-through and trailing content.
- Reduced stray content by clearing buffers reliably after matches.
Refactor
- Tool-call parsing APIs are now synchronous, simplifying integration and reducing overhead.
- Centralized and standardized stream emission logic for clearer, more predictable behavior.

Signed-off-by: ayushag <ayushag@nvidia.com>

Signed-off-by: Biswa Panda <biswa.panda@gmail.com> Signed-off-by: ayushag <ayushag@nvidia.com>

…tion - Implement JailedStream with builder pattern for clean API - Use async-stream's stream! macro for cleaner async implementation - Support configurable jail start/end sequences - Integrate tool call detection and parsing - Add comprehensive tests with shared test utilities - Create documentation for usage patterns The JailedStream provides a standalone solution for accumulating tokens when certain sequences are detected (jailing) and releasing them as a single chunk when the jail ends, enabling proper tool call handling in streaming responses. Signed-off-by: Ryan Olson <rolson@nvidia.com>

- Remove unnecessary annotation code that was never used - Fix efficiency issue by avoiding cloning when not jailing - Only clone response data when actually entering jail state - Simplify jailed behavior to not yield empty chunks - Accumulate content silently and only yield final result - Update tests to match new behavior This makes the stream processing more efficient by: 1. Eliminating unnecessary clones during normal streaming 2. Removing pointless annotation objects 3. Not sending empty chunks while accumulating 4. Only yielding the final aggregated result when jail ends Signed-off-by: Ryan Olson <rolson@nvidia.com>

…rhead - Remove ManyOut type alias and AsyncEngineContextProvider dependency - Change apply() to return impl Stream instead of Pin<Box<ResponseStream>> - Remove unnecessary context extraction and re-wrapping - Use tokio::pin!() for stack pinning instead of Box::pin() (more efficient) - Simplify tests by removing mock context infrastructure - Update documentation with correct usage examples This makes JailedStream a pure stream combinator that: 1. Doesn't handle context (caller's responsibility) 2. Avoids unnecessary boxing in the library 3. Can be composed with other stream transformers 4. Lets callers decide when to add boxing/context Performance improvements: - No heap allocation for pinning (stack pinning instead) - No context passing overhead - Direct stream transformation without wrapper types Signed-off-by: Ryan Olson <rolson@nvidia.com>

- Remove old apply_tool_calling_jail_internal function - Remove PossibleToolCallAnnotation struct and implementations - Update apply_tool_calling_jail_with_parser to conditionally use JailedStream - Only apply jail logic when tool_call_parser is configured - Comment out old jail tests that are no longer applicable - Clean up unused imports and run cargo fmt/clippy

- Change jail entry logic from if/else to evaluate both sequence and tool call conditions - Add early exit detection via should_exit_jail_early() method - Support two paths to enter jail: explicit sequences OR tool call detection - Support two paths to exit jail: end sequences OR complete tool call parsing - Add tests for dual entry paths and early exit behavior - Improve debug logging to show which condition triggered jail start/end

- Separate context passing from stream transformations - Change transform_embedding_postprocessor_stream to return impl Stream - Update transform_postprocessor_stream to accept context as parameter - Defer boxing until the final ResponseStream::new() call - Extract context once at the beginning of generate() methods - Reduce boxing from 5-6 locations to 1 per request - Improves performance by eliminating intermediate heap allocations

- Only apply jail when tools are present AND tool_choice is not None - Error if tool_choice is 'required', 'auto', or named but no parser configured - Skip jail entirely when tool_choice is explicitly None - Add should_apply_tool_jail() to determine jail application logic - Refactor to separate decision logic from stream processing - Avoid lifetime issues by evaluating conditions before stream transformation

- Removed old jail implementation tests that are no longer compatible - Kept test_detect_tool_call_start_different_parsers as it tests parser behavior - Old tests were testing annotation behavior that no longer exists in new JailedStream - New tests in jail.rs provide equivalent coverage

Resolved merge conflicts in .gitignore and preprocessor.rs: - Fixed gitignore format for Ruler generated files - Removed duplicate jail implementation from main - Kept optimized JailedStream with conditional application - Fixed transform_postprocessor_stream signature usage - All tests passing

- Remove unused imports and dead code warnings in test module - Fix test_streaming_usage.rs function signature for transform_postprocessor_stream - Remove test_preprocessor.rs that tested old jail implementation - Add missing context parameter to transform calls - All tests now pass and clippy/fmt are clean

Add 10 new test cases for JailedStream covering all parser types and edge cases: - Hermes parser with <tool_call> markers - Mistral parser (both variants with and without [TOOL_CALLS] marker) - Phi4 parser with <|tool_call|> markers - llama3_json parser with <|python_tag|> markers - False positive detection for non-tool JSON - Malformed JSON handling - Partial tool call scenarios - Empty stream behavior - Multiple consecutive tool calls - Fragmented tool calls across many small chunks All tests verify silent jailing behavior (no empty chunks) and proper tool call accumulation between start/end markers.

Fixes issue where create_unjailed_response was discarding upstream correlation metadata (id, event, comment) from Annotated wrapper, breaking observability. Changes: - Capture id/event/comment fields when jail is triggered - Pass preserved metadata to create_unjailed_response - Attach original metadata to synthesized responses - Add comprehensive test coverage for metadata preservation Tests added: - test_jailed_stream_preserves_metadata: Normal jail processing - test_jailed_stream_preserves_metadata_on_stream_end: Stream termination - test_jailed_stream_metadata_edge_cases: Partial/empty metadata Addresses GitHub issue discussion r2352198367

Fixes issue where trailing content in the same chunk as jail end markers was being lost. When a chunk contains both an end marker and additional content (e.g., "</tool_call>trailing text"), the trailing text was discarded instead of being emitted as a separate chunk. Changes: - Find exact position of end markers for precise content splitting - Split accumulated content at marker boundary into jailed/trailing parts - Emit jailed content as tool call response, trailing as pass-through - Handle both explicit end sequences and early exit scenarios - Add parser-specific end position detection for all supported formats - Preserve metadata in trailing content chunks Tests added: - test_jailed_stream_trailing_content_same_chunk: End marker + trailing - test_jailed_stream_early_exit_with_trailing: Early exit + trailing Fixes GitHub issue discussion r2352198360

Replace HashMap-based jail state with deterministic Vec-based collection to enable independent choice jailing and ensure consistent processing order. Key improvements: - Add ChoiceJailState and ChoiceJailStateCollection for deterministic ordering - Implement independent per-choice jail/unjail logic - Add EmissionMode configuration (Packed vs SingleChoicePerChunk) - Support configurable emission patterns for OpenAI compatibility - Fix double emission issue during choice unjailing - Maintain backward compatibility for single-choice scenarios This addresses GitHub issues r2352198339 (determinism) and enables proper multi-choice (n>1) tool call handling with independent state management per choice index. Tests: 22/24 pass (2 expected multi-choice test failures for future work)

This major refactoring enables true independent choice processing where each choice (when n>1) can jail and unjail independently with deterministic ordering. Core changes: - Add ChoiceEmission enum for clean emission abstraction - Add ChoiceJailState with encapsulated jail logic per choice - Replace HashMap with deterministic Vec-based ChoiceJailStateCollection - Implement independent process_content() method per choice - Add configurable EmissionMode (Packed vs SingleChoicePerChunk) - Clean separation between state management and emission strategy Key achievements: - Multi-choice independent jailing tests now pass - Deterministic choice ordering (always 0, 1, 2...) - Metadata preservation through jail processing - Maintains backward compatibility for single-choice scenarios Tests: 22/24 pass (2 trailing content edge cases remaining) Addresses GitHub issues: - r2352198339: HashMap determinism and independent choice jailing - Enables proper multi-choice (n>1) tool call handling

Modify emission grouping logic to separate trailing emissions from tool/content emissions. This ensures trailing content after jail end markers is always emitted as separate chunks, fixing the failing tests: - test_jailed_stream_trailing_content_same_chunk - test_jailed_stream_early_exit_with_trailing Also remove unused methods to satisfy clippy warnings. All 24 jail tests now pass with the independent multi-choice jailing implementation. Signed-off-by: Ryan Olson <rolson@nvidia.com>

Add efficient prefix matching utility to detect tool call markers split across chunk boundaries. Solves the issue where markers like "<TOOLCALL>" were missed when characters arrived in separate chunks. Key improvements: - Created utils/prefix_matcher.rs with MarkerMatcher using Aho-Corasick - Detects partial suffixes (e.g., "n<TO" finds "<TO" as partial match) - Prevents false positives (e.g., "n < 5" doesn't trigger jailing) - Preserves chunk boundaries to avoid unnecessary yield consumption - Handles complete tool calls within single combined chunks - Maintains backward compatibility with all existing functionality Tests added: - test_partial_matching_suffix_detection: Verifies cross-chunk detection - test_partial_matching_false_positive_prevention: Prevents math expressions All 26 existing jail tests continue to pass. Signed-off-by: Ryan Olson <rolson@nvidia.com>

Signed-off-by: ayushag <ayushag@nvidia.com>

copy-pr-bot · 2025-09-22T22:38:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: ayushag <ayushag@nvidia.com>

Signed-off-by: Ayush Agarwal <ayushag@nvidia.com>

coderabbitai · 2025-09-23T23:40:13Z

Walkthrough

Refactors OpenAI chat_completions jail streaming to centralize choice emission and tool-call handling, introduces parser end-position utilities, and converts multiple tool-calling parser APIs from async to sync. Adds dispatch for end-position detection across parsers and updates builders to derive jail markers from parser config.

Changes

Cohort / File(s)	Summary of changes
OpenAI jail streaming refactor `lib/llm/src/protocols/openai/chat_completions/jail.rs`	Adds helper `create_choice_stream` and `ChoiceJailState` emission helpers; refactors content processing to use helpers; standardizes tool-call vs content emission; clears partial buffers consistently; removes ad-hoc end logic; imports `ChatChoiceLogprobs` and unified end-position helper; builder auto-populates jail start/end sequences from parser config and common markers.
Harmony parser sync conversion `lib/parsers/src/tool_calling/harmony/harmony_parser.rs`	Converts async functions to sync; replaces `tokio::sync::OnceCell` with `std::sync::OnceLock`; updates encoding initialization path; adjusts tests to non-async.
End-position helpers (Harmony) `lib/parsers/src/tool_calling/harmony/mod.rs`	Adds `find_tool_call_end_position_harmony(chunk: &str) -> usize` returning `chunk.len()`.
End-position helpers (JSON) `lib/parsers/src/tool_calling/json/mod.rs`	Adds `find_tool_call_end_position_json(chunk, parser, config)` with parser-specific logic for end detection (hermes/nemotron_deci via end tokens; mistral/phi4 via last `]`; default to `chunk.len()`).
End-position helpers (Pythonic) `lib/parsers/src/tool_calling/pythonic/mod.rs`	Adds `find_tool_call_end_position_pythonic(chunk: &str) -> usize` returning `chunk.len()`.
Parser dispatch: sync + end-position `lib/parsers/src/tool_calling/parsers.rs`	Adds `find_tool_call_end_position(chunk, parser_str)` dispatcher; converts `try_tool_call_parse` and `detect_and_parse_tool_call` to sync; wires Harmony/JSON/Pythonic end-position helpers; maintains unsupported branches for Typescript/XML.
Top-level parser re-exports `lib/parsers/src/tool_calling/mod.rs`	Re-exports `find_tool_call_end_position` publicly.
Tool-call utilities sync conversion `lib/parsers/src/tool_calling/tools.rs`	Converts `try_tool_call_parse_aggregate` and `try_tool_call_parse_stream` to sync and removes awaits on detection/parse.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Stream as OpenAI Stream
  participant Jail as JailedStream
  participant Parser as ToolCall Parsers
  participant Emit as Choice Emission

  Client->>Stream: receive chunk
  Stream->>Jail: process_content(chunk)
  Jail->>Parser: find_tool_call_end_position(chunk, parser_str)
  Parser-->>Jail: end_index
  alt Tool-call markers detected
    Jail->>Jail: accumulate tool_calls / trailing
    Jail->>Emit: emit_tool_or_content(...)
  else No markers
    Jail->>Emit: emit_pass_through or content
  end
  Emit-->>Client: ChatChoiceStream chunk(s)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: parse normal text along with tool calls #2709 — Introduces/adjusts tool-call end-position APIs and re-exports that align with the new find_tool_call_end_position dispatch used in jail processing.
feat: Tool calling support stream=True #2932 — Modifies jail/tool-call streaming integration; closely related to this refactor that standardizes emissions and end detection.
feat: added support for pythonic tool parser #2788 — Expands tool-calling parser modules and wiring, overlapping with added parser end-position helpers and public re-exports.

Poem

In streams I nibble, byte by byte,
Markers blink like stars at night.
Tools or tales? I sniff the end—
Then hop along the flow I send.
Sync now swift, my whiskers preen—
A tidy warren, fast and clean. 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description contains only the repository template headings with placeholder text and lacks any substantive summary, list of changes, files to review, rationale, or testing information, so it does not meet the repository's required description content.	Populate the Overview with a brief summary of intent and high-level changes, fill Details with the specific file changes and behavioral impact, list Where should the reviewer start with key files and functions to inspect, and add Related Issues plus testing/integration notes and any migration or compatibility considerations.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "chore: JailStream Optimizations" is concise and directly describes the primary change (optimizations and refactors to the JailStream logic) and is specific enough for a teammate scanning history to understand the main intent.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

lib/parsers/src/tool_calling/tools.rs (1)

10-16: Remove .await from sync API callsites; lower parser log level to debug.

try_tool_call_parse_aggregate and try_tool_call_parse_stream are synchronous; awaiting them will fail to compile.

Remove .await at: lib/llm/src/protocols/openai/chat_completions/jail.rs:641, 664-665, 723.

Change tracing::info! → tracing::debug! in lib/parsers/src/tool_calling/tools.rs (around lines 19–21) to avoid noisy logs on hot paths.

lib/llm/src/protocols/openai/chat_completions/jail.rs (2)

639-645: Call the free function find_tool_call_end_position correctly

You're calling a non‑existent method on self and passing &String where Option<&str> is expected.

Apply this diff:

-                if let Ok((_, _)) =
-                    try_tool_call_parse_aggregate(accumulated_content, Some(parser)).await
+                if let Ok((_, _)) = try_tool_call_parse_aggregate(
+                    accumulated_content,
+                    Some(parser.as_str()),
+                )
+                .await
                 {
-                    let split_pos = self.find_tool_call_end_position(accumulated_content, parser);
+                    let split_pos = find_tool_call_end_position(
+                        accumulated_content,
+                        Some(parser.as_str()),
+                    );
                     (true, split_pos)
                 } else {

663-699: Avoid unstable let‑chains in if let; restructure for stable Rust

if let PATTERN = EXPR && guard is not stable. Use nested if.

Apply this diff:

-        if let Ok((tool_calls, normal_text)) =
-            try_tool_call_parse_aggregate(accumulated_content, self.tool_call_parser.as_deref())
-                .await
-            && !tool_calls.is_empty()
-        {
-            // Convert to streaming format
-            let tool_call_chunks: Vec<ChatCompletionMessageToolCallChunk> = tool_calls
-                .into_iter()
-                .enumerate()
-                .map(|(idx, tool_call)| ChatCompletionMessageToolCallChunk {
-                    index: idx as u32,
-                    id: Some(tool_call.id),
-                    r#type: Some(tool_call.r#type),
-                    function: Some(FunctionCallStream {
-                        name: Some(tool_call.function.name),
-                        arguments: Some(tool_call.function.arguments),
-                    }),
-                })
-                .collect();
-
-            // Create choice with tool calls
-            #[allow(deprecated)]
-            return ChatChoiceStream {
-                index: choice_index,
-                delta: ChatCompletionStreamResponseDelta {
-                    role: Some(Role::Assistant),
-                    content: normal_text.filter(|t| !t.is_empty()),
-                    tool_calls: Some(tool_call_chunks),
-                    function_call: None,
-                    refusal: None,
-                    reasoning_content: None,
-                },
-                finish_reason: Some(FinishReason::ToolCalls),
-                logprobs: None,
-            };
-        }
+        if let Ok((tool_calls, normal_text)) =
+            try_tool_call_parse_aggregate(accumulated_content, self.tool_call_parser.as_deref())
+                .await
+        {
+            if !tool_calls.is_empty() {
+                // Convert to streaming format
+                let tool_call_chunks: Vec<ChatCompletionMessageToolCallChunk> = tool_calls
+                    .into_iter()
+                    .enumerate()
+                    .map(|(idx, tool_call)| ChatCompletionMessageToolCallChunk {
+                        index: idx as u32,
+                        id: Some(tool_call.id),
+                        r#type: Some(tool_call.r#type),
+                        function: Some(FunctionCallStream {
+                            name: Some(tool_call.function.name),
+                            arguments: Some(tool_call.function.arguments),
+                        }),
+                    })
+                    .collect();
+
+                // Create choice with tool calls
+                #[allow(deprecated)]
+                return ChatChoiceStream {
+                    index: choice_index,
+                    delta: ChatCompletionStreamResponseDelta {
+                        role: Some(Role::Assistant),
+                        content: normal_text.filter(|t| !t.is_empty()),
+                        tool_calls: Some(tool_call_chunks),
+                        function_call: None,
+                        refusal: None,
+                        reasoning_content: None,
+                    },
+                    finish_reason: Some(FinishReason::ToolCalls),
+                    logprobs: None,
+                };
+            }
+        }

🧹 Nitpick comments (16)

lib/parsers/src/tool_calling/json/mod.rs (1)

45-71: Guard against empty end tokens; consider scanning all configured end tokens.

If the first configured end token is an empty string, chunk.find(end_token) returns 0, causing a premature end position. Also, when multiple non-empty end tokens exist, picking the earliest match is safer.

Apply:

-        "hermes" | "nemotron_deci" => {
-            if let Some(end_token) = config.tool_call_end_tokens.first() {
-                if let Some(pos) = chunk.find(end_token) {
-                    pos + end_token.len()
-                } else {
-                    chunk.len()
-                }
-            } else {
-                chunk.len()
-            }
-        }
+        "hermes" | "nemotron_deci" => {
+            let end_pos = config
+                .tool_call_end_tokens
+                .iter()
+                .filter(|t| !t.is_empty())
+                .filter_map(|t| chunk.find(t).map(|pos| pos + t.len()))
+                .min()
+                .unwrap_or(chunk.len());
+            end_pos
+        }

Please confirm ToolCallConfig::hermes() and ToolCallConfig::nemotron_deci() set a non-empty end token first (e.g., "</tool_call>", "</TOOLCALL>"), or accept the above defensive change.

lib/parsers/src/tool_calling/pythonic/mod.rs (1)

9-11: Return end of closing bracket if present.

Mirrors Mistral/Phi4 handling and avoids over-reading beyond a completed call list.
 pub fn find_tool_call_end_position_pythonic(chunk: &str) -> usize {
-    chunk.len()
+    if let Some(pos) = chunk.rfind(']') {
+        pos + 1
+    } else {
+        chunk.len()
+    }
 }

lib/parsers/src/tool_calling/harmony/mod.rs (1)

11-13: LGTM; parity with dispatcher achieved.

Returning chunk.len() is acceptable for Harmony given complete-parse behavior. Consider future extension to accept config if Harmony ever exposes end tokens.

lib/parsers/src/tool_calling/parsers.rs (2)

127-155: End-position dispatcher: sensible defaults; minor hardening optional.

Unknown or unimplemented parsers return chunk.len() — sane.

JSON path forwards parser_key to JSON-specific logic — good.

Add targeted tests for find_tool_call_end_position across: hermes, nemotron_deci, mistral, phi4 (including edge with trailing normal text), to lock behavior.

189-206: Tests now call sync APIs — fine; consider dropping tokio where unnecessary.

Async test harness remains but calls are sync; optional cleanup to #[test] for purely sync cases.

Also applies to: 202-219, 230-234, 245-250, 283-293, 299-306, 315-319, 336-337, 367-368, 386-387, 401-413, 431-433, 484-485, 511-512, 525-526, 539-540, 558-559, 572-573, 590-591, 616-617, 634-635, 668-669, 682-683, 701-702, 729-730, 746-747, 759-760, 775-776, 788-789, 801-802, 817-818, 835-836, 853-854, 864-871, 899-904, 939-947, 970-982, 987-995, 1001-1009, 1022-1030, 1040-1048, 1053-1061, 1081-1087, 1092-1098, 1106-1117, 1125-1136, 1143-1151, 1158-1166, 1172-1179, 1185-1192, 1199-1206, 1213-1218, 1224-1228, 1236-1266, 1274-1286, 1293-1309, 1325-1338, 1353-1366, 1372-1383, 1389-1401, 1407-1444

lib/parsers/src/tool_calling/harmony/harmony_parser.rs (10)

11-19: OnceLock switch LGTM; confirm encoding choice/name.

Good move to std::sync::OnceLock. Please confirm HarmonyEncodingName::HarmonyGptOss is correct for this generic harmony tool-calling parser. If this parser is used beyond GPT‑OSS, consider parameterizing the encoding or renaming the static to a model‑agnostic identifier to avoid confusion.

137-144: Avoid needless clones and unwrap when building ToolCallResponse.

Use the owned fname and Value::to_string(); serde_json::to_string on Value shouldn’t fail, and this trims one clone and an unwrap.

Apply this diff:

-                res.push(ToolCallResponse {
-                    id: format!("call-{}", call_idx),
-                    tp: ToolCallType::Function,
-                    function: CalledFunction {
-                        name: fname.to_string(),
-                        // Safety: `Value::Object` is always valid JSON, so serialization cannot fail
-                        arguments: serde_json::to_string(&args).unwrap(),
-                    },
-                });
+                res.push(ToolCallResponse {
+                    id: format!("call-{}", call_idx),
+                    tp: ToolCallType::Function,
+                    function: CalledFunction {
+                        name: fname,
+                        arguments: args.to_string(),
+                    },
+                });

147-153: Potential panic: indexing message.content[0].

Consistency with prior parsers is noted, but this can panic if content is empty. Safer pattern keeps behavior identical when present.

Apply these diffs:

-        if message.author.role == Role::Assistant && message.channel.as_deref() == Some("analysis")
-        {
-            normal_text.push_str(match &message.content[0] {
-                Text(t) => &t.text,
-                _ => "",
-            });
-        }
+        if message.author.role == Role::Assistant && message.channel.as_deref() == Some("analysis") {
+            if let Some(Text(t)) = message.content.first() {
+                normal_text.push_str(&t.text);
+            }
+        }

-        } else if channel == Some("analysis") {
-            normal_text.push_str(match &message.content[0] {
-                Text(t) => &t.text,
-                _ => "",
-            });
-        }
+        } else if channel == Some("analysis") {
+            if let Some(Text(t)) = message.content.first() {
+                normal_text.push_str(&t.text);
+            }
+        }

Also applies to: 251-255

155-155: Avoid unnecessary String clone on return.

Return the owned String directly.

Apply these diffs:

-    Ok((res, Some(normal_text.to_string())))
+    Ok((res, Some(normal_text)))

-    Ok((res, Some(normal_text.to_string())))
+    Ok((res, Some(normal_text)))

Also applies to: 257-257

225-248: Duplicate clone/unwrap in complete parser.

Same cleanup as streaming variant.

Apply this diff:

-                res.push(ToolCallResponse {
-                    id: format!("call-{}", call_idx),
-                    tp: ToolCallType::Function,
-                    function: CalledFunction {
-                        name: fname.to_string(),
-                        // Safety: `Value::Object` is always valid JSON, so serialization cannot fail
-                        arguments: serde_json::to_string(&args).unwrap(),
-                    },
-                });
+                res.push(ToolCallResponse {
+                    id: format!("call-{}", call_idx),
+                    tp: ToolCallType::Function,
+                    function: CalledFunction {
+                        name: fname,
+                        arguments: args.to_string(),
+                    },
+                });

283-299: Tighten partial-token detection loop (remove redundant get/len slicing).

Current code allocates and then calls .get(..) only to slice the entire string again. Simplify without changing logic.

Apply these diffs:

-        config.tool_call_start_tokens.iter().any(|token| {
+        config.tool_call_start_tokens.iter().any(|token| {
             if token.is_empty() {
                 return false;
             }
             // Check if the chunk could be a prefix of this start token
-            // Handle Unicode character boundaries properly
-            for i in 1..=token.chars().count() {
-                if let Some(prefix) = token.chars().take(i).collect::<String>().get(..) {
-                    let prefix_str = &prefix[..prefix.len()];
-                    if trimmed == prefix_str || trimmed.ends_with(prefix_str) {
-                        return true;
-                    }
-                }
-            }
+            for i in 1..=token.chars().count() {
+                let prefix: String = token.chars().take(i).collect();
+                if trimmed == prefix || trimmed.ends_with(&prefix) {
+                    return true;
+                }
+            }
             false
         })

-        let has_partial_token = config.tool_call_start_tokens.iter().any(|token| {
+        let has_partial_token = config.tool_call_start_tokens.iter().any(|token| {
             if token.is_empty() {
                 return false;
             }
             // Check if the chunk could be a prefix of this start token
-            // Handle Unicode character boundaries properly
-            for i in 1..=token.chars().count() {
-                if let Some(prefix) = token.chars().take(i).collect::<String>().get(..) {
-                    let prefix_str = &prefix[..prefix.len()];
-                    if trimmed == prefix_str || trimmed.ends_with(prefix_str) {
-                        return true;
-                    }
-                }
-            }
+            for i in 1..=token.chars().count() {
+                let prefix: String = token.chars().take(i).collect();
+                if trimmed == prefix || trimmed.ends_with(&prefix) {
+                    return true;
+                }
+            }
             false
         });

Also applies to: 311-329

188-190: Nit: stray comment markers.

Remove the duplicate slashes.

Apply this diff:

-    // // Encode the text into tokens using harmony encoding
+    // Encode the text into tokens using harmony encoding

86-86: Nit: typo in comment.

“Iteratate” → “Iterate”.

Apply this diff:

-    // Iteratate through messages and extract tool calls if there
+    // Iterate through messages and extract tool calls if present

364-375: Tests: make sync to match new API (drop tokio/async).

These tests don’t await anything. Convert to plain #[test] for consistency.

Apply these diffs:

-    #[tokio::test]
-    async fn test_parse_tool_calls_harmony_complete_basic() {
+    #[test]
+    fn test_parse_tool_calls_harmony_complete_basic() {
         let text = r#"<|channel|>commentary to=functions.get_current_weather <|constrain|>json<|message|>{"format":"celsius","location":"San Francisco"}"#;
-        let (tool_calls, normal_content) =
-            parse_tool_calls_harmony_complete(text, &Default::default()).unwrap();
+        let (tool_calls, normal_content) =
+            parse_tool_calls_harmony_complete(text, &Default::default()).unwrap();

-    #[tokio::test]
-    async fn test_parse_tools_harmony_without_start_token() {
+    #[test]
+    fn test_parse_tools_harmony_without_start_token() {
         let text = r#"
 <|channel|>analysis<|message|>Need to use function get_current_weather.<|end|>
 <|message|>{"location":"San Francisco"}<|call|>
 "#;

Also applies to: 376-390

475-477: Nit: typo in test comment.

“warkaround” → “workaround”.

Apply this diff:

-        // This is a warkaround for now. Right now everything is treated as tool call start token.
+        // This is a workaround for now. Right now everything is treated as a tool call start token.

lib/llm/src/protocols/openai/chat_completions/jail.rs (1)

831-877: Dedup marker patterns before constructing the matcher

Prevents redundant work and ambiguous matches if the same marker is added from multiple sources.

Apply this diff:

         // Create the marker matcher (fallback to empty patterns if none configured)
-        let marker_matcher = if all_patterns.is_empty() {
+        // Dedup to avoid redundant comparisons
+        all_patterns.sort();
+        all_patterns.dedup();
+        let marker_matcher = if all_patterns.is_empty() {

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c63ccea and b804181.

📒 Files selected for processing (8)

lib/llm/src/protocols/openai/chat_completions/jail.rs (8 hunks)
lib/parsers/src/tool_calling/harmony/harmony_parser.rs (11 hunks)
lib/parsers/src/tool_calling/harmony/mod.rs (1 hunks)
lib/parsers/src/tool_calling/json/mod.rs (1 hunks)
lib/parsers/src/tool_calling/mod.rs (1 hunks)
lib/parsers/src/tool_calling/parsers.rs (77 hunks)
lib/parsers/src/tool_calling/pythonic/mod.rs (1 hunks)
lib/parsers/src/tool_calling/tools.rs (3 hunks)

🧰 Additional context used

🧠 Learnings (5)

📚 Learning: 2025-09-10T22:32:12.978Z

Learnt from: zhongdaor-nv
PR: ai-dynamo/dynamo#2999
File: lib/parsers/src/tool_calling/harmony/harmony_parser.rs:250-256
Timestamp: 2025-09-10T22:32:12.978Z
Learning: In lib/parsers/src/tool_calling/harmony/harmony_parser.rs, the team prefers to maintain identical code patterns between parse_tool_calls_harmony and parse_tool_calls_harmony_complete functions, including message.content[0] indexing, to ensure consistency between streaming and complete parser implementations.

Applied to files:

lib/parsers/src/tool_calling/mod.rs
lib/parsers/src/tool_calling/harmony/mod.rs
lib/parsers/src/tool_calling/tools.rs
lib/llm/src/protocols/openai/chat_completions/jail.rs
lib/parsers/src/tool_calling/parsers.rs
lib/parsers/src/tool_calling/harmony/harmony_parser.rs

📚 Learning: 2025-09-10T05:04:58.417Z

Learnt from: ayushag-nv
PR: ai-dynamo/dynamo#2932
File: lib/llm/src/protocols/openai/chat_completions/aggregator.rs:66-86
Timestamp: 2025-09-10T05:04:58.417Z
Learning: In the dynamo codebase, tool call chunks from streaming responses always contain complete tool calls (one chunk = one tool call), unlike standard OpenAI streaming where tool calls can be fragmented across multiple chunks. The convert_tool_chunk_to_message_tool_call function correctly assumes complete tool call data in each chunk.

Applied to files:

lib/parsers/src/tool_calling/tools.rs
lib/llm/src/protocols/openai/chat_completions/jail.rs

📚 Learning: 2025-09-10T15:27:42.511Z

Learnt from: ayushag-nv
PR: ai-dynamo/dynamo#2932
File: lib/llm/src/preprocessor.rs:768-844
Timestamp: 2025-09-10T15:27:42.511Z
Learning: In the tool calling jail implementation in lib/llm/src/preprocessor.rs, the design intentionally emits only the first accumulated choice that contains tool calls during unjailing, dropping other accumulated choices. This is a deliberate design decision, not a bug.

Applied to files:

lib/llm/src/protocols/openai/chat_completions/jail.rs

📚 Learning: 2025-08-22T19:55:41.608Z

Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.

Applied to files:

lib/llm/src/protocols/openai/chat_completions/jail.rs

📚 Learning: 2025-08-22T19:55:41.608Z

Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

lib/llm/src/protocols/openai/chat_completions/jail.rs

🧬 Code graph analysis (5)

lib/parsers/src/tool_calling/mod.rs (1)

lib/parsers/src/tool_calling/parsers.rs (4)

detect_and_parse_tool_call (72-96)

detect_tool_call_start (98-125)

find_tool_call_end_position (127-155)

try_tool_call_parse (43-69)

lib/parsers/src/tool_calling/tools.rs (1)

lib/parsers/src/tool_calling/parsers.rs (1)

detect_and_parse_tool_call (72-96)

lib/llm/src/protocols/openai/chat_completions/jail.rs (2)

lib/parsers/src/tool_calling/parsers.rs (3)

get_tool_parser_map (23-37)

detect_tool_call_start (98-125)

find_tool_call_end_position (127-155)

lib/parsers/src/tool_calling/tools.rs (1)

try_tool_call_parse_aggregate (10-42)

lib/parsers/src/tool_calling/parsers.rs (6)

lib/parsers/src/tool_calling/config.rs (5)

harmony (145-154)

pythonic (138-143)

default (45-55)

default (68-73)

mistral (116-125)

lib/parsers/src/tool_calling/harmony/harmony_parser.rs (2)

detect_tool_call_start_harmony (260-330)

parse_tool_calls_harmony_complete (176-258)

lib/parsers/src/tool_calling/harmony/mod.rs (1)

find_tool_call_end_position_harmony (11-13)

lib/parsers/src/tool_calling/json/mod.rs (4)

detect_tool_call_start_json (38-43)

find_tool_call_end_position_json (45-71)

try_tool_call_parse_json (28-36)

default (23-25)

lib/parsers/src/tool_calling/pythonic/pythonic_parser.rs (2)

detect_tool_call_start_pythonic (190-198)

try_tool_call_parse_pythonic (162-188)

lib/parsers/src/tool_calling/pythonic/mod.rs (1)

find_tool_call_end_position_pythonic (9-11)

lib/parsers/src/tool_calling/harmony/harmony_parser.rs (1)

lib/parsers/src/reasoning/gpt_oss_parser.rs (2)

new (39-56)

get_harmony_encoding (20-23)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Build and Test - dynamo
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (.)

🔇 Additional comments (5)

lib/parsers/src/tool_calling/mod.rs (1)

16-19: Re-export of end-position dispatcher looks good.

Public surface now includes find_tool_call_end_position.

lib/parsers/src/tool_calling/parsers.rs (3)

5-15: Consolidated per-parser imports (incl. end-position helpers) — good.

Keeps dispatch localized and explicit.

43-69: Synchronous try_tool_call_parse conversion is consistent.

Harmony path uses parse_tool_calls_harmony_complete, JSON/Pythonic paths unchanged. Looks correct.

Confirm external crates no longer expect an async API for this function.

72-96: Sync detect_and_parse_tool_call — OK.

Defaulting None/empty to "default" parser is clear; error message enumerates available parsers.

lib/llm/src/protocols/openai/chat_completions/jail.rs (1)

78-99: Helper factory looks good

Centralizing ChatChoiceStream construction reduces duplication and error‑prone field copying.

coderabbitai · 2025-09-23T23:40:17Z