Skip to content

Conversation

@nachiketb-nvidia
Copy link
Contributor

@nachiketb-nvidia nachiketb-nvidia commented Aug 25, 2025

…one deployment

Overview:

enable --dyn-reasoning-parser flag to set reasoning parser for any dynamo.vllm deployment

changes the signature of one function, updates ModelDeploymentCard to bring these changes over etcd

-- See the propagation of ModelRuntimeConfig

Summary by CodeRabbit

  • New Features

    • Added runtime configuration for chat completion responses, enabling selection of reasoning parsers at runtime (basic, deepseek_r1, gpt_oss).
    • Reasoning parser automatically falls back to a default if an unknown option is provided.
    • Consistent runtime settings are now carried through model cards and preprocessing, ensuring predictable behavior across deployments.
  • Documentation

    • Improved CLI help for --dyn-reasoning-parser to clearly list available options (basic, deepseek_r1, gpt_oss).

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 25, 2025

Walkthrough

Introduces runtime-config-driven behavior for chat completion delta generation. Adds ModelRuntimeConfig to ModelDeploymentCard and OpenAIPreprocessor, updates response_generator to accept runtime_config, and selects the reasoning parser by name at runtime. Adjusts engine/test call sites. Also expands CLI help text for --dyn-reasoning-parser.

Changes

Cohort / File(s) Summary
CLI help text
components/backends/vllm/src/dynamo/vllm/args.py
Expanded help string for --dyn-reasoning-parser to list options: 'basic', 'deepseek_r1', 'gpt_oss'. No logic changes.
Runtime config propagation
lib/llm/src/model_card.rs, lib/llm/src/local_model.rs, lib/llm/src/preprocessor.rs
Added ModelRuntimeConfig field to ModelDeploymentCard (serde default) and initialize in constructors; propagate and clone into LocalModel and OpenAIPreprocessor; store in preprocessor for downstream use.
Delta generator API + callers
lib/llm/src/protocols/openai/chat_completions/delta.rs, lib/llm/src/engines.rs, lib/llm/tests/http-service.rs
NvCreateChatCompletionRequest::response_generator now takes ModelRuntimeConfig; DeltaGeneratorOptions gains runtime_config; callers updated to pass default/runtime config; engines/tests adjusted accordingly.
Reasoning parser selection
lib/parsers/src/reasoning/mod.rs
Added ReasoningParserType::get_reasoning_parser_from_name(&str) mapping names to parsers with logging and fallback to Basic.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant Preprocessor as OpenAIPreprocessor
    participant Card as ModelDeploymentCard
    participant Request as NvCreateChatCompletionRequest
    participant DeltaGen as DeltaGenerator
    participant Parsers as ReasoningParserType

    Client->>Preprocessor: init with ModelDeploymentCard
    Preprocessor->>Card: read runtime_config
    Card-->>Preprocessor: runtime_config (includes reasoning_parser)
    Client->>Preprocessor: create chat completion request
    Preprocessor->>Request: build request
    Preprocessor->>Request: response_generator(runtime_config)
    Request->>DeltaGen: new(options{ runtime_config, ... })
    DeltaGen->>Parsers: get_reasoning_parser_from_name(runtime_config.reasoning_parser)
    Parsers-->>DeltaGen: ReasoningParserWrapper (fallback=Basic if unknown)
    DeltaGen-->>Client: stream deltas
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

In burrows of bytes I twitch my nose,
A runtime breeze through parser rows—
Basic? Deepseek? GPT-oss?
I pick by name, no carrot loss.
With configs packed, I hop and stream,
Deltas dance—a reasoning dream. 🥕🐇

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/protocols/openai/chat_completions/delta.rs (1)

237-244: Role emission gating may be off (msg_counter is never incremented).

create_choice sets role = Some(Assistant) when self.msg_counter == 0, but self.msg_counter is never incremented in this file. That means every chunk could include role, which diverges from OpenAI’s streaming shape (role only on the first delta). Suggest incrementing a counter on first emission.

One option (local to this method):

-            role: if self.msg_counter == 0 {
+            role: if self.msg_counter == 0 {
                 Some(dynamo_async_openai::types::Role::Assistant)
             } else {
                 None
             },

and then after building the response:

-        dynamo_async_openai::types::CreateChatCompletionStreamResponse {
+        let resp = dynamo_async_openai::types::CreateChatCompletionStreamResponse {
             /* ...existing fields... */
-        }
+        };
+        // Increment message counter after emitting the first chunk.
+        self.msg_counter = self.msg_counter.saturating_add(1);
+        resp

If the counter is intentionally incremented elsewhere, please point me to it; otherwise, this change will align with the expected stream schema.

🧹 Nitpick comments (10)
components/backends/vllm/src/dynamo/vllm/args.py (1)

117-121: Add argparse choices to guard against typos

Verified that in lib/parsers/src/reasoning/mod.rs the only supported parser names in get_reasoning_parser_from_name are "deepseek_r1", "basic", and "gpt_oss" (lines 121–124). To fail fast on invalid input and keep the CLI aligned with Rust, add the choices parameter:

 parser.add_argument(
     "--dyn-reasoning-parser",
     type=str,
     default=None,
+    choices=["basic", "deepseek_r1", "gpt_oss"],
     help="Reasoning parser name for the model. Available options: 'basic', 'deepseek_r1', 'gpt_oss'.",
 )
lib/parsers/src/reasoning/mod.rs (1)

119-133: Normalize parser name (trim, case, hyphen/underscore) before matching

Accept minor variants like "DeepSeek-R1" or " basic " to reduce footguns. Keep the same fallback behavior.

Apply this diff:

-    pub fn get_reasoning_parser_from_name(name: &str) -> ReasoningParserWrapper {
-        tracing::debug!("Selected reasoning parser: {}", name);
-        match name.to_lowercase().as_str() {
+    /// Return a ReasoningParser by string name.
+    ///
+    /// Accepted values (case/whitespace-insensitive; '-' and '_' equivalent):
+    /// - "basic"
+    /// - "deepseek_r1" (or "deepseek-r1")
+    /// - "gpt_oss"     (or "gpt-oss")
+    pub fn get_reasoning_parser_from_name(name: &str) -> ReasoningParserWrapper {
+        tracing::debug!("Selected reasoning parser: {}", name);
+        let key = name.trim().to_ascii_lowercase().replace('-', "_");
+        match key.as_str() {
             "deepseek_r1" => Self::DeepseekR1.get_reasoning_parser(),
             "basic" => Self::Basic.get_reasoning_parser(),
             "gpt_oss" => Self::GptOss.get_reasoning_parser(),
             _ => {
                 tracing::warn!(
                     "Unknown reasoning parser type '{}', falling back to Basic Reasoning Parser",
-                    name
+                    name
                 );
                 Self::Basic.get_reasoning_parser()
             }
         }
     }

I can add unit tests covering the normalization and fallback path if you want.

lib/llm/src/model_card.rs (2)

142-144: New MDC field runtime_config with serde(default) — good for backwards compat

This ensures older, serialized MDCs deserialize cleanly. Consider skipping serialization when it’s default to reduce etcd/NATS payload size if you expect most deployments to use defaults.

If ModelRuntimeConfig can expose an is_default(&self) -> bool, you could annotate as:

  • #[serde(default, skip_serializing_if = "ModelRuntimeConfig::is_default")]

I can wire up an is_default helper in runtime_config.rs and add a serde round-trip test.


448-449: Explicit default init is fine; consider a helper to centralize defaults

Both constructors set runtime_config: ModelRuntimeConfig::default(). That’s clear and explicit. If you find more call sites doing the same, consider a small constructor/helper (e.g., ModelDeploymentCard::with_runtime_defaults(...)) to reduce duplication.

Also applies to: 490-491

lib/llm/src/local_model.rs (2)

205-218: Avoid double clone of runtime_config in echo_full path

Minor nit: you clone into both the MDC and LocalModel. Since the builder is consumed, you can move once and clone only once.

Apply this diff:

-            card.migration_limit = self.migration_limit;
-            card.user_data = self.user_data.take();
-            card.runtime_config = self.runtime_config.clone();
+            card.migration_limit = self.migration_limit;
+            card.user_data = self.user_data.take();
+            let runtime_config = std::mem::take(&mut self.runtime_config);
+            card.runtime_config = runtime_config.clone();
@@
-            return Ok(LocalModel {
+            return Ok(LocalModel {
                 card,
                 full_path: PathBuf::new(),
                 endpoint_id,
                 template,
                 http_host: self.http_host.take(),
                 http_port: self.http_port,
                 tls_cert_path: self.tls_cert_path.take(),
                 tls_key_path: self.tls_key_path.take(),
                 router_config: self.router_config.take().unwrap_or_default(),
-                runtime_config: self.runtime_config.clone(),
+                runtime_config,
             });

278-293: Same nit in main path: move once, reuse

Mirror the optimization for the “main” (non-echo) path.

Apply this diff:

-        card.migration_limit = self.migration_limit;
-        card.user_data = self.user_data.take();
-        card.runtime_config = self.runtime_config.clone();
+        card.migration_limit = self.migration_limit;
+        card.user_data = self.user_data.take();
+        let runtime_config = std::mem::take(&mut self.runtime_config);
+        card.runtime_config = runtime_config.clone();
@@
-        Ok(LocalModel {
+        Ok(LocalModel {
             card,
             full_path,
             endpoint_id,
             template,
             http_host: self.http_host.take(),
             http_port: self.http_port,
             tls_cert_path: self.tls_cert_path.take(),
             tls_key_path: self.tls_key_path.take(),
             router_config: self.router_config.take().unwrap_or_default(),
-            runtime_config: self.runtime_config.clone(),
+            runtime_config,
         })

If you want to ensure we don’t drift between the two copies, we can also remove runtime_config from LocalModel and read from self.card.runtime_config at use sites. I can open a follow-up refactor PR if that direction works for you.

lib/llm/src/engines.rs (1)

187-188: Echo chat path uses default runtime_config; consider threading actual config when available

For a pure echo engine this is okay. If you ever use EchoEngineFull to validate reasoning parser selection end-to-end, prefer passing the model’s runtime_config instead of a default so misconfigurations don’t get masked.

Potential follow-ups:

  • Thread ModelRuntimeConfig via PreprocessedRequest/ExecutionContext, and use it here if present; otherwise fall back to default.
  • Add a simple unit test that asserts the passed reasoning_parser name influences the delta selection path.

Want me to draft a small patch to plumb runtime_config through the context for the echo path?

lib/llm/tests/http-service.rs (1)

102-104: Updated generator call matches the new API; consider exercising a non-default parser in at least one test.

The switch to request.response_generator(runtime_config::ModelRuntimeConfig::default()) is correct. To guard against regressions in runtime parser selection, add a test that passes a non-default reasoning_parser (e.g., "gpt_oss" or "deepseek_r1") and asserts the stream is produced successfully (even a smoke test is fine).

You could do this inline as a one-off in CounterEngine’s generate for a dedicated test:

-        let mut generator =
-            request.response_generator(runtime_config::ModelRuntimeConfig::default());
+        let mut generator = request.response_generator(runtime_config::ModelRuntimeConfig {
+            reasoning_parser: Some("gpt_oss".to_string()),
+            ..Default::default()
+        });
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

103-110: Use runtime-configured parser; update stale comments and centralize the default.

The logic is correct. The surrounding comments still say “hardcoded” and point to a TODO that is no longer accurate. Recommend refreshing the comment and extracting the "basic" fallback into a module-level constant for clarity.

Apply within this block:

-        // Reasoning parser type
-        // This is hardcoded for now, but can be made configurable later.
-        // TODO: Make parser type configurable once front-end integration is determined
-        // Change to GptOss to test GptOSS parser
-        // Reasoning parser wrapper
-        let reasoning_parser = ReasoningParserType::get_reasoning_parser_from_name(
+        // Select reasoning parser from runtime config; fall back to the Basic parser on missing/unknown values.
+        let reasoning_parser = ReasoningParserType::get_reasoning_parser_from_name(
             options
                 .runtime_config
                 .reasoning_parser
                 .as_deref()
                 .unwrap_or("basic"),
         );

Additionally, consider adding at the top of this module (outside the changed block):

const DEFAULT_REASONING_PARSER: &str = "basic";

and use unwrap_or(DEFAULT_REASONING_PARSER).


335-352: Reasoning parse invoked only when text/token_ids present — good check; consider explicit disable knob.

The presence check prevents unnecessary parser work. A future nicety: support disabling reasoning parsing via a sentinel value (e.g., reasoning_parser="none") to fully bypass parsing when desired.

Happy to wire the "none" option through get_reasoning_parser_from_name and runtime_config if you want it in this PR or a follow-up.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8e4d81f and 5d2ee7b.

⛔ Files ignored due to path filters (1)
  • lib/bindings/python/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • components/backends/vllm/src/dynamo/vllm/args.py (1 hunks)
  • lib/llm/src/engines.rs (2 hunks)
  • lib/llm/src/local_model.rs (2 hunks)
  • lib/llm/src/model_card.rs (4 hunks)
  • lib/llm/src/preprocessor.rs (4 hunks)
  • lib/llm/src/protocols/openai/chat_completions/delta.rs (4 hunks)
  • lib/llm/tests/http-service.rs (2 hunks)
  • lib/parsers/src/reasoning/mod.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-22T19:55:41.592Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.592Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

  • lib/llm/src/engines.rs
  • lib/llm/src/protocols/openai/chat_completions/delta.rs
📚 Learning: 2025-08-22T19:55:41.592Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.592Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions/delta.rs
🧬 Code graph analysis (3)
lib/llm/src/protocols/openai/chat_completions/delta.rs (4)
lib/llm/src/protocols/openai/completions/delta.rs (1)
  • response_generator (22-29)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • ModelRuntimeConfig (463-467)
lib/bindings/python/rust/llm/local_model.rs (1)
  • reasoning_parser (76-78)
lib/parsers/src/reasoning/mod.rs (1)
  • get_reasoning_parser_from_name (119-133)
lib/llm/src/preprocessor.rs (1)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
  • new (80-123)
  • response_generator (19-31)
lib/llm/tests/http-service.rs (1)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • ModelRuntimeConfig (463-467)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (10)
lib/llm/src/model_card.rs (1)

22-22: Import looks correct

Brings ModelRuntimeConfig into scope for the new MDC field. No issues.

lib/llm/src/engines.rs (1)

17-18: Import of runtime_config module: LGTM

Keeps the call site explicit and avoids naming collisions.

lib/llm/tests/http-service.rs (1)

28-42: Importing runtime_config for tests is appropriate and minimal.

Bringing in local_model::runtime_config here is the right move to construct a default ModelRuntimeConfig in tests without reaching into production wiring.

lib/llm/src/preprocessor.rs (5)

25-25: Plumbing ModelRuntimeConfig into the preprocessor: good addition.

Importing the type here keeps ownership localized and avoids leaking runtime concerns into callers.


98-99: Stateful preprocessor now carries runtime_config — correct scope.

Storing ModelRuntimeConfig on OpenAIPreprocessor is appropriate since it’s part of deployment-time behavior rather than per-request user input.


133-134: Constructor wiring looks good.

Passing runtime_config into self ensures downstream operators can use it without additional lookups.


502-503: Chat path correctly forwards runtime_config to the generator.

This aligns with the intent of runtime-config-driven parser selection. Leaving the Completions path (non-chat) unchanged is also correct, since reasoning parsing is only relevant to chat completions (per prior learnings).

I’m referencing the retrieved learnings that reasoning parsing is only wired for chat completions and not for text completions; your changes match that design.


126-127: Review comment resolved: ModelDeploymentCard.runtime_config correctly defaults

Verified that in lib/llm/src/model_card.rs the ModelDeploymentCard struct is annotated with #[serde(default)] on the runtime_config field (around line 142) and its Default implementations initialize runtime_config via ModelRuntimeConfig::default() (around lines 448 and 490). Cloning mdc.runtime_config is therefore safe and older cards will deserialize cleanly.

lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

8-11: New import of runtime_config is appropriate.

Keeps the type fully qualified without polluting the broader namespace.


41-43: ModelRuntimeConfig implements Default – no action required

The ModelRuntimeConfig type already derives Default (see lib/bindings/python/rust/llm/local_model.rs where it’s annotated with #[derive(Clone, Default)]), so the DeltaGeneratorOptions::default() implementation remains sound with the new non-Option runtime_config field. Everything compiles and behaves as expected.

Copy link
Member

@paulhendricks paulhendricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@nachiketb-nvidia nachiketb-nvidia merged commit f5a4100 into main Aug 25, 2025
13 checks passed
@nachiketb-nvidia nachiketb-nvidia deleted the nachiketb/enable-runtime-flags-for-reasoning-parser branch August 25, 2025 22:57
hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025
…vllm deployments (#2700)

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
nv-anants pushed a commit that referenced this pull request Aug 28, 2025
jasonqinzhou pushed a commit that referenced this pull request Aug 30, 2025
…vllm deployments (#2700)

Signed-off-by: Jason Zhou <jasonzho@jasonzho-mlt.client.nvidia.com>
KrishnanPrash pushed a commit that referenced this pull request Sep 2, 2025
…vllm deployments (#2700)

Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
nnshah1 pushed a commit that referenced this pull request Sep 8, 2025
…vllm deployments (#2700)

Signed-off-by: nnshah1 <neelays@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants