Skip to content

Conversation

@zhongxuanwang-nv
Copy link
Member

@zhongxuanwang-nv zhongxuanwang-nv commented Dec 1, 2025

Overview:

This PR introduces end-to-end timing metrics through the entire Dynamo → vLLM → frontend stack.

Sample response:

  "nvext": {
    "timing_metrics": {
      "request_received_seconds": 1764561049.9306614,
      "prefill_start_seconds": 1764561049.9381237,
      "prefill_end_seconds": 1764561049.9602249,
      "decode_start_seconds": 1764561049.9671068,
      "decode_first_token_seconds": 1764561050.898138,
      "decode_end_seconds": 1764561050.8981397,
      "request_completed_seconds": 1764561050.8997853
    }
  }

Details:

Added:

  • End-to-end timing collection in vLLM worker handler generate / prefill_generate, including:
    • _should_include_timing_metrics gate on extra_fields and _get_current_time_seconds helper in components/src/dynamo/vllm/handlers.py.
    • timing_metrics map with prefill_start_seconds, prefill_end_seconds, decode_start_seconds, decode_first_token_seconds, decode_end_seconds.
  • Frontend timing capture in OpenAI HTTP service:
    • request_received_seconds recorded at HTTP ingress and stored into request.nvext.request_received_seconds in lib/llm/src/http/service/openai.rs for both completions and chat completions.
  • Request completion timestamp injection:
    • inject_request_completed_seconds helper that adds request_completed_seconds under timing_metrics in lib/llm/src/http/service/openai.rs, plus unit tests.
  • Preprocessor plumbing for timing:
    • New request_received_seconds: Option<f64> on PreprocessedRequest in lib/llm/src/protocols/common/preprocessor.rs.
    • Builder now sets request_received_seconds from nvext in lib/llm/src/preprocessor.rs.

Changed:

  • vLLM worker generate paths in components/src/dynamo/vllm/handlers.py:
    • On aggregated requests (no prefill worker), treat decode start as both prefill_start_seconds and decode_start_seconds, and use first-token time as prefill_end_seconds.
    • When a result chunk finishes (has finish_reason), attach the full timing_metrics map into tok["disaggregated_params"]["timing_metrics"] so downstream layers can read it.
    • Ensure completion_usage construction is wrapped cleanly before further augmentation.
  • OpenAI streaming responses in lib/llm/src/http/service/openai.rs:
    • Wrap streaming completions_single and chat_completions with a map that injects request_completed_seconds only on the chunk whose choices have a finish_reason (not the usage-only chunk), matching when backend timing metrics are finalized.
  • Preprocessing pipeline:
    • OpenAIPreprocessor now forwards request_received_seconds from request.nvext() into the PreprocessedRequest builder so workers can see frontend ingress time.

Removed:

  • None

Why it matters:

  • Attaching timing_metrics into disaggregated_params on the token stream makes these numbers available to observability / analytics stacks and any downstream Dynamo components without changing the public OpenAI surface.
  • Injecting completion timestamps only on the final content chunk avoids double-counting and keeps the usage-only chunk semantics clean.

Where should the reviewer start?

  • Start with the vLLM worker wiring:
    • components/src/dynamo/vllm/handlers.py
      • Focus on _should_include_timing_metrics, the new timing_metrics lifecycle in
        BaseWorkerHandler.generate, and how prefill vs aggregated decode modes are handled.
  • Then look at frontend ingress/egress timing:
    • lib/llm/src/http/service/openai.rs
      • handler_completions / handler_chat_completions for request_received_seconds.
      • inject_request_completed_seconds and its use in completions_single / chat_completions
        where the stream is mapped to stamp the final content chunk.
  • Finally, verify the plumbing and types:
    • lib/llm/src/protocols/common/preprocessor.rs (PreprocessedRequest new field).
    • lib/llm/src/preprocessor.rs (OpenAIPreprocessor copying request_received_seconds).
    • The tests at the bottom of lib/llm/src/http/service/openai.rs to validate the egress stamping.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • Relates to Dynamo requirements from NAT team

Summary by CodeRabbit

  • New Features
    • Added comprehensive timing metrics collection throughout the request lifecycle, capturing key timestamps such as request arrival, decode/prefill processing stages, and response completion for improved visibility into system behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 1, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

👋 Hi zhongxuanwang-nv! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Dec 1, 2025
Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
@zhongxuanwang-nv zhongxuanwang-nv changed the title Initial push Optional timing metrics reporting through nvext for vLLM backend Dec 1, 2025
@zhongxuanwang-nv zhongxuanwang-nv added backend::vllm Relates to the vllm backend and removed external-contribution Pull request is from an external contributor labels Dec 1, 2025
@zhongxuanwang-nv zhongxuanwang-nv changed the title Optional timing metrics reporting through nvext for vLLM backend feat: timing metrics reporting through nvext for vLLM backend Dec 1, 2025
@github-actions github-actions bot added the feat label Dec 1, 2025
@zhongxuanwang-nv
Copy link
Member Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 1, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 1, 2025

Walkthrough

End-to-end timing metrics support is added across Python and Rust layers. The vllm handlers collect timing data (request received, prefill/decode start/end), the HTTP service injects boundary timestamps, the preprocessor extracts timing fields, and delta response builders include timing metrics in nvext responses.

Changes

Cohort / File(s) Summary
Timing collection in vllm handlers
components/src/dynamo/vllm/handlers.py
Added helper functions to detect and capture timing metrics; instrumented decode/prefill/generate paths to populate timing fields (request_received_seconds, decode/prefill start/end times) and merge timing data across worker boundaries; extended disaggregated_params construction to include timing_metrics dict.
HTTP service timing injection
lib/llm/src/http/service/openai.rs
Added private helper inject_request_completed_seconds() to populate nvext.timing_metrics; captures request_received_seconds at request arrival in handlers (completions, chat_completions, embeddings); injects request_completed_seconds on final streaming chunks via per-stream transformation; added unit tests.
Preprocessor and protocol schema extensions
lib/llm/src/preprocessor.rs, lib/llm/src/protocols/common/preprocessor.rs, lib/llm/src/protocols/openai/nvext.rs
Extended PreprocessedRequest with optional request_received_seconds field; updated OpenAIPreprocessor::builder to extract and set this field from nvext; expanded NvExt structure with optional request_received_seconds field and updated allowed extra_fields to include "timing_metrics"; updated tests.
Delta response builders
lib/llm/src/protocols/openai/chat_completions/delta.rs, lib/llm/src/protocols/openai/completions/delta.rs
Refactored disaggregated_params extraction to collect both worker_id and timing_metrics; inject both fields into nvext as a map only when present; added unit tests validating timing_metrics propagation through delta processing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

  • Areas requiring extra attention:
    • Timing metric propagation logic across decode/prefill/generate paths in handlers.py—verify correct initialization and merging across worker boundaries
    • Streaming response timing injection in openai.rs—confirm final chunk detection and timestamp accuracy
    • Field mapping consistency between nvext, preprocessor, and protocol structures—ensure no field loss during transformations
    • Delta/nvext builders—validate conditional logic for when fields are present/absent

Poem

🐰 Hop, hop, the clocks now tick,
Each request leaves a timestamp's quick!
From start to end, the metrics flow,
Through handlers, streams, and responses' glow.
Timing's tale is finally sung—
Infrastructure's golden thread now strung! 🕐✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main feature: adding timing metrics reporting via nvext for the vLLM backend, which aligns with the primary objective of the changeset.
Description check ✅ Passed The description follows the template with all required sections completed: Overview explains end-to-end timing metrics; Details covers Added/Changed/Removed; Where should the reviewer start provides specific file guidance; and Related Issues references Dynamo requirements.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/handlers.py (1)

1-1: Fix Black formatting before merge.

The pipeline failure indicates that Black formatting failed. Please run black on this file to resolve the formatting issue.

Based on the pipeline failure log, run:

black components/src/dynamo/vllm/handlers.py
🧹 Nitpick comments (1)
lib/llm/src/http/service/openai.rs (1)

59-90: Consider returning a success indicator.

The function silently handles errors (line 64 .ok()) and structural mismatches (lines 69-88 nested as_object_mut()). While timing metrics are non-critical and silent failures may be acceptable, consider returning a bool to indicate success for better observability.

If you want to track silent failures, you could refactor to:

-fn inject_request_completed_seconds(nvext: &mut Option<serde_json::Value>) {
+fn inject_request_completed_seconds(nvext: &mut Option<serde_json::Value>) -> bool {
     let ts = SystemTime::now()
         .duration_since(UNIX_EPOCH)
         .map(|d| d.as_secs_f64())
         .ok();

-    if let Some(ts) = ts {
+    let Some(ts) = ts else {
+        tracing::trace!("Failed to get current timestamp for request_completed_seconds");
+        return false;
+    };
+    
         let nvext = nvext.get_or_insert_with(|| serde_json::Value::Object(serde_json::Map::new()));
         if let Some(obj) = nvext.as_object_mut() {
             if let Some(timing) = obj.get_mut("timing_metrics") {
                 if let Some(timing_obj) = timing.as_object_mut() {
                     timing_obj.insert(
                         "request_completed_seconds".to_string(),
                         serde_json::Value::from(ts),
                     );
+                    return true;
                 }
             } else {
                 let mut timing_obj = serde_json::Map::new();
                 timing_obj.insert(
                     "request_completed_seconds".to_string(),
                     serde_json::Value::from(ts),
                 );
                 obj.insert(
                     "timing_metrics".to_string(),
                     serde_json::Value::Object(timing_obj),
                 );
+                return true;
             }
         }
-    }
+    false
 }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9d7d95 and 08d432b.

📒 Files selected for processing (7)
  • components/src/dynamo/vllm/handlers.py (8 hunks)
  • lib/llm/src/http/service/openai.rs (6 hunks)
  • lib/llm/src/preprocessor.rs (1 hunks)
  • lib/llm/src/protocols/common/preprocessor.rs (1 hunks)
  • lib/llm/src/protocols/openai/chat_completions/delta.rs (2 hunks)
  • lib/llm/src/protocols/openai/completions/delta.rs (2 hunks)
  • lib/llm/src/protocols/openai/nvext.rs (4 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

  • lib/llm/src/protocols/openai/nvext.rs
📚 Learning: 2025-09-08T21:18:43.478Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2936
File: lib/parsers/src/reasoning/granite_parser.rs:42-46
Timestamp: 2025-09-08T21:18:43.478Z
Learning: In GraniteReasoningParser in lib/parsers/src/reasoning/granite_parser.rs, the think_start_tokens and think_end_tokens are hardcoded in the constructor with fixed values, so unwrap() calls on these vectors are safe and won't panic.

Applied to files:

  • lib/llm/src/protocols/openai/nvext.rs
📚 Learning: 2025-06-24T20:59:35.725Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.

Applied to files:

  • lib/llm/src/protocols/common/preprocessor.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions/delta.rs
  • lib/llm/src/protocols/openai/completions/delta.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions/delta.rs
  • lib/llm/src/protocols/openai/completions/delta.rs
📚 Learning: 2025-08-25T22:04:45.205Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2700
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:19-28
Timestamp: 2025-08-25T22:04:45.205Z
Learning: The response_generator() method exists on multiple request types in the codebase: NvCreateChatCompletionRequest (for chat completions) and NvCreateCompletionRequest (for text completions). When making signature changes, it's important to distinguish between these different object types as they have separate implementations and call sites.

Applied to files:

  • lib/llm/src/http/service/openai.rs
🧬 Code graph analysis (6)
lib/llm/src/preprocessor.rs (4)
lib/llm/src/protocols/openai/nvext.rs (2)
  • nvext (9-9)
  • builder (99-101)
lib/llm/src/protocols/openai/responses.rs (3)
  • nvext (44-46)
  • nvext (104-106)
  • nvext (151-153)
lib/llm/src/protocols/openai/chat_completions.rs (3)
  • nvext (74-76)
  • nvext (134-136)
  • nvext (238-240)
lib/llm/src/protocols/openai.rs (2)
  • nvext (50-50)
  • nvext (60-60)
lib/llm/src/protocols/openai/nvext.rs (1)
lib/llm/src/preprocessor.rs (1)
  • builder (202-248)
lib/llm/src/protocols/common/preprocessor.rs (1)
lib/llm/src/preprocessor.rs (1)
  • builder (202-248)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
lib/llm/src/protocols/openai/chat_completions/aggregator.rs (2)
  • new (92-104)
  • default (63-65)
lib/llm/src/protocols/openai/nvext.rs (2)
  • default (93-95)
  • nvext (9-9)
lib/llm/src/protocols/openai/completions/delta.rs (2)
lib/llm/src/protocols/openai/nvext.rs (2)
  • default (93-95)
  • nvext (9-9)
lib/llm/src/protocols/openai/responses.rs (3)
  • nvext (44-46)
  • nvext (104-106)
  • nvext (151-153)
components/src/dynamo/vllm/handlers.py (1)
components/src/dynamo/sglang/request_handlers/handler_base.py (1)
  • BaseWorkerHandler (20-222)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4661/merge) by zhongxuanwang-nv.
components/src/dynamo/vllm/handlers.py

[error] 1-1: Black formatting failed. 1 file reformatted by the hook; please re-run the commit or CI to pass: 'pre-commit run --show-diff-on-failure --color=always --all-files'.

🔇 Additional comments (19)
lib/llm/src/preprocessor.rs (1)

240-244: LGTM!

The extraction of request_received_seconds from nvext follows the established pattern for other nvext fields and maintains consistency with the timing metrics feature.

lib/llm/src/protocols/openai/nvext.rs (3)

84-89: LGTM!

The new request_received_seconds field is properly integrated with appropriate serde attributes and clear documentation.


79-79: LGTM!

The documentation correctly lists the expanded set of supported extra_fields.


136-160: LGTM!

Tests properly validate the new field and the expanded extra_fields list.

lib/llm/src/protocols/common/preprocessor.rs (1)

105-109: LGTM!

The request_received_seconds field is well-documented and follows the established patterns for optional fields in the struct.

components/src/dynamo/vllm/handlers.py (5)

77-87: LGTM!

The helper functions provide clean abstractions for timing metrics detection and timestamp retrieval.


313-393: LGTM!

The timing metrics initialization and decode tracking logic is sound:

  • Frontend timestamp is preserved when merging with prefill timing (lines 349-352)
  • Aggregated mode (no prefill_result) correctly treats decode_start as prefill_start (line 380)
  • First token time is used as prefill_end in aggregated mode (line 392), which aligns with the semantic that prefill completes when the first token is generated

402-413: LGTM!

The conditional injection of timing_metrics is correct:

  • Only when finish_reason is present (the final chunk)
  • Only when timing is requested
  • Safely initializes disaggregated_params if needed

267-271: LGTM!

The refactored completion_usage construction is cleaner and preserves the existing structure while preparing for timing metrics augmentation.


447-524: LGTM!

The prefill worker timing metrics logic correctly:

  • Captures prefill_start_seconds at the start of processing (line 460)
  • Records prefill_end_seconds when the result is available (lines 521-523)
  • Builds disaggregated_params with both kv_transfer_params and timing_metrics (lines 515-524)
lib/llm/src/protocols/openai/completions/delta.rs (2)

268-286: LGTM!

The generic extraction of worker_id and timing_metrics from disaggregated_params into nvext is clean and extensible. The conditional check ensures nvext is only set when there's actual data to include.


304-350: LGTM!

The test thoroughly validates that both timing_metrics and worker_id are correctly extracted from disaggregated_params and placed into nvext.

lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

365-383: LGTM!

The nvext construction from disaggregated_params follows the same clean pattern as the completions path, maintaining consistency across the codebase.


464-506: LGTM!

The test provides thorough coverage of timing_metrics and worker_id extraction into nvext.

lib/llm/src/http/service/openai.rs (5)

317-329: LGTM!

Capturing request_received_seconds immediately at handler entry is the right approach for accurate timing. The safe initialization with get_or_insert_with(Default::default) is appropriate.


453-468: LGTM!

The injection on the final content chunk (with finish_reason) aligns with the backend's timing_metrics population strategy, avoiding injection on the separate usage-only chunk.


751-763: LGTM!

Consistent timing capture across both completions and chat completions endpoints.


898-912: LGTM!

The stream mapping for chat completions follows the same correct pattern as completions.


2149-2184: LGTM!

The tests validate the two key scenarios:

  1. Preserving existing fields when injecting into existing timing_metrics
  2. Creating the structure from scratch when nvext is None

Both scenarios are well-covered.

Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
@pull-request-size pull-request-size bot added size/XL and removed size/L labels Dec 1, 2025
Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
@zhongxuanwang-nv
Copy link
Member Author

/ok to test 8588689

Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
@zhongxuanwang-nv zhongxuanwang-nv marked this pull request as ready for review December 1, 2025 20:49
@zhongxuanwang-nv zhongxuanwang-nv requested review from a team as code owners December 1, 2025 20:49
Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
@zhongxuanwang-nv
Copy link
Member Author

/ok to test 0f1d019


let nvext_response = NvExtResponse {
worker_id: Some(worker_id_info),
};
Copy link
Contributor

@grahamking grahamking Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is worker_id gone? I think AI Gateway uses it.

Copy link
Member Author

@zhongxuanwang-nv zhongxuanwang-nv Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely not gone, i just removed the NvExtResponse and WorkerIdInfo struct and instead just export the worker_id json directly because I figured it would look less redundant :)

Do you think i should bring these two struct back, even though no Dynamo code is referencing them?

Thank you Graham!

Comment on lines 522 to 526
if include_timing and timing_metrics:
timing_metrics[
"prefill_end_seconds"
] = _get_current_time_seconds()
disaggregated_params["timing_metrics"] = timing_metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the prefill and decode nodes are on different machines and have slightly different clocks due to time drift? Will the final values be wrong or not synchronized?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true.. This is unfortunately not handled, so the final values would be wrong; I think we can instead make a note that asks the users to have all the clocks synchronized, if they choose to use the timing?

But also the good thing is, we can see the worker_id that handled that request, so we can know how much time drift the worker had :) Thank you Anthony!

Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend feat size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants