-
Notifications
You must be signed in to change notification settings - Fork 719
feat: timing metrics reporting through nvext for vLLM backend
#4661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi zhongxuanwang-nv! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
9f7ee4e to
08d432b
Compare
nvext for vLLM backend
nvext for vLLM backendnvext for vLLM backend
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughEnd-to-end timing metrics support is added across Python and Rust layers. The vllm handlers collect timing data (request received, prefill/decode start/end), the HTTP service injects boundary timestamps, the preprocessor extracts timing fields, and delta response builders include timing metrics in nvext responses. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/handlers.py (1)
1-1: Fix Black formatting before merge.The pipeline failure indicates that Black formatting failed. Please run
blackon this file to resolve the formatting issue.Based on the pipeline failure log, run:
black components/src/dynamo/vllm/handlers.py
🧹 Nitpick comments (1)
lib/llm/src/http/service/openai.rs (1)
59-90: Consider returning a success indicator.The function silently handles errors (line 64
.ok()) and structural mismatches (lines 69-88 nestedas_object_mut()). While timing metrics are non-critical and silent failures may be acceptable, consider returning aboolto indicate success for better observability.If you want to track silent failures, you could refactor to:
-fn inject_request_completed_seconds(nvext: &mut Option<serde_json::Value>) { +fn inject_request_completed_seconds(nvext: &mut Option<serde_json::Value>) -> bool { let ts = SystemTime::now() .duration_since(UNIX_EPOCH) .map(|d| d.as_secs_f64()) .ok(); - if let Some(ts) = ts { + let Some(ts) = ts else { + tracing::trace!("Failed to get current timestamp for request_completed_seconds"); + return false; + }; + let nvext = nvext.get_or_insert_with(|| serde_json::Value::Object(serde_json::Map::new())); if let Some(obj) = nvext.as_object_mut() { if let Some(timing) = obj.get_mut("timing_metrics") { if let Some(timing_obj) = timing.as_object_mut() { timing_obj.insert( "request_completed_seconds".to_string(), serde_json::Value::from(ts), ); + return true; } } else { let mut timing_obj = serde_json::Map::new(); timing_obj.insert( "request_completed_seconds".to_string(), serde_json::Value::from(ts), ); obj.insert( "timing_metrics".to_string(), serde_json::Value::Object(timing_obj), ); + return true; } } - } + false }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
components/src/dynamo/vllm/handlers.py(8 hunks)lib/llm/src/http/service/openai.rs(6 hunks)lib/llm/src/preprocessor.rs(1 hunks)lib/llm/src/protocols/common/preprocessor.rs(1 hunks)lib/llm/src/protocols/openai/chat_completions/delta.rs(2 hunks)lib/llm/src/protocols/openai/completions/delta.rs(2 hunks)lib/llm/src/protocols/openai/nvext.rs(4 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.
Applied to files:
lib/llm/src/protocols/openai/nvext.rs
📚 Learning: 2025-09-08T21:18:43.478Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2936
File: lib/parsers/src/reasoning/granite_parser.rs:42-46
Timestamp: 2025-09-08T21:18:43.478Z
Learning: In GraniteReasoningParser in lib/parsers/src/reasoning/granite_parser.rs, the think_start_tokens and think_end_tokens are hardcoded in the constructor with fixed values, so unwrap() calls on these vectors are safe and won't panic.
Applied to files:
lib/llm/src/protocols/openai/nvext.rs
📚 Learning: 2025-06-24T20:59:35.725Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.
Applied to files:
lib/llm/src/protocols/common/preprocessor.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.
Applied to files:
lib/llm/src/protocols/openai/chat_completions/delta.rslib/llm/src/protocols/openai/completions/delta.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.
Applied to files:
lib/llm/src/protocols/openai/chat_completions/delta.rslib/llm/src/protocols/openai/completions/delta.rs
📚 Learning: 2025-08-25T22:04:45.205Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2700
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:19-28
Timestamp: 2025-08-25T22:04:45.205Z
Learning: The response_generator() method exists on multiple request types in the codebase: NvCreateChatCompletionRequest (for chat completions) and NvCreateCompletionRequest (for text completions). When making signature changes, it's important to distinguish between these different object types as they have separate implementations and call sites.
Applied to files:
lib/llm/src/http/service/openai.rs
🧬 Code graph analysis (6)
lib/llm/src/preprocessor.rs (4)
lib/llm/src/protocols/openai/nvext.rs (2)
nvext(9-9)builder(99-101)lib/llm/src/protocols/openai/responses.rs (3)
nvext(44-46)nvext(104-106)nvext(151-153)lib/llm/src/protocols/openai/chat_completions.rs (3)
nvext(74-76)nvext(134-136)nvext(238-240)lib/llm/src/protocols/openai.rs (2)
nvext(50-50)nvext(60-60)
lib/llm/src/protocols/openai/nvext.rs (1)
lib/llm/src/preprocessor.rs (1)
builder(202-248)
lib/llm/src/protocols/common/preprocessor.rs (1)
lib/llm/src/preprocessor.rs (1)
builder(202-248)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
lib/llm/src/protocols/openai/chat_completions/aggregator.rs (2)
new(92-104)default(63-65)lib/llm/src/protocols/openai/nvext.rs (2)
default(93-95)nvext(9-9)
lib/llm/src/protocols/openai/completions/delta.rs (2)
lib/llm/src/protocols/openai/nvext.rs (2)
default(93-95)nvext(9-9)lib/llm/src/protocols/openai/responses.rs (3)
nvext(44-46)nvext(104-106)nvext(151-153)
components/src/dynamo/vllm/handlers.py (1)
components/src/dynamo/sglang/request_handlers/handler_base.py (1)
BaseWorkerHandler(20-222)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4661/merge) by zhongxuanwang-nv.
components/src/dynamo/vllm/handlers.py
[error] 1-1: Black formatting failed. 1 file reformatted by the hook; please re-run the commit or CI to pass: 'pre-commit run --show-diff-on-failure --color=always --all-files'.
🔇 Additional comments (19)
lib/llm/src/preprocessor.rs (1)
240-244: LGTM!The extraction of
request_received_secondsfrom nvext follows the established pattern for other nvext fields and maintains consistency with the timing metrics feature.lib/llm/src/protocols/openai/nvext.rs (3)
84-89: LGTM!The new
request_received_secondsfield is properly integrated with appropriate serde attributes and clear documentation.
79-79: LGTM!The documentation correctly lists the expanded set of supported extra_fields.
136-160: LGTM!Tests properly validate the new field and the expanded extra_fields list.
lib/llm/src/protocols/common/preprocessor.rs (1)
105-109: LGTM!The
request_received_secondsfield is well-documented and follows the established patterns for optional fields in the struct.components/src/dynamo/vllm/handlers.py (5)
77-87: LGTM!The helper functions provide clean abstractions for timing metrics detection and timestamp retrieval.
313-393: LGTM!The timing metrics initialization and decode tracking logic is sound:
- Frontend timestamp is preserved when merging with prefill timing (lines 349-352)
- Aggregated mode (no prefill_result) correctly treats decode_start as prefill_start (line 380)
- First token time is used as prefill_end in aggregated mode (line 392), which aligns with the semantic that prefill completes when the first token is generated
402-413: LGTM!The conditional injection of timing_metrics is correct:
- Only when finish_reason is present (the final chunk)
- Only when timing is requested
- Safely initializes disaggregated_params if needed
267-271: LGTM!The refactored completion_usage construction is cleaner and preserves the existing structure while preparing for timing metrics augmentation.
447-524: LGTM!The prefill worker timing metrics logic correctly:
- Captures prefill_start_seconds at the start of processing (line 460)
- Records prefill_end_seconds when the result is available (lines 521-523)
- Builds disaggregated_params with both kv_transfer_params and timing_metrics (lines 515-524)
lib/llm/src/protocols/openai/completions/delta.rs (2)
268-286: LGTM!The generic extraction of worker_id and timing_metrics from disaggregated_params into nvext is clean and extensible. The conditional check ensures nvext is only set when there's actual data to include.
304-350: LGTM!The test thoroughly validates that both timing_metrics and worker_id are correctly extracted from disaggregated_params and placed into nvext.
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
365-383: LGTM!The nvext construction from disaggregated_params follows the same clean pattern as the completions path, maintaining consistency across the codebase.
464-506: LGTM!The test provides thorough coverage of timing_metrics and worker_id extraction into nvext.
lib/llm/src/http/service/openai.rs (5)
317-329: LGTM!Capturing
request_received_secondsimmediately at handler entry is the right approach for accurate timing. The safe initialization withget_or_insert_with(Default::default)is appropriate.
453-468: LGTM!The injection on the final content chunk (with
finish_reason) aligns with the backend's timing_metrics population strategy, avoiding injection on the separate usage-only chunk.
751-763: LGTM!Consistent timing capture across both completions and chat completions endpoints.
898-912: LGTM!The stream mapping for chat completions follows the same correct pattern as completions.
2149-2184: LGTM!The tests validate the two key scenarios:
- Preserving existing fields when injecting into existing timing_metrics
- Creating the structure from scratch when nvext is None
Both scenarios are well-covered.
Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
|
/ok to test 8588689 |
Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
Signed-off-by: Zhongxuan Wang <daniewang@nvidia.com>
|
/ok to test 0f1d019 |
|
|
||
| let nvext_response = NvExtResponse { | ||
| worker_id: Some(worker_id_info), | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is worker_id gone? I think AI Gateway uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely not gone, i just removed the NvExtResponse and WorkerIdInfo struct and instead just export the worker_id json directly because I figured it would look less redundant :)
Do you think i should bring these two struct back, even though no Dynamo code is referencing them?
Thank you Graham!
| if include_timing and timing_metrics: | ||
| timing_metrics[ | ||
| "prefill_end_seconds" | ||
| ] = _get_current_time_seconds() | ||
| disaggregated_params["timing_metrics"] = timing_metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the prefill and decode nodes are on different machines and have slightly different clocks due to time drift? Will the final values be wrong or not synchronized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true.. This is unfortunately not handled, so the final values would be wrong; I think we can instead make a note that asks the users to have all the clocks synchronized, if they choose to use the timing?
But also the good thing is, we can see the worker_id that handled that request, so we can know how much time drift the worker had :) Thank you Anthony!
Overview:
This PR introduces end-to-end timing metrics through the entire Dynamo → vLLM → frontend stack.
Sample response:
Details:
Added:
generate/prefill_generate, including:_should_include_timing_metricsgate onextra_fieldsand_get_current_time_secondshelper incomponents/src/dynamo/vllm/handlers.py.timing_metricsmap withprefill_start_seconds,prefill_end_seconds,decode_start_seconds,decode_first_token_seconds,decode_end_seconds.request_received_secondsrecorded at HTTP ingress and stored intorequest.nvext.request_received_secondsinlib/llm/src/http/service/openai.rsfor both completions and chat completions.inject_request_completed_secondshelper that addsrequest_completed_secondsundertiming_metricsinlib/llm/src/http/service/openai.rs, plus unit tests.request_received_seconds: Option<f64>onPreprocessedRequestinlib/llm/src/protocols/common/preprocessor.rs.request_received_secondsfromnvextinlib/llm/src/preprocessor.rs.Changed:
components/src/dynamo/vllm/handlers.py:prefill_start_secondsanddecode_start_seconds, and use first-token time asprefill_end_seconds.finish_reason), attach the fulltiming_metricsmap intotok["disaggregated_params"]["timing_metrics"]so downstream layers can read it.completion_usageconstruction is wrapped cleanly before further augmentation.lib/llm/src/http/service/openai.rs:completions_singleandchat_completionswith amapthat injectsrequest_completed_secondsonly on the chunk whose choices have afinish_reason(not the usage-only chunk), matching when backend timing metrics are finalized.OpenAIPreprocessornow forwardsrequest_received_secondsfromrequest.nvext()into thePreprocessedRequestbuilder so workers can see frontend ingress time.Removed:
Why it matters:
timing_metricsintodisaggregated_paramson the token stream makes these numbers available to observability / analytics stacks and any downstream Dynamo components without changing the public OpenAI surface.Where should the reviewer start?
components/src/dynamo/vllm/handlers.py_should_include_timing_metrics, the newtiming_metricslifecycle inBaseWorkerHandler.generate, and how prefill vs aggregated decode modes are handled.lib/llm/src/http/service/openai.rshandler_completions/handler_chat_completionsforrequest_received_seconds.inject_request_completed_secondsand its use incompletions_single/chat_completionswhere the stream is mapped to stamp the final content chunk.
lib/llm/src/protocols/common/preprocessor.rs(PreprocessedRequestnew field).lib/llm/src/preprocessor.rs(OpenAIPreprocessorcopyingrequest_received_seconds).lib/llm/src/http/service/openai.rsto validate the egress stamping.Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.