feat(fault-injection): Add latency percentile metrics and per-phase statistics tracking #4692

nv-oviya · 2025-12-02T04:13:37Z

Overview:

Enhances InferenceLoadTester to capture latency percentiles (p50, p95, p99) and per-phase statistics, enabling quantitative fault tolerance impact analysis. Previously only tracked cumulative success/failure counts and average latency, making it impossible to understand how faults impact latency distribution or compare performance across test phases (baseline → fault → recovery), especially when all requests succeeded due to pod-node distribution. This PR adds percentile tracking and phase checkpointing for detailed impact analysis.

Details:

Added:

Checkpointing mechanism:

Added checkpoint_index to track phase boundaries
checkpoint() method marks start of new test phase
get_stats(since_checkpoint=bool) returns per-phase or cumulative stats

Latency percentile calculation:

Enhanced get_stats() to calculate p50, p95, p99 using linear interpolation
Added min/max latency tracking
Percentiles computed only for successful requests

Updated return schema:

Added 5 new fields: p50_latency, p95_latency, p99_latency, min_latency, max_latency
Enables quantitative fault impact analysis and SLA validation

Where should the reviewer start?

Start with the checkpoint() method (lines 156-159):
- Simple implementation that just stores current results list length
- Called by test orchestrator before each phase transition
Review enhanced get_stats() method (lines 161-239):
- New since_checkpoint parameter controls per-phase vs. cumulative stats
- Lines 175-190: Slicing logic to get phase-specific results
- Lines 200-213: New latency calculations (sort, percentile, min/max)
- Lines 214-223: Percentile calculation using linear interpolation
- Lines 226-239: Updated return dictionary with new fields

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Used by #4690

copy-pr-bot · 2025-12-02T04:13:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>

pull-request-size bot added the size/M label Dec 2, 2025

github-actions bot added the feat label Dec 2, 2025

This was referenced Dec 2, 2025

refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests #4690

Draft

test(feat-injection): add 200x minimized XID 79 PyTest harnessing fixtures #4694

Draft

pull-request-size bot added size/L and removed size/M labels Dec 3, 2025

removed - file

db10853

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>

nv-oviya force-pushed the oviya/fault-injection/metrics-config branch from f2f391b to db10853 Compare December 3, 2025 22:59

pull-request-size bot added size/M and removed size/L labels Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fault-injection): Add latency percentile metrics and per-phase statistics tracking #4692

feat(fault-injection): Add latency percentile metrics and per-phase statistics tracking #4692

Uh oh!

nv-oviya commented Dec 2, 2025

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(fault-injection): Add latency percentile metrics and per-phase statistics tracking #4692

Are you sure you want to change the base?

feat(fault-injection): Add latency percentile metrics and per-phase statistics tracking #4692

Uh oh!

Conversation

nv-oviya commented Dec 2, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants