Skip to content

Conversation

@nv-oviya
Copy link
Contributor

@nv-oviya nv-oviya commented Dec 2, 2025

Overview:

Enhances InferenceLoadTester to capture latency percentiles (p50, p95, p99) and per-phase statistics, enabling quantitative fault tolerance impact analysis. Previously only tracked cumulative success/failure counts and average latency, making it impossible to understand how faults impact latency distribution or compare performance across test phases (baseline → fault → recovery), especially when all requests succeeded due to pod-node distribution. This PR adds percentile tracking and phase checkpointing for detailed impact analysis.


Details:

Added:

  1. Checkpointing mechanism:
  • Added checkpoint_index to track phase boundaries
  • checkpoint() method marks start of new test phase
  • get_stats(since_checkpoint=bool) returns per-phase or cumulative stats
  1. Latency percentile calculation:
  • Enhanced get_stats() to calculate p50, p95, p99 using linear interpolation
  • Added min/max latency tracking
  • Percentiles computed only for successful requests
  1. Updated return schema:
  • Added 5 new fields: p50_latency, p95_latency, p99_latency, min_latency, max_latency
  • Enables quantitative fault impact analysis and SLA validation

Where should the reviewer start?

  • Start with the checkpoint() method (lines 156-159):
    • Simple implementation that just stores current results list length
    • Called by test orchestrator before each phase transition
  • Review enhanced get_stats() method (lines 161-239):
    • New since_checkpoint parameter controls per-phase vs. cumulative stats
    • Lines 175-190: Slicing logic to get phase-specific results
    • Lines 200-213: New latency calculations (sort, percentile, min/max)
    • Lines 214-223: Percentile calculation using linear interpolation
    • Lines 226-239: Updated return dictionary with new fields

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Used by #4690

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>
@nv-oviya nv-oviya force-pushed the oviya/fault-injection/metrics-config branch from f2f391b to db10853 Compare December 3, 2025 22:59
@pull-request-size pull-request-size bot added size/M and removed size/L labels Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants