Skip to content

[Feature]: Add num_corrupted_request metric to V1 metrics system. #27301

@atalhens

Description

@atalhens

Description

Currently, vLLM internally tracks a corrupted_requests_counter metric whenever a request produces invalid outputs (NaNs) due to model, engine, or hardware issues. However, this metric is not directly exposed to users in logs or Prometheus metrics.

Exposing this metric would allow users to:

  • Detect model instability or misbehaving custom models.
  • Monitor runtime/engine health in production clusters.
  • Quickly identify hardware or distributed inference issues affecting outputs

Motivation & Problem

While NaN outputs are rare with well-tested models, they become critical for custom models in early development stages or may also arise within the runtime because of Engine/runtime issues.

  • Models may have numerical instability.
  • Hardware issues are more likely to surface

The codebase already detects corrupted requests (Request.is_output_corrupted) when the env variable is set VLLM_COMPUTE_NANS_IN_LOGITS = TRUE, but this diagnostic information is completely hidden from users, as there are no metrics, no logging, and no monitoring.

Proposed Idea

Two Approaches for Corrupted Request Metrics that i am thinking of are:

  1. Approach 1: CLI Config-Based (Current Implementation)
  • --include-corrupted-requests #CLI flag
  • Config: SchedulerConfig.include_corrupted_requests
  • Usage: vllm serve model --include-corrupted-requests

Pros: User-friendly, explicit control, follows vLLM patterns
Cons: Adds new CLI argument, requires config changes

I welcome suggestions and thoughts on the same, and would love to contribute the same.

Alternatives

  1. Approach 2: Environment Variable-Based (Proposed Alternative)
  • Existing VLLM_COMPUTE_NANS_IN_LOGITS environment variable
  • Logic: When NaN detection is enabled, automatically expose corrupted metrics
  • Usage: VLLM_COMPUTE_NANS_IN_LOGITS=1 vllm serve model

Pros: Reuses existing infrastructure, no new CLI args.
Cons: A coupling of metrics exposure to NaN detection, and less granular control.

Thanks
Snehlata

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions