-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Description
Description
Currently, vLLM internally tracks a corrupted_requests_counter metric whenever a request produces invalid outputs (NaNs) due to model, engine, or hardware issues. However, this metric is not directly exposed to users in logs or Prometheus metrics.
Exposing this metric would allow users to:
- Detect model instability or misbehaving custom models.
- Monitor runtime/engine health in production clusters.
- Quickly identify hardware or distributed inference issues affecting outputs
Motivation & Problem
While NaN outputs are rare with well-tested models, they become critical for custom models in early development stages or may also arise within the runtime because of Engine/runtime issues.
- Models may have numerical instability.
- Hardware issues are more likely to surface
The codebase already detects corrupted requests (Request.is_output_corrupted) when the env variable is set VLLM_COMPUTE_NANS_IN_LOGITS = TRUE, but this diagnostic information is completely hidden from users, as there are no metrics, no logging, and no monitoring.
Proposed Idea
Two Approaches for Corrupted Request Metrics that i am thinking of are:
- Approach 1: CLI Config-Based (Current Implementation)
--include-corrupted-requests #CLI flag- Config: SchedulerConfig.include_corrupted_requests
- Usage: vllm serve model --include-corrupted-requests
Pros: User-friendly, explicit control, follows vLLM patterns
Cons: Adds new CLI argument, requires config changes
I welcome suggestions and thoughts on the same, and would love to contribute the same.
Alternatives
- Approach 2: Environment Variable-Based (Proposed Alternative)
- Existing
VLLM_COMPUTE_NANS_IN_LOGITSenvironment variable - Logic: When NaN detection is enabled, automatically expose corrupted metrics
- Usage: VLLM_COMPUTE_NANS_IN_LOGITS=1 vllm serve model
Pros: Reuses existing infrastructure, no new CLI args.
Cons: A coupling of metrics exposure to NaN detection, and less granular control.
Thanks
Snehlata
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.