-
Notifications
You must be signed in to change notification settings - Fork 7k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CoreobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profilingregressionstability
Description
What happened + What you expected to happen
For ray cluster started with ray:2.49.2-py312-gpu
node count returns zero:
Directly query the metrics endpoint returns metrics as zero which there're two ray node is setup using kuberay:
curl 10.50.188.221:44217 | grep autoscaler_active_nodes
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15703 100 15703 0 0 5346k 0 --:--:-- --:--:-- --:--:-- 7667k
# HELP autoscaler_active_nodes Number of nodes in the cluster.
# TYPE autoscaler_active_nodes gauge
autoscaler_active_nodes{NodeType="ray.head.default",SessionName="session_2025-10-27_13-47-39_696182_1"} 0.0
For ray cluster started with ray:2.48.0-py312-gpu image
curl 10.50.188.211:44217 | grep autoscaler_active_nodes
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 16825 100 16825 0 0 8670k 0 --:--:-- --:--:-- --:--:-- 16.0M
# HELP autoscaler_active_nodes Number of nodes in the cluster.
# TYPE autoscaler_active_nodes gauge
autoscaler_active_nodes{NodeType="ray.head.default",SessionName="session_2025-10-27_14-26-15_601476_1"} 0.0
autoscaler_active_nodes{NodeType="node_8bc1b8cbe00436768e0f24379d1adec5d1d7b43fcc4420d92596f96f",SessionName="session_2025-10-27_14-26-15_601476_1"} 1.0
autoscaler_active_nodes{NodeType="node_28bcb3dbdca5f896ddfc704f57f50afa73c1a22b2fbbd6a0cedc49cc",SessionName="session_2025-10-27_14-26-15_601476_1"} 1.0Versions / Dependencies
The cluster is started using kuberay with
image: ray:2.49.2-py312-gpu
and image: ray:2.48.0-py312-gpu
Reproduction script
Starts ray cluster with both images using kuberay.
Once started, login to head node pod, execute below to find out the autoscaler metrics port (the one with 44217 port)
cat /tmp/ray/prom_metrics_service_discovery.json Then query the port and grep at autoscaler_active_nodes
Issue Severity
None
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CoreobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profilingregressionstability