Skip to content

[core] : node count outputs zero for 2.49.2 but has data on 2.48 when not enabling auto scaler #58227

@photoszzt

Description

@photoszzt

What happened + What you expected to happen

For ray cluster started with ray:2.49.2-py312-gpu
node count returns zero:
Directly query the metrics endpoint returns metrics as zero which there're two ray node is setup using kuberay:

curl 10.50.188.221:44217 | grep autoscaler_active_nodes 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15703  100 15703    0     0  5346k      0 --:--:-- --:--:-- --:--:-- 7667k
# HELP autoscaler_active_nodes Number of nodes in the cluster.
# TYPE autoscaler_active_nodes gauge
autoscaler_active_nodes{NodeType="ray.head.default",SessionName="session_2025-10-27_13-47-39_696182_1"} 0.0

For ray cluster started with ray:2.48.0-py312-gpu image

curl 10.50.188.211:44217 | grep autoscaler_active_nodes
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16825  100 16825    0     0  8670k      0 --:--:-- --:--:-- --:--:-- 16.0M
# HELP autoscaler_active_nodes Number of nodes in the cluster.
# TYPE autoscaler_active_nodes gauge
autoscaler_active_nodes{NodeType="ray.head.default",SessionName="session_2025-10-27_14-26-15_601476_1"} 0.0
autoscaler_active_nodes{NodeType="node_8bc1b8cbe00436768e0f24379d1adec5d1d7b43fcc4420d92596f96f",SessionName="session_2025-10-27_14-26-15_601476_1"} 1.0
autoscaler_active_nodes{NodeType="node_28bcb3dbdca5f896ddfc704f57f50afa73c1a22b2fbbd6a0cedc49cc",SessionName="session_2025-10-27_14-26-15_601476_1"} 1.0

Versions / Dependencies

The cluster is started using kuberay with
image: ray:2.49.2-py312-gpu
and image: ray:2.48.0-py312-gpu

Reproduction script

Starts ray cluster with both images using kuberay.
Once started, login to head node pod, execute below to find out the autoscaler metrics port (the one with 44217 port)

cat /tmp/ray/prom_metrics_service_discovery.json 

Then query the port and grep at autoscaler_active_nodes

Issue Severity

None

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profilingregressionstability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions