Skip to content

[core] Placement Group bundle label selectors are not respected #55590

@andrewsykim

Description

@andrewsykim

What happened + What you expected to happen

I am testing the new JaxTrainer which leverages bundle label selectors for placement group bundles. I am encountering a bug where the bundle label selector is not being respected. This breaks multi-host training on my RayCluster that has a 4x4 slice (4 nodes) and a couple of single-host nodes.

Here are the label selectors in the placement group table from Ray Dashboard which look correct:

Image

The node labels look correct as well. Here's the node labels from a node in the 4x4 topology:

Image

And here are the labels of the single-host node:

Image

Both the placement gruop label selector and Raylet labels look correct. However, when I run my job, I can see from the logs that 2 train workers run on the single host node and the other 2 run on the 4x4 slice. Based on the label selectors I expect all 4 train workers to run on the 4x4 slice.

I checked the job task table and also confirmed a bunch of tasks running on the single-host workers:

Image

Versions / Dependencies

nightly version of Ray on v6e TPUs on GKE

Reproduction script

Deploy a RayCluster with 2 worker groups (4x4 slice and single-host pool). Deploy a training job to the 4x4 slice.

Issue Severity

None

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray Corestability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions