-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
What happened + What you expected to happen
I am testing the new JaxTrainer which leverages bundle label selectors for placement group bundles. I am encountering a bug where the bundle label selector is not being respected. This breaks multi-host training on my RayCluster that has a 4x4 slice (4 nodes) and a couple of single-host nodes.
Here are the label selectors in the placement group table from Ray Dashboard which look correct:
The node labels look correct as well. Here's the node labels from a node in the 4x4 topology:
And here are the labels of the single-host node:
Both the placement gruop label selector and Raylet labels look correct. However, when I run my job, I can see from the logs that 2 train workers run on the single host node and the other 2 run on the 4x4 slice. Based on the label selectors I expect all 4 train workers to run on the 4x4 slice.
I checked the job task table and also confirmed a bunch of tasks running on the single-host workers:
Versions / Dependencies
nightly version of Ray on v6e TPUs on GKE
Reproduction script
Deploy a RayCluster with 2 worker groups (4x4 slice and single-host pool). Deploy a training job to the 4x4 slice.
Issue Severity
None