REP: RayCluster status improvement #54

kevin85421 · 2024-07-01T21:45:25Z

This is just a copy of https://docs.google.com/document/d/1bRL0cZa87eCX6SI7gqthN68CgmHaB6l3-vJuIse-BrY/edit.

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 · 2024-07-01T21:46:55Z

cc @andrewsykim @rueian would you mind taking a look?

andrewsykim · 2024-07-02T14:23:51Z

reps/2024-07-01-raycluster-crd-status-improvement.md

+type RayClusterConditionType string
+
+const (
+    RayClusterSuspending RayClusterConditionType = "Suspending"


Can we make the addition of the new suspending / suspended conditions in step 1 and keep step 7 only about deprecating the old state?

I may prefer to add the suspending and suspended conditions when we decide to start to work on making the suspend operation atomic. In my expectation, Step 1 will be in v1.2.0, but Step 7 will be in v1.3.0.

If we add the conditions in Step 1, these two conditions may be unused in v1.2.0.

andrewsykim · 2024-07-02T14:27:05Z

reps/2024-07-01-raycluster-crd-status-improvement.md

+        * Reference: [DeploymentAvailable](https://github.com/kubernetes/api/blob/857a946a225f212b64d42c68a7da0dc44837636f/apps/v1/types.go#L532-L542), [LeaderWorkerSetAvailable](https://github.com/kubernetes-sigs/lws/blob/557dfd8b14b8f94633309f6d7633a4929dcc10c3/api/leaderworkerset/v1/leaderworkerset_types.go#L272)
+    * Add `RayClusterK8sFailure` to surface the Kubernetes resource failures that are not Pods.
+
+### Step 2: Remove `rayv1.Failed` from `Status.State`.


Remove or deprecate?

We can leave Failed ClusterState = "failed" in raycluster_types.go. For Status.State, I prefer not to assign Status.State to rayv1.Failed anymore.

I have updated in the Google doc.

andrewsykim · 2024-07-02T14:27:54Z

reps/2024-07-01-raycluster-crd-status-improvement.md

+### Step 2: Remove `rayv1.Failed` from `Status.State`.
+* Add the information about Pod failures to `RayClusterReplicaFailure` instead.
+
+### Step 3: Make sure every reconciliation which has status change goes through `inconsistentRayClusterStatus`, and we only call `r.Status().Update(...)` when `inconsistentRayClusterStatus` returns true.


I don't think this needs to be it's own step, it would have to happen when we introduce the new conditions in step 1 right?

Would you mind adding more details? In my plan, step 1 just updates the CRD.

andrewsykim · 2024-07-02T14:29:12Z

reps/2024-07-01-raycluster-crd-status-improvement.md

+* Future works
+    * Add `RayClusterAvailable` or `RayClusterReady` to `RayClusterConditionType` in the future for the RayCluster readiness.
+        * Reference: [DeploymentAvailable](https://github.com/kubernetes/api/blob/857a946a225f212b64d42c68a7da0dc44837636f/apps/v1/types.go#L532-L542), [LeaderWorkerSetAvailable](https://github.com/kubernetes-sigs/lws/blob/557dfd8b14b8f94633309f6d7633a4929dcc10c3/api/leaderworkerset/v1/leaderworkerset_types.go#L272)
+    * Add `RayClusterK8sFailure` to surface the Kubernetes resource failures that are not Pods.


It might actually be worthwhile to introduce a generic condition type like RayClusterInitializing that encapsulates all the other resource dependencies that aren't pods.

Probably not? Additional conditions for other resource dependencies are good, but a RayClusterInitializing may be too vague and could be easily misused. It smells like another Ready.

RayClusterK8sFailure is pretty vague too. Maybe RayClusterResourceInitialization is more specific?

RayClusterK8sFailure is indeed vague. I think conditions RayClusterXXXFailure, where XXX is specific, are better.

I am still considering whether to add the field or not. Maybe creating K8s events for the non-Pod failures is enough?

andrewsykim · 2024-07-02T14:33:58Z

reps/2024-07-01-raycluster-crd-status-improvement.md

+	HeadReady RayClusterConditionType = "HeadReady"
+	// Pod failures. See DeploymentReplicaFailure and
+// ReplicaSetReplicaFailure for more details.
+RayClusterReplicaFailure RayClusterConditionType = "ReplicaFailure"


In a follow-up iteration of the conditions, I think a condition like AllWorkersReady indicating all worker replicas are ready will be useful too. The example that comes to mind is in RayJob controller to know when a job should start (assuming we deprecate the status.state field). Relying on head readiness won't be enough

How about using AllRayPodsReady, which indicates that both the head and workers are ready? If we use AllWorkersReady, we need to check both AllWorkersReady and HeadReady to determine whether we can submit the job. If we have AllRayPodsReady, we only need to check AllRayPodsReady.

kevin85421 · 2024-07-05T22:46:32Z

Hey folks, I will address comments by updating the Google doc. I will sync this PR with the Google doc periodically.

jjyao · 2024-07-10T05:31:58Z

reps/2024-07-01-raycluster-crd-status-improvement.md

+        * I am considering making `ready` a condition in `Status.Conditions` and removing `rayv1.Ready` from `Status.State`. Kubernetes Pod also makes a [PodCondition](https://pkg.go.dev/k8s.io/api/core/v1#PodConditionType), but this may be a bit aggressive.
+        * In the future, we can expose a way for users to self-define the definition of `ready` ([#1631](https://github.com/ray-project/kuberay/issues/1631)).
+
+## Implementation


There are some breaking changes, could you talk about why you think it's ok to have those breaking changes

@kevin85421 I don't think we need to make any breaking changes right? We can deprecate the existing .status.state field and point useers to the new conditions without deleting .status.state

update

abd2fb1

Signed-off-by: kaihsun <kaihsun@anyscale.com>

andrewsykim reviewed Jul 2, 2024

View reviewed changes

jjyao reviewed Jul 10, 2024

View reviewed changes

andrewsykim mentioned this pull request Jul 16, 2024

[Feature] Event record for failed Pod creation ray-project/kuberay#2250

Closed

2 tasks

rueian mentioned this pull request Jul 18, 2024

[Feat][RayCluster] new RayClusterReplicaFailure condition ray-project/kuberay#2245

Closed

4 tasks

cchen777 mentioned this pull request Jul 21, 2024

[Feature][RayCluster]: Implement the HeadReady condition ray-project/kuberay#2261

Merged

3 tasks

rueian mentioned this pull request Jul 30, 2024

[Feature] REP 54: Replace rayv1.Suspended with RayClusterSuspended condition ray-project/kuberay#2277

Closed

2 tasks

This was referenced Aug 6, 2024

[Feature][RayCluster]: Deprecate the RayCluster .Status.State field ray-project/kuberay#2288

Merged

[Umbrella] Remove the RayCluster .Status.State field ray-project/kuberay#2299

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REP: RayCluster status improvement #54

REP: RayCluster status improvement #54

kevin85421 commented Jul 1, 2024 •

edited

Loading

kevin85421 commented Jul 1, 2024

andrewsykim Jul 2, 2024

kevin85421 Jul 5, 2024

kevin85421 Jul 5, 2024

andrewsykim Jul 2, 2024

kevin85421 Jul 5, 2024

kevin85421 Jul 5, 2024

andrewsykim Jul 2, 2024

kevin85421 Jul 5, 2024

andrewsykim Jul 2, 2024

rueian Jul 3, 2024

andrewsykim Jul 3, 2024

rueian Jul 3, 2024

kevin85421 Jul 5, 2024

andrewsykim Jul 2, 2024

kevin85421 Jul 5, 2024

kevin85421 commented Jul 5, 2024

jjyao Jul 10, 2024

andrewsykim Jul 10, 2024

REP: RayCluster status improvement #54

Are you sure you want to change the base?

REP: RayCluster status improvement #54

Conversation

kevin85421 commented Jul 1, 2024 • edited Loading

kevin85421 commented Jul 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Jul 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Jul 1, 2024 •

edited

Loading