-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REP: RayCluster status improvement #54
base: main
Are you sure you want to change the base?
Conversation
cc @andrewsykim @rueian would you mind taking a look? |
type RayClusterConditionType string | ||
|
||
const ( | ||
RayClusterSuspending RayClusterConditionType = "Suspending" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the addition of the new suspending / suspended conditions in step 1 and keep step 7 only about deprecating the old state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may prefer to add the suspending and suspended conditions when we decide to start to work on making the suspend operation atomic. In my expectation, Step 1 will be in v1.2.0, but Step 7 will be in v1.3.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add the conditions in Step 1, these two conditions may be unused in v1.2.0.
* Reference: [DeploymentAvailable](https://github.com/kubernetes/api/blob/857a946a225f212b64d42c68a7da0dc44837636f/apps/v1/types.go#L532-L542), [LeaderWorkerSetAvailable](https://github.com/kubernetes-sigs/lws/blob/557dfd8b14b8f94633309f6d7633a4929dcc10c3/api/leaderworkerset/v1/leaderworkerset_types.go#L272) | ||
* Add `RayClusterK8sFailure` to surface the Kubernetes resource failures that are not Pods. | ||
|
||
### Step 2: Remove `rayv1.Failed` from `Status.State`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove or deprecate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave Failed ClusterState = "failed"
in raycluster_types.go
. For Status.State
, I prefer not to assign Status.State
to rayv1.Failed
anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated in the Google doc.
### Step 2: Remove `rayv1.Failed` from `Status.State`. | ||
* Add the information about Pod failures to `RayClusterReplicaFailure` instead. | ||
|
||
### Step 3: Make sure every reconciliation which has status change goes through `inconsistentRayClusterStatus`, and we only call `r.Status().Update(...)` when `inconsistentRayClusterStatus` returns true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be it's own step, it would have to happen when we introduce the new conditions in step 1 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding more details? In my plan, step 1 just updates the CRD.
* Future works | ||
* Add `RayClusterAvailable` or `RayClusterReady` to `RayClusterConditionType` in the future for the RayCluster readiness. | ||
* Reference: [DeploymentAvailable](https://github.com/kubernetes/api/blob/857a946a225f212b64d42c68a7da0dc44837636f/apps/v1/types.go#L532-L542), [LeaderWorkerSetAvailable](https://github.com/kubernetes-sigs/lws/blob/557dfd8b14b8f94633309f6d7633a4929dcc10c3/api/leaderworkerset/v1/leaderworkerset_types.go#L272) | ||
* Add `RayClusterK8sFailure` to surface the Kubernetes resource failures that are not Pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might actually be worthwhile to introduce a generic condition type like RayClusterInitializing
that encapsulates all the other resource dependencies that aren't pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not? Additional conditions for other resource dependencies are good, but a RayClusterInitializing
may be too vague and could be easily misused. It smells like another Ready
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RayClusterK8sFailure
is pretty vague too. Maybe RayClusterResourceInitialization
is more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RayClusterK8sFailure is indeed vague. I think conditions RayClusterXXXFailure, where XXX is specific, are better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still considering whether to add the field or not. Maybe creating K8s events for the non-Pod failures is enough?
HeadReady RayClusterConditionType = "HeadReady" | ||
// Pod failures. See DeploymentReplicaFailure and | ||
// ReplicaSetReplicaFailure for more details. | ||
RayClusterReplicaFailure RayClusterConditionType = "ReplicaFailure" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up iteration of the conditions, I think a condition like AllWorkersReady
indicating all worker replicas are ready will be useful too. The example that comes to mind is in RayJob controller to know when a job should start (assuming we deprecate the status.state field). Relying on head readiness won't be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using AllRayPodsReady
, which indicates that both the head and workers are ready? If we use AllWorkersReady
, we need to check both AllWorkersReady
and HeadReady
to determine whether we can submit the job. If we have AllRayPodsReady
, we only need to check AllRayPodsReady
.
Hey folks, I will address comments by updating the Google doc. I will sync this PR with the Google doc periodically. |
* I am considering making `ready` a condition in `Status.Conditions` and removing `rayv1.Ready` from `Status.State`. Kubernetes Pod also makes a [PodCondition](https://pkg.go.dev/k8s.io/api/core/v1#PodConditionType), but this may be a bit aggressive. | ||
* In the future, we can expose a way for users to self-define the definition of `ready` ([#1631](https://github.com/ray-project/kuberay/issues/1631)). | ||
|
||
## Implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some breaking changes, could you talk about why you think it's ok to have those breaking changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin85421 I don't think we need to make any breaking changes right? We can deprecate the existing .status.state
field and point useers to the new conditions without deleting .status.state
This is just a copy of https://docs.google.com/document/d/1bRL0cZa87eCX6SI7gqthN68CgmHaB6l3-vJuIse-BrY/edit.