Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REP: RayCluster status improvement #54

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jul 1, 2024

Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421
Copy link
Member Author

cc @andrewsykim @rueian would you mind taking a look?

type RayClusterConditionType string

const (
RayClusterSuspending RayClusterConditionType = "Suspending"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the addition of the new suspending / suspended conditions in step 1 and keep step 7 only about deprecating the old state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may prefer to add the suspending and suspended conditions when we decide to start to work on making the suspend operation atomic. In my expectation, Step 1 will be in v1.2.0, but Step 7 will be in v1.3.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add the conditions in Step 1, these two conditions may be unused in v1.2.0.

* Reference: [DeploymentAvailable](https://github.com/kubernetes/api/blob/857a946a225f212b64d42c68a7da0dc44837636f/apps/v1/types.go#L532-L542), [LeaderWorkerSetAvailable](https://github.com/kubernetes-sigs/lws/blob/557dfd8b14b8f94633309f6d7633a4929dcc10c3/api/leaderworkerset/v1/leaderworkerset_types.go#L272)
* Add `RayClusterK8sFailure` to surface the Kubernetes resource failures that are not Pods.

### Step 2: Remove `rayv1.Failed` from `Status.State`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove or deprecate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave Failed ClusterState = "failed" in raycluster_types.go. For Status.State, I prefer not to assign Status.State to rayv1.Failed anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated in the Google doc.

### Step 2: Remove `rayv1.Failed` from `Status.State`.
* Add the information about Pod failures to `RayClusterReplicaFailure` instead.

### Step 3: Make sure every reconciliation which has status change goes through `inconsistentRayClusterStatus`, and we only call `r.Status().Update(...)` when `inconsistentRayClusterStatus` returns true.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be it's own step, it would have to happen when we introduce the new conditions in step 1 right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding more details? In my plan, step 1 just updates the CRD.

* Future works
* Add `RayClusterAvailable` or `RayClusterReady` to `RayClusterConditionType` in the future for the RayCluster readiness.
* Reference: [DeploymentAvailable](https://github.com/kubernetes/api/blob/857a946a225f212b64d42c68a7da0dc44837636f/apps/v1/types.go#L532-L542), [LeaderWorkerSetAvailable](https://github.com/kubernetes-sigs/lws/blob/557dfd8b14b8f94633309f6d7633a4929dcc10c3/api/leaderworkerset/v1/leaderworkerset_types.go#L272)
* Add `RayClusterK8sFailure` to surface the Kubernetes resource failures that are not Pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might actually be worthwhile to introduce a generic condition type like RayClusterInitializing that encapsulates all the other resource dependencies that aren't pods.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not? Additional conditions for other resource dependencies are good, but a RayClusterInitializing may be too vague and could be easily misused. It smells like another Ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RayClusterK8sFailure is pretty vague too. Maybe RayClusterResourceInitialization is more specific?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RayClusterK8sFailure is indeed vague. I think conditions RayClusterXXXFailure, where XXX is specific, are better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still considering whether to add the field or not. Maybe creating K8s events for the non-Pod failures is enough?

HeadReady RayClusterConditionType = "HeadReady"
// Pod failures. See DeploymentReplicaFailure and
// ReplicaSetReplicaFailure for more details.
RayClusterReplicaFailure RayClusterConditionType = "ReplicaFailure"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up iteration of the conditions, I think a condition like AllWorkersReady indicating all worker replicas are ready will be useful too. The example that comes to mind is in RayJob controller to know when a job should start (assuming we deprecate the status.state field). Relying on head readiness won't be enough

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using AllRayPodsReady, which indicates that both the head and workers are ready? If we use AllWorkersReady, we need to check both AllWorkersReady and HeadReady to determine whether we can submit the job. If we have AllRayPodsReady, we only need to check AllRayPodsReady.

@kevin85421
Copy link
Member Author

Hey folks, I will address comments by updating the Google doc. I will sync this PR with the Google doc periodically.

* I am considering making `ready` a condition in `Status.Conditions` and removing `rayv1.Ready` from `Status.State`. Kubernetes Pod also makes a [PodCondition](https://pkg.go.dev/k8s.io/api/core/v1#PodConditionType), but this may be a bit aggressive.
* In the future, we can expose a way for users to self-define the definition of `ready` ([#1631](https://github.com/ray-project/kuberay/issues/1631)).

## Implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some breaking changes, could you talk about why you think it's ok to have those breaking changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 I don't think we need to make any breaking changes right? We can deprecate the existing .status.state field and point useers to the new conditions without deleting .status.state

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants