Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Disable zero downtime upgrade for a RayService using RayServiceSpec #2468

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chiayi
Copy link
Contributor

@chiayi chiayi commented Oct 23, 2024

Why are these changes needed?

To allow disabling zero downtime upgrade for a specific ray service

Related issue number

Part of #2397

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

type UpgradeStrategy string

const (
BlueGreenUpgrade UpgradeStrategy = "BlueGreenUpgrade"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of alternatives out loud:

  1. RollingUpdate / Recreate
  2. CreateThenDelete / DeleteThenCreate
  3. CreateFirst / DeleteFirst

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps we can follow upgrade strategy types similar to StatefulSet: RollingUpdate / OnDelete https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see "RollingUpdate" being misleading because we only ever update 1 new RayCluster at a time, so it's not really a rolling update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use the "OnDelete" as I think it fits the "delete first" and "delete then create". For the zero downtime maybe we can name it something along the lines of "PrepareUpdateThenSwap" to describe that the updated cluster is getting ready?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OnDelete implies the users has to manually delete the cluster first. This is a new behavior which is maybe okay? If we are still automatically deleting the cluster, then it should be just Delete, DeleteFirst, Replace something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see, so like upgrade OnDelete. If that's the case, my vote would be for "Replace" or "DeleteFirst".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I think having a separate strategy for OnDelete and DeleteFirst would be valuable. So in total we could support 3 upgrade strategies

  1. spec.upgradeStrategy=RollingUpdate -- same behavior as ENABLE_ZERO_DOWNTIME=true
  2. spec.upgradeStrategy=DeleteFirst -- same behavior as ENABLE_ZERO_DOWNTIME=false
  3. spec.upgradeStrategy=OnDelete -- new behavior, only upgrade if user manually deletes the RayCluster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OnDelete implies the users has to manually delete the cluster first. This is a new behavior which is maybe okay?

I think the behavior of OnDelete is fine, but it may require additional code changes, such as updating the CR status. This is beyond the scope of this PR. Maybe we can open an issue to track the progress, and we can revisit whether we need this behavior after the refactoring of RayService.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ENABLE_ZERO_DOWNTIME=false case, how about name it to None or Disabled or NoUpgrade?

Copy link
Member

@MortalHappiness MortalHappiness Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior of ENABLE_ZERO_DOWNTIME=false is to ignore all spec changes and do nothing.

@chiayi chiayi force-pushed the rayservice-upgrade branch 2 times, most recently from b1c2096 to 179207c Compare October 29, 2024 21:02
ray-operator/apis/ray/v1/rayservice_types.go Show resolved Hide resolved
if zeroDowntimeEnvVar != "" {
enableZeroDowntime = strings.ToLower(zeroDowntimeEnvVar) != "false"
} else {
enableZeroDowntime = rayServiceSpecUpgradeStrategy == rayv1.NewCluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may cause issues when upgrading the CRD. Can you test the case that:

  1. Install KubeRay v1.2.2
  2. Create a RayService
  3. Upgrade KubeRay to this PR (without upgrading CRD)
  4. Try to trigger zero-downtime upgrade.

I expect that the zero-downtime upgrade will not be triggered. The reason is that the KubeRay v1.2.2's CRD doesn't have the field rayServiceInstance.Spec.RayServiceUpgradeStrategy, so the zero value of string will be used, so rayServiceSpecUpgradeStrategy == rayv1.NewCluster will be false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my above statement is correct (I'm not 100% sure), can you:

  1. Add a check in validateRayServiceSpec to make sure the value of rayServiceInstance.Spec.RayServiceUpgradeStrategy is valid (for now, NewCluster, None, and the zero value of string are valid).

func validateRayServiceSpec(rayService *rayv1.RayService) error {

  1. Handle the case if RayServiceUpgradeStrategy is an empty string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, would you mind adding some comments to summarize the mechanism to control zero-downtime upgrade?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your statement is correct, with the current logic, it did not trigger zero-downtime.

I changed the logic within rayservice controller to default zero-downtime to true let me know if this is sufficient.

@kevin85421 kevin85421 self-assigned this Oct 31, 2024
@chiayi chiayi force-pushed the rayservice-upgrade branch 2 times, most recently from c2567a1 to 818e12b Compare November 4, 2024 19:36
@chiayi chiayi marked this pull request as ready for review November 4, 2024 19:36
Copy link
Collaborator

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @kevin85421 can you take another look?

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
@@ -815,6 +823,14 @@ func TestReconcileRayCluster(t *testing.T) {
updateKubeRayVersion: true,
kubeRayVersion: "new-version",
},
// Test 7: Zero downtime upgrade is enabled, but is enabled through the RayServiceSpec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add more tests? We have 4 combinations enableZeroDowntime (true, false) and rayServiceUpgradeStrategy (NewCluster, None).

Copy link
Contributor Author

@chiayi chiayi Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the following combinations:
env var: false, spec: true
env var: true, spec: false
env var: false, spec: unset
env var: false, spec: false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I believe there are 6 cases based on enableZeroDowntime (true, false) and rayServiceUpgradeStrategy (NewCluster, None, "").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all 6 of the combinations are covered. The "" should be covered by the original tests and the ("NewCluster", "None") x (true, false) are covered by the new tests that are added.

@chiayi chiayi force-pushed the rayservice-upgrade branch 4 times, most recently from 932941b to a5cdab5 Compare November 7, 2024 21:14
@chiayi chiayi force-pushed the rayservice-upgrade branch 3 times, most recently from 48d17a6 to 8a44e43 Compare November 7, 2024 22:02
@@ -57,6 +66,9 @@ type RayServiceSpec struct {
DeploymentUnhealthySecondThreshold *int32 `json:"deploymentUnhealthySecondThreshold,omitempty"`
// ServeService is the Kubernetes service for head node and worker nodes who have healthy http proxy to serve traffics.
ServeService *corev1.Service `json:"serveService,omitempty"`
// UpgradeStrategy represents the strategy used when upgrading the RayService. Currently supports `NewCluster` and `None`
// +kubebuilder:default:=NewCluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting upgradeStrategy to NewCluster seems to be a breaking change. For example, if users set ENABLE_ZERO_DOWNTIME to false and don't specify upgradeStrategy, zero-downtime upgrades were disabled in KubeRay v1.2.2 but are enabled with this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I'm seeing as well. Should we not have a default value for UpgradeStrategy then?
cc @andrewsykim What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not have a default value for UpgradeStrategy then?

Having no default value sounds reasonable if there isn’t a better solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the defaulting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be defaulting to define the default upgrade strategy though? Why not have ENABLE_ZERO_DOWNTIME take precedence to preserve backwards compatibility?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENABLE_ZERO_DOWNTIME is an operator-level configuration, and UpgradeStrategy is a CR-level configuration. My current reasoning for why the operator-level config should take precedence is as follows:

  1. KubeRay operator is managed by a ML infra team.
  2. Custom resources are managed by application teams (e.g., recommendation systems, LLM, etc.).

The ML infra team will set a default configuration, i.e., ENABLE_ZERO_DOWNTIME in this case, for KubeRay that can work for most use cases.

However, in certain scenarios, some application teams may have rare use cases that require customizing their CRs. In these cases, they might want to set a different upgrade strategy for their RayService.

If the operator-level configuration takes precedence over the CR-level configuration, application teams would need to request changes to the KubeRay operator's configuration from the infra team. This could prevent these teams from sharing the KubeRay operator with other teams.

Does this make sense?

Copy link
Collaborator

@andrewsykim andrewsykim Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current reasoning for why the operator-level config should take precedence is as follows:

Agreed operator-level config should take precedence for backwards compatibility, but UpgradeStrategy should still have a default when operator-level env var is not set right?

Copy link
Member

@kevin85421 kevin85421 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpgradeStrategy should still have a default when operator-level env var is not set right?

How about setting the default value in the KubeRay operator instead of OpenAPI, so that we can determine whether users have specified the field or not?

If users don't explicitly specify the CR-level config, we use the operator-level config.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about setting the default value in the KubeRay operator instead of OpenAPI, so that we can determine whether users have specified the field or not?

Do you mean default the behavior only from the controller but not in the API? If we do this we probably want the field to be a pointer.

But personally, I think defaulting the API is a better use experience. I'm assuming that most users have not set ENABLE_ZERODOWN_TIME=false. If they have, we can log a warning in KubeRay saying that the upgrade strategy API will be ignored until the envirionment variable is unset

@chiayi chiayi force-pushed the rayservice-upgrade branch 3 times, most recently from 623360e to b8a01fd Compare November 8, 2024 01:00
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. We can merge it after we reach a consensus on: #2468 (comment)

const (
// During upgrade, NewCluster strategy will create new upgraded cluster and switch to it when it becomes ready
NewCluster RayServiceUpgradeStrategy = "NewCluster"
// Having None as the strategy will mean down-time while it wait for new cluster to be ready when performing upgrade
Copy link
Member

@kevin85421 kevin85421 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, if the upgrade strategy is set to None, no new RayCluster will be created. Can you double check it and update the comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my fault, deleted the wrong thing while playing around with few things.

@kevin85421
Copy link
Member

could you rebase with the master branch? I just fixed a CI issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants