Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support elastic jobset #622

Closed
wants to merge 2 commits into from

Conversation

kannon92
Copy link
Contributor

Fixes #463

Add elastic jobset support (downscale/upscaling replicas of a Replicated Job).

Before this change, we treated ReplicateJobs (in JobSet.Spec) as immutable. This relaxes that restriction.

I choose to validate that names cannot be changed and the job templates must not change during updates. I don't see a reason for names to change and we also don't allow order change of the replicated job list so we can detect differences easily.

I am reopening this to resolve the rebase labels..

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 22, 2024
Copy link

netlify bot commented Jul 22, 2024

Deploy Preview for kubernetes-sigs-jobset ready!

Name Link
🔨 Latest commit bf458c6
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/67000c7d0434c600085e7192
😎 Deploy Preview https://deploy-preview-622--kubernetes-sigs-jobset.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@kannon92
Copy link
Contributor Author

/assign @danielvegamyhre

@mimowo wouls you be open for a review also?

@kannon92
Copy link
Contributor Author

/cc @tenzen-y

@k8s-ci-robot k8s-ci-robot requested a review from tenzen-y July 25, 2024 13:02
@kannon92
Copy link
Contributor Author

/hold

Missed the conflicts..

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 27, 2024
@mimowo
Copy link
Contributor

mimowo commented Jul 29, 2024

@mimowo wouls you be open for a review also?

Ack, I will try this week, but don't wait for my review unnecessarily if you get enough input from others.

@kannon92 kannon92 force-pushed the elastic-jobset-reopen branch from 086a389 to 3f3a2ac Compare July 29, 2024 13:43
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92
Once this PR has been reviewed and has the lgtm label, please ask for approval from danielvegamyhre. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kannon92
Copy link
Contributor Author

kannon92 commented Jul 29, 2024

@danielvegamyhre do you know how to remove the rebase label? I did resolve conflicts in my branch and pushes those up. I think the prow removal of it is not working.

ref: https://kubernetes.slack.com/archives/C09QZ4DQB/p1708018595891739

Looks like rebase label can have issues. So I think removing is fine. It also doesn't block the merge so this is ready for review.

@kannon92
Copy link
Contributor Author

/remove needs-rebase

@danielvegamyhre danielvegamyhre removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2024
@danielvegamyhre
Copy link
Contributor

@danielvegamyhre do you know how to remove the rebase label? I did resolve conflicts in my branch and pushes those up. I think the prow removal of it is not working.

ref: https://kubernetes.slack.com/archives/C09QZ4DQB/p1708018595891739

Looks like rebase label can have issues. So I think removing is fine. It also doesn't block the merge so this is ready for review.

I removed it via the github UI, the "labels" button on the top right.

@kannon92
Copy link
Contributor Author

kannon92 commented Jul 29, 2024

I think since you have admin rights to the repo you have access to that.

Thank you though!

@kannon92
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 30, 2024
// Comparing job templates can be slow
// Only do it if we detect a difference.
if !equality.Semantic.DeepEqual(mungedSpec.ReplicatedJobs, oldJS.Spec.ReplicatedJobs) {
if err := validateReplicatedJobsUpdate(mungedSpec.ReplicatedJobs, oldJS.Spec.ReplicatedJobs); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this allows to update the number of replicas for an unsuspended JobSet. Is this desired?
I this would not work if there is no mechanism to recreate the Jobs. Maybe it is better to only support mutability of replicas when the JobSet is suspended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that one would want to add more replicated jobs as a major use case for Elastic JobSet.

We debated about whether or not we should modify the indexed job or the JobSet. I think we went with scaling up/down on the replicas of a ReplicatedJob.

Are you thinking that elasticity should only occur on a suspended object?

Copy link
Contributor

@mimowo mimowo Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to support elasticity for running Jobs, but you may test e2e if this really works in this PR (I doubt).

IIUC this requires an update to the underlying Job object no? IIUC you only update the JobSet object, but this does not impact the underlying Jobs which are running, but I might be missing something.

Also, I think changing paralellism only works for IndexedJobs where completions = parallelism.

If my thinking is correct, then your PR works ok, by only if combined with #625, and only if the JobSet is updated while suspended.

Overall it would be good to demonstrate this feature works with an e2e test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have probably mixed a single Job mutability and the number of Job replicas.

In this PR you only let to update the Job replicas count, right?

Still, it is not clear to me if we already have code to wcale up and down the number of replicas.

Copy link
Contributor Author

@kannon92 kannon92 Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a e2e test and scaling up works. I create a JobSet and then I edit the replicated jobs of a worker jobset and I can confirm that new jobs are created.

I change the replicated jobs. I should check on scaling down actually.

Copy link
Contributor

@mimowo mimowo Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I create a JobSet and then I edit the replicated jobs of a worker jobset and I can confirm that new jobs are created.

Great!

I change the replicated jobs. I should check on scaling down actually.

Please share the results of the testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downscaling doesn’t seem to work.

I think I’ll add some e2e tests for this as it “works” for integration tests so there may be some subtleties

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

@@ -1599,7 +1599,7 @@ func TestValidateUpdate(t *testing.T) {
}.ToAggregate(),
},
{
name: "replicated job pod template can be updated for suspended jobset",
name: "replicated jobs updates can change replicas",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure there are two versions of the test: for suspended JobSet and unsuspended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that be relevant once #625 merges?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I see what you mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed up a change.

},
{
checkJobCreation: func(js *jobset.JobSet) {
expectedStarts := 7
Copy link
Contributor

@mimowo mimowo Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to add a comment why 7 is expected. It is a bit magical number here. Maybe also "numJobs" is a better name? Not sure why this is called expectedStarts while it is checked against NumJobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

},
{
checkJobCreation: func(js *jobset.JobSet) {
expectedStarts := 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here. Please add a comment why 2 is expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@kannon92
Copy link
Contributor Author

kannon92 commented Aug 4, 2024

Pushed up some e2e tests for this. I need to work on the downscale case.

What should be the behavior for elastic jobs on a downscale?

If more jobs exists, do we delete the extra jobs only? Or should we delete and recreate all the jobs?

On upscale, the extra jobs are created and I think the original ones are left alone.

On downscale, I guess we could delete the extra replicas.

@danielvegamyhre
Copy link
Contributor

@kannon92 Looks like e2e tests are failing, can you take a look?

@kannon92
Copy link
Contributor Author

kannon92 commented Aug 6, 2024

Yes, my hope is to get some clarity on the design/code.

Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

@danielvegamyhre
Copy link
Contributor

Yes, my hope is to get some clarity on the design/code.

Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.

@kannon92
Copy link
Contributor Author

kannon92 commented Aug 8, 2024

Yes, my hope is to get some clarity on the design/code.
Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.

Maybe I’m overthinking it.

Say I have 4 jobs and I downscale to 2. Should I only delete the 3rd and 4th replicated job via index? Say I want to keep the 1st and 2nd job running and I remove the excess ones?

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Aug 8, 2024

Yes, my hope is to get some clarity on the design/code.
Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.

Maybe I’m overthinking it.

Say I have 4 jobs and I downscale to 2. Should I only delete the 3rd and 4th replicated job via index? Say I want to keep the 1st and 2nd job running and I remove the excess ones?

If I recall correctly the Job controller works this way for Elastic Indexed Jobs, removing indexes from largest to smallest until there are no more excess pods, so it would make sense to follow that pattern here for consistency I think.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 10, 2024
@kannon92 kannon92 force-pushed the elastic-jobset-reopen branch 2 times, most recently from f4e6034 to b266034 Compare August 13, 2024 00:18
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 13, 2024
@kannon92 kannon92 force-pushed the elastic-jobset-reopen branch 2 times, most recently from fcd1bd1 to 58ca914 Compare August 17, 2024 13:59
@kannon92
Copy link
Contributor Author

@danielvegamyhre @mimowo

I got the downscale to work so I think this PR is ready for review.

@danielvegamyhre danielvegamyhre added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 8, 2024
want: fmt.Errorf("updates can not change job names or reorder the jobs"),
},
{
name: "updates on replicated job templates are not allowed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to disallow job template updates? This would prevent us from using Elastic Indexed Jobs, which I'm hoping could be used in a more efficient implementation of exclusive job placement (see #482)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment: #463 (comment).

I wasn’t sure if there was a use case for this so I omitted support for it.

@ahg-g would this be a strong enough usecase to add support for ElasticIndexedJobs.

pkg/controllers/elastic_jobset.go Outdated Show resolved Hide resolved
pkg/controllers/elastic_jobset.go Outdated Show resolved Hide resolved
@@ -520,6 +527,22 @@ func (r *JobSetReconciler) reconcileReplicatedJobs(ctx context.Context, js *jobs
return nil
}

// We need to check if the replicas of a replicated job are mismatching
// If they are, we should delete the extra jobs
func (r *JobSetReconciler) downscaleElasticJobs(ctx context.Context, js *jobset.JobSet, replicatedJobStatus []jobset.ReplicatedJobStatus) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be moved to the elastic_jobset.go file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can. We keep anything related to the JobSetReconciler in jobset_controller.

ElasticJob and other utilites seem to be pure functions that can easily be tested with units. This does rely on the reconcile class to talk to the client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what we've done so far but I think it makes more sense for all functions and struct methods that are feature-specific to be colocated in the same feature file, rather than group them based on whether or not the function has a pointer receiver. I think this is a useful pattern for organizing the methods of large structs with a growing number of methods.

For now we can leave this as is though, since this refactor I'm describing isn't really in the scope of this PR. I'll work on this refactor in a separate PR later and we can discuss the pros/cons of it in the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, in theory that sounds like a good idea. I tend to like separating code from controller logic as unit testing controllers is quite painful.

I will eagerly await your PR.

if err != nil {
return nil, fmt.Errorf("unable get integer from job index key")
}
if jobIndex >= int(countOfJobsToDelete) {
Copy link
Contributor

@danielvegamyhre danielvegamyhre Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we have 4 jobs (indexes 0,1,2,3) and we want to delete 1 job, then countOfJobsToDelete=1 and we may end up deleting any of job indexes 1,2,3 - so it is non-deterministic and may leave us with an non-contiguous range of job indexes (remaining jobs indexes 0,2,3).

Rather than this, I think we should delete jobs from largest job index to smallest, so the remaining jobs still occupy a contiguous index range. This follows the same pattern as how the Job controller behaves for elastic indexed jobs, deleting pods from largest to smallest completion indexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I went ahead and added this change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielvegamyhre , keeping the indices unchanged could be desirable in cases where we don't want to restart the entire jobset right?

Copy link
Contributor

@danielvegamyhre danielvegamyhre Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielvegamyhre , keeping the indices unchanged could be desirable in cases where we don't want to restart the entire jobset right?

Can you clarify what you mean? In this thread we are discussing which jobs to delete if the JobSet is scaled down (e.g., number of replicas in a replicatedJob reduced from 4 to 3) - not restarting the JobSet.

countOfJobsToDelete := status.Ready - replicatedJob.Replicas
if countOfJobsToDelete > 0 {
jobsWeDeleted := 0
for _, val := range jobItems {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ii think it would be more optimal (although perhaps not provide a big perf improvement in practice) to store the jobs in a map of "replicatedJobName -> [slice of jobs for that replicatedJob]". That way we don't need to iterate over every job in the jobset, for every replicatedJob we are downscaling.

The O(jobs) cost of constructing the map is preferable to the O(replicatedJobs * jobs) cost of this approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up doing this to help with making sure I delete the right jobs.

jobsToDelete = append(jobsToDelete, &val)
}
if jobsWeDeleted == int(countOfJobsToDelete) {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can break here rather than continue right, since there's no more jobs to delete in this inner loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. done.

@kannon92
Copy link
Contributor Author

@ahg-g id thinking we should clear use cases with this in mind. I am not a ML engineer so I’m not really able to give clear concrete cases for this. Let me ask around and see. Otherwise I may close this and wait until we have a customer ask for this.

@kannon92
Copy link
Contributor Author

/close

We need more design/use cases on this. Going to close for now.

@k8s-ci-robot
Copy link
Contributor

@kannon92: Closed this PR.

In response to this:

/close

We need more design/use cases on this. Going to close for now.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Elastic JobSets
5 participants