Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call Status.Update once in each reconcile attempt #494

Merged
merged 6 commits into from
Apr 13, 2024

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Apr 4, 2024

Fixes #405

This approach removes the JobSet status update API calls from the controller, and simply modifies the status in memory, and does a single status update call at the end of each reconciliation attempt.

These changes also include:

  • Refactoring to improve readability and organization
  • Fixing bug causing unnecessary status update calls (here the replicated job statuses must be sorted before doing the semantic equality test)
  • Refactor "in order startup policy" condition into 2 mutually exclusive conditions: one indicating the startup policy condition is in progress, and one indicating it has completed.
  • Remove forceFalseUpdate flag from conditionOpts since we are no longer using a condition status of false/true on the in order startup policy condition to indicate if the startup policy is in-progress/completed (see prior bullet point).

Note: the operation conflict errors are still in the integration tests. One idea I had is that we are still doing Job status updates in the middle of reconcile logic. These will trigger Jobset reconciles which then update the jobSet.status.replicatedJobStatuses as it looks at those child jobs. The controller runtime work queue should be handling these sequentially though since MaxConcurrency defaults to 1 and we don't override that, so I'm not sure of the root cause yet. Kueue has these errors in the integration test logs as well though so they may be an expected side effect of the integration test framework or someting.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 4, 2024
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and kannon92 April 4, 2024 18:55
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 4, 2024
Copy link

netlify bot commented Apr 4, 2024

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit f199865
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/661ac4816ae25e00080d39a8

@kannon92
Copy link
Contributor

kannon92 commented Apr 4, 2024

There are two TODOs in this PR. Do you want to address them in the PR? or as part of the review?

@danielvegamyhre danielvegamyhre force-pushed the refactor-status branch 2 times, most recently from 21583a3 to bde9e1b Compare April 4, 2024 19:50
@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Apr 4, 2024

There are two TODOs in this PR. Do you want to address them in the PR? or as part of the review?

Let's discuss the TODO related to the event emission in this PR.

I removed the other TODO (related to updating child Job statuses), that was just a note for myself while investigating the integration test "resource has been modified, try again" errors, which I'm no longer attempting to address in this PR. I think you're right that if these errors also appear in Kueue integration test logs then it's likely expected, and instead of going down that rabbit hole we can look into ways to filter those log lines from the integration test logs if we want to clean that up.

pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
@danielvegamyhre danielvegamyhre force-pushed the refactor-status branch 2 times, most recently from 7bf9e1f to a8d9a0b Compare April 8, 2024 22:44
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
@danielvegamyhre
Copy link
Contributor Author

Need to refactor unit tests for this, will comment when it's ready for another look.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 10, 2024
@danielvegamyhre danielvegamyhre force-pushed the refactor-status branch 3 times, most recently from a5af01e to e6940bf Compare April 10, 2024 00:41
@danielvegamyhre
Copy link
Contributor Author

/hold

I'll continue work on this after the v0.5.0 release

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2024
@danielvegamyhre
Copy link
Contributor Author

@ahg-g @kannon92 this is ready for another look. Please take a look at the updated PR description for a summary of the changes.

pkg/controllers/jobset_controller.go Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
@ahg-g
Copy link
Contributor

ahg-g commented Apr 13, 2024

/label tide/merge-method-squash
/lgtm
/approve
/hold cancel

Thanks!

@k8s-ci-robot k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 13, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 13, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [ahg-g,danielvegamyhre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 5143e0b into kubernetes-sigs:main Apr 13, 2024
12 checks passed
@danielvegamyhre danielvegamyhre mentioned this pull request Apr 15, 2024
20 tasks
@danielvegamyhre danielvegamyhre deleted the refactor-status branch May 2, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Call Status.Update once in reconcile
5 participants