Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce RetryStrategy API to ProvisioningRequestConfig #3375

Merged
merged 25 commits into from
Nov 5, 2024

Conversation

PBundyra
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #1353

Special notes for your reviewer:

I am yet to add integration tests, but the rest of the code is ready for the first pass

Does this PR introduce a user-facing change?

ProvisioningRequestConfig API has now `RetryStrategy` field that allows users to configure retries per ProvisioningRequest class. By default retry releases allocated quota in Kueue. 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 30, 2024
@PBundyra
Copy link
Contributor Author

/assign @mimowo @tenzen-y

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 30, 2024
Copy link

netlify bot commented Oct 30, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit d19a4cc
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6729f6a09798ed0008c41ed3

@PBundyra PBundyra force-pushed the ProvReq-config-retry branch from 198179c to 9312d0a Compare October 30, 2024 09:06
@PBundyra PBundyra force-pushed the ProvReq-config-retry branch from 9312d0a to 5d6baf8 Compare October 30, 2024 09:18
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial review pass

@PBundyra PBundyra force-pushed the ProvReq-config-retry branch from 2995cb1 to f9fa268 Compare October 30, 2024 13:46
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b399f08207b36f7091fda86281ce35873d8f7e43

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2024
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, is this still needed?

https://github.com/PBundyra/kueue/blob/29cd2a6605ca380cf7cb0ff8ca3841e119d6fc7e/pkg/controller/core/workload_controller.go#L370-L371

I guess that answer is yes since this is not only for ProvReq, this is used by other AdmissionChecks.

Comment on lines +518 to 529
if attempt := getAttempt(log, pr, wl.Name, check); attempt <= backoffLimitCount {
// it is going to be retried
message := fmt.Sprintf("Retrying after failure: %s", apimeta.FindStatusCondition(pr.Status.Conditions, autoscaling.Failed).Message)
updated = updateCheckMessage(&checkState, message) || updated
if features.Enabled(features.KeepQuotaForProvReqRetry) {
updated = updateCheckState(&checkState, kueue.CheckStatePending) || updated
updated = updateCheckMessage(&checkState, message) || updated
} else {
} else if wl.Status.RequeueState == nil || getAttempt(log, pr, wl.Name, check) > ptr.Deref(wl.Status.RequeueState.Count, 0) {
// We don't want to Retry on old ProvisioningRequests
updated = true
checkState.State = kueue.CheckStateRejected
checkState.Message = apimeta.FindStatusCondition(pr.Status.Conditions, autoscaling.Failed).Message
updateCheckState(&checkState, kueue.CheckStateRetry)
workload.UpdateRequeueState(wlPatch, backoffBaseSeconds, backoffMaxSeconds, c.clock)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that this and BookingExpired case requeue mechanism will be implemented in the workload-controller.
I guess that we should follow the steps:

  1. ProvReqController: Update Workload AdmissionCheckState with Retry
  2. WorkloadController: Do Requeue mechanism same as the WaitForPodsReady based on the Workload AdmissionCheckState.

In the current implementation, the field owner conflict obviously occured between WorkloadController and ProvisioningRequestController.
Then the requeueState update will fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially, the WaitForPodsReady condition and ProvReqRequeue conditions are met at the requeue cycle (This often could happen), I guess that conflict often happen.

Copy link
Contributor Author

@PBundyra PBundyra Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that we should follow the steps:

  1. ProvReqController: Update Workload AdmissionCheckState with Retry
  2. WorkloadController: Do Requeue mechanism same as the WaitForPodsReady based on the Workload > AdmissionCheckState.

The con of this solution is that Workload Controller doesn't have information about the ProvReqConfig. The controller would need to do two additional requests:

  1. GET on AdmissionCheck that is in Retry state;
  2. LIST on all ProvReqConfigs and match it with the AdmissionCheck

It's seems to me like we're just moving complexity from one controller to the another without any clear benefit.

However, I am aware of this problem and we discussed it with @mimowo. The third important controller in this mechanism is jobframework reconciler which is responsible for setting Requeue condition, and unsetting Quota. So currently evicting workload and unsetting quota is not atomic which may also cause problems. We're on designing a solution to that but it seems like this need to be well-thought and is out of the scope for this PR.

Additionally, I implemented the PR with those problems in mind so it's resilient to any interleavings.

In the current implementation, the field owner conflict obviously occured between WorkloadController and ProvisioningRequestController.
Then the requeueState update will fail.

We send PATCH with FieldOwnership and during my e2e tests it worked properly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The con of this solution is that Workload Controller doesn't have information about the ProvReqConfig. The controller would need to do two additional requests:

GET on AdmissionCheck that is in Retry state;
LIST on all ProvReqConfigs and match it with the AdmissionCheck

Which ProvReq information do you need in the workload-controller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backoffBaseSeconds and backoffMaxSeconds from ``ProvReqConfig.RetryStrategy`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I know where you are coming from, for me this line also was "eyebrow raising" and we discussed this at length with Patryk.
Let me summarize the more important aspects:

  1. actually conflict is unlikely because WaitForPodsReady owns the process when the workload is already admitted, while ProvReq owns the workload before admission (during admission checks phase), (I don't know how to trigger the conflict)
  2. there is no issue about field ownership since both workload_controller and ProvReq controller use ForceOwnership.
  3. the fact that this line sets requeueAt while still Requeued: True is a bit bothering me, but in fact the same happens for WaitForPodsReady, just the time window is a bit shorter. I think we could actually "fix" this by setting "Requeued: False" at the moment of reserving quota. I see no downsides, we could open a follow up issue to investigate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for summarizing that.
My root motivation was centralizing managing the Requeue mechanism in the workload-controller to avoid logic fragmentation.

In the current implementation, even if both features (WaitForPodsReady and ProvisioningRequest retryStrategy) relies on the same API, the mechanism are fragmented, then those fragmentations could bring us uncontrollable regresssions.
Alternatively, we can consider preparing the dedicated field separate from the WaitForPodsReady, but I think that is not a good idea rather than the fragmented mechanism since that indicates we will push debt to API and end users.

So, I'm wondering if we can store the ProvReq Retry parameters like backoffBaseSeconds in our cache.
Indeed, we already have relationship between AdmissionCheck and ProvisioningRequest in our cache. We only extend the cache, a slightly.

@mimowo Any objections?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I am aware of this problem and we discussed it with @mimowo. The third important controller in this mechanism is jobframework reconciler which is responsible for setting Requeue condition, and unsetting Quota.

Here key problem is even if they rely on the same API, both feature are managed by different mechanism.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 5, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Nov 5, 2024
Copy link

linux-foundation-easycla bot commented Nov 5, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 5, 2024
@PBundyra PBundyra force-pushed the ProvReq-config-retry branch from 247c14b to d19a4cc Compare November 5, 2024 10:42
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 5, 2024
@PBundyra
Copy link
Contributor Author

PBundyra commented Nov 5, 2024

Additionally, is this still needed?

https://github.com/PBundyra/kueue/blob/29cd2a6605ca380cf7cb0ff8ca3841e119d6fc7e/pkg/controller/core/workload_controller.go#L370-L371

I guess that answer is yes since this is not only for ProvReq, this is used by other AdmissionChecks.

I believe so, even for ProvReq. Why do you think this is redundant? ProvReq check can also be in Rejected state, and then we want to deactivate the Workload

@mimowo
Copy link
Contributor

mimowo commented Nov 5, 2024

I believe the last remaining comment is more exploratory in nature, and I don't see any "quick fix" there. I summarized my understanding of the issue in the last message. Based on that understanding and the e2e tests performed by @PBundyra I believe it is safe to move the PR forward. I believe we can address remaining ideas in follow up PRs / issues, for example as I proposed to explore setting Requeued: False when reserving the quota.

@mimowo
Copy link
Contributor

mimowo commented Nov 5, 2024

/lgtm
/approve
Thank you @PBundyra for the code and @tenzen-y for careful review!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 5, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 37c6cd40321ca21fb58caa4420010eb6715efc24

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, PBundyra

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mimowo
Copy link
Contributor

mimowo commented Nov 5, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2024
@k8s-ci-robot k8s-ci-robot merged commit da0e8e1 into kubernetes-sigs:main Nov 5, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Nov 5, 2024
@tenzen-y
Copy link
Member

tenzen-y commented Nov 5, 2024

Additionally, is this still needed?
https://github.com/PBundyra/kueue/blob/29cd2a6605ca380cf7cb0ff8ca3841e119d6fc7e/pkg/controller/core/workload_controller.go#L370-L371
I guess that answer is yes since this is not only for ProvReq, this is used by other AdmissionChecks.

I believe so, even for ProvReq. Why do you think this is redundant? ProvReq check can also be in Rejected state, and then we want to deactivate the Workload

That was just a question. So, I did not indicate the refining that part :)

kannon92 pushed a commit to kannon92/kueue that referenced this pull request Dec 5, 2024
…sigs#3375)

* Introduce RetryStrategy API to ProvisioningRequestConfig

* Address comments

* Improve API, address comments about defaults and clock in tests

* Address comments

* Address comments

* wip integration tests

* Add integration tests

* Add generated API file

* Fix the holy empty line

* Change API fields into required

* all provreq integration tests working, before cleanup

* Fix ProvReq integration tests, delete stalePR variable, use clock in ProvReq controller

* Fix linter

* Update installation table, update examples

* Rever redundant ifs

* Change api fields to be optional

* Generate fresh charts

* Fix integration test

* Make fields required to enable user set 0 value

* Make backoffBaseSeconds, and backoffMaxSeconds poitners

* Delete redundant ptr.Deref as we already default those values with kubebuilder annotation

* Requeue workloads based on eviction in case of eviction by admission check

* Default backoffLimitCount to 3

* Address nits

* Address comments, use kubebuilder to default RetryStrategy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for configurable ProvisioningRequest retries
5 participants