-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce RetryStrategy API to ProvisioningRequestConfig #3375
Introduce RetryStrategy API to ProvisioningRequestConfig #3375
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
198179c
to
9312d0a
Compare
9312d0a
to
5d6baf8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial review pass
2995cb1
to
f9fa268
Compare
LGTM label has been added. Git tree hash: b399f08207b36f7091fda86281ce35873d8f7e43
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, is this still needed?
I guess that answer is yes since this is not only for ProvReq, this is used by other AdmissionChecks.
if attempt := getAttempt(log, pr, wl.Name, check); attempt <= backoffLimitCount { | ||
// it is going to be retried | ||
message := fmt.Sprintf("Retrying after failure: %s", apimeta.FindStatusCondition(pr.Status.Conditions, autoscaling.Failed).Message) | ||
updated = updateCheckMessage(&checkState, message) || updated | ||
if features.Enabled(features.KeepQuotaForProvReqRetry) { | ||
updated = updateCheckState(&checkState, kueue.CheckStatePending) || updated | ||
updated = updateCheckMessage(&checkState, message) || updated | ||
} else { | ||
} else if wl.Status.RequeueState == nil || getAttempt(log, pr, wl.Name, check) > ptr.Deref(wl.Status.RequeueState.Count, 0) { | ||
// We don't want to Retry on old ProvisioningRequests | ||
updated = true | ||
checkState.State = kueue.CheckStateRejected | ||
checkState.Message = apimeta.FindStatusCondition(pr.Status.Conditions, autoscaling.Failed).Message | ||
updateCheckState(&checkState, kueue.CheckStateRetry) | ||
workload.UpdateRequeueState(wlPatch, backoffBaseSeconds, backoffMaxSeconds, c.clock) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that this and BookingExpired case requeue mechanism will be implemented in the workload-controller.
I guess that we should follow the steps:
- ProvReqController: Update Workload AdmissionCheckState with Retry
- WorkloadController: Do Requeue mechanism same as the WaitForPodsReady based on the Workload AdmissionCheckState.
In the current implementation, the field owner conflict obviously occured between WorkloadController and ProvisioningRequestController.
Then the requeueState update will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially, the WaitForPodsReady condition and ProvReqRequeue conditions are met at the requeue cycle (This often could happen), I guess that conflict often happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that we should follow the steps:
- ProvReqController: Update Workload AdmissionCheckState with Retry
- WorkloadController: Do Requeue mechanism same as the WaitForPodsReady based on the Workload > AdmissionCheckState.
The con of this solution is that Workload Controller doesn't have information about the ProvReqConfig. The controller would need to do two additional requests:
- GET on AdmissionCheck that is in
Retry
state; - LIST on all ProvReqConfigs and match it with the AdmissionCheck
It's seems to me like we're just moving complexity from one controller to the another without any clear benefit.
However, I am aware of this problem and we discussed it with @mimowo. The third important controller in this mechanism is jobframework reconciler which is responsible for setting Requeue
condition, and unsetting Quota. So currently evicting workload and unsetting quota is not atomic which may also cause problems. We're on designing a solution to that but it seems like this need to be well-thought and is out of the scope for this PR.
Additionally, I implemented the PR with those problems in mind so it's resilient to any interleavings.
In the current implementation, the field owner conflict obviously occured between WorkloadController and ProvisioningRequestController.
Then the requeueState update will fail.
We send PATCH with FieldOwnership and during my e2e tests it worked properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The con of this solution is that Workload Controller doesn't have information about the ProvReqConfig. The controller would need to do two additional requests:
GET on AdmissionCheck that is in Retry state;
LIST on all ProvReqConfigs and match it with the AdmissionCheck
Which ProvReq information do you need in the workload-controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backoffBaseSeconds
and backoffMaxSeconds
from ``ProvReqConfig.RetryStrategy`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I know where you are coming from, for me this line also was "eyebrow raising" and we discussed this at length with Patryk.
Let me summarize the more important aspects:
- actually conflict is unlikely because WaitForPodsReady owns the process when the workload is already admitted, while ProvReq owns the workload before admission (during admission checks phase), (I don't know how to trigger the conflict)
- there is no issue about field ownership since both workload_controller and ProvReq controller use ForceOwnership.
- the fact that this line sets
requeueAt
while stillRequeued: True
is a bit bothering me, but in fact the same happens for WaitForPodsReady, just the time window is a bit shorter. I think we could actually "fix" this by setting "Requeued: False" at the moment of reserving quota. I see no downsides, we could open a follow up issue to investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for summarizing that.
My root motivation was centralizing managing the Requeue mechanism in the workload-controller to avoid logic fragmentation.
In the current implementation, even if both features (WaitForPodsReady and ProvisioningRequest retryStrategy) relies on the same API, the mechanism are fragmented, then those fragmentations could bring us uncontrollable regresssions.
Alternatively, we can consider preparing the dedicated field separate from the WaitForPodsReady, but I think that is not a good idea rather than the fragmented mechanism since that indicates we will push debt to API and end users.
So, I'm wondering if we can store the ProvReq Retry parameters like backoffBaseSeconds
in our cache.
Indeed, we already have relationship between AdmissionCheck and ProvisioningRequest in our cache. We only extend the cache, a slightly.
@mimowo Any objections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I am aware of this problem and we discussed it with @mimowo. The third important controller in this mechanism is jobframework reconciler which is responsible for setting Requeue condition, and unsetting Quota.
Here key problem is even if they rely on the same API, both feature are managed by different mechanism.
test/integration/controller/admissionchecks/provisioning/provisioning_test.go
Outdated
Show resolved
Hide resolved
test/integration/controller/admissionchecks/provisioning/provisioning_test.go
Outdated
Show resolved
Hide resolved
test/integration/controller/admissionchecks/provisioning/provisioning_test.go
Outdated
Show resolved
Hide resolved
247c14b
to
d19a4cc
Compare
I believe so, even for ProvReq. Why do you think this is redundant? ProvReq check can also be in |
I believe the last remaining comment is more exploratory in nature, and I don't see any "quick fix" there. I summarized my understanding of the issue in the last message. Based on that understanding and the e2e tests performed by @PBundyra I believe it is safe to move the PR forward. I believe we can address remaining ideas in follow up PRs / issues, for example as I proposed to explore setting |
LGTM label has been added. Git tree hash: 37c6cd40321ca21fb58caa4420010eb6715efc24
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, PBundyra The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
That was just a question. So, I did not indicate the refining that part :) |
…sigs#3375) * Introduce RetryStrategy API to ProvisioningRequestConfig * Address comments * Improve API, address comments about defaults and clock in tests * Address comments * Address comments * wip integration tests * Add integration tests * Add generated API file * Fix the holy empty line * Change API fields into required * all provreq integration tests working, before cleanup * Fix ProvReq integration tests, delete stalePR variable, use clock in ProvReq controller * Fix linter * Update installation table, update examples * Rever redundant ifs * Change api fields to be optional * Generate fresh charts * Fix integration test * Make fields required to enable user set 0 value * Make backoffBaseSeconds, and backoffMaxSeconds poitners * Delete redundant ptr.Deref as we already default those values with kubebuilder annotation * Requeue workloads based on eviction in case of eviction by admission check * Default backoffLimitCount to 3 * Address nits * Address comments, use kubebuilder to default RetryStrategy
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #1353
Special notes for your reviewer:
I am yet to add integration tests, but the rest of the code is ready for the first pass
Does this PR introduce a user-facing change?