-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix defective revision can lead to pods never being removed #14573
Conversation
Skipping CI for Draft Pull Request. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #14573 +/- ##
==========================================
- Coverage 85.99% 85.74% -0.26%
==========================================
Files 197 198 +1
Lines 14916 15143 +227
==========================================
+ Hits 12827 12984 +157
- Misses 1777 1833 +56
- Partials 312 326 +14 ☔ View full report in Codecov by Sentry. |
623d48b
to
16007ba
Compare
…adline also added a condition in the Scale function in which in case we are not receiving data, check if you had panicked and the stable windows has passed, and continue to check again further conditions
16007ba
to
905e516
Compare
Version: "v1", | ||
Resource: "revisions", | ||
} | ||
uRevision, err := ks.dynamicClient.Resource(gvr).Namespace(pa.Namespace).Get(ctx, pa.Name, metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to be fetching revisions when we're making scaling decisions - that'll overload the API server at scale and possible trigger client side rate limiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
in case this proposal is viable, where is a good place in the code to fetch the revisionTimeout
?
just to note that this is called only when we are scaling to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case this proposal is viable, where is a good place in the code to fetch the revisionTimeout?
Two options
- we add a field to the Knative PodAutoscaler type
- we propagate this using an annotation on the PodAutoscaler type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we already have an annotation
serving/pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go
Lines 171 to 174 in f939498
func (pa *PodAutoscaler) ProgressDeadline() (time.Duration, bool) { | |
// the value is validated in the webhook | |
return pa.annotationDuration(serving.ProgressDeadlineAnnotation) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oddly seems like it's a user knob that we propagate to the deployment - but we aren't setting it on all PodAutoscalers (eg. when the user doesn't set it)
serving/pkg/apis/serving/register.go
Line 146 in f939498
ProgressDeadlineAnnotationKey = GroupName + "/progress-deadline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Details here: #12743
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new annotation serving.knative.dev/revision-timeout
for the pod autoscaler, this has the RevisionTimeout field on the RevisionSpec. With this change I am not fetching the revision from the API server
// We are not receiving data, but we panicked before, and the panic stable window has gone, we want to continue. | ||
// We want to un-panic and scale down to 0 if the rest of the conditions allow it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with when and how things panic - can you elaborate on this?
Why do we only do this when the scaler panics - why not do something like this when it doesn't panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario is something like this:
- we have a failing revision (which is what was reported in the bug)
- this revision got a burst in requests, entered in panic mode.
- the stable window is 60secs by default.
- the revision has a custom revisionTimeout less than 60secs ( by default the revisionTimeout is 300 secs). This means requests in the activator will timeout after revisionTimeout.
- when requests timed out, metrics will be reported. As there are not requests pending, the autoscaler would want to scale down to 0, but because we are still in panic mode, this will be ignored and the
desiredPanicPodCount
will be applied instead. - no metrics will be reported anymore, this condition will always be true:
if err != nil
if errors.Is(err, metrics.ErrNoData)
a return invalidSR
will be returned always and no change to the decider.Status.DesiredScale
will be made
This causes the failing revision never will be scale down to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when requests timed out, metrics will be reported. As there are not requests pending, the autoscaler would want to scale down to 0, but because we are still in panic mode, this will be ignored and the desiredPanicPodCount will be applied instead.
what is the value of desiredPanicPodCount
? I'm wondering in this scenario if we should opt of panic'ing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to say, it depends on the calculations done in the function. In line 205 you can see this
// We want to keep desired pod count in the [maxScaleDown, maxScaleUp] range.
desiredStablePodCount := int32(math.Min(math.Max(dspc, maxScaleDown), maxScaleUp))
desiredPanicPodCount := int32(math.Min(math.Max(dppc, maxScaleDown), maxScaleUp))
dppc
is calculated like (line 194)
dspc := math.Ceil(observedStableValue / spec.TargetValue)
dppc := math.Ceil(observedPanicValue / spec.TargetValue)
and these values comes from the metricClient (line 163)
observedStableValue, observedPanicValue, err = a.metricClient.StableAndPanicConcurrency(metricKey, now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run a manual test again.
I setup a ksvc revision with containerConcurrency: 2
. This is a failing revision, it will not start up.
This revision also has timeoutSeconds: 30
, which means that request at the activator will timeout after 30 seconds.
I created 4 concurrent requests, which made the autoscaler to determine that 3 pods were needed.
For some seconds these were the values for
desiredStablePodCount = 3, desiredPanicPodCount = 3
I saw as well the message Operating in panic mode.
during this period of time.
This block of code makes sure that a.maxPanicPods
is the greater of desiredStablePodCount and desiredPanicPodCount. Which at this point is 3
.
At some point the values changed to
desiredStablePodCount = 3, desiredPanicPodCount = 2
and finally to
desiredStablePodCount = 3, desiredPanicPodCount = 0
then it moved to
desiredStablePodCount = 2, desiredPanicPodCount = 0
but this decision is skipped due to
Skipping pod count decrease from 3 to 2.
(see condition in the above code)
then, even though the values are:
desiredStablePodCount = 0, desiredPanicPodCount = 0
I still see the skipping Skipping pod count decrease from 3 to 0.
Because we are in panic mode
and we do not decrease values when we are in panic mode.
We are getting to desiredStablePodCount = 0
because there are no more requests at the activator (they timed out)
Without the changes in this PR. No more metrics will be received from now on for this revision. And this block of code will always be true.
if err != nil {
if errors.Is(err, metrics.ErrNoData) {
logger.Debug("No data to scale on yet")
} else {
logger.Errorw("Failed to obtain metrics", zap.Error(err))
}
return invalidSR
}
And the revision will remain in panic mode
from now on.
The new condition aims for allowing another execution of the function if the revision panicked and the stable window has gone so the revision will scale down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this change is needed. Let me know your thoughts about the above explanation.
if revTimeout, err := ks.revisionTimeout(ctx, pa); err == nil && revTimeout > 0 { | ||
activationTimeout = time.Duration(revTimeout) * time.Second | ||
} else if progressDeadline, ok := pa.ProgressDeadline(); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary to fix the issue? I'd imagine failed revisions would scale down after ProgressDeadline is hit?
In our config map deployment
we say that we wait ProgressDeadline until we considered a revision failed. These changes would break that existing behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes would break that existing behaviour.
This could be why the upgrade tests are failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd imagine failed revisions would scale down after ProgressDeadline is hit?
yeah. The failed revision will scale down eventually. After 10min (progress deadline). This is what the issue is reporting: why pods on the failed revision are still around.
After revisionTimeout
passed, there should no be any requests sit at the activator, hence no reason to keep pods around.
I think the current condition comparing progressDeadline should have been changed when the revisionTimeout
field was introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are new changes in this code. I check the revision-timeout when the revision is Unreachable. Could you review this again ?
Scale down to zero if revision timeout has gone and the revision is unreachable
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jsanin-vmw The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to my PR comments - I'm having a hard time reproducing the original bug (see: #13677 (comment)) . When the activator times out the request I'm seeing bad revisions scale down after the progress deadline.
I don't think it's necessary to continue this PR until we circle back with the original reporter for more information.
} else if revTimeout, ok := pa.RevisionTimeout(); ok && pa.CanFailActivationOnUnreachableRevision(now, revTimeout) { | ||
logger.Info("Activation has timed out after revision-timeout ", revTimeout) | ||
return desiredScale, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After spending time thinking about this I don't think we should be failing the revision based on the desired request timeout. The user should be setting the progressdeadline annotation directly if they want to lower this value.
Also there are valid use cases where spinning up a new 'unreachable' revision is the norm. eg. I have all my traffic pointing to an older revision and I'm applying changes to my service spec.template.spec to rollout new changes by creating new revisions. In my manual testing the changes in this PR will cause the newer unreachable revisions to fail - when normally they should spin up and then scale to zero. The change here short-circuits that.
Going to close this out - thanks for the patience and help 🙏 this is an area of the code I'm still becoming more familiar with |
Fixes #13677
Proposed Changes
Release Note