-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix defective revision can lead to pods never being removed #14573
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -201,6 +201,9 @@ func (ks *scaler) handleScaleToZero(ctx context.Context, pa *autoscalingv1alpha1 | |
if pa.Status.CanFailActivation(now, activationTimeout) { | ||
logger.Info("Activation has timed out after ", activationTimeout) | ||
return desiredScale, true | ||
} else if revTimeout, ok := pa.RevisionTimeout(); ok && pa.CanFailActivationOnUnreachableRevision(now, revTimeout) { | ||
logger.Info("Activation has timed out after revision-timeout ", revTimeout) | ||
return desiredScale, true | ||
Comment on lines
+204
to
+206
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After spending time thinking about this I don't think we should be failing the revision based on the desired request timeout. The user should be setting the progressdeadline annotation directly if they want to lower this value. Also there are valid use cases where spinning up a new 'unreachable' revision is the norm. eg. I have all my traffic pointing to an older revision and I'm applying changes to my service spec.template.spec to rollout new changes by creating new revisions. In my manual testing the changes in this PR will cause the newer unreachable revisions to fail - when normally they should spin up and then scale to zero. The change here short-circuits that. |
||
} | ||
ks.enqueueCB(pa, activationTimeout) | ||
return scaleUnknown, false | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with when and how things panic - can you elaborate on this?
Why do we only do this when the scaler panics - why not do something like this when it doesn't panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario is something like this:
desiredPanicPodCount
will be applied instead.a
return invalidSR
will be returned always and no change to thedecider.Status.DesiredScale
will be madeThis causes the failing revision never will be scale down to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the value of
desiredPanicPodCount
? I'm wondering in this scenario if we should opt of panic'ingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to say, it depends on the calculations done in the function. In line 205 you can see this
dppc
is calculated like (line 194)and these values comes from the metricClient (line 163)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run a manual test again.
I setup a ksvc revision with
containerConcurrency: 2
. This is a failing revision, it will not start up.This revision also has
timeoutSeconds: 30
, which means that request at the activator will timeout after 30 seconds.I created 4 concurrent requests, which made the autoscaler to determine that 3 pods were needed.
For some seconds these were the values for
I saw as well the message
Operating in panic mode.
during this period of time.This block of code makes sure that
a.maxPanicPods
is the greater of desiredStablePodCount and desiredPanicPodCount. Which at this point is3
.At some point the values changed to
and finally to
then it moved to
but this decision is skipped due to
Skipping pod count decrease from 3 to 2.
(see condition in the above code)then, even though the values are:
I still see the skipping
Skipping pod count decrease from 3 to 0.
Because we are inpanic mode
and we do not decrease values when we are in panic mode.We are getting to
desiredStablePodCount = 0
because there are no more requests at the activator (they timed out)Without the changes in this PR. No more metrics will be received from now on for this revision. And this block of code will always be true.
And the revision will remain in
panic mode
from now on.The new condition aims for allowing another execution of the function if the revision panicked and the stable window has gone so the revision will scale down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this change is needed. Let me know your thoughts about the above explanation.