Fix defective revision can lead to pods never being removed #14573

jsanin-vmw · 2023-10-30T21:11:41Z

Proposed Changes

Requests in the activator times out by checking the spec.revisionTimeout field at the Revision
The idea here is to allow scale to 0 when revisionTime has passed, instead of allowing this only when the progressDeadline has passed.
After revisionTimeout seconds there are not requests queued at the activator, because they already timed out. So, there is no need to not allow scaling to 0
There another edge case when revisionTimeout is less than the StableWindow (when in panic mode) , that prevents a revision from scaling to 0. See changes in the autoscaler.go

Release Note

https://github.com/knative/serving/issues/13677 fixed

knative-prow · 2023-10-30T21:11:44Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

codecov · 2023-10-30T21:18:23Z

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (fd8f461) 85.99% compared to head (b570ea6) 85.74%.
Report is 180 commits behind head on main.

Files	Patch %	Lines
pkg/autoscaler/scaling/autoscaler.go	28.57%	3 Missing and 2 partials ⚠️
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go	0.00%	4 Missing ⚠️
pkg/reconciler/autoscaling/kpa/scaler.go	0.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14573      +/-   ##
==========================================
- Coverage   85.99%   85.74%   -0.26%     
==========================================
  Files         197      198       +1     
  Lines       14916    15143     +227     
==========================================
+ Hits        12827    12984     +157     
- Misses       1777     1833      +56     
- Partials      312      326      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…adline also added a condition in the Scale function in which in case we are not receiving data, check if you had panicked and the stable windows has passed, and continue to check again further conditions

dprotaso · 2023-11-03T13:43:18Z

pkg/reconciler/autoscaling/kpa/scaler.go

+		Version:  "v1",
+		Resource: "revisions",
+	}
+	uRevision, err := ks.dynamicClient.Resource(gvr).Namespace(pa.Namespace).Get(ctx, pa.Name, metav1.GetOptions{})


we don't want to be fetching revisions when we're making scaling decisions - that'll overload the API server at scale and possible trigger client side rate limiting

good point.
in case this proposal is viable, where is a good place in the code to fetch the revisionTimeout?

just to note that this is called only when we are scaling to 0.

In case this proposal is viable, where is a good place in the code to fetch the revisionTimeout?

Two options

we add a field to the Knative PodAutoscaler type

we propagate this using an annotation on the PodAutoscaler type

Looks like we already have an annotation

serving/pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go

Lines 171 to 174 in f939498

func (pa *PodAutoscaler) ProgressDeadline() (time.Duration, bool) {

// the value is validated in the webhook

return pa.annotationDuration(serving.ProgressDeadlineAnnotation)

}

Oddly seems like it's a user knob that we propagate to the deployment - but we aren't setting it on all PodAutoscalers (eg. when the user doesn't set it)

serving/pkg/apis/serving/register.go

Line 146 in f939498

ProgressDeadlineAnnotationKey = GroupName + "/progress-deadline"

Details here: #12743

I created a new annotation serving.knative.dev/revision-timeout for the pod autoscaler, this has the RevisionTimeout field on the RevisionSpec. With this change I am not fetching the revision from the API server

dprotaso · 2023-11-03T13:53:17Z

pkg/autoscaler/scaling/autoscaler.go

+			// We are not receiving data, but we panicked before, and the panic stable window has gone, we want to continue.
+			// We want to un-panic and scale down to 0 if the rest of the conditions allow it


I'm not familiar with when and how things panic - can you elaborate on this?

Why do we only do this when the scaler panics - why not do something like this when it doesn't panic?

The scenario is something like this:

we have a failing revision (which is what was reported in the bug)

this revision got a burst in requests, entered in panic mode.

the stable window is 60secs by default.

the revision has a custom revisionTimeout less than 60secs ( by default the revisionTimeout is 300 secs). This means requests in the activator will timeout after revisionTimeout.

when requests timed out, metrics will be reported. As there are not requests pending, the autoscaler would want to scale down to 0, but because we are still in panic mode, this will be ignored and the desiredPanicPodCount will be applied instead.

no metrics will be reported anymore, this condition will always be true:

if err != nil if errors.Is(err, metrics.ErrNoData)

a return invalidSR will be returned always and no change to the decider.Status.DesiredScale will be made

This causes the failing revision never will be scale down to 0.

when requests timed out, metrics will be reported. As there are not requests pending, the autoscaler would want to scale down to 0, but because we are still in panic mode, this will be ignored and the desiredPanicPodCount will be applied instead.

what is the value of desiredPanicPodCount ? I'm wondering in this scenario if we should opt of panic'ing

hard to say, it depends on the calculations done in the function. In line 205 you can see this

// We want to keep desired pod count in the [maxScaleDown, maxScaleUp] range. desiredStablePodCount := int32(math.Min(math.Max(dspc, maxScaleDown), maxScaleUp)) desiredPanicPodCount := int32(math.Min(math.Max(dppc, maxScaleDown), maxScaleUp))

dppc is calculated like (line 194)

dspc := math.Ceil(observedStableValue / spec.TargetValue) dppc := math.Ceil(observedPanicValue / spec.TargetValue)

and these values comes from the metricClient (line 163)

observedStableValue, observedPanicValue, err = a.metricClient.StableAndPanicConcurrency(metricKey, now)

I run a manual test again.
I setup a ksvc revision with containerConcurrency: 2. This is a failing revision, it will not start up.
This revision also has timeoutSeconds: 30, which means that request at the activator will timeout after 30 seconds.

I created 4 concurrent requests, which made the autoscaler to determine that 3 pods were needed.

For some seconds these were the values for

desiredStablePodCount = 3, desiredPanicPodCount = 3

I saw as well the message Operating in panic mode. during this period of time.

This block of code makes sure that a.maxPanicPods is the greater of desiredStablePodCount and desiredPanicPodCount. Which at this point is 3.

At some point the values changed to

desiredStablePodCount = 3, desiredPanicPodCount = 2

and finally to

desiredStablePodCount = 3, desiredPanicPodCount = 0

then it moved to

desiredStablePodCount = 2, desiredPanicPodCount = 0

but this decision is skipped due to
Skipping pod count decrease from 3 to 2. (see condition in the above code)

then, even though the values are:

desiredStablePodCount = 0, desiredPanicPodCount = 0

I still see the skipping Skipping pod count decrease from 3 to 0. Because we are in panic mode and we do not decrease values when we are in panic mode.
We are getting to desiredStablePodCount = 0 because there are no more requests at the activator (they timed out)

Without the changes in this PR. No more metrics will be received from now on for this revision. And this block of code will always be true.

if err != nil { if errors.Is(err, metrics.ErrNoData) { logger.Debug("No data to scale on yet") } else { logger.Errorw("Failed to obtain metrics", zap.Error(err)) } return invalidSR }

And the revision will remain in panic mode from now on.

The new condition aims for allowing another execution of the function if the revision panicked and the stable window has gone so the revision will scale down.

I still think this change is needed. Let me know your thoughts about the above explanation.

dprotaso · 2023-11-03T13:58:36Z

pkg/reconciler/autoscaling/kpa/scaler.go

+	if revTimeout, err := ks.revisionTimeout(ctx, pa); err == nil && revTimeout > 0 {
+		activationTimeout = time.Duration(revTimeout) * time.Second
+	} else if progressDeadline, ok := pa.ProgressDeadline(); ok {


Is this change necessary to fix the issue? I'd imagine failed revisions would scale down after ProgressDeadline is hit?

In our config map deployment we say that we wait ProgressDeadline until we considered a revision failed. These changes would break that existing behaviour.

These changes would break that existing behaviour.

This could be why the upgrade tests are failing

I'd imagine failed revisions would scale down after ProgressDeadline is hit?

yeah. The failed revision will scale down eventually. After 10min (progress deadline). This is what the issue is reporting: why pods on the failed revision are still around.

After revisionTimeout passed, there should no be any requests sit at the activator, hence no reason to keep pods around.

I think the current condition comparing progressDeadline should have been changed when the revisionTimeout field was introduced.

there are new changes in this code. I check the revision-timeout when the revision is Unreachable. Could you review this again ?

Scale down to zero if revision timeout has gone and the revision is unreachable

linux-foundation-easycla · 2024-01-31T17:08:30Z

The committers listed above are authorized under a signed CLA.

✅ login: jsanin-vmw / name: Juan Sanin (905e516)
✅ login: jsanin (18761cb, 1218843, b570ea6)

knative-prow · 2024-01-31T17:08:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jsanin-vmw
Once this PR has been reviewed and has the lgtm label, please assign davidhadas for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS
pkg/apis/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jsanin · 2024-02-01T01:13:02Z

/retest

dprotaso

In addition to my PR comments - I'm having a hard time reproducing the original bug (see: #13677 (comment)) . When the activator times out the request I'm seeing bad revisions scale down after the progress deadline.

I don't think it's necessary to continue this PR until we circle back with the original reporter for more information.

dprotaso · 2024-02-17T03:04:32Z

pkg/reconciler/autoscaling/kpa/scaler.go

+		} else if revTimeout, ok := pa.RevisionTimeout(); ok && pa.CanFailActivationOnUnreachableRevision(now, revTimeout) {
+			logger.Info("Activation has timed out after revision-timeout ", revTimeout)
+			return desiredScale, true


After spending time thinking about this I don't think we should be failing the revision based on the desired request timeout. The user should be setting the progressdeadline annotation directly if they want to lower this value.

Also there are valid use cases where spinning up a new 'unreachable' revision is the norm. eg. I have all my traffic pointing to an older revision and I'm applying changes to my service spec.template.spec to rollout new changes by creating new revisions. In my manual testing the changes in this PR will cause the newer unreachable revisions to fail - when normally they should spin up and then scale to zero. The change here short-circuits that.

dprotaso · 2024-02-17T03:11:57Z

Going to close this out - thanks for the patience and help 🙏 this is an area of the code I'm still becoming more familiar with

knative-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/API API objects and controllers labels Oct 30, 2023

knative-prow bot requested review from kauana and KauzClay October 30, 2023 21:12

knative-prow bot added the area/autoscale label Oct 30, 2023

jsanin-vmw force-pushed the js-defective-revision-13677-2 branch 2 times, most recently from 623d48b to 16007ba Compare October 30, 2023 21:54

jsanin-vmw changed the title ~~Js defective revision 13677 2~~ Fix defective revision can lead to pods never being removed Oct 30, 2023

jsanin-vmw marked this pull request as ready for review October 30, 2023 22:07

knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 30, 2023

knative-prow bot requested a review from ReToCode October 30, 2023 22:07

Take into account the revisionTimeout field instead of the progressDe…

905e516

…adline also added a condition in the Scale function in which in case we are not receiving data, check if you had panicked and the stable windows has passed, and continue to check again further conditions

jsanin-vmw force-pushed the js-defective-revision-13677-2 branch from 16007ba to 905e516 Compare October 30, 2023 22:11

dprotaso reviewed Nov 3, 2023

View reviewed changes

Set revision-timeout as an annotation in the PA

18761cb

Scale down to zero if revision timeout has gone and the revision is unreachable

jsanin added 2 commits January 31, 2024 16:18

Update unit tests

1218843

Add documentation to functions and const

b570ea6

knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 31, 2024

jsanin mentioned this pull request Feb 1, 2024

Defective revision can lead to pods never being removed #13677

Closed

This was referenced Feb 14, 2024

Add revision-failure test image #14875

Merged

Prevent a PodAutoscaler's DesiredScale to turn to -1 #14866

Merged

dprotaso reviewed Feb 17, 2024

View reviewed changes

dprotaso closed this Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix defective revision can lead to pods never being removed #14573

Fix defective revision can lead to pods never being removed #14573

jsanin-vmw commented Oct 30, 2023 •

edited

Loading

knative-prow bot commented Oct 30, 2023

codecov bot commented Oct 30, 2023 •

edited

Loading

dprotaso Nov 3, 2023

jsanin-vmw Nov 20, 2023

dprotaso Nov 22, 2023

dprotaso Nov 22, 2023

dprotaso Nov 22, 2023

dprotaso Nov 22, 2023

jsanin Jan 31, 2024

dprotaso Nov 3, 2023

jsanin-vmw Nov 20, 2023

dprotaso Nov 22, 2023

jsanin-vmw Nov 28, 2023

jsanin-vmw Dec 11, 2023

jsanin Jan 31, 2024

dprotaso Nov 3, 2023

dprotaso Nov 3, 2023

jsanin-vmw Nov 20, 2023

jsanin Jan 31, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jan 31, 2024 •

edited

Loading

knative-prow bot commented Jan 31, 2024

jsanin commented Feb 1, 2024

dprotaso left a comment

dprotaso Feb 17, 2024

dprotaso commented Feb 17, 2024

	func (pa *PodAutoscaler) ProgressDeadline() (time.Duration, bool) {
	// the value is validated in the webhook
	return pa.annotationDuration(serving.ProgressDeadlineAnnotation)
	}

		// We are not receiving data, but we panicked before, and the panic stable window has gone, we want to continue.
		// We want to un-panic and scale down to 0 if the rest of the conditions allow it

Fix defective revision can lead to pods never being removed #14573

Fix defective revision can lead to pods never being removed #14573

Conversation

jsanin-vmw commented Oct 30, 2023 • edited Loading

Proposed Changes

knative-prow bot commented Oct 30, 2023

codecov bot commented Oct 30, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsanin Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

linux-foundation-easycla bot commented Jan 31, 2024 • edited Loading

knative-prow bot commented Jan 31, 2024

jsanin commented Feb 1, 2024

dprotaso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dprotaso commented Feb 17, 2024

jsanin-vmw commented Oct 30, 2023 •

edited

Loading

codecov bot commented Oct 30, 2023 •

edited

Loading

jsanin Jan 31, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jan 31, 2024 •

edited

Loading