[Experimental] DeliverySpec RetryAfter #5811

travis-minke-sap · 2021-10-13T20:02:56Z

Description
The Retry-After header is a standard part of the HTTP spec which can be returned with 429 and 503 responses in order to specify a duration, or timestamp, which subsequent retries should wait before attempting to resend. It provides downstream event recipients with a mechanism to provide back-pressure when requests are arriving too frequently. The event retry mechanism in Knative eventing currently does not respect the Retry-After header. The intent of this new experimental-feature is to expose the ability for users to opt-in to respecting the Retry-After header.

Design Overview
Following the pattern established in the experimental-features process, and closely mirroring the implementation in the similar Delivery Timeout experimental-feature, the plan is to enhance the DeliverySpec to include a new optional retryAfter component. Use of this new component will be gated by the delivery-retryafter experimental feature flag in the config-features ConfigMap and enforced via WebHook validation.

Example DeliverySpec with new retryAfter component...

delivery:
  backoffDelay: PT0.5S
  backoffPolicy: exponential
  retry: 3
  retryAfter:
    enabled: true
    maxDuration: "PT30S"

The new retryAfter component will only take effect if the retry value is specified and is at least 1. The optional maxDuration field provides an override to prevent excessive backoff durations as might be desirable in certain use cases.

Exit Criteria
DeliverySpec allows optional configuration of retry behavior for 429 and 503 Retry-After headers.

Experimental Feature Flag Name
delivery-retryafter

Experimental Feature Stages
The following is the proposed plan for moving through the stages of the experimental feature process...

Alpha: (Target 1.1 Release)
- Initial Implementation, Unit Tests, e2e Test [Experimental-Feature] DeliverySpec.RetryAfter #5813
- User Documentation DeliverySpec.RetryAfter Experimental-Feature docs#4361
- Refactor Flaky Unit Tests Fix flaky Retry-After Header Test #5943, Refactor Retry-After backoff tests #5981
- Fix Channel Backed MT Broker To Proxy Response Headers Enhance Broker Filter to proxy Response Headers #5946
- Adjust KafkaChannel CRD to allow experimental-features in delivery spec PR template should have examples of good release notes #1025.
Beta: (minimum of 1 release after Alpha)
- Changes / Fixes based on Alpha ?
- Documentation Enhancements ?
- Conformance Tests ?
Stable: (minimum of 2 releases after Beta)
- Remove delivery-retryafter feature flag.
- Remove eventing WebHook validation enforcement.
- Add requirements to support DeliverySpec.RetryAfter to knative/specs repo

Affected WG

Eventing WG
Eventing-Kafka WG (previously Event-Delivery)

Additional Context

Resources
Experimental Features Process
Retry-After HTTP 1.1 RFC
Retry-After Mozilla Docs
CloudEvent Webhook Spec
History
An initial attempt at supporting Retry-After headers was made in March 2021 and is mostly documented in Knative Eventing Discussion #5011. This second attempt has been briefly discussed at the Eventing WG in the past few weeks.

The text was updated successfully, but these errors were encountered:

travis-minke-sap · 2021-10-13T20:03:07Z

/assign

evankanderson · 2021-10-14T16:28:22Z

Do we need the enabled bool, or could we require a max duration (even if it's something silly like "1 week")?

Other questions:

Any guidance to implementers about handling this in-process vs in the storage layer?
Should a Retry-After apply to all deliveries to that host, or only that specific delivery?
Should there be an administrator-level ConfigMap parameter to specify a max retry-after duration?

evankanderson · 2021-10-14T16:28:43Z

(I'm 👍 on this feature, by the way)

travis-minke-sap · 2021-10-14T18:12:14Z

Thanks for the 👀 's on this Evan - much appreciated! I'll do my best to respond, forgive me if I miss the intent - I don't have as clear an understanding of the larger eventing ecosystem as you do (too narrowly focused on the KafkaChannel ; )

Also, I was holding off on creating the PR (I probably got ahead of myself by doing the work before the issue was reviewed) until I had the e2e tests were done... but I've gone ahead and created it so that folks can see the intended changes. I am still happy to refactor as the community deems necessary ; )

Do we need the enabled bool, or could we require a max duration (even if it's something silly like "1 week")?

I would vote to keep the separation, but am certainly open to that approach if the community prefers it. I only added maxDuration as an afterthought safety gauge, and wasn't sure what the reaction or need would be. It feels a bit wrong to combine the two and require a maxDuration in all cases, and more importantly for it to "infer" the enabled state. Just my $0.02 though - we can see if there are any other opinions ; )

Any guidance to implementers about handling this in-process vs in the storage layer?

I'm not sure what you mean by "handling in the storage layer"? My intent (and what I've implemented thus far) is to handle this exclusively "in-process" on a per-event basis. Meaning, when a dispatcher sends an event and receives back a 429/Retry-After, it would apply that backoff before resending that event only.

Should a Retry-After apply to all deliveries to that host, or only that specific delivery?

As I mentioned in 1. above, I see this as being used to inform the determination of the backoff duration for a single event request only. This is what I've implemented in the message_sender.go SendWithRetries() implementation. There is nothing about this new experimental-feature that would prevent a custom implementation from taking a broader view and throttling other parallel requests.

Should there be an administrator-level ConfigMap parameter to specify a max retry-after duration?

I haven't seen any precedent for similar DeliverySpec changes (e.g. Timeout field) and wouldn't know where such a thing would live? I would rather wait and see if there is request for global config and add it in later if necessary.

matzew · 2021-10-14T19:00:54Z

If I set retry: 3 (or any larger as 0) I do indicate that I am interested in "retrying".

So, for me I'd thnk, that I am also interested in doing a retry-after recognition, for a good reason.

Therefore not really sure if we need to tweak it and add extra nobs, like enabling/disabling it.

However, I can see that if the server indicates 600 seconds, that this might be too long for me...

matzew · 2021-10-14T19:01:13Z

(I'm 👍 on this feature too, by the way)

travis-minke-sap · 2021-10-14T19:21:32Z

If I set retry: 3 (or any larger as 0) I do indicate that I am interested in "retrying".

So, for me I'd thnk, that I am also interested in doing a retry-after recognition, for a good reason.

Therefore not really sure if we need to tweak it and add extra nobs, like enabling/disabling it.

However, I can see that if the server indicates 600 seconds, that this might be too long for me...

Good point @matzew - I think the question you're really asking is... "If/when this feature graduates to Stable/GA, would we want to allow users of the retry capability an option to respect/ignore Retry-After headers?" If we are comfortable answering "No, if you use retry then Retry-After headers WILL be respected", then the DeliverySpec configuration can be reduced to a single optional field as follows...

delivery:
  backoffDelay: PT0.5S
  backoffPolicy: exponential
  retry: 3
  retryAfterMax: PT120S

...and during the experimental-features Alpha/Beta stages the feature flag would simply control whether or not to respect the Retry-After headers, instead of gating the DeliverySpec.RetryAfter configuration during validation.

Personally, I'm fine with this approach as it seems to better adhere to the intentions of the HTTP Spec, but I thought this was undesirable from a "don't change the existing behavior" perspective. It's unclear how many users would actually try out the experimental-feature between Alpha/Stable, so it's possible the switchover might catch them out?

matzew · 2021-10-14T19:42:49Z

On Thu 14. Oct 2021 at 21:22, Travis Minke ***@***.***> wrote: If I set retry: 3 (or any larger as 0) I do indicate that I am interested in "retrying". So, for me I'd thnk, that I am also interested in doing a retry-after recognition, for a good reason. Therefore not really sure if we need to tweak it and add extra nobs, like enabling/disabling it. However, I can see that if the server indicates 600 seconds, that this might be too long for me... Good point @matzew <https://github.com/matzew> - I think the question you're really asking is... *"If/when this feature graduates to Stable/GA, would we want to allow users of the retry capability an option to respect/ignore Retry-After headers?"* If we are comfortable answering *"No, if you use retry then Retry-After headers WILL be respected"*, then the DeliverySpec configuration can be reduced to a single optional field as follows... delivery: backoffDelay: PT0.5S backoffPolicy: exponential retry: 3 retryAfterMax: PT120S

more compact and reasonable to express it like this ...and during the experimental-features Alpha/Beta stages the feature flag

would simply control whether or not to respect the *Retry-After* headers, instead of gating the DeliverySpec.RetryAfter configuration during validation. Personally, I'm fine with this approach as it seems to better adhere to the intentions of the HTTP Spec,

ACK, that’s right. Especially if I set retry to “n” (larger 0) but I thought this was undesirable from a *"don't change the existing

behavior"* perspective. It's unclear how many users would actually try out the experimental-feature between Alpha/Stable, so it's possible the switchover might catch them out?

I’d use it: if I can honor the retry-after header I’d like to do so. IMO an oversight when designing retry —

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5811 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABGPTTATM6NJN2BBSSAM7DUG4UURANCNFSM5F6AMC6Q> .

-- Sent from Gmail Mobile

travis-minke-sap · 2021-10-14T19:46:02Z

I’d use it: if I can honor the retry-after header I’d like to do so.
IMO an oversight when designing retry

I like it! I can re-tool to that implementation if folks agree (or no-one disagrees after a week and a summary at next weeks WG) ?

lionelvillard · 2021-10-15T14:25:18Z

+1 for retryAfterMax. The default behavior should be to always respect retry-after.

Setting retryAfterMax: PT0S provides an effective way to disable this feature. We could provide a post-install job to set this field, as a way to make sure that existing applications won't break when upgrading.

pierDipi · 2021-10-15T15:30:35Z

+1 for retryAfterMax. The default behavior should be to always respect retry-after.

+1.

pierDipi · 2021-10-15T15:31:22Z

The default behavior should be to always respect retry-after.

Do you mean when this goes GA, right?

lionelvillard · 2021-10-15T15:42:37Z

Do you mean when this goes GA, right?

When this feature is enabled.

travis-minke-sap · 2021-10-18T18:06:49Z

Hey everyone - thanks for the feedback - great discussion and suggestions! I'm going to summarize what I heard as the preferred approach, along with a related complication and proposal for how to resolve. Sorry it got a bit long-winded - trying to be clear / precise...

New approach...

There will be no Enabled flag in the DeliverySpec and instead Retry-After headers will always be respected (when the experimental-feature flag is "enabled", and if/when the feature becomes "Stable / GA").
The DeliverySpec change will include an optional retryAfterMax field which can be set to PT0S to allow users to opt-out of Retry-After header support as follows...
```
delivery:
  backoffDelay: PT0.5S
  backoffPolicy: exponential
  retry: 3
  retryAfterMax: PT120S
```

Complication...

The nice thing about having the Enabled field in the DeliverySpec is that it lets us enforce the experimental-feature flag in the eventing WebHook. This is a single touch-point that will cover all of Knative eventing and it is already watching the config-features ConfigMap.

If we eliminate the Enabled field from the DeliverySpec, then we will instead have to track the experimental-feature state in the lower-level RetryConfig struct which is passed to DispatchMessageWithRetries() in channel/message_dispatcher.go and possibly in RetryConfigFromDeliverySpec() in kncloudevents/retries.go. This would require ALL sandbox implementations to start watching the config-features ConfigMap and track in their context so that it can be passed to those functions above for enforcement. I'm assuming this a deal-breaker and not how the experimental-features were intended to be used for such "common" behavior changes?

Proposed Solution...

We could REQUIRE the use of the retryAfterMax field in the DeliverySpec ONLY while this is an experimental-feature. Meaning, in order to use it as an experimental feature you would have to enable the feature in the ConfigMap AND set the retryAfterMax to some value greater than 0. Then we could still enforce the experimental-feature flag in the eventing WebHook.

If / when the experimental-feature graduates to "Stable / GA", we will remove the requirement for the retryAfterMax field to be populated in order for Retry-After enforcement to occur. It will stop being "dual-use", and will become optional only for specifying a max override duration, or for opting-out of Retry-After headers. The downside is that there will be a small code/behavior change at the switch to "Stable / GA", but it seems reasonable to me - thoughts? I think this is actually close to what @evankanderson was suggesting originally (apologies if I missed the point there ; )

The following table attempts to summarize the operation for each stage/use-case...

EF Stage	Feature Flag	retryAfterMax Field Absent	retryAfterMax Field Present
Alpha / Beta	Disabled	Accepted by Webhook Validation & no Retry-After header enforced	Rejected by WebHook Validation
Alpha / Beta	Enabled	Accepted by Webhook Validation & no Retry-After header enforced	Accepted by Webhook Validation & Retry-After header enforced if max>0
Stable / GA	n/a	Retry-After header enforced without max override	Retry-After header enforced if max>0 with max override

Thanks again for the review - let me know your thoughts, otherwise I'll proceed with this approach (will give it a few days to become lazily accepted ; )

pierDipi · 2021-10-20T19:14:35Z

I'm ok with the proposed solution.

Or if we're worried about making changes in non-webhook code when this goes GA, we can make this field always required but defaulted by the webhook when the feature is enabled (or GA).

type DeliverySpec struct {
  # ...
  RetryAfterMax *string // JSON stuff
}

Defaulting webhook in alpha and beta phase:

if featureEnabled && ds.RetryAfterMax == nil {
  // Take globalDefault from a CM
  ds.RetryAfterMax = globalDefault
}

Validation webhook in alpha and beta phase:

if featureDisabled && ds.RetryAfterMax != nil {
  // fail webhook
}

Defaulting webhook in GA phase:

if ds.RetryAfterMax == nil {
  // Take globalDefault from a CM
  ds.RetryAfterMax = globalDefault
}

Validation webhook in GA:

// Left only the validation related to the `ds.RetryAfterMax` format

In the dispatcher code by checking if ds.RetryAfterMax == nil we know that is disabled (or enabled).

The problem is that we need to choose a good global default that is good for most use cases (PT60S ?).

travis-minke-sap · 2021-10-27T19:07:04Z

There's been a good discussion here and in a Slack thread followed by a quiet period so I thought I'd summarize the current approach for another review... (thanks to everyone who participated!)

Essentially the proposal above is the plan. The TLDR is...

The Retry-After header will only be supported for 429 and 503 error responses.
Retry-After headers are only considered when normal DeliverySpec retry has been enabled/configured, and is only used to determine the backoff of an otherwise planned retry attempt.
The DeliverySpec will be modified to include a new optional field called retryAfterMax.
- The Retry-After value in a 503 response could be rather large (hours) which is why the retryAfterMax allows an optional override to prevent excessive blocking.
- In Alpha / Beta states retryAfterMax must be set in order to opt-in to respecting Retry-After headers.
- In Stable / GA state retryAfterMax can be set ("PT0S" or similar) in order to opt-out of respecting Retry-After headers.
The common/shared code in the eventing repo will provide this capability to implementations who use it.
- Implementations with custom event sending and retry logic are free to respect/ignore this new optional retryAfterMax field according to their own abilities/timelines. (This is how the recently added DeliverySpec.Timeout experimental-feature seems to have been handled.)
- Documentation of the feature should make the implementation-specific nature of the feature clear.

Discussion Highlights

The flip-flop of behavior (opt-in to opt-out) between Beta/GA is not ideal, but allows for single consistent validation enforcement of the experimental-feature without requiring an "enabled" field. The alternative is to require implementations using the common/shared eventing logic to individually track the experimental-feature state.
The Retry-After header is meant to be respected only when calculating the backoff time of the next, already planned, retry attempt for an in-process event transmission. There is no expectation of out-of-band async retry at a later time.
There is a disconnect with regard to defining a common DeliverySpec API, and accompanying shared libs, while still allowing sandbox implementations (rabbitmq, google pub/sub, etc.) to create their own event sending/retry logic. This complicates the ability to change the DeliverySpec, as it requires understanding the capabilities of any/all custom implementations. The alternative chosen here, and as precedent with the similar DeliverySpec.Timeout experimental-feature, is to leave that decision to those implementations and document the possible inconsistency for end users to understand.
Concerns that respecting the Retry-After header when calculating backoff durations might potentially block an event-sending thread-pool was noted to be no different from the same possibility with a normal retry with large backoff durations, and can be mitigated by setting the retryAfterMax appropriately.
The possibility of implementing this in the CloudEvents sdk-go library, and then refactoring the common/shared eventing retry logic to use that was suggested. This is a much larger architectural direction change that would have to be evaluated separately as a possibility.

travis-minke-sap added the kind/feature-request label Oct 13, 2021

knative-prow-robot assigned travis-minke-sap Oct 13, 2021

travis-minke-sap mentioned this issue Oct 13, 2021

DeliverySpec.RetryAfter Experimental-Feature knative/docs#4361

Merged

travis-minke-sap mentioned this issue Oct 14, 2021

[Experimental-Feature] DeliverySpec.RetryAfter #5813

Merged

5 tasks

pierDipi mentioned this issue Nov 24, 2021

Retry After doesn't work for MT Channel Based Broker #5935

Closed

devguyio moved this to In Progress in Eventing WG Roadmap Dec 1, 2021

devguyio added this to Eventing WG Roadmap Dec 1, 2021

devguyio added the roadmap Issues for linking from the roadmap label Dec 2, 2021

devguyio added the triage/accepted Issues which should be fixed (post-triage) label Dec 16, 2021

devguyio moved this from In Progress to Experimental Features in Eventing WG Roadmap Mar 3, 2022

aliok mentioned this issue Apr 27, 2022

Only proxy some headers from the reply in MTBroker filter #6357

Merged

5 tasks

pierDipi moved this from Alpha Features to In Progress in Eventing WG Roadmap Aug 17, 2022

pierDipi moved this from In Progress to Ready To Work in Eventing WG Roadmap Feb 22, 2024

pierDipi unassigned travis-minke-sap May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] DeliverySpec RetryAfter #5811

[Experimental] DeliverySpec RetryAfter #5811

travis-minke-sap commented Oct 13, 2021 •

edited

Loading

travis-minke-sap commented Oct 13, 2021

evankanderson commented Oct 14, 2021

evankanderson commented Oct 14, 2021

travis-minke-sap commented Oct 14, 2021

matzew commented Oct 14, 2021

matzew commented Oct 14, 2021 •

edited

Loading

travis-minke-sap commented Oct 14, 2021

matzew commented Oct 14, 2021 via email

travis-minke-sap commented Oct 14, 2021

lionelvillard commented Oct 15, 2021

pierDipi commented Oct 15, 2021

pierDipi commented Oct 15, 2021

lionelvillard commented Oct 15, 2021

travis-minke-sap commented Oct 18, 2021

pierDipi commented Oct 20, 2021 •

edited

Loading

travis-minke-sap commented Oct 27, 2021

[Experimental] DeliverySpec RetryAfter #5811

[Experimental] DeliverySpec RetryAfter #5811

Comments

travis-minke-sap commented Oct 13, 2021 • edited Loading

travis-minke-sap commented Oct 13, 2021

evankanderson commented Oct 14, 2021

evankanderson commented Oct 14, 2021

travis-minke-sap commented Oct 14, 2021

matzew commented Oct 14, 2021

matzew commented Oct 14, 2021 • edited Loading

travis-minke-sap commented Oct 14, 2021

matzew commented Oct 14, 2021 via email

travis-minke-sap commented Oct 14, 2021

lionelvillard commented Oct 15, 2021

pierDipi commented Oct 15, 2021

pierDipi commented Oct 15, 2021

lionelvillard commented Oct 15, 2021

travis-minke-sap commented Oct 18, 2021

pierDipi commented Oct 20, 2021 • edited Loading

travis-minke-sap commented Oct 27, 2021

travis-minke-sap commented Oct 13, 2021 •

edited

Loading

matzew commented Oct 14, 2021 •

edited

Loading

pierDipi commented Oct 20, 2021 •

edited

Loading