Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental-Feature] DeliverySpec.RetryAfter #5813

Conversation

travis-minke-sap
Copy link
Contributor

@travis-minke-sap travis-minke-sap commented Oct 14, 2021

This PR addresses parts of #5811 by creating the experimental-feature in the Alpha stage.

TLDR = New experimental feature allowing users to opt-in to respecting the Retry-After header when calculating the appropriate backoff duration for 429 and 503 responses.

Proposed Changes

  • 🎁 Add new delivery-retryafter feature flag.
  • 🎁 Enhance DeliverySpec with new optional RetryAfterMax component
  • 🎁 Enforce DeliverySpec.RetryAfterMax usage against feature flag in WebHook validation
  • 🎁 Add new DeliverySpec.RetryAfterMax to Subscription Controller/Reconciler updating of Channel.Spec.Subscribers
  • 🎁 Enhance RetryConfig struct to include RetryAfterMax configuration.
  • 🎁 Enhance RetryConfigFromDeliverySpec() to parse DeliverySpec.RetryAfterMax into RetryConfig.
  • 🎁 Enhance SendWithRetries() to consider Retry-After headers and RetryConfig settings when calculating backoff durations.
  • 🎁 Update codegen (local docs)
  • 🎁 Update experimental-features documentation

Open Questions

  1. I modified the pkg/reconciler/subscription/subscription_test.go to cover the changes in the Subscription controller. This introduced the need to have the test be aware of experimental-features. I think this a win but wanted to confirm it was acceptable (ie - no need to keep the controller test "pure").
  2. This implementation mirrors that of the Timeout experimental-feature, which also updated the v1beta1 version of the DeliverySpec without any experimental-feature gates. I wasn't sure of the rationale for doing so and have NOT done so in this PR. If someone could explain why we'd wan to do this I'll be happy to include it.

Pre-review Checklist

  • At least 80% unit test coverage
  • E2E tests for any new behavior
  • Docs PR for any user-facing impact
  • Spec PR for any new API feature (Will be created if experimental-feature graduates to Stable state.)
  • Conformance test for any change to the spec. (If needed will be added in experimental-feature Beta release.)

Release Note

New experimental-feature "delivery-retryafter" flag allows use of "DeliverySpec.retryAfter" to configure handling of Retry-After headers in 429 / 503 responses.  See https://github.com/knative/docs/blob/main/docs/eventing/experimental-features.md

Docs

knative/docs#4361

Holding for Experimental-Features review process & refactoring approach review
/hold

@knative-prow-robot knative-prow-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 14, 2021
@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Oct 14, 2021
@codecov
Copy link

codecov bot commented Oct 14, 2021

Codecov Report

Merging #5813 (5be0e20) into main (2cda8f4) will increase coverage by 0.21%.
The diff coverage is 96.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5813      +/-   ##
==========================================
+ Coverage   82.02%   82.23%   +0.21%     
==========================================
  Files         220      220              
  Lines        7527     7572      +45     
==========================================
+ Hits         6174     6227      +53     
+ Misses        918      911       -7     
+ Partials      435      434       -1     
Impacted Files Coverage Δ
pkg/kncloudevents/message_sender.go 90.19% <92.00%> (+0.54%) ⬆️
pkg/apis/duck/v1/delivery_types.go 93.54% <100.00%> (+1.88%) ⬆️
pkg/kncloudevents/retries.go 100.00% <100.00%> (+13.63%) ⬆️
pkg/reconciler/subscription/subscription.go 82.46% <100.00%> (+2.23%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2cda8f4...5be0e20. Read the comment docs.

@knative-prow-robot knative-prow-robot added the area/test-and-release Test infrastructure, tests or release label Oct 21, 2021
// header value will be used when calculating the next backoff duration. This will
// only be considered when a 429 (Too Many Requests) or 503 (Service Unavailable)
// response code is received and Retry is greater than 0.
Enabled bool `json:"enabled"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we always respect the Retry-After?

I am not really sure I like the toggling here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is stale - based on the original design which has evolved (is evolving in the associated Issue). Yes, everyone was against the "enabled" flag and in the latest design it is gone. I'm waiting to update this PR for the new design to settle - i need to summarize and push for agreement again...

// - https://www.iso.org/iso-8601-date-and-time-format.html
// - https://en.wikipedia.org/wiki/ISO_8601
// +optional
MaxDuration *string `json:"maxDuration,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we do not toggle (just thinking out loud) and have respecting Retry-after as default.

Setting MaxDuration: -1 could disable it.
If nothing (or a value greater than 0) is provided we do as state on the comments

🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the idea is that at least in GA it would be always respected and you could opt out by specifying "PT0S" or something similar.

@matzew
Copy link
Member

matzew commented Nov 9, 2021

/assign @matzew

@travis-minke-sap
Copy link
Contributor Author

PR has been refactored to align with current plan and is ready for review!

Copy link
Contributor

@odacremolbap odacremolbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome piece of code.
added minor comments and learned some stuff about testing at eventing, thanks for that.

Could we add to the docs PR (not sure if also to the API) that in case the backoff + retry-after competing for setting the duration, the largest is honored?

// Return The Larger Of The Two Backoff Durations
if retryAfterDuration > backoffDuration {
return retryAfterDuration
} else {
return backoffDuration
}

pkg/apis/duck/v1/delivery_types.go Outdated Show resolved Hide resolved
pkg/apis/duck/v1/delivery_types.go Outdated Show resolved Hide resolved
@@ -112,6 +119,15 @@ func RetryConfigFromDeliverySpec(spec v1.DeliverySpec) (RetryConfig, error) {
retryConfig.RequestTimeout, _ = timeout.Duration()
}

if spec.RetryAfterMax != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is too picky but I think it would be a nice policy to re-check the feature enablement here. If an admin enables a feature, then disable it and restart the controller, I think it would be expected that this value is not used.

This would be important for security related features, probably not for this one. Leaving the comment here for your consideration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this might be nice to do it is a bit problematic. The goal was to do all validation of the experimental-feature in the Webhook in order to avoid requiring downstream implementations from having to load/watch the config-features ConfigMap into their context and then plumb that context through. I suppose we could build a separate ConfigMap lookup inline here, but that feels heavyweight and non-standard with other ConfigMap tracking logic. I'm inclined to not recheck the feature flags here but we can let others weigh in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, and I didn't thought of other implementations.
On my side, this can be resolved.

Copy link
Contributor

@devguyio devguyio Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this might be nice to do it is a bit problematic. The goal was to do all validation of the experimental-feature in the Webhook in order to avoid requiring downstream implementations from having to load/watch the config-features ConfigMap into their context and then plumb that context through. I suppose we could build a separate ConfigMap lookup inline here, but that feels heavyweight and non-standard with other ConfigMap tracking logic. I'm inclined to not recheck the feature flags here but we can let others weigh in?

I think we actually must check if the feature is enabled. New experimental API fields will be stored as unknown fields if the feature is disabled and the webhook will ignore them. This means in a "enable-disable" scenario, this code will find a valid value for the RetryAfterMax and will honor it, even thought the feature is disabled.

I believe experimental features are designed to be part of eventing core as:

  1. API changes that're optionally implemented in alternative implementations.
  2. A change in the reference implementation. This is to validate the feature, as well as provide a reference implementation for other alternative implementations.

That means, that by design, if an alternative implementation will adopt an experimental feature, it needs to watch the features CM and provide its own implementation. Otherwise, it can wait until the graduation of the feature and becomes "on by default".

pkg/kncloudevents/message_sender.go Outdated Show resolved Hide resolved
pkg/kncloudevents/message_sender.go Outdated Show resolved Hide resolved
@knative-prow-robot knative-prow-robot added lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 10, 2021
@pierDipi
Copy link
Member

pierDipi commented Nov 18, 2021 via email

@travis-minke-sap
Copy link
Contributor Author

@lionelvillard - are you still planning on reviewing? I realize it's a holiday week so it can wait till next week if you're out - just checking ; )

docs/eventing-api.md Outdated Show resolved Hide resolved
@lionelvillard
Copy link
Member

Can you replace fixes #5811 by parts of #5811, to show this PR is one step towards fully implementing #5811

/unhold

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2021
@lionelvillard
Copy link
Member

/lgtm
/approve

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 23, 2021
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lionelvillard, odacremolbap, travis-minke-sap

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 23, 2021
@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 23, 2021
@travis-minke-sap
Copy link
Contributor Author

Can you replace fixes #5811 by parts of #5811, to show this PR is one step towards fully implementing #5811

yep - good call - thanks!

@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-eventing-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/duck/v1/delivery_types.go 95.2% 96.3% 1.1
pkg/kncloudevents/message_sender.go 88.2% 92.5% 4.3
pkg/kncloudevents/retries.go 94.7% 100.0% 5.3
pkg/reconciler/subscription/subscription.go 86.1% 87.2% 1.0

@travis-minke-sap
Copy link
Contributor Author

/retest

@lionelvillard
Copy link
Member

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 23, 2021
@knative-prow-robot knative-prow-robot merged commit 820db20 into knative:main Nov 23, 2021
Copy link
Member

@pierDipi pierDipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work for a Channel-based Broker since the filter handler doesn't propagate response headers back to the channel.

func (h *Handler) send(ctx context.Context, writer http.ResponseWriter, headers http.Header, target string, reportArgs *ReportArgs, event *cloudevents.Event, ttl int32) {
// send the event to trigger's subscriber
response, err := h.sendEvent(ctx, headers, target, event, reportArgs)
if err != nil {
h.logger.Error("failed to send event", zap.Error(err))
writer.WriteHeader(http.StatusInternalServerError)
_ = h.reporter.ReportEventCount(reportArgs, http.StatusInternalServerError)
return
}
h.logger.Debug("Successfully dispatched message", zap.Any("target", target))
// If there is an event in the response write it to the response
statusCode, err := h.writeResponse(ctx, writer, response, ttl, target)
if err != nil {
h.logger.Error("failed to write response", zap.Error(err))
}
_ = h.reporter.ReportEventCount(reportArgs, statusCode)
}
func (h *Handler) sendEvent(ctx context.Context, headers http.Header, target string, event *cloudevents.Event, reporterArgs *ReportArgs) (*http.Response, error) {
// Send the event to the subscriber
req, err := h.sender.NewCloudEventRequestWithTarget(ctx, target)
if err != nil {
return nil, fmt.Errorf("failed to create the request: %w", err)
}
message := binding.ToMessage(event)
defer message.Finish(nil)
additionalHeaders := utils.PassThroughHeaders(headers)
// Following the spec https://github.com/knative/specs/blob/main/specs/eventing/data-plane.md#derived-reply-events
additionalHeaders.Set("prefer", "reply")
err = kncloudevents.WriteHTTPRequestWithAdditionalHeaders(ctx, message, req, additionalHeaders)
if err != nil {
return nil, fmt.Errorf("failed to write request: %w", err)
}
start := time.Now()
resp, err := h.sender.Send(req)
dispatchTime := time.Since(start)
if err != nil {
err = fmt.Errorf("failed to dispatch message: %w", err)
}
sc := 0
if resp != nil {
sc = resp.StatusCode
}
_ = h.reporter.ReportEventDispatchTime(reporterArgs, sc, dispatchTime)
return resp, err
}
// The return values are the status
func (h *Handler) writeResponse(ctx context.Context, writer http.ResponseWriter, resp *http.Response, ttl int32, target string) (int, error) {
response := cehttp.NewMessageFromHttpResponse(resp)
defer response.Finish(nil)
if response.ReadEncoding() == binding.EncodingUnknown {
// Response doesn't have a ce-specversion header nor a content-type matching a cloudevent event format
// Just read a byte out of the reader to see if it's non-empty, we don't care what it is,
// just that it is not empty. This means there was a response and it's not valid, so treat
// as delivery failure.
body := make([]byte, 1)
n, _ := response.BodyReader.Read(body)
response.BodyReader.Close()
if n != 0 {
// Note that we could just use StatusInternalServerError, but to distinguish
// between the failure cases, we use a different code here.
writer.WriteHeader(http.StatusBadGateway)
return http.StatusBadGateway, errors.New("received a non-empty response not recognized as CloudEvent. The response MUST be either empty or a valid CloudEvent")
}
h.logger.Debug("Response doesn't contain a CloudEvent, replying with an empty response", zap.Any("target", target))
writer.WriteHeader(resp.StatusCode)
return resp.StatusCode, nil
}
event, err := binding.ToEvent(ctx, response)
if err != nil {
// Like in the above case, we could just use StatusInternalServerError, but to distinguish
// between the failure cases, we use a different code here.
writer.WriteHeader(http.StatusBadGateway)
// Malformed event, reply with err
return http.StatusBadGateway, err
}
// Reattach the TTL (with the same value) to the response event before sending it to the Broker.
if err := broker.SetTTL(event.Context, ttl); err != nil {
writer.WriteHeader(http.StatusInternalServerError)
return http.StatusInternalServerError, fmt.Errorf("failed to reset TTL: %w", err)
}
eventResponse := binding.ToMessage(event)
defer eventResponse.Finish(nil)
if err := cehttp.WriteResponseWriter(ctx, eventResponse, resp.StatusCode, writer); err != nil {
return http.StatusInternalServerError, fmt.Errorf("failed to write response event: %w", err)
}
h.logger.Debug("Replied with a CloudEvent response", zap.Any("target", target))
return resp.StatusCode, nil
}

@lionelvillard
Copy link
Member

@pierDipi can you open an issue?

@pierDipi
Copy link
Member

Done

@travis-minke-sap travis-minke-sap deleted the retry-after-experimental-feature branch November 24, 2021 15:21
Copy link
Contributor

@devguyio devguyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need a followup to check if the feature is enabled. This current behavior will be problematic in case of enabling then disabling the feature.

@@ -112,6 +119,15 @@ func RetryConfigFromDeliverySpec(spec v1.DeliverySpec) (RetryConfig, error) {
retryConfig.RequestTimeout, _ = timeout.Duration()
}

if spec.RetryAfterMax != nil {
Copy link
Contributor

@devguyio devguyio Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this might be nice to do it is a bit problematic. The goal was to do all validation of the experimental-feature in the Webhook in order to avoid requiring downstream implementations from having to load/watch the config-features ConfigMap into their context and then plumb that context through. I suppose we could build a separate ConfigMap lookup inline here, but that feels heavyweight and non-standard with other ConfigMap tracking logic. I'm inclined to not recheck the feature flags here but we can let others weigh in?

I think we actually must check if the feature is enabled. New experimental API fields will be stored as unknown fields if the feature is disabled and the webhook will ignore them. This means in a "enable-disable" scenario, this code will find a valid value for the RetryAfterMax and will honor it, even thought the feature is disabled.

I believe experimental features are designed to be part of eventing core as:

  1. API changes that're optionally implemented in alternative implementations.
  2. A change in the reference implementation. This is to validate the feature, as well as provide a reference implementation for other alternative implementations.

That means, that by design, if an alternative implementation will adopt an experimental feature, it needs to watch the features CM and provide its own implementation. Otherwise, it can wait until the graduation of the feature and becomes "on by default".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test-and-release Test infrastructure, tests or release cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants