Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial attempt at 429 Retry-After header support #5285

Conversation

travis-minke-sap
Copy link
Contributor

@travis-minke-sap travis-minke-sap commented Apr 20, 2021

The common logic for sending CloudEvents (see kncloudevents/message_sender.go) does not support the Retry-After header for 429 StatusCodes as specified in the CloudEvent Spec.

See Discussion #5011 for full details.

🎁 Creating RetryConfig from DeliverySpec now supports ability to indicate preference for respecting the 429 Retry-After header in message sending responses as specified in the CloudEvents specification.

Proposed Changes

  • Enhance the message_sender implementation to allow the ability to respect the Retry-After header when retrying 429 responses.
  • Avoid changing the API shape (V1 awareness Scott!) while still providing a path towards potential exposure in DeliverySpec.
  • Wrap and extend the current implementation to allow opt-in usage without impacting implementations who might not want this capability.

Release Notes

Optional extension function RetryConfigFromDeliverySpecWith429() allows users to indicate their preference for respecting Retry-After headers in 429 responses.

@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Apr 20, 2021
@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 20, 2021
@knative-prow-robot
Copy link
Contributor

Hi @travis-minke-sap. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 20, 2021
@codecov
Copy link

codecov bot commented Apr 20, 2021

Codecov Report

Merging #5285 (81399c7) into main (ab3978c) will increase coverage by 1.05%.
The diff coverage is 100.00%.

❗ Current head 81399c7 differs from pull request most recent head 258fba3. Consider uploading reports for the commit 258fba3 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5285      +/-   ##
==========================================
+ Coverage   82.78%   83.83%   +1.05%     
==========================================
  Files         197      243      +46     
  Lines        6093     6955     +862     
==========================================
+ Hits         5044     5831     +787     
- Misses        724      782      +58     
- Partials      325      342      +17     
Impacted Files Coverage Δ
pkg/kncloudevents/message_sender.go 88.04% <100.00%> (+5.78%) ⬆️
pkg/duck/listable.go 63.38% <0.00%> (-19.72%) ⬇️
pkg/apis/sources/v1/apiserver_validation.go 85.71% <0.00%> (-14.29%) ⬇️
pkg/apis/messaging/config/channel_defaults.go 65.00% <0.00%> (-10.00%) ⬇️
pkg/duck/channel.go 54.54% <0.00%> (-9.10%) ⬇️
pkg/adapter/v2/cloudevents.go 75.00% <0.00%> (-9.05%) ⬇️
pkg/apis/config/defaults.go 85.71% <0.00%> (-7.15%) ⬇️
pkg/apis/messaging/v1/subscription_validation.go 80.00% <0.00%> (-5.00%) ⬇️
pkg/channel/message_dispatcher.go 75.92% <0.00%> (-2.59%) ⬇️
pkg/apis/eventing/v1/trigger_validation.go 90.41% <0.00%> (-1.37%) ⬇️
... and 76 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92d85bf...258fba3. Read the comment docs.

@slinkydeveloper
Copy link
Contributor

I think producing an opt-in header is one of those features that I would categorize between Experimental features, because is still a behavioural change. WDYT @travis-minke-sap?

@n3wscott
Copy link
Contributor

+1 to @slinkydeveloper 's comment.

This seems like a reasonable feature but it also changes behaviour and we are trying to tread lightly on changes to core until we can get this 1.0 out the door.

@travis-minke-sap
Copy link
Contributor Author

Hey @slinkydeveloper and @newscott - thanks for the feedback!

I wouldn't necessarily characterize this change as a "new feature", as it is an incremental step towards addressing a deficiency in adhering to the CloudEvents Specification. The implementation attempts to add the ability to respect the 429 Retry-After header if a consumer of this functionality desires it, while otherwise providing the same functionality that currently exists. I was hoping that would allow for a slow-rollout and some soak-time before it was widely adopted.

I do acknowledge, though, that the underlying implementation is changing, and there is a possibility for unintended bugs / regressions to be introduced. If the risk of such a possibility is too great for the current timeframe (pre-V1) that is understandable. Would you be open to this approach (and possibly exposing a flag in the DeliverySpec) if this was postponed until after the V1 release?

If you feel strongly that this needs to go through the "experimental-features" track, regardless of timeframes, then I have the following questions / concerns...

  • What is the state of the Experimental Features document? Is it still being debated or has it been accepted?
  • If the process has been agreed upon, has work begun on all the supporting infrastructure required to make it a standard part of Eventing/Sandbox implementations? Items such as...
    • Duplicating or refactoring the Knative Serving config-features logic for loading/exposing the new ConfigMap
    • Plumbing the ConfigMap into the controllers for use at runtime (how would the RetryConfigFromDeliverySpec() or a Controller access it? what about WebHooks?)
    • What are the rules around dynamically "watching" the new ConfigMap?
    • For this particular Issue, where we would likely expose the opt-in flag in the DeliverySpec, I would also have to add logic to the WebHook of every CRD that includes the DeliverySpec (Subscription, Triggers, etc).

If the infrastructure mentioned above was already in-place and available, it might not be too large an effort to go through the experimental-feature process, but if I have to build all that out just to have the privilege of being the first use case... It's a hard sell ; ) I do feel bad not offering to take that on, as it would be a win for the project, but I just don't have the time available (internal management, deadlines, etc.) I'm also not 100% certain that we will need this capability internally yet, and was trying to provide a low-touch effort at improving things.

Maybe waiting till post-V1 is a possibility? I assume that post-V1 we're not going to require every new change to inherently be an "experimental feature" and that issues like this (opt-in adherence to the CloudEvent Spec) can just be added via the normal process?

/hold

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2021
@slinkydeveloper
Copy link
Contributor

I wouldn't necessarily characterize this change as a "new feature", as it is an incremental step towards addressing a deficiency in adhering to the CloudEvents Specification

So first of all, here we're talking about the Webhook spec, which is an "optional" spec. I feel this particular issue is part of the larger topic whether we want to support this spec or not. There is an open issue, opened a while ago, about it #3092.

What is the state of the Experimental Features document? Is it still being debated or has it been accepted?
If the process has been agreed upon, has work begun on all the supporting infrastructure required to make it a standard part of Eventing/Sandbox implementations?

The process is still being debated, feel free to chime in, but we already have part of the WG leads agreeing. The infra is being worked here: #5214, so you won't need to do anything more than opening an issue with the plan, send a mail, and do little changes to this PR.

Maybe waiting till post-V1 is a possibility?

I don't think we need to wait post-V1. As soon as we start with experimental features, I'll definitely be happy to endorse this effort through the experimental features process, and hopefully the larger one of supporting cloudevents webhook spec.

@travis-minke-sap
Copy link
Contributor Author

So first of all, here we're talking about the Webhook spec, which is an "optional" spec. I feel this particular issue is part of the larger topic whether we want to support this spec or not. There is an open issue, opened a while ago, about it #3092.

Ah, great point about the Webhook Spec being optional - I lost sight of that in the interim from initiating the Discussion and resurrecting this effort. Also, I was completely unaware of those prior efforts - no one ever mentioned them or added to the Discussion (I'll add a link). This makes me think I might have over-simplified the approach?

The process is still being debated, feel free to chime in, but we already have part of the WG leads agreeing. The infra is being worked here: #5214, so you won't need to do anything more than opening an issue with the plan, send a mail, and do little changes to this PR.

Great to hear that the experimental-features framework is in progress!

I don't think we need to wait post-V1. As soon as we start with experimental features, I'll definitely be happy to endorse this effort through the experimental features process, and hopefully the larger one of supporting cloudevents webhook spec.

OK, sounds good. Once the experimental-features stuff is ready I can re-evaluate our need for supporting the Retry-After header and start again down that path.

Thanks for your patience and for providing the background/status in order to level set me ; )
I'll close this PR for now and will update the Discussion with some of this detail.

@travis-minke-sap
Copy link
Contributor Author

Closing for now - will re-asses later ; )

@travis-minke-sap
Copy link
Contributor Author

Re-opening to keep it on the radar ; )

@slinkydeveloper
Copy link
Contributor

@travis-minke-sap FYI the experimental feature process is now merged https://github.com/knative/eventing/blob/main/docs/experimental-features.md

@lionelvillard
Copy link
Member

/ok-to-test

@knative-prow-robot knative-prow-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 8, 2021
@lionelvillard
Copy link
Member

We might need this for v1.0 depending on whether the spec precisely defines how sender must behave when receiving 429.

Adding this PR to the v1 project so we don't forget about it. We can always remove it if the spec stays vague.

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: travis-minke-sap
To complete the pull request process, please assign akashrv after the PR has been reviewed.
You can assign the PR to them by writing /assign @akashrv in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@travis-minke-sap
Copy link
Contributor Author

We might need this for v1.0 depending on whether the spec precisely defines how sender must behave when receiving 429.

Adding this PR to the v1 project so we don't forget about it. We can always remove it if the spec stays vague.

Depending on if/how this is added to the V1 spec... we might want to consider removing the "configurability" or opt-in nature of the PR? Just something to consider as when it was originally written it was not part of the spec and I was trying to "tread lightly" ; )

@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-eventing-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/kncloudevents/message_sender.go 87.2% 92.0% 4.8


var retryAfterDuration time.Duration

if resp != nil && resp.StatusCode == nethttp.StatusTooManyRequests && resp.Header != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whatever we decide, make sure to add a comment to the spec as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also honor the 503? https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503

Yeah I agree, good catch - thanks!

@matzew
Copy link
Member

matzew commented Jun 10, 2021

Depending on if/how this is added to the V1 spec... we might want to consider removing the "configurability" or opt-in nature of the PR? Just something to consider as when it was originally written it was not part of the spec and I was trying to "tread lightly" ; )

yeah, I agree making this the default!

Perhaps deprecating/retire the other Retry func?

@knative-prow-robot
Copy link
Contributor

@travis-minke-sap: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 15, 2021
@lionelvillard
Copy link
Member

@travis-minke-sap still working on this?

@travis-minke-sap
Copy link
Contributor Author

@travis-minke-sap still working on this?

I am planning on coming back to it in the near-ish future. Probably a rewrite without the "opt-in" plumbing. Feel free to take it if you're interested in running with it ; ). Or are you just checking to see whether it can be closed?

@lionelvillard
Copy link
Member

just checking :-)

@benmoss
Copy link
Member

benmoss commented Oct 6, 2021

cc @embano1

@travis-minke-sap
Copy link
Contributor Author

HEADS UP... This PR is old/state and I will be creating a new/separate PR in the near future with an similar approach based on enhancements to the DeliverySpec wrapped in experimental-feature protection - stay tuned ; )

@matzew
Copy link
Member

matzew commented Oct 6, 2021 via email

@lionelvillard
Copy link
Member

@travis-minke-sap can we also add support for 503 retry-after?

Knative Service replies with a 503 when the queue is overloaded. AFAIK, it currently does not set retry-after but it might (though I wonder if there is a way to know what the delay should be). @julz @markusthoemmes any insights?

@pierDipi
Copy link
Member

This is the plan we agreed in the last call

Only if DeliverySpec Retry is configured and only for 429 / 503

@travis-minke-sap
Copy link
Contributor Author

@travis-minke-sap can we also add support for 503 retry-after?

Knative Service replies with a 503 when the queue is overloaded. AFAIK, it currently does not set retry-after but it might (though I wonder if there is a way to know what the delay should be). @julz @markusthoemmes any insights?

Yes, Matthias had previously requested this and I have included 503 in the implementation I'll be creating a new Issue/PR for soon (waiting on the recociler-test changes so I can finish the e2e - it's almost ready ; )

Also... Yes - what Pier stated is how it is implemented. Here's a sneak preview of the new retryAfter DeliverySpec experimental-feature...

  delivery:
    backoffDelay: PT0.5S
    backoffPolicy: exponential
    retry: 3
    retryAfter:
      enabled: true
      maxDuration: "PT30S"

@matzew
Copy link
Member

matzew commented Nov 10, 2021

@travis-minke-sap Can we close this, since we have #5813 ?

@travis-minke-sap
Copy link
Contributor Author

@travis-minke-sap Can we close this, since we have #5813 ?

Yep - was on my list to do so - was just waiting out of paranoia to see how the reviews for #5813 might go ; ). I'll close it now though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Indicates the PR's author has signed the CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants