-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recommend 503 status code for a service with no healthy endpoints #3121
recommend 503 status code for a service with no healthy endpoints #3121
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dprotaso!
@@ -263,6 +263,9 @@ type HTTPRouteRule struct { | |||
// invalid, 50 percent of traffic must receive a 500. Implementations may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youngnick do you remember if the intent here was "exactly 500" or "5xx"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was definitely "exactly 500", we logged #1200 to do that.
I've been trying to remember why we moved this to "exactly 500", and I think it was to do with partial validity rules.
There's a bunch of discussion in #1112, and even more in #1211 about it. There's also some discussion on #1511, with @mikemorris' comment #1151 (comment) being a good summary.
I seem to recall not being confident at the time that we didn't want to overcomplicate the spec. It's already pretty complicated, because we were discussing if "zero endpoints" means "not valid" or not.
Looking back, I think the answer we've landed on is that we treat the references between objects differently to possibly-transient conditions on the proxy anyway. ResolvedRefs
is for references.
I don't think we should do this until we've gone back through those discussions and checked that we're not breaking any of the assumptions that we made then - or if we are, then we update other documentation as well to make it clearer.
However, if we can all agree that "zero endpoints" should be considered a transient state that does not impact the validity of the HTTPRoute, then returning a 503 in that case is okay.
Like I said, we need to clarify what happens here in the other listed cases for 500 errors.
- What happens when there are multiple BackendRefs and one has no endpoints? As it stands, this update leaves that unclear.
- What happens when all the BackendRefs have no endpoints? (Note that this covers the case where there's only one backend that has no endpoints).
I think the answer should be something like:
- Having no endpoints does not make a HTTPBackendRef invalid in configuration terms
- However, a backend with no endpoints MAY (tbh this might need to be SHOULD or even MUST) be treated as invalid for traffic management purposes and return a 503 error code. This means that, if there are multiple backendRefs:
- each backendRef must get the correct proportion of traffic, even if that means the proportion of traffic bound for that backendRef all gets a 503. This is to ensure that weighted load balancing failures don't happen silently. (There's a case where you're doing a gradual failover, one of the services gets 503, and you don't notice until you flip the weight to 100 percent on the faulty one that we have to avoid.)
- if all backendRefs have no endpoints, then all traffic that matches that rule will get a 503.
These are basically the same rules as above for 500s, we're basically making a class of traffic that's "invalid at a traffic level, but not at a config level" by doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are basically the same rules as above for 500s, we're basically making a class of traffic that's "invalid at a traffic level, but not at a config level" by doing this.
Yeah - that all sounds good - what further edits do you think this PR requires?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a case where you're doing a gradual failover, one of the services gets 503, and you don't notice until you flip the weight to 100 percent on the faulty one that we have to avoid.
Can you elaborate on this a bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @youngnick's summary - went back to read some of my old comments and this seems to align with my thinking at that time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to codify parts of your comment into the godoc @youngnick ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added a suggestion to that effect. Once that's done, this LGTM.
I think this makes sense, thanks @dprotaso! Would like a LGTM from @mikemorris or @youngnick though. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dprotaso, robscott The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Co-authored-by: Nick Young <inocuo@gmail.com>
Co-authored-by: Nick Young <inocuo@gmail.com>
Stand by my position a while back that 503 is appropriate for this case. /lgtm |
…bernetes-sigs#3121) * recommend 503 status code for a service with no healthy endpoints * MAY=>SHOULD * Update apis/v1/httproute_types.go Co-authored-by: Nick Young <inocuo@gmail.com> * run codegen * Update apis/v1/httproute_types.go Co-authored-by: Nick Young <inocuo@gmail.com> * run codegen --------- Co-authored-by: Nick Young <inocuo@gmail.com>
…bernetes-sigs#3121) * recommend 503 status code for a service with no healthy endpoints * MAY=>SHOULD * Update apis/v1/httproute_types.go Co-authored-by: Nick Young <inocuo@gmail.com> * run codegen * Update apis/v1/httproute_types.go Co-authored-by: Nick Young <inocuo@gmail.com> * run codegen --------- Co-authored-by: Nick Young <inocuo@gmail.com>
…bernetes-sigs#3121) * recommend 503 status code for a service with no healthy endpoints * MAY=>SHOULD * Update apis/v1/httproute_types.go Co-authored-by: Nick Young <inocuo@gmail.com> * run codegen * Update apis/v1/httproute_types.go Co-authored-by: Nick Young <inocuo@gmail.com> * run codegen --------- Co-authored-by: Nick Young <inocuo@gmail.com>
What type of PR is this?
/kind cleanup
/kind documentation
What this PR does / why we need it:
A specific error case for 503 status code is a Kubernetes service with no healthy endpoints
It was recommend we keep this case here - #1210 (comment)
It was subsquently removed in this PR - #1243 (comment)
I think having this differentiator is important because it allows consumers (eg. Knative) to know whether the 5xx is being returned by the user's pod or the gateway.
This behaviour is already present in numerous Gateway Implementations (Istio, Contour, Linkerd)
Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a user-facing change?: