recommend 503 status code for a service with no healthy endpoints #3121

dprotaso · 2024-05-30T17:02:01Z

What type of PR is this?

/kind cleanup
/kind documentation

What this PR does / why we need it:

A specific error case for 503 status code is a Kubernetes service with no healthy endpoints

It was recommend we keep this case here - #1210 (comment)

It was subsquently removed in this PR - #1243 (comment)

I think having this differentiator is important because it allows consumers (eg. Knative) to know whether the 5xx is being returned by the user's pod or the gateway.

This behaviour is already present in numerous Gateway Implementations (Istio, Contour, Linkerd)

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

HTTPRoute - 503 Status Code MAY be returned for Kubernetes Services who don't have any healthy endpoints

robscott

Thanks @dprotaso!

robscott · 2024-06-03T18:05:34Z

apis/v1/httproute_types.go

@@ -263,6 +263,9 @@ type HTTPRouteRule struct {
 	// invalid, 50 percent of traffic must receive a 500. Implementations may


@youngnick do you remember if the intent here was "exactly 500" or "5xx"?

It was definitely "exactly 500", we logged #1200 to do that.

I've been trying to remember why we moved this to "exactly 500", and I think it was to do with partial validity rules.
There's a bunch of discussion in #1112, and even more in #1211 about it. There's also some discussion on #1511, with @mikemorris' comment #1151 (comment) being a good summary.

I seem to recall not being confident at the time that we didn't want to overcomplicate the spec. It's already pretty complicated, because we were discussing if "zero endpoints" means "not valid" or not.

Looking back, I think the answer we've landed on is that we treat the references between objects differently to possibly-transient conditions on the proxy anyway. ResolvedRefs is for references.

I don't think we should do this until we've gone back through those discussions and checked that we're not breaking any of the assumptions that we made then - or if we are, then we update other documentation as well to make it clearer.

However, if we can all agree that "zero endpoints" should be considered a transient state that does not impact the validity of the HTTPRoute, then returning a 503 in that case is okay.

Like I said, we need to clarify what happens here in the other listed cases for 500 errors.

What happens when there are multiple BackendRefs and one has no endpoints? As it stands, this update leaves that unclear.

What happens when all the BackendRefs have no endpoints? (Note that this covers the case where there's only one backend that has no endpoints).

I think the answer should be something like:

Having no endpoints does not make a HTTPBackendRef invalid in configuration terms

However, a backend with no endpoints MAY (tbh this might need to be SHOULD or even MUST) be treated as invalid for traffic management purposes and return a 503 error code. This means that, if there are multiple backendRefs:

each backendRef must get the correct proportion of traffic, even if that means the proportion of traffic bound for that backendRef all gets a 503. This is to ensure that weighted load balancing failures don't happen silently. (There's a case where you're doing a gradual failover, one of the services gets 503, and you don't notice until you flip the weight to 100 percent on the faulty one that we have to avoid.)

if all backendRefs have no endpoints, then all traffic that matches that rule will get a 503.

These are basically the same rules as above for 500s, we're basically making a class of traffic that's "invalid at a traffic level, but not at a config level" by doing this.

These are basically the same rules as above for 500s, we're basically making a class of traffic that's "invalid at a traffic level, but not at a config level" by doing this.

Yeah - that all sounds good - what further edits do you think this PR requires?

There's a case where you're doing a gradual failover, one of the services gets 503, and you don't notice until you flip the weight to 100 percent on the faulty one that we have to avoid.

Can you elaborate on this a bit more?

Agreed with @youngnick's summary - went back to read some of my old comments and this seems to align with my thinking at that time.

Do you want me to codify parts of your comment into the godoc @youngnick ?

Yes, I added a suggestion to that effect. Once that's done, this LGTM.

apis/v1/httproute_types.go

robscott · 2024-06-28T00:17:23Z

I think this makes sense, thanks @dprotaso! Would like a LGTM from @mikemorris or @youngnick though.

/approve

k8s-ci-robot · 2024-06-28T00:17:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [robscott]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

apis/v1/httproute_types.go

Co-authored-by: Nick Young <inocuo@gmail.com>

apis/v1/httproute_types.go

Co-authored-by: Nick Young <inocuo@gmail.com>

mikemorris · 2024-07-02T16:03:10Z

Stand by my position a while back that 503 is appropriate for this case.

/lgtm

…bernetes-sigs#3121) * recommend 503 status code for a service with no healthy endpoints * MAY=>SHOULD * Update apis/v1/httproute_types.go Co-authored-by: Nick Young <inocuo@gmail.com> * run codegen * Update apis/v1/httproute_types.go Co-authored-by: Nick Young <inocuo@gmail.com> * run codegen --------- Co-authored-by: Nick Young <inocuo@gmail.com>

recommend 503 status code for a service with no healthy endpoints

33ebeed

k8s-ci-robot requested review from robscott and youngnick May 30, 2024 17:02

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 30, 2024

robscott reviewed Jun 3, 2024

View reviewed changes

MAY=>SHOULD

5680588

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2024

youngnick reviewed Jun 28, 2024

View reviewed changes

apis/v1/httproute_types.go Show resolved Hide resolved

dprotaso and others added 2 commits June 28, 2024 11:10

Update apis/v1/httproute_types.go

4d34a66

Co-authored-by: Nick Young <inocuo@gmail.com>

run codegen

d181c11

robscott reviewed Jun 28, 2024

View reviewed changes

apis/v1/httproute_types.go Outdated Show resolved Hide resolved

dprotaso and others added 2 commits July 2, 2024 10:07

Update apis/v1/httproute_types.go

7ca8889

Co-authored-by: Nick Young <inocuo@gmail.com>

run codegen

839c3af

k8s-ci-robot assigned mikemorris Jul 2, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 2, 2024

k8s-ci-robot merged commit e971a8d into kubernetes-sigs:main Jul 2, 2024
8 checks passed

dprotaso deleted the 503-no-healthy-upstream branch July 3, 2024 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recommend 503 status code for a service with no healthy endpoints #3121

recommend 503 status code for a service with no healthy endpoints #3121

dprotaso commented May 30, 2024 •

edited

Loading

robscott left a comment

robscott Jun 3, 2024

youngnick Jun 6, 2024

dprotaso Jun 6, 2024 •

edited

Loading

dprotaso Jun 6, 2024

mikemorris Jun 6, 2024

dprotaso Jun 18, 2024

youngnick Jun 28, 2024

robscott commented Jun 28, 2024

k8s-ci-robot commented Jun 28, 2024

mikemorris commented Jul 2, 2024

		@@ -263,6 +263,9 @@ type HTTPRouteRule struct {
		// invalid, 50 percent of traffic must receive a 500. Implementations may

recommend 503 status code for a service with no healthy endpoints #3121

recommend 503 status code for a service with no healthy endpoints #3121

Conversation

dprotaso commented May 30, 2024 • edited Loading

robscott left a comment

Choose a reason for hiding this comment

robscott Jun 3, 2024

Choose a reason for hiding this comment

youngnick Jun 6, 2024

Choose a reason for hiding this comment

dprotaso Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

dprotaso Jun 6, 2024

Choose a reason for hiding this comment

mikemorris Jun 6, 2024

Choose a reason for hiding this comment

dprotaso Jun 18, 2024

Choose a reason for hiding this comment

youngnick Jun 28, 2024

Choose a reason for hiding this comment

robscott commented Jun 28, 2024

k8s-ci-robot commented Jun 28, 2024

mikemorris commented Jul 2, 2024

dprotaso commented May 30, 2024 •

edited

Loading

dprotaso Jun 6, 2024 •

edited

Loading