Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recommend 503 status code for a service with no healthy endpoints #3121

Merged
merged 6 commits into from
Jul 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions apis/v1/httproute_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,11 @@ type HTTPRouteRule struct {
// invalid, 50 percent of traffic must receive a 500. Implementations may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youngnick do you remember if the intent here was "exactly 500" or "5xx"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was definitely "exactly 500", we logged #1200 to do that.

I've been trying to remember why we moved this to "exactly 500", and I think it was to do with partial validity rules.
There's a bunch of discussion in #1112, and even more in #1211 about it. There's also some discussion on #1511, with @mikemorris' comment #1151 (comment) being a good summary.

I seem to recall not being confident at the time that we didn't want to overcomplicate the spec. It's already pretty complicated, because we were discussing if "zero endpoints" means "not valid" or not.

Looking back, I think the answer we've landed on is that we treat the references between objects differently to possibly-transient conditions on the proxy anyway. ResolvedRefs is for references.

I don't think we should do this until we've gone back through those discussions and checked that we're not breaking any of the assumptions that we made then - or if we are, then we update other documentation as well to make it clearer.

However, if we can all agree that "zero endpoints" should be considered a transient state that does not impact the validity of the HTTPRoute, then returning a 503 in that case is okay.

Like I said, we need to clarify what happens here in the other listed cases for 500 errors.

  • What happens when there are multiple BackendRefs and one has no endpoints? As it stands, this update leaves that unclear.
  • What happens when all the BackendRefs have no endpoints? (Note that this covers the case where there's only one backend that has no endpoints).

I think the answer should be something like:

  • Having no endpoints does not make a HTTPBackendRef invalid in configuration terms
  • However, a backend with no endpoints MAY (tbh this might need to be SHOULD or even MUST) be treated as invalid for traffic management purposes and return a 503 error code. This means that, if there are multiple backendRefs:
    • each backendRef must get the correct proportion of traffic, even if that means the proportion of traffic bound for that backendRef all gets a 503. This is to ensure that weighted load balancing failures don't happen silently. (There's a case where you're doing a gradual failover, one of the services gets 503, and you don't notice until you flip the weight to 100 percent on the faulty one that we have to avoid.)
    • if all backendRefs have no endpoints, then all traffic that matches that rule will get a 503.

These are basically the same rules as above for 500s, we're basically making a class of traffic that's "invalid at a traffic level, but not at a config level" by doing this.

Copy link
Contributor Author

@dprotaso dprotaso Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are basically the same rules as above for 500s, we're basically making a class of traffic that's "invalid at a traffic level, but not at a config level" by doing this.

Yeah - that all sounds good - what further edits do you think this PR requires?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a case where you're doing a gradual failover, one of the services gets 503, and you don't notice until you flip the weight to 100 percent on the faulty one that we have to avoid.

Can you elaborate on this a bit more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @youngnick's summary - went back to read some of my old comments and this seems to align with my thinking at that time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to codify parts of your comment into the godoc @youngnick ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added a suggestion to that effect. Once that's done, this LGTM.

// choose how that 50 percent is determined.
//
// When a HTTPBackendRef refers to a Service that has no ready endpoints,
// implementations SHOULD return a 503 for requests to that backend instead.
robscott marked this conversation as resolved.
Show resolved Hide resolved
// If an implementation chooses to do this, all of the above rules for 500 responses
// MUST also apply for responses that return a 503.
//
// Support: Core for Kubernetes Service
//
// Support: Extended for Kubernetes ServiceImport
Expand Down
12 changes: 12 additions & 0 deletions config/crd/experimental/gateway.networking.k8s.io_httproutes.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 12 additions & 0 deletions config/crd/standard/gateway.networking.k8s.io_httproutes.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion pkg/generated/openapi/zz_generated.openapi.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.