Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding KEP 2433: Topology Aware Hints #2434

Merged
merged 1 commit into from
Feb 9, 2021

Conversation

robscott
Copy link
Member

@robscott robscott commented Feb 5, 2021

This is a follow up to #2355 and #2354 and represents what I hope will be a simpler and better approach for topology aware routing.

Enhancement issue: #2433

/cc @andrewsykim @bowei @dcbw
/assign @thockin @wojtek-t
/sig network

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Feb 5, 2021
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 5, 2021
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple small comments for the PRR - please address.
And I will work with approval for SIG approval.

keps/prod-readiness/sig-network/2433.yaml Outdated Show resolved Hide resolved
keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
- The `topology.kubernetes.io/topology-aware-routing` annotation is set to
`Auto` for the Service.
- At least one endpoint for the Service has a hint pointing to the zone
Kube-Proxy is running within.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least one with hint to our zone or without hint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the safest approach here might actually be to ignore all hints if some are missing. I think the most likely cause for hints missing would be a transition between states, and that transition exposes us to risk if we don't fallback to all endpoints immediately. For example, if we have a 500 endpoint Service, updates are going to come to kube-proxy for each EndpointSlice individually.

It's possible that at one point only one EndpointSlice will have hints for each endpoint. Without an immediate fallback, we'd risk significantly overloading these endpoints.

The other alternative is also not great. Let's say we always send traffic to endpoints with matching hints + endpoints without any hints. In the above scenario, we'd overload endpoints without hints if we were weighting all endpoints equally.

If kube-proxy could reliably determine how many zones were in a cluster, we could do a variation of that second approach that gave endpoints without a zone hint a 1/n weight, where n == the number of zones in the cluster. I think that could still result in imbalanced traffic during a transition depending on the composition of slices with/without hints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be safe.

Easy: if all endpoints have the hint, then respect the hint

More nuanced: note how many unique values we see and only use the hint if ~ at least 1/N of all endpoints are for this zone (with some fudge factor).

I'm not sure the more nuanced answer is useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah every nuanced approach here feels like it could lead to more trouble than its worth. I'd lean towards the safest approach at least for alpha.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explore more nuanced, but I agree not for Alpha. Maybe we can explicitly add as a beta graduation criteria to have the answer for it.

keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
@robscott robscott force-pushed the topology-aware-hints branch 2 times, most recently from f169818 to 6d777cc Compare February 5, 2021 07:51

* **Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?**
Yes. It can easily be disabled universally by turning off the feature gate or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when it will be GA it will be always enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Although the feature gate would be enabled by default, it would still require per-Service opt-in. I think we could potentially change the default value of this annotation with a follow up KEP. This is hidden a bit further up in the KEP:

A future KEP will explore changing the default behavior from Disabled to Auto with a new feature gate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


### Kube-Proxy

When the `TopologyAwareHints` feature gate is enabled, Kube-Proxy will be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember why, but for slices I think you ended having two independent feature gates for kube-proxy and the slice controller, can this happen again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I almost did that here. In that case I actually split the features up into different KEPs entirely because they felt sufficiently different. This feels more like a single feature, but I'm open to a separate feature gate here. Any naming recommendations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just asking because I don't remember the cause, so if we don't know I don't think that is a big deal to amend it later if we find out it is really needed

keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
}
```

#### Future API Expansion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So your future safety here is "add a new field and treat them as mutually exclusive" ? It seems to me that we'll need multiple zones (and probably weight) pretty quickly in order to handle small services.

e.g. 2 replicas (in zones A and B) and 3 node zones (A, B, C). You'll need to set up zone A to send to replica A 100%, Zone B to replica B 100% and zone C to replicas A and B at 50%. That's a simple case where weight doesn't matter but you simply can't represent an endpoint in 2 zones with the above proposal.

e.g. 4 replicas (2 in zone A, 1 each in B and C) and 3 node-zones. You'll want something like:

zone A: 50% to replica a1; 50% to replica a2
zone B: 75% to replica b1; 25% to replica a1
zone C: 75% to replica c1; 25% to replica a2

I think it will be important quickly - am I wrong? To that end, I'd suggest we start with zones []struct now and as the impl builds up consider weight.

Note this is different than general per-endpoint weight (which we should still consider on its own), because this is a hint to consumers how to split for capacity (based on the set of endpoints), not an indicator of capacity (which is more intrinsic to the endpoint). I'm not sure that holds water yet, ergo let's not add weight yet :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer for the current proposal is that we simply would fallback to slices without hints if we could not reach a reasonable balance otherwise. When we tested weighting with the previous proposals it looked like it would significantly increase the number of EndpointSlice updates. Weight is most useful when there are a small number of endpoints. This is also when topology aware routing includes the most risk since it would be very easy for an individual endpoint to be overloaded. So even if we had weight and could accomplish topology aware routing with only 2 endpoints, the result might not be great. Agree that it's worth expanding the API to make weight easy to add in the future and also agree that we shouldn't add weight yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MY argument is to start now with zones []struct with no weight and when (inevitably, IMO) we get complaints that small-N services don't work, we can revisit weight. it might be, as you say, that weight only matters at the very low end (N < 12), but that's a lot of Services. All your harm-reduction heuristics will have to consider it, so it's definitely a different discussion.

- The `topology.kubernetes.io/topology-aware-routing` annotation is set to
`Auto` for the Service.
- At least one endpoint for the Service has a hint pointing to the zone
Kube-Proxy is running within.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be safe.

Easy: if all endpoints have the hint, then respect the hint

More nuanced: note how many unique values we see and only use the hint if ~ at least 1/N of all endpoints are for this zone (with some fudge factor).

I'm not sure the more nuanced answer is useful.

@thockin
Copy link
Member

thockin commented Feb 5, 2021

Let's pin down basics and finish API argument in the code PR(s)


#### Overload

Overload is a key concept for this proposal. This occurs when there are less
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be a property of a zone only, or it will be extended to the other things that you mention like regions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same concepts could apply to region, it's just less common in my experience to see individual clusters spanning zones. This may be something that becomes more relevant with multicluster Services.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Kube-Proxy is running within.
- All endpoints for the Service have zone hints.

When the above conditions are true, kube-proxy will only route traffic to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the example you've added

- addresses: ["10.1.2.1"]
  zone: "zone-a"
  hints:
    zone: "zone-a"
- addresses: ["10.1.2.2"]
  zone: "zone-b"
  hints:
    zone: "zone-b"
- addresses: ["10.1.2.3"]
  zone: "zone-a"
  hints:
    zone: "zone-c"

kube-proxies from zone A never will use

- addresses: ["10.1.2.3"]
  zone: "zone-a"
  hints:
    zone: "zone-c"

despite it is in the same zone, is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, in this case C didn't have any so it was "donated" from A, and therefore all zones are ~equally provisioned.

Copy link
Member

@aojea aojea Feb 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it is possible that pods in the same node that the endpoint, will not use the local endpoint because was donated to other zone, I don't know how much are you into soccer/football, but there is a legendary sentence from a Brazilian coach nicknamed Tim that says: "Playing soccer is like covering yourself with a short blanket." , that means either you cover your head, or you wrap your feet 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yep, that is definitely an unfortunate downside of this approach. The previous subsetting approach was incompatible with ExternalTrafficPolicy=Local for the same reason.

@aojea
Copy link
Member

aojea commented Feb 6, 2021

/lgtm
Looking forward to seeing your numbers with this implementation 🥇

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2021
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some super minor comments.

const SubSystem = "endpoint_slice_controller"

// This metric observes overload per zone for endpoints in EndpointSlices
EPSOverloadByZone = metrics.NewHistogramVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to report this metric?
Whenever you process a particular service you report it for each of zones?

I'm definitely not against this metric, but I'm having some troubles with intuitively explaining what it means to myself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm really not sure, I was hoping that reporting every value when a Service syncs would give us somewhat helpful data as far as distribution. I think some visibility into overload would be really helpful but I still haven't found a great way to communicate that with a metric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two possible use cases:

  • help the users to understand how good or bad distributed are their deployments, i.e they are overloading zoneA, ...
  • understand the distribution algorithm convergence, I think that in a dynamic environment this is going to fluctuate, if you see that this is changing constantly that means that you are constantly redistributing endpoints, that will impact workloads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I fully buy the usecases (those are the same ones I had on my mind).

My concern was more: "would this metric allow us answering those questions"?
And I'm not convinced (at least not intuitively...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure it would either. I've been trying to think of alternatives but haven't so far. To avoid overcommitting to anything, I've removed this from the KEP for now, will keep on thinking of potential metrics we could add here. Any ideas would be very appreciated.

This feature is not yet implemented but per-Service enablement/disablement is
covered in depth as part of the test plan.

### Rollout, Upgrade and Rollback Planning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically neither of the below is needed for Alpha, but I'm really happy to see that already filled in :)

keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved

### Configuration

A new `service.kubernetes.io/topology-aware-routing` annotation will be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an annotation, I wonder if TopologyAware (or just Topology) can one of the possible values for internalTrafficPolicy. That way we can validate and reject it when externalTrafficPolicy: Local or when internalTrafficPolicy: Local|PreferLocal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can also validate against annotation but not sure what the precedent for that is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, based on the suggestion here and in that KEP (#2441 (comment)) and some follow up discussion in Slack, I think it makes sense to integrate this with the TrafficPolicy field instead of adding a new annotation. I've updated this KEP to reflect that. I think you provided a good summary in Slack of what we're thinking here:

  • if externalTrafficPolicy=Cluster, fall back to trafficPolicy for external sources
  • if externalTrafficPolicy=Local -- this rule only takes precedent for external sources, with internal traffic following trafficPolicy

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2021
@robscott
Copy link
Member Author

robscott commented Feb 8, 2021

@thockin @andrewsykim I've updated this KEP to use the TrafficPolicy field PTAL.

@thockin thockin added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 9, 2021
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

This functionality will be guarded by the `TopologyAwareHints` feature gate.
This gate will be dependent on the `ServiceInternalTrafficPolicy` feature gate
since it uses the `TrafficPolicy` guarded by that gate. This gate should not be
enabled if the deprecated `ServiceTopology` gate is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL - can we get a table or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, cleaned this up so it's hopefully a bit easier to understand.

for a Service, the EndpointSlice controller will add hints to EndpointSlices.
These hints will indicate where an endpoint should be consumed by proxy
implementations to enable topology aware routing.
When the `TopologyAwareHints` feature gate is enabled and the `trafficPolicy`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intersectional issue - for all of the trafficPolicy values that involve a fallback - what do they fall back to? I'm a bit anxious about this (commented on both KEPs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - wouldn't they fall back to Cluster to match the current default behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but is that what we really want?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2021
This functionality will be guarded by the `TopologyAwareHints` feature gate.
This gate will be dependent on the `ServiceInternalTrafficPolicy` feature gate
since it uses the `TrafficPolicy` guarded by that gate. This gate should not be
enabled if the deprecated `ServiceTopology` gate is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

keps/sig-network/2433-topology-aware-hints/kep.yaml Outdated Show resolved Hide resolved
keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved
@wojtek-t
Copy link
Member

wojtek-t commented Feb 9, 2021

/approve

/hold
[for addressing those small nits]

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: robscott, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2021
@wojtek-t
Copy link
Member

wojtek-t commented Feb 9, 2021

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 9, 2021
@robscott
Copy link
Member Author

robscott commented Feb 9, 2021

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2021
@k8s-ci-robot k8s-ci-robot merged commit 43cc9a7 into kubernetes:master Feb 9, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants