Adding KEP 2433: Topology Aware Hints #2434

robscott · 2021-02-05T05:21:07Z

This is a follow up to #2355 and #2354 and represents what I hope will be a simpler and better approach for topology aware routing.

Enhancement issue: #2433

/cc @andrewsykim @bowei @dcbw
/assign @thockin @wojtek-t
/sig network

wojtek-t

Couple small comments for the PRR - please address.
And I will work with approval for SIG approval.

keps/prod-readiness/sig-network/2433.yaml

keps/sig-network/2433-topology-aware-hints/README.md

wojtek-t · 2021-02-05T07:06:33Z

keps/sig-network/2433-topology-aware-hints/README.md

+- The `topology.kubernetes.io/topology-aware-routing` annotation is set to
+ `Auto` for the Service.
+- At least one endpoint for the Service has a hint pointing to the zone
+ Kube-Proxy is running within.


at least one with hint to our zone or without hint?

I think the safest approach here might actually be to ignore all hints if some are missing. I think the most likely cause for hints missing would be a transition between states, and that transition exposes us to risk if we don't fallback to all endpoints immediately. For example, if we have a 500 endpoint Service, updates are going to come to kube-proxy for each EndpointSlice individually.

It's possible that at one point only one EndpointSlice will have hints for each endpoint. Without an immediate fallback, we'd risk significantly overloading these endpoints.

The other alternative is also not great. Let's say we always send traffic to endpoints with matching hints + endpoints without any hints. In the above scenario, we'd overload endpoints without hints if we were weighting all endpoints equally.

If kube-proxy could reliably determine how many zones were in a cluster, we could do a variation of that second approach that gave endpoints without a zone hint a 1/n weight, where n == the number of zones in the cluster. I think that could still result in imbalanced traffic during a transition depending on the composition of slices with/without hints.

Let's be safe.

Easy: if all endpoints have the hint, then respect the hint

More nuanced: note how many unique values we see and only use the hint if ~ at least 1/N of all endpoints are for this zone (with some fudge factor).

I'm not sure the more nuanced answer is useful.

Yeah every nuanced approach here feels like it could lead to more trouble than its worth. I'd lean towards the safest approach at least for alpha.

I think we should explore more nuanced, but I agree not for Alpha. Maybe we can explicitly add as a beta graduation criteria to have the answer for it.

keps/sig-network/2433-topology-aware-hints/README.md

aojea · 2021-02-05T23:42:53Z

keps/sig-network/2433-topology-aware-hints/README.md

+
+* **Can the feature be disabled once it has been enabled (i.e. can we roll back
+ the enablement)?**
+ Yes. It can easily be disabled universally by turning off the feature gate or


when it will be GA it will be always enabled?

Good question. Although the feature gate would be enabled by default, it would still require per-Service opt-in. I think we could potentially change the default value of this annotation with a follow up KEP. This is hidden a bit further up in the KEP:

A future KEP will explore changing the default behavior from Disabled to Auto with a new feature gate.

aojea · 2021-02-05T23:56:17Z

keps/sig-network/2433-topology-aware-hints/README.md

+
+### Kube-Proxy
+
+When the `TopologyAwareHints` feature gate is enabled, Kube-Proxy will be


I don't remember why, but for slices I think you ended having two independent feature gates for kube-proxy and the slice controller, can this happen again?

Yeah I almost did that here. In that case I actually split the features up into different KEPs entirely because they felt sufficiently different. This feels more like a single feature, but I'm open to a separate feature gate here. Any naming recommendations?

I was just asking because I don't remember the cause, so if we don't know I don't think that is a big deal to amend it later if we find out it is really needed

keps/sig-network/2433-topology-aware-hints/README.md

thockin · 2021-02-05T23:48:55Z

keps/sig-network/2433-topology-aware-hints/README.md

+}
+```
+
+#### Future API Expansion


So your future safety here is "add a new field and treat them as mutually exclusive" ? It seems to me that we'll need multiple zones (and probably weight) pretty quickly in order to handle small services.

e.g. 2 replicas (in zones A and B) and 3 node zones (A, B, C). You'll need to set up zone A to send to replica A 100%, Zone B to replica B 100% and zone C to replicas A and B at 50%. That's a simple case where weight doesn't matter but you simply can't represent an endpoint in 2 zones with the above proposal.

e.g. 4 replicas (2 in zone A, 1 each in B and C) and 3 node-zones. You'll want something like:

zone A: 50% to replica a1; 50% to replica a2
zone B: 75% to replica b1; 25% to replica a1
zone C: 75% to replica c1; 25% to replica a2

I think it will be important quickly - am I wrong? To that end, I'd suggest we start with zones []struct now and as the impl builds up consider weight.

Note this is different than general per-endpoint weight (which we should still consider on its own), because this is a hint to consumers how to split for capacity (based on the set of endpoints), not an indicator of capacity (which is more intrinsic to the endpoint). I'm not sure that holds water yet, ergo let's not add weight yet :)

The answer for the current proposal is that we simply would fallback to slices without hints if we could not reach a reasonable balance otherwise. When we tested weighting with the previous proposals it looked like it would significantly increase the number of EndpointSlice updates. Weight is most useful when there are a small number of endpoints. This is also when topology aware routing includes the most risk since it would be very easy for an individual endpoint to be overloaded. So even if we had weight and could accomplish topology aware routing with only 2 endpoints, the result might not be great. Agree that it's worth expanding the API to make weight easy to add in the future and also agree that we shouldn't add weight yet.

MY argument is to start now with zones []struct with no weight and when (inevitably, IMO) we get complaints that small-N services don't work, we can revisit weight. it might be, as you say, that weight only matters at the very low end (N < 12), but that's a lot of Services. All your harm-reduction heuristics will have to consider it, so it's definitely a different discussion.

keps/sig-network/2433-topology-aware-hints/README.md

thockin · 2021-02-05T23:56:27Z

keps/sig-network/2433-topology-aware-hints/README.md

+- The `topology.kubernetes.io/topology-aware-routing` annotation is set to
+ `Auto` for the Service.
+- At least one endpoint for the Service has a hint pointing to the zone
+ Kube-Proxy is running within.


Let's be safe.

Easy: if all endpoints have the hint, then respect the hint

More nuanced: note how many unique values we see and only use the hint if ~ at least 1/N of all endpoints are for this zone (with some fudge factor).

I'm not sure the more nuanced answer is useful.

keps/sig-network/2433-topology-aware-hints/README.md

thockin · 2021-02-05T23:59:55Z

Let's pin down basics and finish API argument in the code PR(s)

keps/sig-network/2433-topology-aware-hints/README.md

aojea · 2021-02-06T00:03:36Z

keps/sig-network/2433-topology-aware-hints/README.md

+
+#### Overload
+
+Overload is a key concept for this proposal. This occurs when there are less


this seems to be a property of a zone only, or it will be extended to the other things that you mention like regions?

I think the same concepts could apply to region, it's just less common in my experience to see individual clusters spanning zones. This may be something that becomes more relevant with multicluster Services.

aojea · 2021-02-06T11:26:50Z

keps/sig-network/2433-topology-aware-hints/README.md

+ Kube-Proxy is running within.
+- All endpoints for the Service have zone hints.
+
+When the above conditions are true, kube-proxy will only route traffic to


Based on the example you've added

- addresses: ["10.1.2.1"] zone: "zone-a" hints: zone: "zone-a" - addresses: ["10.1.2.2"] zone: "zone-b" hints: zone: "zone-b" - addresses: ["10.1.2.3"] zone: "zone-a" hints: zone: "zone-c"

kube-proxies from zone A never will use

- addresses: ["10.1.2.3"] zone: "zone-a" hints: zone: "zone-c"

despite it is in the same zone, is that correct?

yeah, in this case C didn't have any so it was "donated" from A, and therefore all zones are ~equally provisioned.

Then it is possible that pods in the same node that the endpoint, will not use the local endpoint because was donated to other zone, I don't know how much are you into soccer/football, but there is a legendary sentence from a Brazilian coach nicknamed Tim that says: "Playing soccer is like covering yourself with a short blanket." , that means either you cover your head, or you wrap your feet 😄

Haha yep, that is definitely an unfortunate downside of this approach. The previous subsetting approach was incompatible with ExternalTrafficPolicy=Local for the same reason.

keps/sig-network/2433-topology-aware-hints/README.md

aojea · 2021-02-06T11:44:22Z

/lgtm
Looking forward to seeing your numbers with this implementation 🥇

wojtek-t

Some super minor comments.

wojtek-t · 2021-02-08T07:41:54Z

keps/sig-network/2433-topology-aware-hints/README.md

+const SubSystem = "endpoint_slice_controller"
+
+// This metric observes overload per zone for endpoints in EndpointSlices
+EPSOverloadByZone = metrics.NewHistogramVec(


How are we going to report this metric?
Whenever you process a particular service you report it for each of zones?

I'm definitely not against this metric, but I'm having some troubles with intuitively explaining what it means to myself.

Yeah I'm really not sure, I was hoping that reporting every value when a Service syncs would give us somewhat helpful data as far as distribution. I think some visibility into overload would be really helpful but I still haven't found a great way to communicate that with a metric.

I see two possible use cases:

help the users to understand how good or bad distributed are their deployments, i.e they are overloading zoneA, ...

understand the distribution algorithm convergence, I think that in a dynamic environment this is going to fluctuate, if you see that this is changing constantly that means that you are constantly redistributing endpoints, that will impact workloads

Sure - I fully buy the usecases (those are the same ones I had on my mind).

My concern was more: "would this metric allow us answering those questions"?
And I'm not convinced (at least not intuitively...)

Yeah, I'm not sure it would either. I've been trying to think of alternatives but haven't so far. To avoid overcommitting to anything, I've removed this from the KEP for now, will keep on thinking of potential metrics we could add here. Any ideas would be very appreciated.

wojtek-t · 2021-02-08T07:43:24Z

keps/sig-network/2433-topology-aware-hints/README.md

+ This feature is not yet implemented but per-Service enablement/disablement is
+ covered in depth as part of the test plan.
+
+### Rollout, Upgrade and Rollback Planning


Technically neither of the below is needed for Alpha, but I'm really happy to see that already filled in :)

keps/sig-network/2433-topology-aware-hints/README.md

andrewsykim · 2021-02-08T15:04:14Z

keps/sig-network/2433-topology-aware-hints/README.md

+
+### Configuration
+
+A new `service.kubernetes.io/topology-aware-routing` annotation will be


Instead of an annotation, I wonder if TopologyAware (or just Topology) can one of the possible values for internalTrafficPolicy. That way we can validate and reject it when externalTrafficPolicy: Local or when internalTrafficPolicy: Local|PreferLocal.

I guess we can also validate against annotation but not sure what the precedent for that is

Good call, based on the suggestion here and in that KEP (#2441 (comment)) and some follow up discussion in Slack, I think it makes sense to integrate this with the TrafficPolicy field instead of adding a new annotation. I've updated this KEP to reflect that. I think you provided a good summary in Slack of what we're thinking here:

if externalTrafficPolicy=Cluster, fall back to trafficPolicy for external sources

if externalTrafficPolicy=Local -- this rule only takes precedent for external sources, with internal traffic following trafficPolicy

robscott · 2021-02-08T23:31:01Z

@thockin @andrewsykim I've updated this KEP to use the TrafficPolicy field PTAL.

thockin

/lgtm
/approve

thockin · 2021-02-09T05:13:56Z

keps/sig-network/2433-topology-aware-hints/README.md

+This functionality will be guarded by the `TopologyAwareHints` feature gate.
+This gate will be dependent on the `ServiceInternalTrafficPolicy` feature gate
+since it uses the `TrafficPolicy` guarded by that gate. This gate should not be
+enabled if the deprecated `ServiceTopology` gate is enabled.


LOL - can we get a table or something?

Haha, cleaned this up so it's hopefully a bit easier to understand.

thockin · 2021-02-09T05:22:34Z

keps/sig-network/2433-topology-aware-hints/README.md

-for a Service, the EndpointSlice controller will add hints to EndpointSlices.
-These hints will indicate where an endpoint should be consumed by proxy
-implementations to enable topology aware routing.
+When the `TopologyAwareHints` feature gate is enabled and the `trafficPolicy`


Intersectional issue - for all of the trafficPolicy values that involve a fallback - what do they fall back to? I'm a bit anxious about this (commented on both KEPs).

Good question - wouldn't they fall back to Cluster to match the current default behavior?

Yeah, but is that what we really want?

wojtek-t · 2021-02-09T06:58:28Z

keps/sig-network/2433-topology-aware-hints/README.md

+This functionality will be guarded by the `TopologyAwareHints` feature gate.
+This gate will be dependent on the `ServiceInternalTrafficPolicy` feature gate
+since it uses the `TrafficPolicy` guarded by that gate. This gate should not be
+enabled if the deprecated `ServiceTopology` gate is enabled.


keps/sig-network/2433-topology-aware-hints/kep.yaml

keps/sig-network/2433-topology-aware-hints/README.md

wojtek-t · 2021-02-09T07:04:39Z

/approve

/hold
[for addressing those small nits]

k8s-ci-robot · 2021-02-09T07:04:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: robscott, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-network/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2021-02-09T07:38:47Z

/lgtm
/hold cancel

robscott · 2021-02-09T07:40:40Z

/hold

k8s-ci-robot assigned thockin Feb 5, 2021

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Feb 5, 2021

k8s-ci-robot assigned wojtek-t Feb 5, 2021

k8s-ci-robot requested review from andrewsykim, bowei and dcbw February 5, 2021 05:21

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 5, 2021

robscott force-pushed the topology-aware-hints branch from 20d2279 to 39d06fa Compare February 5, 2021 05:25

This was referenced Feb 5, 2021

Updating Topology Aware Subsetting KEP for 1.21 #2355

Closed

Updating Topology Aware Proxying KEP for 1.21 release cycle #2354

Closed

wojtek-t reviewed Feb 5, 2021

View reviewed changes

robscott force-pushed the topology-aware-hints branch 2 times, most recently from f169818 to 6d777cc Compare February 5, 2021 07:51

thockin reviewed Feb 5, 2021

View reviewed changes

keps/sig-network/2433-topology-aware-hints/README.md Outdated Show resolved Hide resolved

aojea reviewed Feb 5, 2021

View reviewed changes

keps/sig-network/2433-topology-aware-hints/README.md Show resolved Hide resolved

aojea reviewed Feb 5, 2021

View reviewed changes

keps/sig-network/2433-topology-aware-hints/README.md Show resolved Hide resolved

aojea reviewed Feb 5, 2021

View reviewed changes

keps/sig-network/2433-topology-aware-hints/README.md Show resolved Hide resolved

aojea reviewed Feb 5, 2021

View reviewed changes

thockin reviewed Feb 5, 2021

View reviewed changes

aojea reviewed Feb 6, 2021

View reviewed changes

keps/sig-network/2433-topology-aware-hints/README.md Show resolved Hide resolved

aojea reviewed Feb 6, 2021

View reviewed changes

robscott force-pushed the topology-aware-hints branch from 6d777cc to 1908b5e Compare February 6, 2021 01:49

aojea reviewed Feb 6, 2021

View reviewed changes

keps/sig-network/2433-topology-aware-hints/README.md Show resolved Hide resolved

k8s-ci-robot assigned aojea Feb 6, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2021

wojtek-t reviewed Feb 8, 2021

View reviewed changes

wojtek-t mentioned this pull request Feb 8, 2021

KEP-2086: alpha prod readiness review #2441

Merged

andrewsykim reviewed Feb 8, 2021

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2021

thockin added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 9, 2021

thockin reviewed Feb 9, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2021

wojtek-t reviewed Feb 9, 2021

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2021

Adding KEP 2433: Topology Aware Hints

d1a4d5e

robscott force-pushed the topology-aware-hints branch from dbca8e5 to d1a4d5e Compare February 9, 2021 07:35

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2021

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 9, 2021

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2021

k8s-ci-robot merged commit 43cc9a7 into kubernetes:master Feb 9, 2021

k8s-ci-robot added this to the v1.21 milestone Feb 9, 2021

palnabarun mentioned this pull request Jun 12, 2022

canonical kep number json field palnabarun/enhancements#1

Closed

palnabarun mentioned this pull request Jul 26, 2022

KEP Manifest Generator palnabarun/enhancements#2

Closed


		### Kube-Proxy

		When the `TopologyAwareHints` feature gate is enabled, Kube-Proxy will be


		#### Overload

		Overload is a key concept for this proposal. This occurs when there are less


		### Configuration

		A new `service.kubernetes.io/topology-aware-routing` annotation will be

Adding KEP 2433: Topology Aware Hints #2434

Adding KEP 2433: Topology Aware Hints #2434

Conversation

robscott commented Feb 5, 2021 • edited Loading

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin commented Feb 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea Feb 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea commented Feb 6, 2021

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robscott commented Feb 8, 2021

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Feb 9, 2021

k8s-ci-robot commented Feb 9, 2021

wojtek-t commented Feb 9, 2021

robscott commented Feb 9, 2021

robscott commented Feb 5, 2021 •

edited

Loading

aojea Feb 6, 2021 •

edited

Loading