Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1672: updates for v1.24 #3168

Merged
merged 1 commit into from
Jun 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-network/1672.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 1672
alpha:
approver: "@wojtek-t"
beta:
approver: "@wojtek-t"
52 changes: 40 additions & 12 deletions keps/sig-network/1672-tracking-terminating-endpoints/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [GA](#ga)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
Expand All @@ -32,13 +33,13 @@
## Release Signoff Checklist

- [X] Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] KEP approvers have approved the KEP status as `implementable`
- [ ] Design details are appropriately documented
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [ ] Graduation criteria is in place
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
- [X] KEP approvers have approved the KEP status as `implementable`
- [X] Design details are appropriately documented
- [X] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [X] Graduation criteria is in place
- [X] "Implementation History" section is up-to-date for milestone
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
Expand Down Expand Up @@ -164,6 +165,13 @@ E2E tests:
* `EndpointSliceTerminatingCondition` is enabled by default.
* Consensus on scalability implications resulting from additional EndpointSlice writes with approval from sig-scalability.

#### GA
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding what I think are the GA graduation criteria for this work right now so we have it in mind, but the feature is going to stay beta for v1.24


* E2E tests validating that terminating pods are properly reflected in EndpointSlice API.
* Ensure there are no performance/scalability regressions when enabling additional endpointslice writes for terminating endpoints.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - technically we know the regression exists. But we say that it's acceptable one. Please adjust the wording.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess we can tweak it when graduating the feature to GA - I don't want to block this PR on this simple comment now.

* This will be validated by running the existing scalability test suites where pods handle SIGTERM from kubelet before terminating.
* All necessary metrics are in place to provide adequate observability and monitoring for this feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments (since I can't comment below);

L208: Were the tests added? If so, can you maybe link them?

L219: IIRC, we were discussing adding some metrics. Has that happened? If not, what is the plan?
[Relying on application-level metrics is not really where we would like to be for such important features like load serving.]

L223: Were the tests run? Can you summarize the findings?

L238: Were metrics added? If so, can you link their names/component exposing them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L208: Were the tests added? If so, can you maybe link them?

We have integration tests here as well as rest strategy unit tests here for validating the behavior of the feature gate. I'll link these in the KEP as well.

L219: IIRC, we were discussing adding some metrics. Has that happened? If not, what is the plan?
[Relying on application-level metrics is not really where we would like to be for such important features like load serving.]

I think we can add metrics for total # of endpoints by condition at the very least, let me update the KEP to include this.

L223: Were the tests run? Can you summarize the findings?

There was some manual testing done, will dig through some history to find the results.

L238: Were metrics added? If so, can you link their names/component exposing them?

(see my comment above re: metrics by condition)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw -- we do have endpointslice metrics now that would be useful here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpointslice/metrics/metrics.go#L30-L53

So users would be able to see the amount of endpoint churn that happens due to this change during upgrade. I'll link these in the KEP too.

### Upgrade / Downgrade Strategy

Since this is an addition to the EndpointSlice API, the upgrade/downgrade strategy will follow that
Expand Down Expand Up @@ -200,6 +208,10 @@ EndpointSlice will continue to have the `terminating` and `serving` condition se

Yes, there will be strategy API unit tests validating if the new API field is allowed based on the feature gate.

These tests can be found here:
- https://github.com/kubernetes/kubernetes/blob/master/test/integration/endpointslice/endpointsliceterminating_test.go#L44
- https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/discovery/endpointslice/strategy_test.go#L42-L137

### Rollout, Upgrade and Rollback Planning

###### How can a rollout fail? Can it impact already running workloads?
Expand All @@ -209,7 +221,15 @@ It is assumed that almost all consumers of EndpointSlice check the `ready` condi

###### What specific metrics should inform a rollback?

Application-level traffic indicating packet-loss or error rates.
EndpointSlice controller supports the following metrics that would be relevant for this feature:
- endpoint_slice_controller_endpoints_added_per_sync
- endpoint_slice_controller_endpoints_removed_per_sync
- endpoint_slice_controller_changes
- endpoint_slice_controller_endpointslices_changed_per_sync
- endpoint_slice_controller_syncs

The following metrics can be used to see if the introduction of this change resulted in a significantly
large number of traffic to the apiserver.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Expand All @@ -228,21 +248,29 @@ on how the new conditions are being used.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics will be added for total endpoints with the `serving` and `terminating` condition set.
The existing SLI can be used to determine the health of this feature:

```
Latency of programming in-cluster load balancing mechanism (e.g. iptables), measured from when service spec or list of its Ready pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes aggregated across all programmers
```

###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

N/A
It's hard to gauge an exact number here, because the existing SLI does not have a target SLO yet.
However, we should assume that the addition of the `serving` and `terminating` conditions do not
significantly impact the latency of kube-proxy syncing load balancer rules.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

N/A
Adapting the existing endpoint slice controller metrics to also include endpoint conditions
as a label could be useful since a user can distinguish if the endpoint churn is happening due
to the addition of terminating endpoints or for another reason.

### Dependencies

###### Does this feature depend on any specific services running in the cluster?

N/A
None aside from the existing core Kubernetes components, specifically kube-apiserver and kube-controller-manager.

### Scalability

Expand Down
5 changes: 3 additions & 2 deletions keps/sig-network/1672-tracking-terminating-endpoints/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,16 +20,17 @@ see-also:
replaces: []

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.22"
latest-milestone: "v1.24"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.20"
beta: "v1.22"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my apologies -- seems like I never got around to coming back to update this KEP after some of the last minute changes that happened in v1.22


# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Expand Down