Skip to content

Commit

Permalink
kep-1669: update alpha milestones for v1.22
Browse files Browse the repository at this point in the history
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
  • Loading branch information
andrewsykim committed May 12, 2021
1 parent 90424fd commit a6909c2
Show file tree
Hide file tree
Showing 3 changed files with 298 additions and 8 deletions.
3 changes: 3 additions & 0 deletions keps/prod-readiness/sig-network/1669.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
kep-number: 1669
alpha:
approver: "@wojtek-t"
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,13 @@
- [Alpha](#alpha)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
Expand All @@ -28,10 +35,10 @@
## Release Signoff Checklist

- [X] Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] KEP approvers have approved the KEP status as `implementable`
- [ ] Design details are appropriately documented
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [ ] Graduation criteria is in place
- [X] KEP approvers have approved the KEP status as `implementable`
- [X] Design details are appropriately documented
- [X] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [X] Graduation criteria is in place
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Expand Down Expand Up @@ -125,8 +132,9 @@ All existing E2E tests for Services should continue to pass.

#### Alpha

* kube-proxy internally tracks the terminating condition of an endpoint.
* feature is only enabled if the feature gate `EndpointSliceTerminatingCondition` is on.
* kube-proxy internally tracks the `terminating` and `serving` condition from EndpointSlice
* kube-proxy falls back to terminating endpoints if and only if they are the only available endpoints.
* feature is only enabled if the feature gate `ProxyTerminatingEndpoints` is on.
* unit tests in kube-proxy.

### Upgrade / Downgrade Strategy
Expand All @@ -141,6 +149,266 @@ This would either happen if a version of the control plane was not aware of the

There's not much risk involved as the worse case scenario is falling back to existing behavior.

## Production Readiness Review Questionnaire

### Feature Enablement and Rollback

###### How can this feature be enabled / disabled in a live cluster?

- [X] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: ProxyTerminatingEndpoints
- Components depending on the feature gate: kube-proxy

###### Does enabling the feature change any default behavior?

Yes, when externalTrafficPolicy=Local and there are only terimating endpoints,
kube-proxy will route traffic to those endpoints. Before this change, kube-proxy
dropped this traffic instead.

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes.

###### What happens if we reenable the feature if it was previously rolled back?

kube-proxy will drop traffic if only terminating endpoints are available.

###### Are there any tests for feature enablement/disablement?

Yes, there will be unit tests in kube-proxy with the feature gate enabled and disabled.

### Rollout, Upgrade and Rollback Planning

<!--
This section must be completed when targeting beta to a release.
-->

###### How can a rollout fail? Can it impact already running workloads?

<!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
-->

TBD for beta.

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

TBD for beta.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

<!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

TBD for beta.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

<!--
Even if applying deprecation policies, they may still surprise some users.
-->

TBD for beta.

### Monitoring Requirements

<!--
This section must be completed when targeting beta to a release.
-->

###### How can an operator determine if the feature is in use by workloads?

<!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

TBD for beta.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

<!--
Pick one more of these and delete the rest.
-->

TBD for beta.

- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:

###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

<!--
At a high level, this usually will be in the form of "high percentile of SLI
per day <= X". It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors <= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
-->

TBD for beta.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

<!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->

TBD for beta.

### Dependencies

<!--
This section must be completed when targeting beta to a release.
-->

###### Does this feature depend on any specific services running in the cluster?

<!--
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI). Focus on external or
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->

TBD for beta.

### Scalability

<!--
For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->

TBD for beta.

###### Will enabling / using this feature result in any new API calls?

<!--
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
Focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources
(e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
-->

TBD for beta.

###### Will enabling / using this feature result in introducing new API types?

<!--
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-->

TBD for beta.

###### Will enabling / using this feature result in any new calls to the cloud provider?

<!--
Describe them, providing:
- Which API(s):
- Estimated increase:
-->

###### Will enabling / using this feature result in increasing size or count of the existing API objects?

<!--
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->

###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

<!--
Look at the [existing SLIs/SLOs].
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->

###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

<!--
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->

### Troubleshooting

<!--
This section must be completed when targeting beta to a release.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->

###### How does this feature react if the API server and/or etcd is unavailable?

###### What are other known failure modes?

<!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->

###### What steps should be taken if SLOs are not being met to determine the problem?

## Implementation History

- [x] 2020-04-23: KEP accepted as implementable for v1.19
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,31 @@ reviewers:
- "@smarterclayton"
approvers:
- "@thockin"
prr-approvers:
- "@johnbelamaric"
creation-date: 2020-04-07
last-updated: 2020-04-07
status: implementable
see-also:
- "/keps/sig-network/1672-tracking-terminating-endpoints/README.md"
- https://github.com/kubernetes/kubernetes/issues/85643

latest-milestone: "0.0"
stage: "alpha"
# The target maturity stage in the current dev cycle for this KEP.
stage: alpha

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.22"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.22"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: ProxyTerminatingEndpoints
components:
- kube-proxy
disable-supported: true

0 comments on commit a6909c2

Please sign in to comment.