Skip to content

Commit

Permalink
kep-1672: update beta milestones for v1.22
Browse files Browse the repository at this point in the history
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
  • Loading branch information
andrewsykim committed May 13, 2021
1 parent 5e1d8ec commit 7454fd1
Show file tree
Hide file tree
Showing 3 changed files with 176 additions and 10 deletions.
5 changes: 5 additions & 0 deletions keps/prod-readiness/sig-network/1672.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
kep-number: 1672
alpha:
approver: "@wojtek-t"
beta:
approver: "@wojtek-t"
158 changes: 150 additions & 8 deletions keps/sig-network/1672-tracking-terminating-endpoints/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,16 @@
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
<!-- /toc -->
Expand Down Expand Up @@ -91,7 +99,7 @@ and possibly more depending on how many times readiness changes during terminati

## Design Details

To track whether an endpoint is terminating, a `terminating` field would be added as part of
To track whether an endpoint is terminating, a `terminating` and `serving` field would be added as part of
the `EndpointCondition` type in the EndpointSlice API.

```go
Expand All @@ -100,14 +108,25 @@ type EndpointConditions struct {
// ready indicates that this endpoint is prepared to receive traffic,
// according to whatever system is managing the endpoint. A nil value
// indicates an unknown state. In most cases consumers should interpret this
// unknown state as ready.
// unknown state as ready. For compatibility reasons, ready should never be
// "true" for terminating endpoints.
// +optional
Ready *bool `json:"ready,omitempty" protobuf:"bytes,1,name=ready"`

// terminating indicates if this endpoint is terminating. Consumers should assume a
// nil value indicates the endpoint is not terminating.
// serving is identical to ready except that it is set regardless of the
// terminating state of endpoints. This condition should be set to true for
// a ready endpoint that is terminating. If nil, consumers should defer to
// the ready condition. This field can be enabled with the
// EndpointSliceTerminatingCondition feature gate.
// +optional
Terminating *bool `json:"terminating,omitempty" protobuf:"bytes,2,name=terminating"`
Serving *bool `json:"serving,omitempty" protobuf:"bytes,2,name=serving"`

// terminating indicates that this endpoint is terminating. A nil value
// indicates an unknown state. Consumers should interpret this unknown state
// to mean that the endpoint is not terminating. This field can be enabled
// with the EndpointSliceTerminatingCondition feature gate.
// +optional
Terminating *bool `json:"terminating,omitempty" protobuf:"bytes,3,name=terminating"`
}
```

Expand All @@ -116,7 +135,8 @@ NOTE: A nil value for `Terminating` indicates that the endpoint is not terminati
Updates to endpointslice controller:
* include pods with a deletion timestamp in endpointslice
* any pod with a deletion timestamp will have condition.terminating = true
* allow endpoint ready condition to change during termination
* any terminating pod must have condition.ready = false.
* the new `serving` condition is set based on pod readiness regardless of terminating state.

### Test Plan

Expand All @@ -134,10 +154,16 @@ E2E tests:

#### Alpha

* EndpointSlice API includes `Terminating` condition.
* `Terminating` condition can only be set if feature gate `EndpointSliceTerminatingCondition` is enabled.
* EndpointSlice API includes `Terminating` and `Serving` condition.
* `Terminating` and `Serving` condition can only be set if feature gate `EndpointSliceTerminatingCondition` is enabled.
* Unit tests in endpointslice controller and API validation/strategy.

#### Beta

* Integration API tests exercising the `terminating` and `serving` conditions.
* `EndpointSliceTerminatingCondition` is enabled by default.
* Consensus on scalability implications resulting from additional EndpointSlice writes with approval from sig-scalability.

### Upgrade / Downgrade Strategy

Since this is an addition to the EndpointSlice API, the upgrade/downgrade strategy will follow that
Expand All @@ -148,9 +174,125 @@ of the [EndpointSlice API work](/keps/sig-network/20190603-endpointslices/README
Since this is an addition to the EndpointSlice API, the version skew strategy will follow that
of the [EndpointSlice API work](/keps/sig-network/20190603-endpointslices/README.md).

## Production Readiness Review Questionnaire

### Feature Enablement and Rollback

###### How can this feature be enabled / disabled in a live cluster?

- [X] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: EndpointSliceTerminatingCondition
- Components depending on the feature gate: kube-apiserver and kube-controller-manager

###### Does enabling the feature change any default behavior?

Yes, terminating endpoints are now included as part of EndpointSlice API. The `ready` condition of an endpoint will always be `false` to ensure consumers do not send traffic to terminating endpoints unless the new conditions `serving` and `terminating` are checked.

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. On rollback, terminating endpoints will no longer be included in EndpointSlice and the `terminating` and `serving` conditions will not be set.

###### What happens if we reenable the feature if it was previously rolled back?

EndpointSlice will continue to have the `terminating` and `serving` condition set and terminating endpoints will be added to the endpointslice in it's next sync.

###### Are there any tests for feature enablement/disablement?

Yes, there will be strategy API unit tests validating if the new API field is allowed based on the feature gate.

### Rollout, Upgrade and Rollback Planning

###### How can a rollout fail? Can it impact already running workloads?

If there are consumers of EndpointSlice that do not check the `ready` condition, then they may unexpectedly start sending traffic to terminating endpoints.
It is assumed that almost all consumers of EndpointSlice check the `ready` condition prior to allowing traffic to a pod.

###### What specific metrics should inform a rollback?

Application-level traffic indicating packet-loss or error rates.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not yet, but manual upgrade and rollback testing will be done prior to graduating the feature to Beta.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

### Monitoring Requirements

###### How can an operator determine if the feature is in use by workloads?

The condition will always be set for terminating pods but consumers may choose to ignore them. It is up to consumers of the API to provide metrics
on how the new conditions are being used.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics will be added for total endpoints with the `serving` and `terminating` condition set.

###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

N/A

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

N/A

### Dependencies

###### Does this feature depend on any specific services running in the cluster?

N/A

### Scalability

###### Will enabling / using this feature result in any new API calls?

Yes, there will be more writes to EndpointSlice when:
* a pod starts termination
* a pod's readiness changes during termination

###### Will enabling / using this feature result in introducing new API types?

No.

###### Will enabling / using this feature result in any new calls to the cloud provider?

No.

###### Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes, it will increase the size of EndpointSlice by adding two boolean fields for each endpoint.

###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

The networking programming latency SLO might be impacted due to additional writes to EndpointSlice.

###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

More writes to EndpointSlice could result in more resource usage from etcd disk IO and network bandwidth for all watchers.

### Troubleshooting

###### How does this feature react if the API server and/or etcd is unavailable?

EndpointSlice conditions will get stale.

###### What are other known failure modes?

* Consumers of EndpointSlice that do not not check the `ready` condition may unexpectedly use terminating endpoints.

###### What steps should be taken if SLOs are not being met to determine the problem?

* Disable the feature gate
* Check if consumers of EndpointSlice are using the serving or termianting condition
* Check etcd disk usage

## Implementation History

- [x] 2020-04-23: KEP accepted as implementable for v1.19
- [x] 2020-07-01: initial PR with alpha imlementation merged for v1.20
- [x] 2020-05-12: KEP accepted as implementable for v1.22

## Drawbacks

Expand Down
23 changes: 21 additions & 2 deletions keps/sig-network/1672-tracking-terminating-endpoints/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,24 @@ see-also:
- /kep/sig-network/20190603-EndpointSlice-API.md
replaces: []

latest-milestone: "0.0"
stage: "alpha"
# The target maturity stage in the current dev cycle for this KEP.
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.22"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.20"
beta: "v1.22"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: EndpointSliceTerminatingCondition
components:
- kube-apiserver
- kube-controller-manager
disable-supported: true

0 comments on commit 7454fd1

Please sign in to comment.