diff --git a/keps/prod-readiness/sig-node/2238.yaml b/keps/prod-readiness/sig-node/2238.yaml new file mode 100644 index 00000000000..810d76ca634 --- /dev/null +++ b/keps/prod-readiness/sig-node/2238.yaml @@ -0,0 +1,3 @@ +kep-number: 2238 +alpha: + approver: "@johnbelamaric" diff --git a/keps/sig-node/2238-liveness-probe-grace-period/README.md b/keps/sig-node/2238-liveness-probe-grace-period/README.md new file mode 100644 index 00000000000..2abd177a1f5 --- /dev/null +++ b/keps/sig-node/2238-liveness-probe-grace-period/README.md @@ -0,0 +1,542 @@ +# KEP-2238: Liveness Probe Grace Periods + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Configuration example](#configuration-example) + - [User Stories (Optional)](#user-stories-optional) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [Graduation](#graduation) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [X] (R) KEP approvers have approved the KEP status as `implementable` +- [X] (R) Design details are appropriately documented +- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [X] (R) Graduation criteria is in place +- [X] (R) Production readiness review completed +- [X] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +Liveness probes currently use the `terminationGracePeriodSeconds` on both +normal shutdown and when probes fail. Hence, if a long termination period is +set, and a liveness probe fails, a workload will not be promptly restarted +because it will wait for the full termination period. + +I propose adding a new field to probes, `probe.terminationGracePeriodSeconds`. +When set, it will override the `terminationGracePeriodSeconds` for liveness or +startup termination, and will be ignored for readiness probes. This maintains +the current behaviour if desired while providing configuration to address this +unintended behaviour. + +## Motivation + +This change is important to give users the option to decouple these fields, as +the current behaviour is counter-intuitive and doesn't provide an option to +decouple. + +[Other implementation options](#alternatives) might avoid an API change, but +would end up breaking backwards compatibility which some users may have come to +[expect/rely on](https://github.com/kubernetes/kubernetes/issues/64715#issuecomment-756100727), +even if we take a phased approach to introduce them. + +An example of a scenario where this bug was apparent is a long outage reported +by an ingress controller: + +> the controller uses a 3600s grace period so that it can drain long requests +> from end user app traffic (it listens on endpoints that can remain available +> for hours like host network, but also incoming tcp load balancers which have +> open connections), but when the controller wedges it takes an hour to +> resolve, which is the worst possible outcome (the process wedges immediately, +> but kube has to wait the full hour to kill it, defeating the goal of +> liveness). [comment](https://github.com/kubernetes/kubernetes/issues/64715#issuecomment-756201502) + +### Goals + +- Liveness probes can have a configurable timeout on failure. +- Existing behaviour is not changed. + +### Non-Goals + +Any changes to probe behaviour outside of the scope of addressing the +[bug](https://github.com/kubernetes/kubernetes/issues/64715) this is intended +to address. + +## Proposal + +We can add a `terminationGracePeriodSeconds` integer field to the Probe API. +This field would specify how long the kubelet should wait before forcibly +terminating the container when a probe fails, and override the pod-level +`terminationGracePeriodSeconds` value. + +This change would apply to [liveness and startup probes][probes1], but **not** +readiness probes, which [use a different code path][probes2]. + +This change would be [API-compatible][compatible]. If the field is not +specified, we will default to the current behaviour, which uses the +`terminationGracePeriodSeconds`. + +[compatible]: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#on-compatibility +[probes1]: https://github.com/kubernetes/kubernetes/blob/e414d4e5c2ab85eada7eb8e80a363658e199968b/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L631-L655 +[probes2]: https://github.com/kubernetes/kubernetes/blob/e414d4e5c2ab85eada7eb8e80a363658e199968b/pkg/kubelet/prober/prober_manager.go#L55 + +### Configuration example + +```yaml +spec: + terminationGracePeriodSeconds: 3600 + containers: + - name: test + image: ... + + ports: + - name: liveness-port + containerPort: 8080 + hostPort: 8080 + + livenessProbe: + httpGet: + path: /healthz + port: liveness-port + failureThreshold: 1 + periodSeconds: 60 + # New field # + terminationGracePeriodSeconds: 60 +``` + +### User Stories (Optional) + +N/A - bugfix + +### Notes/Constraints/Caveats (Optional) + +### Risks and Mitigations + +This should be a low-risk API change as it is backwards-compatible. If the +field is not set, the current behaviour will be maintained. + +## Design Details + +This field is not valid for readiness probes. We will add validation to ensure +`readinessProbe.terminationGracePeriodSeconds` remains unset. + +We expect that the `livenessProbe.terminationGracePeriodSeconds` (or +`startupProbe.*`) will not be greater than the pod-level +`terminationGracePeriodSeconds`, but we will not explicitly validate this. + +`initialDelaySeconds` is not relevant to this value; it indicates how long we +should wait before probing the container, but does not have any relation to how +long a probe should take or how long it will take to shut down the container +upon a failed probe. + +`periodSeconds` and `timeoutSeconds` indicate how long we think a probe should +take, but not how long it will take to shut down the container; hence, they +also are not relevant to this value. It is suggested but not required that +`livenessProbe.terminationGracePeriodSeconds` be less than these values. + +Finally, `failureThreshold` and `successThreshold` are not relevant to this +value; they indicate the number of failures required to mark a probe as +succeeded or failed, but not how long to give a container to gracefully shut +down upon failure. + +### Test Plan + +This change will be unit tested for the API changes and +backwards-compatibility. + +If added to the Probe API, this will also need to be tested in the readiness +and startup probe cases. Readiness probes will not support the field. Startup +probes should be tested for correct behaviour. + +E2E integration tests will verify the correct behaviour for the current broken +use case, where a `terminationGracePeriodSeconds` is set to a +large value, a probe fails, and with +`livenessProbe.terminationGracePeriodSeconds` set, the container terminates +quickly. + +### Graduation Criteria + +#### Alpha + +- New probe field, `terminationGracePeriodSeconds`, is implemented and + available behind a feature flag. +- Appropriate tests are written. + +_Below graduation criteria are tentative._ + +#### Beta + +- Feature flag is defaulted on. + +#### Graduation + +- Feature flag is removed, feature is graduated. + +### Upgrade / Downgrade Strategy + + + +On upgrade: feature flag/new field will become available for use. + +On downgrade: no longer available. Any workloads with the Probe +`terminationGracePeriodSeconds` set will need to unset. + +### Version Skew Strategy + + + +n-2 kubelet without this feature will default to the old behaviour, using the +pod-level `terminationGracePeriodSeconds`. + +Only when feature gate is enabled for all components will we use the new field. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + +_This section must be completed when targeting alpha to a release._ + +* **How can this feature be enabled / disabled in a live cluster?** + - [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `ProbeTerminationGracePeriod` + - Components depending on the feature gate: Kubelet, API Server + - [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + +* **Does enabling the feature change any default behavior?** + Any change of default behavior may be surprising to users or break existing + automations, so be extremely careful here. + + No. + +* **Can the feature be disabled once it has been enabled (i.e. can we roll back + the enablement)?** + Also set `disable-supported` to `true` or `false` in `kep.yaml`. + Describe the consequences on existing workloads (e.g., if this is a runtime + feature, can it break the existing applications?). + + Yes: disable feature flag. + + While feature flag is enabled, the feature can also be disabled by unsetting + the field on the Probe specification, which will restore the default + behaviour. + +* **What happens if we reenable the feature if it was previously rolled back?** + + Field will be used instead of the `terminationGracePeriodSeconds` for + shutting down a container upon failed liveness probe. + +* **Are there any tests for feature enablement/disablement?** + The e2e framework does not currently support enabling or disabling feature + gates. However, unit tests in each component dealing with managing data, created + with and without the feature, are necessary. At the very least, think about + conversion tests if API types are being modified. + + We will manually test that if a resource is created with the new field, and + the feature flag is later disabled, that the field will no longer be used + even though it is already filled in. We will then confirm that, when the flag + is reenabled, the new value is used as expected. + +### Rollout, Upgrade and Rollback Planning + +_This section must be completed when targeting beta graduation to a release._ + +* **How can a rollout fail? Can it impact already running workloads?** + Try to be as paranoid as possible - e.g., what if some components will restart + mid-rollout? + +* **What specific metrics should inform a rollback?** + +* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?** + Describe manual testing that was done and the outcomes. + Longer term, we may want to require automated upgrade/rollback tests, but we + are missing a bunch of machinery and tooling and can't do that now. + +* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, +fields of API types, flags, etc.?** + Even if applying deprecation policies, they may still surprise some users. + +### Monitoring Requirements + +_This section must be completed when targeting beta graduation to a release._ + +* **How can an operator determine if the feature is in use by workloads?** + Ideally, this should be a metric. Operations against the Kubernetes API (e.g., + checking if there are objects with field X set) may be a last resort. Avoid + logs or events for this purpose. + +* **What are the SLIs (Service Level Indicators) an operator can use to determine +the health of the service?** + - [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: + - [ ] Other (treat as last resort) + - Details: + +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** + At a high level, this usually will be in the form of "high percentile of SLI + per day <= X". It's impossible to provide comprehensive guidance, but at the very + high level (needs more precise definitions) those may be things like: + - per-day percentage of API calls finishing with 5XX errors <= 1% + - 99% percentile over day of absolute value from (job creation time minus expected + job creation time) for cron job <= 10% + - 99,9% of /health requests per day finish with 200 code + +* **Are there any missing metrics that would be useful to have to improve observability +of this feature?** + Describe the metrics themselves and the reasons why they weren't added (e.g., cost, + implementation difficulties, etc.). + +### Dependencies + +_This section must be completed when targeting beta graduation to a release._ + +* **Does this feature depend on any specific services running in the cluster?** + Think about both cluster-level services (e.g. metrics-server) as well + as node-level agents (e.g. specific version of CRI). Focus on external or + optional services that are needed. For example, if this feature depends on + a cloud provider API, or upon an external software-defined storage or network + control plane. + + For each of these, fill in the following—thinking about running existing user workloads + and creating new ones, as well as about cluster-level services (e.g. DNS): + - [Dependency name] + - Usage description: + - Impact of its outage on the feature: + - Impact of its degraded performance or high-error rates on the feature: + + +### Scalability + +_For alpha, this section is encouraged: reviewers should consider these questions +and attempt to answer them._ + +_For beta, this section is required: reviewers must answer these questions._ + +_For GA, this section is required: approvers should be able to confirm the +previous answers based on experience in the field._ + +* **Will enabling / using this feature result in any new API calls?** + Describe them, providing: + - API call type (e.g. PATCH pods) + - estimated throughput + - originating component(s) (e.g. Kubelet, Feature-X-controller) + focusing mostly on: + - components listing and/or watching resources they didn't before + - API calls that may be triggered by changes of some Kubernetes resources + (e.g. update of object X triggers new updates of object Y) + - periodic API calls to reconcile state (e.g. periodic fetching state, + heartbeats, leader election, etc.) + + No. + +* **Will enabling / using this feature result in introducing new API types?** + Describe them, providing: + - API type + - Supported number of objects per cluster + - Supported number of objects per namespace (for namespace-scoped objects) + + No. + +* **Will enabling / using this feature result in any new calls to the cloud +provider?** + + No. + +* **Will enabling / using this feature result in increasing size or count of +the existing API objects?** + Describe them, providing: + - API type(s): + - Estimated increase in size: (e.g., new annotation of size 32B) + - Estimated amount of new objects: (e.g., new Object X for every existing Pod) + + One new integer field on the Probe type. + +* **Will enabling / using this feature result in increasing time taken by any +operations covered by [existing SLIs/SLOs]?** + Think about adding additional work or introducing new steps in between + (e.g. need to do X to start a container), etc. Please describe the details. + + No, this should not affect that. + +* **Will enabling / using this feature result in non-negligible increase of +resource usage (CPU, RAM, disk, IO, ...) in any components?** + Things to keep in mind include: additional in-memory state, additional + non-trivial computations, excessive access to disks (including increased log + volume), significant amount of data sent and/or received over network, etc. + This through this both in small and large cases, again with respect to the + [supported limits]. + + No. + +### Troubleshooting + +The Troubleshooting section currently serves the `Playbook` role. We may consider +splitting it into a dedicated `Playbook` document (potentially with some monitoring +details). For now, we leave it here. + +_This section must be completed when targeting beta graduation to a release._ + +* **How does this feature react if the API server and/or etcd is unavailable?** + +* **What are other known failure modes?** + For each of them, fill in the following information by copying the below template: + - [Failure mode brief description] + - Detection: How can it be detected via metrics? Stated another way: + how can an operator troubleshoot without logging into a master or worker node? + - Mitigations: What can be done to stop the bleeding, especially for already + running user workloads? + - Diagnostics: What are the useful log messages and their required logging + levels that could help debug the issue? + Not required until feature graduated to beta. + - Testing: Are there any tests for failure mode? If not, describe why. + +* **What steps should be taken if SLOs are not being met to determine the problem?** + +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos + +## Implementation History + +- 2021-01-07: Initial draft KEP + +## Drawbacks + +The main drawback of this proposal is that it requires a (compatible) API +change. + +## Alternatives + +From [#64715](https://github.com/kubernetes/kubernetes/issues/64715#issuecomment-756173979): + +Infer the maximum grace period for liveness from `num failures * probe period`, +implying that a user with short liveness reaction wants a fast reaction from +their pod (@smarterclayton) + +- PRO: Any user unaware of the coupling reacts to liveness failures faster, + normal pods (including reason 2 above) are less likely to be impacted +- CON: If we later want to change the API to indicate this field, the default + behavior makes less sense (i.e. complicates choosing the option above) +- CON: A user with a long grace period who wants liveness to take more than + tens of seconds (not sure why they would) suddenly sees that behavior + impacted +- CON: This sort of implicit calculation feels pretty arbitrary (we generally + make defaults explicit vs calculated) + +Introduce a new toleration, similar to execute toleration for node not ready +(how long the pod should be left on the node while unready before being +evicted), that controls how long liveness tolerates (@smarterclayton) + +- PRO: avoids API change +- CON: basically a hack to avoid an API change, tolerations for execution don't + really make sense for conditional stuff + +Since liveness probe failures indicate that a container is unhealthy and should +be terminated, update the behaviour of liveness probes to terminate the pod +immediately, or with some default grace period (@sjenning) + +- PRO: avoids API change +- CON: may break existing applications that rely on the grace period for crash + safety +- CON: thresholds not user-configurable +- CON: even if we use a feature flag for this, it will eventually become the + required behaviour + +Introduce a similar field, but at the pod-level, next to the existing field. + +- PRO: similar values are grouped together +- CON: logically, liveness probes operate on a per-container basis, and thus we + may need different thresholds for different containers + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-node/2238-liveness-probe-grace-period/kep.yaml b/keps/sig-node/2238-liveness-probe-grace-period/kep.yaml new file mode 100644 index 00000000000..1d630ea627c --- /dev/null +++ b/keps/sig-node/2238-liveness-probe-grace-period/kep.yaml @@ -0,0 +1,41 @@ +title: Liveness Probe Grace Period +kep-number: 2238 +authors: + - "@ehashman" +owning-sig: sig-node +participating-sigs: + - sig-node +status: implementable +creation-date: 2021-01-07 +reviewers: + - "@matthyx" + - "@dims" + - "@derekwaynecarr" + - "@smarterclayton" +approvers: + - "@derekwaynecarr" +prr-approvers: + - "@johnbelamaric" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.21" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.21" + beta: "v1.22" + stable: "v1.23" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: ProbeTerminationGracePeriod + components: + - kube-apiserver + - kubelet +disable-supported: true