- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The current downscaling algorithm for ReplicaSets prefers to delete Pods that have been running for the least amount of time. This heuristic attempts to minimize disruption under the premise that newer Pods are likely to be serving a lesser amount of clients. However, the heuristic can be detrimental to high availability requirements: when a ReplicaSet lands in an imbalanced state across failure domains, the heuristic tends to preserve the imbalance after repeated up and down scales.
We propose a randomized approach to the downscale Pod victim selection algorithm of the ReplicaSet controller to mitigate ReplicaSet imbalance across failure domains.
There are scenarios where a ReplicaSet can reach an imbalanced state across failure domains. See user stories to see one such scenario and how a randomized approach solves the issue.
- A randomized algorithm for Pod selection when downscaling ReplicaSets.
- Softly honor the heuristic that prefers to downscale newer Pods first.
- Validate that the approach is able to get a ReplicaSet out of an imbalanced state.
- Provide guarantees of preserving balance during scale down. This introduces a violation of separation of concerns between the ReplicaSet controller and kube-scheduler.
- Preserve the existing behavior that always downscales the newer Pods first. This order was never guaranteed in the API or user documentation.
This story shows an imbalance cycle after a failure domain fails or gets upgraded.
- Assume a ReplicaSet has 2N pods evenly distributed across 2 failure domains, thus each has N pods.
- An upgrade happens adding a new available domain and the ReplicaSet is upscaled to 3N. The new domain now holds all the youngest pods due to scheduler spreading.
- ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all the Pods from one domain are removed, leading to imbalance. The situation doesn't improve with repeated upscale and downscale steps. Instead, a randomized approach leaves about 2/3*N nodes in each failure domain.
The original heuristic, dowscaling the youngest Pods first, has its benefits. Newer Pods might not have finished starting up (or warming up) and are likely to have less active connections than older Pods. However, this distinction doesn't generally apply once Pods have been running steadily for some time.
A purely randomized approach would break those assumptions, potentially leading to services disruption. Choosing a heuristic could be left to the user. On the other hand, certain workloads take a long time to warm up and, at the same time, require high availability.
Certain users might be relaying in the existing downscaling heuristic. However, there are a number of reasons why we don't need to preserve such behavior as is:
-
The behavior is not documented in the API or user docs, only in code.
-
The heuristic is applied last, after other criteria have been applied. In particular, the heuristic doesn't apply when multiple Pods are in the same Node or if a particular Pod has had several container restarts. While users can enforce one Pod per Node with certain scheduling features, there is no workaround for other criteria.
-
We are not proposing to entirely remove the heuristic, just make it more laxed.
-
We are introducing a related feature that provides higher guarantees for downscaling order that users can migrate to.
We propose a randomized approach to the algorithm for Pod victim selection during ReplicaSet downscale:
- Sort ReplicaSet pods by pod UUID. The purpose of this is to obtain a pseudo-random shuffle of the pods (this also does not necessarily have to be the first step, it is just another comparison criteria).
- Obtain wall time, and add it to
ActivePodsWithRanks
- Call sorting algorithm with a modified time comparison for start and creation timestamp.
Instead of directly comparing timestamps, the algorithm compares the elapsed times since the creation and ready timestamps until the current time but in a logarithmic scale, floor rounded. These serve as sorting criteria. This has the effect of treating elapsed times as equals when they have the same scale. That is, Pods that have been running for a few nanoseconds are equal, but they are different from pods that have been running for a few seconds or a few days.
For example, let's assume the base 10 is used, then we have the following mapping for different durations:
Duration | Scale |
---|---|
5ns | 0 |
23ns | 1 |
71ns | 1 |
1ms | 6 |
8ms | 6 |
50ms | 7 |
2m | 11 |
11m | 11 |
An alternative interpretation for the base 10 is that, if a Pod has been running for more than 10 times the time as another Pod, then the second Pod would be deleted first.
While base 10 is quite intuitive, it might be too aggressive on bucketing timestamps together. A base of 2 could be similarly intuitive and provide a better bucketing. But if documentation is not a problem, the natural base is a good choice as well.
Unit and e2e tests will be helpful to ensure continuing performance of the intended behavior. However, due to the random nature of downscale selection within each rounded bucket it will be important to keep in mind the difficulty in expecting a specific pod to be deleted when multiple could still be valid candidates. Understanding this while writing tests will significantly reduce flakes.
Specific test cases could include something similar to what is described above in the user stories, where a balanced cluster state is created, then downscaled, then upscaled (to rebalance), and finally downscaled again. The expectation is that after the final downscale, the nodes should still be relatively balanced.
Alpha (v1.21):
- Add LogarithmicScaleDown feature gate to kube-controller-manager (disabled by default).
- Unit and e2e tests
Beta (v1.22):
- Enable LogarithmicScaleDown feature gate by default
- Enable
sorting_deletion_age_ratio
metric
Stable (v1.31):
- Lock LogarithmicScaleDown feature gate to true
- Make this behavior standard
There should be no issues during upgrades and downgrades since this does not affect any APIs or user-exposed behavior. If there are cluster components that currently assume or depend on the existing behavior this change should be clearly communicated to work on an acceptable solution during development of this change.
Version skew should have minimal effect with this feature for similar reasons to the upgrade/downgrade strategy. The lack of exposure or documentation around the current behavior reduces the risk that it is an expectation from other components.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: LogarithmicScaleDown
- Components depending on the feature gate: kube-controller-manager
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
Yes, this changes the default assumption that the youngest pod in a replica set will always be the one evicted. However, it still groups pods by their age and picks from the youngest group.
Yes. Existing workloads should see no change when disabling this feature.
Assumptions that the newest pod will be deleted first may break.
Tests for feature disablement shouldn't be necessary, as this is already an assumed (but not documented) controller behavior.
This should not affect running workloads, though there is the possibility that the logic panics which would cause kube-controller-manager to crash
Increased pod deletions could indicate runaway/hot-loop failures in the scaledown logic. Availability of applications may also be affected. Though the intent of this is to provide better available through more distributed victim selection, in cases of desired binpacking pods may remain running on undesired nodes.
This is purely in-memory change for the controller, so upgrade/downgrade doesn't really change anything.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
The feature is global, so it's always going to be used on any downscale.
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details: A ReplicaSet with two ready pods whose Pod Cost annotation is not set, if the logarithmic values of the pod ready times are identical, the pod with the smaller UID will be downscaled first rather than the latest ready one
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: sorting_deletion_age_ratio
- [Optional] Aggregation method:
- Components exposing the metric: kube-controller-manager
- Other (treat as last resort)
The metric sorting_deletion_age_ratio
will provide a histogram of the ratio between the
chosen deleted pod
's age over the current youngest pod
's age, for pods where the sort
algorithm falls back to age. (Pod age is the final criteria in the sorting algorithm, so we don't
want to measure this ratio for deletions which don't use this feature, as those may validly fall
outside the desired range).
There should be no values >2
in the above metric when the Pod Cost annotation is unset
(see https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost) and
the pod's deletion was based on a timestamp comparison (rather than, for example, pod state).
Are there any missing metrics that would be useful to have to improve observability of this feature?
No, we didn't find any other gaps that could be covered by metrics.
No, it is part of the controller-manager
No
No
No
No
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No, perhaps minimal increase in calculating the buckets for pod age
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
N/a - this is not a feature of running workloads. The main controller will not work and be unable to scale up or down if API or etcd are unavailable.
n/a
n/a
- 2021-01-06: Initial KEP submitted
- 2021-05-07: Updated KEP for graduation to beta
- 2024-05-21:Updated KEP for graduation to GA
The first drawback to this is that assumptions that the newest pod will always be the first deleted may break. However, the number of users affected by this should be small and acceptable due to the current behavior being undocumented.
This may also introduce slightly more work for the controller manager as it requires additional calculations before making a selection of which pod to downscale.
Choosing between a random or newest-first downscale heuristic can be left out to the user, but this has 2 problems:
- Both heuristics optimize for different things, and they might be useful together.
- Leaving the decision to the user hurts usability. Given the different comparison criteria in the downscale algorithm, it might be hard to describe the heuristic in a way that users can take an informed decision.
Pods can express spreading constraints when scheduling via
.spec.topologySpreadConstraints
. The constraints include the failure domains
to be used and skew tolerations.
This API could be used to calculate spreading skew and inform the downscaling algorithm to preserve a minimum skew. However, this has 2 problems:
- Calculating the skew might be expensive. The controller needs to track Nodes to obtain their topology information.
- Violates separation of concerns. The replication controller needs to implement or reuse scheduling algorithms. It also opens the question of whether other scheduling features need to be respected during downscale.