-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2400: Update swap KEP for 1.23 beta #2858
Conversation
2553298
to
3de9495
Compare
/assign @deads2k /assign @derekwaynecarr @mrunalp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because graduation to beta changes the implied level of support for running with swap, I recommend that we carefully review Kubernetes documentation as part of this graduation work and look for areas that might need fixing.
One likely area is the reliance on tmpfs for information at rest. When swap was unsupported / alpha, cluster operators and app developers could assume that memory
backed emptyDir
volumes were never written to durable storage, and that the same would be true for Secret data used in connection with ServiceAccounts etc.
If tmpfs might be written to swap (which Linux readily does) then cluster operators may need to be warned not to enable swap. We didn't need (as much) to warn them when swap was unsupported because it was documented at a sort of “enable this and expect undefined behavior” level.
Another similar concern comes with the use of mlock()
vs. not bothering. Apps that haven't worried about calling mlock()
might need to think again.
If a new node with swap memory fails to come online, it will not impact any | ||
running components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone does an in-place upgrade on a node (stopping kubelet, starting a new kubelet on the same server), can that fail? How?
If it could fail then an upgrade might for example take out the nodes where the control plane ought to be running as static pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The in-place upgrade would not fail unless swap access was added while the node was still online. Normally turning swap on and off at runtime isn't considered best practice for a production environment, I'd expect a node to be reimaged and rebooted, but I can mention it.
@@ -622,12 +637,21 @@ checking if there are objects with field X set) may be a last resort. Avoid | |||
logs or events for this purpose. | |||
--> | |||
|
|||
KubeletConfiguration has set `failOnSwap: false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I tell two nodes apart:
- one has
failOnSwap: false
andmemorySwap
set toswapBehavior: LimitedSwap
, with theNodeSwap
feature gate enabled - another has
failOnSwap: false
and theNodeSwap
feature gate disabled
via the kubernetes API? If so, I'd mention how to distinguish them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that is bubbled up to the API Server. The purpose of this question is for beta, when the feature gate is defaulted on, so you can't rely on it being turned on as a sign that the feature is in use. We might be able to check if swapBehavior
is explicitly set, but empty string is equivalent to LimitedSwap.
Realistically, this KEP iterates on the existed unsupported configuration with failOnSwap: false
. Because it was previously unsupported, I am assuming here that a production environment would not have it set if it were not using this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this is beta we should assume that people have this feature gate set to a value of their choice. For alpha it was different: you need to be a little more brave to try it and most clusters run with the default, ie feature enabled.
The switch from unsupported to “mostly supported, but it's still beta” is why I'm asking about observability.
Fill out remaining beta PRR questions, add test plans
3de9495
to
20a8885
Compare
I think this is fair feedback. Note that (at least for CRI-O, I think) the CRI changes from 1.22 still haven't landed, so we can't test alpha swap support in a meaningful way in CI until that's available. This is an ambitious beta so it might slip to 1.24, but I want to ensure we're at least making forward progress this release, e.g. with adding pod-level cgroup limits and gathering data from real-world testing. |
Once testing with an updated container runtime is possible, we could graduate it to beta-but-disabled-by-default. Even that needs the docs review, though. Improving the feature but leaving it alpha is fine of course and doesn't need the docs review I mentioned. |
###### What specific metrics should inform a rollback? | ||
|
||
<!-- | ||
What signals should users be paying attention to when the feature is young | ||
that might indicate a serious problem? | ||
--> | ||
|
||
Workload churn or performance degradations on nodes. The metrics will be | ||
application/use-case specific, but we can provide some suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the spot to provide those suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we don't have them yet (comment below).
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||
|
||
<!-- | ||
Pick one more of these and delete the rest. | ||
--> | ||
|
||
TBD. We will determine a set of metrics as part of beta graduation. We will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The beta criteria (including this metric) is listed as tentative. Can you commit to the beta criteria about metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figuring out what these metrics are is blocking for beta graduation, but we need end-user data/a release cycle of testing to determine them, which will happen during the dev cycle. The graduation of swap is a little different from a regular Kubernetes feature in that we need to be able to look at metrics node-wide.
This is why it might be a stretch to graduate this for beta this release, but I don't want to block the other required dev work for beta just because we don't have this list yet.
The PRR updates look good /approve remember to get up with sig-node to approve the graduation to beta as a whole. |
/label tide/merge-method-squash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor request to add Swap
to the MemoryStats
api to track usage.
@@ -426,6 +430,7 @@ _(Tentative.)_ | |||
detects on the host. | |||
- Consider introducing new configuration modes for swap, such as a node-wide | |||
swap limit for workloads. | |||
- Add swap memory to the Kubelet stats api. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. I don't remember what swap metrics are available for cgroups v1 vs v2, but I'd want to at least be able to tell if swap is causing problems at the node level. It may also be nice to have swap metrics at the pod/container level if the available metrics can tell me if swapping is hurting my application's performance in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashpole cgroup v2 has memory.swap.current
which exists on non-root cgroups which is total amount of swap currently being used by the cgroup and its descendants.
It appears to already be supported in cAdvisor, we just did not include it in k8s API for MemoryStats as it was not yet supported to be on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like cgroups v1 has an equivalent, so we should be all set in that regard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
Thanks for making the updates, and I am happy with this plan. I agree with @mrunalp that we will learn a lot in real world workload testing, and so this should be fun!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, derekwaynecarr, ehashman The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Update swap KEP for 1.23 beta Fill out remaining beta PRR questions, add test plans * Address PRR feedback * Add test plan note for eviction manager/MemoryPressure * Add swap memory to Kubelet stats API
* Update swap KEP for 1.23 beta Fill out remaining beta PRR questions, add test plans * Address PRR feedback * Add test plan note for eviction manager/MemoryPressure * Add swap memory to Kubelet stats API
/sig node