-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5507: Container-level OOM kill mode configuration #5496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: utam0k <k0ma@utam0k.jp>
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: utam0k The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
### Goals | ||
|
||
- Add a per-container `oomKillMode` field to allow container-level OOM behavior configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we have four types of containers: regular, init, sidecar, and ephemeral. Will the oomKillMode
field be supported for all of them? If so, it might be better to state this explicitly. Also, since ephemeral containers use a separate struct, the API change would likely need to be applied there as well.
https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/apis/core/types.go#L4240
|
||
These reports led to [PR #122813](https://github.com/kubernetes/kubernetes/pull/122813), which attempted to add a kubelet flag but was closed after community discussion concluded that container-level configuration was the proper solution[^3]. The consensus was that node-level configuration cannot adequately address the needs of heterogeneous workloads. | ||
|
||
This KEP also deprecates the `singleProcessOOMKill` kubelet flag for removal in v1.38 (GA). Container-level configuration provides better granularity and eliminates the complexity of maintaining both node-level and container-level settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you’re only running multi-process applications (e.g., using a managed database service in the cloud), it seems more convenient from a UX perspective to configure this once at the kubelet level. If configuration is only allowed at the container level, you’d likely need to rely on a Mutating Admission Policy instead.
Since the logic that allows the kubelet to determine a default value for memory.oom.group
based on the OS and cgroup settings remains when the oomKillMode
field is unset, I don’t think keeping the kubelet flag to override that default will add much complexity to the implementation. I think it will be more difficult for users who currently rely on the singleProcessOOMKill
kubelet flag to migrate away from it within just three releases.
So personally, I don’t think the kubelet flag needs to be removed. What do you think?
|
||
### Validation Rules | ||
|
||
API validation will be added in `pkg/apis/core/validation/validation.go`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Pod spec's OS
field is set to windows, it might be a better to prevent users from specifying OOMKillMode
.
https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/apis/core/types.go#L3680
- cgroups are managed at the container level | ||
|
||
3. **Consistency with Existing APIs**: Follows the pattern of other container-level resource configurations: | ||
- Resources (limits/requests) are container-level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
- Resources (limits/requests) are container-level | |
- Resources (limits/requests) can be set at both pod and container level |
useGroupKill = false | ||
case v1.OOMKillModeGroup: | ||
if !isCgroup2UnifiedMode() { | ||
klog.Warningf("Container %s/%s requests Group mode but system uses cgroup v1, falling back to single process", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End users probably don’t look at kubelet logs much, so it would be nice to notify them via Kubernetes Events. However, with the current kubelet implementation, that might be tricky.
// OOMKillMode specifies how the OOM killer behaves for this container. | ||
// - "Single": only the process that triggered OOM is killed | ||
// - "Group": all processes in the container are killed (cgroup v2 default) | ||
// If not specified, the behavior is determined by the kubelet configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If OOMKillMode
is not specified, the kubelet will determine the default value. It might be helpful to write the determined value into the ContainerStatus so that users can check it.
https://github.com/kubernetes/kubernetes/blob/v1.33.4/staging/src/k8s.io/api/core/v1/types.go#L3152
@utam0k: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
- SecurityContext can be set at both pod and container level | ||
- The existing `singleProcessOOMKill` implementation operates per container | ||
|
||
### API Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this API is heavily influenced by the linux configuration that controls this behavior. Some things we may consider here by expanding the scope of the feature and integrating in some wider scope:
- Will we ever want same/similar setting for eviction logic? Will kubelet be ever extended to understand that a single container or even a single process in a container can be "evicted"? Especially with PSI metrics support
- Do we need to "WholePod" option? Especially in conjunction with the restart policies Container restart rules to customize the pod restart policy #5307 Mostly for use cases when the init container will need to be re-run if OOM kill happened.
There may be more things it may align with, but those two are the first things came to mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should consider how the OOM kill could work on the container-level as well as pod-level. In general, I feel this should be handled by kubelet as an action after the the container died for whatever reasons (e.g. KillPod action on container exit). Container-OOM would be one of the cases, so probably defining WholePod here is not the most clean solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, is this OOMKillMode
something that's better specified on the container-level or the pod-level?
- Platform-specific limitations (cgroup v1, Windows) | ||
- Another field for users to understand and configure | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some alternatives include implementing this as an NRI plugin or as a privileged Init container that configures this for a cgroup. You may want to list these alternatives with the rejection reason.
Uh oh!
There was an error while loading. Please reload this page.