Skip to content

Conversation

utam0k
Copy link
Member

@utam0k utam0k commented Aug 24, 2025

  • One-line PR description: Adding KEP for container-level OOM kill mode configuration
  • Other comments:

Signed-off-by: utam0k <k0ma@utam0k.jp>
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 24, 2025
@k8s-ci-robot k8s-ci-robot requested a review from dchen1107 August 24, 2025 08:56
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: utam0k
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Aug 24, 2025
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 24, 2025

### Goals

- Add a per-container `oomKillMode` field to allow container-level OOM behavior configuration

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we have four types of containers: regular, init, sidecar, and ephemeral. Will the oomKillMode field be supported for all of them? If so, it might be better to state this explicitly. Also, since ephemeral containers use a separate struct, the API change would likely need to be applied there as well.
https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/apis/core/types.go#L4240


These reports led to [PR #122813](https://github.com/kubernetes/kubernetes/pull/122813), which attempted to add a kubelet flag but was closed after community discussion concluded that container-level configuration was the proper solution[^3]. The consensus was that node-level configuration cannot adequately address the needs of heterogeneous workloads.

This KEP also deprecates the `singleProcessOOMKill` kubelet flag for removal in v1.38 (GA). Container-level configuration provides better granularity and eliminates the complexity of maintaining both node-level and container-level settings.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you’re only running multi-process applications (e.g., using a managed database service in the cloud), it seems more convenient from a UX perspective to configure this once at the kubelet level. If configuration is only allowed at the container level, you’d likely need to rely on a Mutating Admission Policy instead.

Since the logic that allows the kubelet to determine a default value for memory.oom.group based on the OS and cgroup settings remains when the oomKillMode field is unset, I don’t think keeping the kubelet flag to override that default will add much complexity to the implementation. I think it will be more difficult for users who currently rely on the singleProcessOOMKill kubelet flag to migrate away from it within just three releases.

So personally, I don’t think the kubelet flag needs to be removed. What do you think?


### Validation Rules

API validation will be added in `pkg/apis/core/validation/validation.go`:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Pod spec's OS field is set to windows, it might be a better to prevent users from specifying OOMKillMode.
https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/apis/core/types.go#L3680

- cgroups are managed at the container level

3. **Consistency with Existing APIs**: Follows the pattern of other container-level resource configurations:
- Resources (limits/requests) are container-level

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
- Resources (limits/requests) are container-level
- Resources (limits/requests) can be set at both pod and container level

useGroupKill = false
case v1.OOMKillModeGroup:
if !isCgroup2UnifiedMode() {
klog.Warningf("Container %s/%s requests Group mode but system uses cgroup v1, falling back to single process",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End users probably don’t look at kubelet logs much, so it would be nice to notify them via Kubernetes Events. However, with the current kubelet implementation, that might be tricky.

// OOMKillMode specifies how the OOM killer behaves for this container.
// - "Single": only the process that triggered OOM is killed
// - "Group": all processes in the container are killed (cgroup v2 default)
// If not specified, the behavior is determined by the kubelet configuration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If OOMKillMode is not specified, the kubelet will determine the default value. It might be helpful to write the determined value into the ContainerStatus so that users can check it.
https://github.com/kubernetes/kubernetes/blob/v1.33.4/staging/src/k8s.io/api/core/v1/types.go#L3152

@utam0k utam0k changed the title [WIP] Add container OOM kill mode configuration KEP KEP-5507: Add container OOM kill mode configuration KEP Aug 28, 2025
@utam0k utam0k marked this pull request as ready for review August 28, 2025 22:29
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 28, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mrunalp August 28, 2025 22:29
@k8s-ci-robot
Copy link
Contributor

@utam0k: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify d8fe9cd link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@utam0k utam0k changed the title KEP-5507: Add container OOM kill mode configuration KEP KEP-5507: Container-level OOM kill mode configuration Aug 28, 2025
- SecurityContext can be set at both pod and container level
- The existing `singleProcessOOMKill` implementation operates per container

### API Design
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this API is heavily influenced by the linux configuration that controls this behavior. Some things we may consider here by expanding the scope of the feature and integrating in some wider scope:

  1. Will we ever want same/similar setting for eviction logic? Will kubelet be ever extended to understand that a single container or even a single process in a container can be "evicted"? Especially with PSI metrics support
  2. Do we need to "WholePod" option? Especially in conjunction with the restart policies Container restart rules to customize the pod restart policy #5307 Mostly for use cases when the init container will need to be re-run if OOM kill happened.

There may be more things it may align with, but those two are the first things came to mind

cc @haircommander @ndixita @yuanwang04

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should consider how the OOM kill could work on the container-level as well as pod-level. In general, I feel this should be handled by kubelet as an action after the the container died for whatever reasons (e.g. KillPod action on container exit). Container-OOM would be one of the cases, so probably defining WholePod here is not the most clean solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, is this OOMKillMode something that's better specified on the container-level or the pod-level?

- Platform-specific limitations (cgroup v1, Windows)
- Another field for users to understand and configure

## Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some alternatives include implementing this as an NRI plugin or as a privileged Init container that configures this for a cgroup. You may want to list these alternatives with the rejection reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants