[OCPNODE-725] Control Group v2 Enablement on New Clusters #939

rphillips · 2021-10-19T18:35:13Z

OCPNODE-725

This enhancement describes enabling cgroup v2 within OpenShift.

enhancements/machine-config/mco-cgroupsv2-support.md

kikisdeliveryservice

some initial comments

enhancements/machine-config/mco-cgroupsv2-support.md

rphillips · 2021-10-19T21:59:04Z

Thank you everyone for the great comments. I've updated the doc with your comments and to use the Infrastructure API. It feels like a good spot for the cgroup setting.

enhancements/machine-config/mco-cgroupsv2-support.md

cgwalters · 2021-10-20T13:33:40Z

enhancements/machine-config/mco-cgroupsv2-support.md

+
+Migrating to cgroup v2 will bring in many new features and fixes not found in
+cgroup v1. cgroup v1 is considered 'legacy' and migrating to cgroup v2 is
+considered necessary since RHEL ships with cgroup v2 on by default. (OpenShift


s/RHEL/RHEL9/

(Also of note related to this, current Fedora defaults to v2, and I believe OKD is still reverting that - hopefully we can also validate this flag on OKD)

When we're thinking about RHEL9 - would one expect that upgrading to that automatically inherits the OS default to cgroup v2, or would the admin need to flip on this flag in addition? I would hope the former.

The intent is to enable this by flag at install time.

cgwalters · 2021-10-20T13:38:53Z

enhancements/machine-config/mco-cgroupsv2-support.md

+
+## Proposal
+
+The option to enable cgroup v2 will have to reside in a centralized location.


I totally see your desire here to make the cgroup setting a cluster wide thing.

But so far this enhancement is ignoring the BYO-RHEL 7|8 case where the cgroup setting is not managed by the cluster. Or conversely, it seems to mostly be assuming RHCOS8.

It also seems to me that we really actually do want the ability to have a custom worker pool with cgroup v2 even in older releases - so that admins of existing clusters can test out workloads in cgroup v2 before making a cluster wide switch. Right?

Also to flip this around, even ignoring BYO-RHEL, upgrades of existing clusters that are purely RHCOS8 will have at least a period of time during the upgrade where the nodes are mixed.

What are the major problems you see with a mixed environment?

Somewhat tangential to this but I want to emphasize, the incoming RHEL9 will mean that bare term "RHCOS" becomes a weak identifier, and we will also commonly need to say e.g. RHCOS8 or RHCOS9 or so.

I don't see a huge need of mixed mode outside of maybe hypershift if cgroups v2 works fine as we would have to invest in testing it vs. testing pure cgroups v2 more thoroughly and address edge cases.

I clarified in the doc a pure mode cgroup v2 environment. Metric reporting may be different between cgroup v1 and cgroup v2 environments. We will need more testing around HPA, VPA, and other controller workloads in mixed mode environments to make sure they understand the metrics. The goal of this enhancement is to enable a pure mode cluster.

I'm not advocating specifically to emphasize this API at a granular level.

I'm more saying three things:

The hard reality is that clusters will be mixed during upgrades, at least for some period of time, so it needs to at least not actively fail

We know there are people out there that are e.g. pausing the worker pool for long periods of time, so again those situations will be mixed. Unless we try to actively block updates based on this (in the same way there's some MCO work to block upgrades on having too-old workers)

Internally in the MCO, pools roll out machineconfig. Expressing this as a machineconfig feels natural.

That said, the more we add things like this, the more need I see for making it easy for a MachineConfig object to apply to all node types.

But today higher level flags in the install config like hyperThreading and fips are represented as a generated MachineConfig for master and worker.

I think @cgwalters brings up really good points, but seems like this enhancement is targetting new clusters with cgroupsv2 enabled on all pools as a first step to bringing this into OCP. I'd imagine we'd need to hammer out these details via another mixed/heterogenous deployment/day2 enhancement?

Are there any restrictions/specific callouts we'd want to bring in from Colin's points above on the scope of which new clusters would be eligible for using this at install?

So far, we do not have any restrictions on which clusters cgroup v2 can be enabled on. We will need to gather bug reports and feedback on what is working (and not working).

yuqi-zhang · 2021-10-20T16:59:22Z

A follow up question on the previous (now resolved) comment: is this mutable after installation? If so (assuming we are allowing migration between v1 to v2) are you able to also go from v2 back to v1 or is the migration unidirectional?

enhancements/machine-config/mco-cgroupsv2-support.md

kikisdeliveryservice · 2021-10-20T18:54:55Z

enhancements/machine-config/mco-cgroupsv2-support.md

+```
+
+### User Stories
+


These aren't quite user stories and are more to-dos

mrunalp · 2021-10-20T19:09:49Z

enhancements/machine-config/mco-cgroupsv2-support.md

+
+### Non-Goals
+
+- Support mixed cgroups modes


Thinking about this more, maybe hypershift may want to run control planes and worker nodes differently.

rphillips · 2021-10-21T14:45:20Z

Updated the enhancement to clarify this is for new clusters and added a new API.

enhancements/machine-config/mco-cgroupsv2-support.md

vrutkovs · 2021-10-26T06:27:38Z

enhancements/machine-config/mco-cgroupsv2-support.md

+
+#### Tech Preview -> GA
+
+With sufficient internal testing and customer feedback the feature will graduate


jfyi new OKD 4.9 clusters will have cgroupsv2 enabled by default

rphillips · 2021-11-02T21:06:56Z

I believe this enhancement is ready for a final review.

enhancements/machine-config/mco-cgroupsv2-support.md

kikisdeliveryservice

Looks good, just a few last comments :)

enhancements/machine-config/mco-cgroupsv2-support.md

sinnykumari · 2021-11-03T14:44:32Z

Since this enhancement impacts MCO as well, wanted to know in which OpenShift version we are expecting this to be get implemented(dev preview)?

rphillips · 2021-11-03T17:28:50Z

@sinnykumari @kikisdeliveryservice thank you. updated the PR and mentioned we can target 4.10 for the changes.

sinnykumari · 2021-11-03T17:43:04Z

@sinnykumari @kikisdeliveryservice thank you. updated the PR and mentioned we can target 4.10 for the changes.

umm, not sure but shouldn't a enhancement be merged before the release planning of a particular release?

rphillips · 2021-11-03T17:46:12Z

@sinnykumari that does make sense to me... removed it from the doc.

enhancements/machine-config/mco-cgroupsv2-support.md

rphillips · 2021-11-03T20:56:53Z

@sttts could you take a pass through this?

enhancements/machine-config/mco-cgroupsv2-support.md

sttts · 2021-11-05T08:57:51Z

enhancements/machine-config/mco-cgroupsv2-support.md

+to Tech Preview.
+
+Upon graduation to GA the feature will still be turned off by default, but may
+be enabled within OpenShift at a later time.


do you expect a period where it is on by default, but can be disabled?

I updated this sentence in f69f9c7 to address that another enhancement will be created to specify how the feature will be turned on by default. The original sentence here was not clear with my other modifications to the document.

enhancements/machine-config/mco-cgroupsv2-support.md

deads2k · 2021-11-11T14:41:14Z

enhancements/machine-config/mco-cgroupsv2-support.md

+}
+
+type NodeSpec struct {
+  CgroupMode CgroupMode `json:"cgroupMode,omitempty"`


This field makes sense to me. In the actual api PR, you'll need to fill in godoc for users.

sttts · 2021-11-11T14:41:19Z

enhancements/machine-config/mco-cgroupsv2-support.md

+}
+
+type NodeSpec struct {
+  CgroupMode CgroupMode `json:"cgroupMode,omitempty"`


are we sure it stays at a string? Or should we make it a struct from the beginning?

cgroup: mode: v2

Other parts of the API do the same thing... Making it a struct seems overkill for something that is likely not going to change again for a long time.

deads2k · 2021-11-11T14:42:01Z

enhancements/machine-config/mco-cgroupsv2-support.md

+  CgroupMode_Default = CgroupMode_v1
+)
+
+type Node struct {


@sttts you devoted some thought to the top level object name, does this match your preference? I have not devoted enough think-time to have a clear opinion.

I started with NodeOptions... someone mentioned on this thread Node makes more sense. I agree with simplifying it to Node.

will the latency settings (in some other enhancement) go here too?

Latency is a multi component api addition. I do not believe it goes in this object.

It's only about kubelet, and others to cope with the fallout. Seem to be very Node'ish.

rphillips · 2021-11-29T15:52:55Z

@deads2k I added two blurbs in 54031d4: adding a blocking cgroup v2 upgrade job, and a comment about how the pass percentage should be the same or better with the cgroup v2 job.

rphillips · 2021-12-02T15:08:26Z

non-blocking for this PR @deads2k to provide a list of openshift jobs for GA

rphillips · 2021-12-02T15:17:43Z

periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade
periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade
periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.9-e2e-metal-ipi-upgrade-ovn-ipv6
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-vsphere-upgrade

deads2k · 2021-12-02T16:07:12Z

enhancements/machine-config/mco-cgroupsv2-support.md

+
+### Version Skew Strategy
+
+A cluster installed with cgroup v2 would not be impacted by the version skew


you will be when "Add blocking cgroup v2 upgrade jobs" is done for your GA criteria. You plan to revisit?

deads2k · 2021-12-02T16:08:53Z

enhancements/machine-config/mco-cgroupsv2-support.md

+
+### Non-Goals
+
+Mixed mode cgroup modes are not 100% compatible with each other. We need data


this precludes migrating from cgroups v1 to cgroups v2. Can you explode that into a top level "non-goal" so we know it this enhancement will need revisiting to remove cgroups v1.

enhancements/machine-config/mco-cgroupsv2-support.md

deads2k · 2021-12-02T17:06:42Z

The flow and GA criteria lgtm. Needs a squash

/approve
/lgtm
/hold

holding for squash. feel free to release after.

openshift-ci · 2021-12-02T17:07:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rphillips · 2021-12-02T18:19:13Z

/hold cancel

openshift-ci bot requested review from dhellmann and sdodson October 19, 2021 18:35

mrunalp reviewed Oct 19, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Show resolved Hide resolved

kikisdeliveryservice reviewed Oct 19, 2021

View reviewed changes

yuqi-zhang reviewed Oct 19, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

cgwalters reviewed Oct 19, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

kikisdeliveryservice reviewed Oct 19, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

rphillips changed the title ~~[OCPNODE-725] CgroupsV2 MCP Enablement~~ [OCPNODE-725] Control Group v2 Enablement Oct 19, 2021

sinnykumari reviewed Oct 20, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

sinnykumari reviewed Oct 20, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

sinnykumari reviewed Oct 20, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Show resolved Hide resolved

cgwalters reviewed Oct 20, 2021

View reviewed changes

kikisdeliveryservice reviewed Oct 20, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

kikisdeliveryservice reviewed Oct 20, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Show resolved Hide resolved

kikisdeliveryservice reviewed Oct 20, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Show resolved Hide resolved

kikisdeliveryservice reviewed Oct 20, 2021

View reviewed changes

mrunalp reviewed Oct 20, 2021

View reviewed changes

rphillips changed the title ~~[OCPNODE-725] Control Group v2 Enablement~~ [OCPNODE-725] Control Group v2 Enablement on New Clusters Oct 20, 2021

sinnykumari reviewed Oct 25, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

harche mentioned this pull request Oct 26, 2021

WIP: Enable cgroup v2 support #652

Closed

vrutkovs reviewed Oct 26, 2021

View reviewed changes

kikisdeliveryservice reviewed Nov 2, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Show resolved Hide resolved

kikisdeliveryservice reviewed Nov 2, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Show resolved Hide resolved

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

kikisdeliveryservice reviewed Nov 3, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

sttts reviewed Nov 5, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

sttts reviewed Nov 5, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Show resolved Hide resolved

sttts reviewed Nov 5, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

deads2k reviewed Nov 11, 2021

View reviewed changes

sttts reviewed Nov 11, 2021

View reviewed changes

deads2k reviewed Nov 11, 2021

View reviewed changes

deads2k reviewed Dec 2, 2021

View reviewed changes

enhancements/machine-config/mco-cgroupsv2-support.md Outdated Show resolved Hide resolved

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2021

initial pass at cgroupsv2 enablement

45138a0

rphillips force-pushed the cgroupsv2 branch from 764f842 to 45138a0 Compare December 2, 2021 17:07

openshift-ci bot assigned deads2k Dec 2, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 2, 2021

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2021

openshift-merge-robot merged commit 6a6d8bf into openshift:master Dec 2, 2021


		## Proposal

		The option to enable cgroup v2 will have to reside in a centralized location.


		#### Tech Preview -> GA

		With sufficient internal testing and customer feedback the feature will graduate


		### Version Skew Strategy

		A cluster installed with cgroup v2 would not be impacted by the version skew


		### Non-Goals

		Mixed mode cgroup modes are not 100% compatible with each other. We need data

[OCPNODE-725] Control Group v2 Enablement on New Clusters #939

[OCPNODE-725] Control Group v2 Enablement on New Clusters #939

Conversation

rphillips commented Oct 19, 2021 • edited Loading

kikisdeliveryservice left a comment

Choose a reason for hiding this comment

rphillips commented Oct 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgwalters Oct 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgwalters Oct 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuqi-zhang commented Oct 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphillips commented Oct 21, 2021

Choose a reason for hiding this comment

rphillips commented Nov 2, 2021

kikisdeliveryservice left a comment

Choose a reason for hiding this comment

sinnykumari commented Nov 3, 2021

rphillips commented Nov 3, 2021

sinnykumari commented Nov 3, 2021

rphillips commented Nov 3, 2021

rphillips commented Nov 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphillips commented Nov 29, 2021

rphillips commented Dec 2, 2021

rphillips commented Dec 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Dec 2, 2021

openshift-ci bot commented Dec 2, 2021

rphillips commented Dec 2, 2021

rphillips commented Oct 19, 2021 •

edited

Loading

cgwalters Oct 20, 2021 •

edited

Loading

cgwalters Oct 20, 2021 •

edited

Loading