Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCPNODE-725] Control Group v2 Enablement on New Clusters #939

Merged
merged 1 commit into from
Dec 2, 2021

Conversation

rphillips
Copy link
Contributor

@rphillips rphillips commented Oct 19, 2021

OCPNODE-725

This enhancement describes enabling cgroup v2 within OpenShift.

@openshift-ci openshift-ci bot requested review from dhellmann and sdodson October 19, 2021 18:35
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial comments

@rphillips rphillips changed the title [OCPNODE-725] CgroupsV2 MCP Enablement [OCPNODE-725] Control Group v2 Enablement Oct 19, 2021
@rphillips
Copy link
Contributor Author

Thank you everyone for the great comments. I've updated the doc with your comments and to use the Infrastructure API. It feels like a good spot for the cgroup setting.


Migrating to cgroup v2 will bring in many new features and fixes not found in
cgroup v1. cgroup v1 is considered 'legacy' and migrating to cgroup v2 is
considered necessary since RHEL ships with cgroup v2 on by default. (OpenShift
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/RHEL/RHEL9/

(Also of note related to this, current Fedora defaults to v2, and I believe OKD is still reverting that - hopefully we can also validate this flag on OKD)

When we're thinking about RHEL9 - would one expect that upgrading to that automatically inherits the OS default to cgroup v2, or would the admin need to flip on this flag in addition? I would hope the former.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to enable this by flag at install time.


## Proposal

The option to enable cgroup v2 will have to reside in a centralized location.
Copy link
Member

@cgwalters cgwalters Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally see your desire here to make the cgroup setting a cluster wide thing.

But so far this enhancement is ignoring the BYO-RHEL 7|8 case where the cgroup setting is not managed by the cluster. Or conversely, it seems to mostly be assuming RHCOS8.

It also seems to me that we really actually do want the ability to have a custom worker pool with cgroup v2 even in older releases - so that admins of existing clusters can test out workloads in cgroup v2 before making a cluster wide switch. Right?

Also to flip this around, even ignoring BYO-RHEL, upgrades of existing clusters that are purely RHCOS8 will have at least a period of time during the upgrade where the nodes are mixed.

What are the major problems you see with a mixed environment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat tangential to this but I want to emphasize, the incoming RHEL9 will mean that bare term "RHCOS" becomes a weak identifier, and we will also commonly need to say e.g. RHCOS8 or RHCOS9 or so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a huge need of mixed mode outside of maybe hypershift if cgroups v2 works fine as we would have to invest in testing it vs. testing pure cgroups v2 more thoroughly and address edge cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified in the doc a pure mode cgroup v2 environment. Metric reporting may be different between cgroup v1 and cgroup v2 environments. We will need more testing around HPA, VPA, and other controller workloads in mixed mode environments to make sure they understand the metrics. The goal of this enhancement is to enable a pure mode cluster.

Copy link
Member

@cgwalters cgwalters Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not advocating specifically to emphasize this API at a granular level.

I'm more saying three things:

  • The hard reality is that clusters will be mixed during upgrades, at least for some period of time, so it needs to at least not actively fail
  • We know there are people out there that are e.g. pausing the worker pool for long periods of time, so again those situations will be mixed. Unless we try to actively block updates based on this (in the same way there's some MCO work to block upgrades on having too-old workers)
  • Internally in the MCO, pools roll out machineconfig. Expressing this as a machineconfig feels natural.

That said, the more we add things like this, the more need I see for making it easy for a MachineConfig object to apply to all node types.

But today higher level flags in the install config like hyperThreading and fips are represented as a generated MachineConfig for master and worker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @cgwalters brings up really good points, but seems like this enhancement is targetting new clusters with cgroupsv2 enabled on all pools as a first step to bringing this into OCP. I'd imagine we'd need to hammer out these details via another mixed/heterogenous deployment/day2 enhancement?

Are there any restrictions/specific callouts we'd want to bring in from Colin's points above on the scope of which new clusters would be eligible for using this at install?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, we do not have any restrictions on which clusters cgroup v2 can be enabled on. We will need to gather bug reports and feedback on what is working (and not working).

@yuqi-zhang
Copy link
Contributor

A follow up question on the previous (now resolved) comment: is this mutable after installation? If so (assuming we are allowing migration between v1 to v2) are you able to also go from v2 back to v1 or is the migration unidirectional?

```

### User Stories

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't quite user stories and are more to-dos


### Non-Goals

- Support mixed cgroups modes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, maybe hypershift may want to run control planes and worker nodes differently.

@rphillips rphillips changed the title [OCPNODE-725] Control Group v2 Enablement [OCPNODE-725] Control Group v2 Enablement on New Clusters Oct 20, 2021
@rphillips
Copy link
Contributor Author

Updated the enhancement to clarify this is for new clusters and added a new API.


#### Tech Preview -> GA

With sufficient internal testing and customer feedback the feature will graduate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jfyi new OKD 4.9 clusters will have cgroupsv2 enabled by default

@rphillips
Copy link
Contributor Author

I believe this enhancement is ready for a final review.

Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few last comments :)

@sinnykumari
Copy link
Contributor

Since this enhancement impacts MCO as well, wanted to know in which OpenShift version we are expecting this to be get implemented(dev preview)?

@rphillips
Copy link
Contributor Author

@sinnykumari @kikisdeliveryservice thank you. updated the PR and mentioned we can target 4.10 for the changes.

@sinnykumari
Copy link
Contributor

@sinnykumari @kikisdeliveryservice thank you. updated the PR and mentioned we can target 4.10 for the changes.

umm, not sure but shouldn't a enhancement be merged before the release planning of a particular release?

@rphillips
Copy link
Contributor Author

@sinnykumari that does make sense to me... removed it from the doc.

@rphillips
Copy link
Contributor Author

@sttts could you take a pass through this?

to Tech Preview.

Upon graduation to GA the feature will still be turned off by default, but may
be enabled within OpenShift at a later time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you expect a period where it is on by default, but can be disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this sentence in f69f9c7 to address that another enhancement will be created to specify how the feature will be turned on by default. The original sentence here was not clear with my other modifications to the document.

}

type NodeSpec struct {
CgroupMode CgroupMode `json:"cgroupMode,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field makes sense to me. In the actual api PR, you'll need to fill in godoc for users.

}

type NodeSpec struct {
CgroupMode CgroupMode `json:"cgroupMode,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure it stays at a string? Or should we make it a struct from the beginning?

cgroup:
  mode: v2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other parts of the API do the same thing... Making it a struct seems overkill for something that is likely not going to change again for a long time.

CgroupMode_Default = CgroupMode_v1
)

type Node struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts you devoted some thought to the top level object name, does this match your preference? I have not devoted enough think-time to have a clear opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with NodeOptions... someone mentioned on this thread Node makes more sense. I agree with simplifying it to Node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the latency settings (in some other enhancement) go here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latency is a multi component api addition. I do not believe it goes in this object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only about kubelet, and others to cope with the fallout. Seem to be very Node'ish.

@rphillips
Copy link
Contributor Author

@deads2k I added two blurbs in 54031d4: adding a blocking cgroup v2 upgrade job, and a comment about how the pass percentage should be the same or better with the cgroup v2 job.

@rphillips
Copy link
Contributor Author

  • non-blocking for this PR @deads2k to provide a list of openshift jobs for GA

@rphillips
Copy link
Contributor Author

periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade
periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade
periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.9-e2e-metal-ipi-upgrade-ovn-ipv6
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-vsphere-upgrade


### Version Skew Strategy

A cluster installed with cgroup v2 would not be impacted by the version skew
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will be when "Add blocking cgroup v2 upgrade jobs" is done for your GA criteria. You plan to revisit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


### Non-Goals

Mixed mode cgroup modes are not 100% compatible with each other. We need data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this precludes migrating from cgroups v1 to cgroups v2. Can you explode that into a top level "non-goal" so we know it this enhancement will need revisiting to remove cgroups v1.

@deads2k
Copy link
Contributor

deads2k commented Dec 2, 2021

The flow and GA criteria lgtm. Needs a squash

/approve
/lgtm
/hold

holding for squash. feel free to release after.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2021
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 2, 2021
@rphillips
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2021
@openshift-merge-robot openshift-merge-robot merged commit 6a6d8bf into openshift:master Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants