Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Node Sizing #642

Merged
merged 1 commit into from
Jun 18, 2021
Merged

Conversation

harche
Copy link
Contributor

@harche harche commented Feb 11, 2021

This enhancement attempts to add a capability to dynamically select node sizing values for the kubelet.

Signed-off-by: Harshal Patil harpatil@redhat.com

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 11, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign adambkaplan after the PR has been reviewed.
You can assign the PR to them by writing /assign @adambkaplan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


## Summary

Kubelet should have an automatic sizing calculation mechanism, which could give kubelet an ability to dynamically set sizing values for memory and cpu reservations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory and cpu system reserved*

@harche
Copy link
Contributor Author

harche commented Feb 11, 2021

/cc @rphillips @mrunalp


We have observed that varying the value of `system reserved` and `kube reserved` with respect to the installed capacity of the node helps to deduce optimal values.

Currently, the only way to customize the `system reserved` and `kube reserved` limits is to pre-calculate the values manullay prior to Kubelet start.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: manually

metadata:
creationTimestamp: null
name: testcluster1
autoNodeSizing: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 nice


Node sizing values will be calculated by the script `/etc/kubernetes/node-sizing`. This script will set the environment variables `CPU_RESERVED`, `MEMORY_RESERVED` and `EPHEMERAL_RESERVED` which will be used for setting the kubelet parameter `system-reserved`

The script `/etc/kubernetes/node-sizing` will be executed using `ExecStartPre`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick audit of MCO... Let write the script to /usr/local/bin. _base/files/configure-ovs-network.yaml is one example that writes to /usr/local/bin that already exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

@harche harche force-pushed the dynamic-node-sizing branch 3 times, most recently from 5643f43 to cef149b Compare February 11, 2021 14:42

### Installer Changes

If the user wishes to enable `auto node sizing` during the installation of the cluster itself, they will have to indicate that using a field `metadata.autoNodeSizing` in installation configuration. e.g.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What percentage of clusters would you expect to enable this versus disable it? Does it require in depth analysis to decide? If we have 80% one way or the other I don't think we want to promote this into install-config.
CC @staebler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will eventually want this in the default install, because cluster memory reservations can change the workload capacity within the system. It'll be defaulted to off (for now), but some clouds - like Azure - may be enabled by default.


### Installer Changes

If the user wishes to enable `auto node sizing` during the installation of the cluster itself, they will have to indicate that using a field `metadata.autoNodeSizing` in installation configuration. e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field should not be under metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought is that this should be in the machine pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to have a cluster wide option that's applicable to all nodes. Either all nodes (workers and masters) have this enabled or not. We didn't want user to specify this option for compute or controlPlane individually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning for that? Is it required that it be applied cluster-wide for some technical reason? Or is it just for user convenience?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enhancement will set the optimal values of the system-reserved for the kubelet on the node. Having the optimal value of system-reserved makes a huge difference when node is under severe resource crunch, such as pod running with extremely high memory requirements.

Without the right value of system-reserved the risk of node getting into lock up with NotReady status increases. You can set the system-reserved value only on specific nodes, since it's just a parameter to a kubelet. But this would not be very ideal from the customer support point of view. Customers will see a non-uniform behaviour in performance within their cluster nodes. Nodes with optimal system-reserve will perform substantially better than those without optimal values.

We want to avoid any ambiguity that may arise if customers see their workload run fine on some nodes but not others. This is the reason, we want to make this a cluster-wide option. That way any in future should we receive any customer support ticket related to node lock ups, it's easy for us to understand if the system-reserve was set across the cluster properly or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Alternatives

1. Enhance kubelet itself to be more smart about calculating node sizing values. We have an actively debated [KEP](https://github.com/kubernetes/enhancements/pull/2370) in sig-node around this idea.
2. Modify MCO the way it handles kubeletconfig. Instead of passing `--system-reserved` argument to the kubelet, maybe there is a possibility to make sure MCO is more tolerant of changes to the kubelet config file. This way we will modify the config file to add system reserve values instead of passing them as `--system-reserved`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I would have expected node sizing to be handled. I would have expected a KubeletConfig resource add by the user to either set explicit values or set it to use auto-sizing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Rather than laying down a script to calculate this on the node, the MCO should already know what the node capacity is; could we add logic there to dynamically set the system reserved for nodes in each pool based on some formula? Otherwise the node will potentially be fighting changes in the MCO.

Copy link
Contributor Author

@harche harche Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the day-1 operation, until the kubelet is deployed and node has joined the cluster the MCO controller does not know the node size.

WantedBy=multi-user.target
```

Node sizing values will be calculated by the script `/usr/local/bin/node-sizing.sh`. This script will set the environment variables `CPU_RESERVED`, `MEMORY_RESERVED` and `EPHEMERAL_RESERVED` in `/etc/kubernetes/node-sizing-env` which will be used for setting the kubelet parameter `system-reserved`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a bit more detail about how the actual values used will be decided in the script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have added a subsection Node sizing script that talks more about how the script will use the guidance values to determine optimal system reserved.


Irrespective of `auto node sizing` is enabled or not, MCO will always place the script file `/usr/local/bin/node-sizing.sh` on the node.

However the content of the script file will vary depending upon whether the `auto node sizing` is enabled or disabled. When the `auto node sizing` is disabled, `/usr/local/bin/node-sizing.sh` will set variables `CPU_RESERVED`, `MEMORY_RESERVED` and `EPHEMERAL_RESERVED` to their existing default values, but when the `auto node sizing` is enabled the script will dynamically select the optimal value based on the resources available on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan for how the MCO will know whether auto node sizing is enabled? Is this something that we will allow the user to change as a day-2 operation?


## Summary

Kubelet should have an automatic sizing calculation mechanism, which could give kubelet an ability to dynamically set sizing values for memory and cpu system reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kubelet should have an automatic sizing calculation mechanism, which could give kubelet an ability to dynamically set sizing values for memory and cpu system reserved.
Nodes should have an automatic sizing calculation mechanism, which could give kubelet an ability to scale values for memory and cpu system reserved based on machine size.

I don't think the mechanism necessarily needs to live inside the kubelet; some other component of a cluster could be responsible for doing the calculation.

enhancements/kubelet/kubelet-node-sizing.md Outdated Show resolved Hide resolved

### Installer Changes

If the user wishes to enable `auto node sizing` during the installation of the cluster itself, they will have to indicate that using a field `metadata.autoNodeSizing` in installation configuration. e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this name is end user facing, it's going to be extremely confusing I think; I would expect this to mean "the node automatically will scale in size with respect to my workloads", and not "the system reserved will scale to size of the node". I would suggest another name, like autoSystemReserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will update the name to autoSystemReserved

## Alternatives

1. Enhance kubelet itself to be more smart about calculating node sizing values. We have an actively debated [KEP](https://github.com/kubernetes/enhancements/pull/2370) in sig-node around this idea.
2. Modify MCO the way it handles kubeletconfig. Instead of passing `--system-reserved` argument to the kubelet, maybe there is a possibility to make sure MCO is more tolerant of changes to the kubelet config file. This way we will modify the config file to add system reserve values instead of passing them as `--system-reserved`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Rather than laying down a script to calculate this on the node, the MCO should already know what the node capacity is; could we add logic there to dynamically set the system reserved for nodes in each pool based on some formula? Otherwise the node will potentially be fighting changes in the MCO.

@harche harche force-pushed the dynamic-node-sizing branch 2 times, most recently from 7c84577 to 8357ab6 Compare February 19, 2021 11:04

This solution utilizes kubelet command line flags. Kubelet command line flags have been deprecated in favour of config file, so there is risk for this solution if those flags are actually purged. Having said that, those flags are quite widely used today. So there has not been much traction on actually removing those flags even though they have been marked deprecated.

## Alternatives

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt IPI has the knowledge in advance? Installer it the one who spawns the VMs in the cloud, so it knows the size. Same for BM with the assisted installer where it knows the values in advance. Not sure about Metal3/ironic though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a solution that works across all the clouds and install flavors. Metal would be problematic here to figure out up front, and sometimes machines have different hardware layouts. We plan on having this script run on each node which should alleviate cloud to bare metal discrepancies.

@harche harche changed the title WIP: Dynamic node sizing Auto Node Sizing Mar 26, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 26, 2021
@rphillips
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 26, 2021
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2021
@harche
Copy link
Contributor Author

harche commented Jun 18, 2021

Not sure why the job ci/prow/markdownlint is failing, when I run it locally it doesn't show any error.

$ hack/markdownlint.sh 
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this output (/usr/share/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ markdownlint-cli2 '**/*.md'
markdownlint-cli2 v0.0.15 (markdownlint v0.23.1)
Finding: **/*.md
Linting: 218 file(s)
Summary: 0 error(s)

@harche
Copy link
Contributor Author

harche commented Jun 18, 2021

Oh got it,

enhancements/kubelet/kubelet-node-sizing.md missing "### User Stories"
enhancements/kubelet/kubelet-node-sizing.md missing "### Risks and Mitigations"
enhancements/kubelet/kubelet-node-sizing.md missing "## Design Details"
enhancements/kubelet/kubelet-node-sizing.md missing "### Graduation Criteria"
enhancements/kubelet/kubelet-node-sizing.md missing "#### Dev Preview -> Tech Preview"
enhancements/kubelet/kubelet-node-sizing.md missing "#### Tech Preview -> GA"
enhancements/kubelet/kubelet-node-sizing.md missing "#### Removing a deprecated feature"
enhancements/kubelet/kubelet-node-sizing.md missing "### Upgrade / Downgrade Strategy"
enhancements/kubelet/kubelet-node-sizing.md missing "## Implementation History"

@harche harche force-pushed the dynamic-node-sizing branch 4 times, most recently from 1f7f6bb to 302b8d8 Compare June 18, 2021 11:28
Signed-off-by: Harshal Patil <harpatil@redhat.com>
@rphillips
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 18, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2021
@openshift-merge-robot openshift-merge-robot merged commit 46d7be3 into openshift:master Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants