-
Notifications
You must be signed in to change notification settings - Fork 474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto Node Sizing #642
Auto Node Sizing #642
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
## Summary | ||
|
||
Kubelet should have an automatic sizing calculation mechanism, which could give kubelet an ability to dynamically set sizing values for memory and cpu reservations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory and cpu system reserved*
2247a41
to
39c26d8
Compare
/cc @rphillips @mrunalp |
|
||
We have observed that varying the value of `system reserved` and `kube reserved` with respect to the installed capacity of the node helps to deduce optimal values. | ||
|
||
Currently, the only way to customize the `system reserved` and `kube reserved` limits is to pre-calculate the values manullay prior to Kubelet start. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: manually
metadata: | ||
creationTimestamp: null | ||
name: testcluster1 | ||
autoNodeSizing: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 nice
39c26d8
to
17efd41
Compare
|
||
Node sizing values will be calculated by the script `/etc/kubernetes/node-sizing`. This script will set the environment variables `CPU_RESERVED`, `MEMORY_RESERVED` and `EPHEMERAL_RESERVED` which will be used for setting the kubelet parameter `system-reserved` | ||
|
||
The script `/etc/kubernetes/node-sizing` will be executed using `ExecStartPre` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick audit of MCO... Let write the script to /usr/local/bin. _base/files/configure-ovs-network.yaml
is one example that writes to /usr/local/bin that already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
5643f43
to
cef149b
Compare
|
||
### Installer Changes | ||
|
||
If the user wishes to enable `auto node sizing` during the installation of the cluster itself, they will have to indicate that using a field `metadata.autoNodeSizing` in installation configuration. e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What percentage of clusters would you expect to enable this versus disable it? Does it require in depth analysis to decide? If we have 80% one way or the other I don't think we want to promote this into install-config.
CC @staebler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will eventually want this in the default install, because cluster memory reservations can change the workload capacity within the system. It'll be defaulted to off (for now), but some clouds - like Azure - may be enabled by default.
|
||
### Installer Changes | ||
|
||
If the user wishes to enable `auto node sizing` during the installation of the cluster itself, they will have to indicate that using a field `metadata.autoNodeSizing` in installation configuration. e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field should not be under metadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first thought is that this should be in the machine pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to have a cluster wide option that's applicable to all nodes. Either all nodes (workers and masters) have this enabled or not. We didn't want user to specify this option for compute
or controlPlane
individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reasoning for that? Is it required that it be applied cluster-wide for some technical reason? Or is it just for user convenience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enhancement will set the optimal values of the system-reserved
for the kubelet on the node. Having the optimal value of system-reserved
makes a huge difference when node is under severe resource crunch, such as pod running with extremely high memory requirements.
Without the right value of system-reserved
the risk of node getting into lock up with NotReady
status increases. You can set the system-reserved
value only on specific nodes, since it's just a parameter to a kubelet. But this would not be very ideal from the customer support point of view. Customers will see a non-uniform behaviour in performance within their cluster nodes. Nodes with optimal system-reserve
will perform substantially better than those without optimal values.
We want to avoid any ambiguity that may arise if customers see their workload run fine on some nodes but not others. This is the reason, we want to make this a cluster-wide option. That way any in future should we receive any customer support ticket related to node lock ups, it's easy for us to understand if the system-reserve
was set across the cluster properly or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @rphillips
## Alternatives | ||
|
||
1. Enhance kubelet itself to be more smart about calculating node sizing values. We have an actively debated [KEP](https://github.com/kubernetes/enhancements/pull/2370) in sig-node around this idea. | ||
2. Modify MCO the way it handles kubeletconfig. Instead of passing `--system-reserved` argument to the kubelet, maybe there is a possibility to make sure MCO is more tolerant of changes to the kubelet config file. This way we will modify the config file to add system reserve values instead of passing them as `--system-reserved`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how I would have expected node sizing to be handled. I would have expected a KubeletConfig resource add by the user to either set explicit values or set it to use auto-sizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Rather than laying down a script to calculate this on the node, the MCO should already know what the node capacity is; could we add logic there to dynamically set the system reserved for nodes in each pool based on some formula? Otherwise the node will potentially be fighting changes in the MCO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the day-1 operation, until the kubelet is deployed and node has joined the cluster the MCO controller does not know the node size.
WantedBy=multi-user.target | ||
``` | ||
|
||
Node sizing values will be calculated by the script `/usr/local/bin/node-sizing.sh`. This script will set the environment variables `CPU_RESERVED`, `MEMORY_RESERVED` and `EPHEMERAL_RESERVED` in `/etc/kubernetes/node-sizing-env` which will be used for setting the kubelet parameter `system-reserved` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a bit more detail about how the actual values used will be decided in the script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have added a subsection Node sizing script
that talks more about how the script will use the guidance values to determine optimal system reserved.
|
||
Irrespective of `auto node sizing` is enabled or not, MCO will always place the script file `/usr/local/bin/node-sizing.sh` on the node. | ||
|
||
However the content of the script file will vary depending upon whether the `auto node sizing` is enabled or disabled. When the `auto node sizing` is disabled, `/usr/local/bin/node-sizing.sh` will set variables `CPU_RESERVED`, `MEMORY_RESERVED` and `EPHEMERAL_RESERVED` to their existing default values, but when the `auto node sizing` is enabled the script will dynamically select the optimal value based on the resources available on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan for how the MCO will know whether auto node sizing is enabled? Is this something that we will allow the user to change as a day-2 operation?
|
||
## Summary | ||
|
||
Kubelet should have an automatic sizing calculation mechanism, which could give kubelet an ability to dynamically set sizing values for memory and cpu system reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubelet should have an automatic sizing calculation mechanism, which could give kubelet an ability to dynamically set sizing values for memory and cpu system reserved. | |
Nodes should have an automatic sizing calculation mechanism, which could give kubelet an ability to scale values for memory and cpu system reserved based on machine size. |
I don't think the mechanism necessarily needs to live inside the kubelet; some other component of a cluster could be responsible for doing the calculation.
|
||
### Installer Changes | ||
|
||
If the user wishes to enable `auto node sizing` during the installation of the cluster itself, they will have to indicate that using a field `metadata.autoNodeSizing` in installation configuration. e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this name is end user facing, it's going to be extremely confusing I think; I would expect this to mean "the node automatically will scale in size with respect to my workloads", and not "the system reserved will scale to size of the node". I would suggest another name, like autoSystemReserved
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will update the name to autoSystemReserved
## Alternatives | ||
|
||
1. Enhance kubelet itself to be more smart about calculating node sizing values. We have an actively debated [KEP](https://github.com/kubernetes/enhancements/pull/2370) in sig-node around this idea. | ||
2. Modify MCO the way it handles kubeletconfig. Instead of passing `--system-reserved` argument to the kubelet, maybe there is a possibility to make sure MCO is more tolerant of changes to the kubelet config file. This way we will modify the config file to add system reserve values instead of passing them as `--system-reserved`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Rather than laying down a script to calculate this on the node, the MCO should already know what the node capacity is; could we add logic there to dynamically set the system reserved for nodes in each pool based on some formula? Otherwise the node will potentially be fighting changes in the MCO.
7c84577
to
8357ab6
Compare
|
||
This solution utilizes kubelet command line flags. Kubelet command line flags have been deprecated in favour of config file, so there is risk for this solution if those flags are actually purged. Having said that, those flags are quite widely used today. So there has not been much traction on actually removing those flags even though they have been marked deprecated. | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isnt IPI has the knowledge in advance? Installer it the one who spawns the VMs in the cloud, so it knows the size. Same for BM with the assisted installer where it knows the values in advance. Not sure about Metal3/ironic though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a solution that works across all the clouds and install flavors. Metal would be problematic here to figure out up front, and sometimes machines have different hardware layouts. We plan on having this script run on each node which should alleviate cloud to bare metal discrepancies.
8357ab6
to
b4655ab
Compare
b4655ab
to
99f3bbe
Compare
/lgtm |
99f3bbe
to
d2bbdc6
Compare
Not sure why the job $ hack/markdownlint.sh
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this output (/usr/share/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ markdownlint-cli2 '**/*.md'
markdownlint-cli2 v0.0.15 (markdownlint v0.23.1)
Finding: **/*.md
Linting: 218 file(s)
Summary: 0 error(s) |
Oh got it,
|
1f7f6bb
to
302b8d8
Compare
Signed-off-by: Harshal Patil <harpatil@redhat.com>
302b8d8
to
6b66fa4
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mrunalp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This enhancement attempts to add a capability to dynamically select node sizing values for the kubelet.
Signed-off-by: Harshal Patil harpatil@redhat.com