|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: 'New conversion from cgroup v1 CPU shares to v2 CPU weight' |
| 4 | +date: 2025-10-25T05:00:00-08:00 |
| 5 | +draft: true |
| 6 | +math: true |
| 7 | +slug: new-cgroup-v1-to-v2-cpu-conversion-formula |
| 8 | +author: > |
| 9 | + [Itamar Holder](https://github.com/iholder101) (Red Hat) |
| 10 | +--- |
| 11 | + |
| 12 | +I'm excited to announce the implementation of an improved conversion formula |
| 13 | +from cgroup v1 CPU shares to cgroup v2 CPU weight. This enhancement addresses |
| 14 | +critical issues with CPU priority allocation for Kubernetes workloads when |
| 15 | +running on systems with cgroup v2. |
| 16 | + |
| 17 | +## Background |
| 18 | + |
| 19 | +Kubernetes was originally designed with cgroup v1 in mind, where CPU shares |
| 20 | +were defined simply by assigning the container's CPU requests in millicpu |
| 21 | +form. |
| 22 | + |
| 23 | +For example, a container requesting 1 CPU (1024m) would get \(cpu.shares = 1024\). |
| 24 | + |
| 25 | +After a while, cgroup v1 was stared being replaced by its successor, |
| 26 | +cgroup v2. In cgroup v2, the concept of CPU shares (which ranges from 2 to |
| 27 | +262144, or from \(2^1\) to \(2^18\)) was replaced with CPU weight (which ranges from |
| 28 | +\([1, 10000]\), or \(10^0\) to \(10^4\). |
| 29 | + |
| 30 | +With the transition to cgroup v2, |
| 31 | +[KEP-2254](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) |
| 32 | +introduced a conversion formula to map cgroup v1 CPU shares to cgroup v2 CPU |
| 33 | +weight. The conversion formula was defined as: |
| 34 | + |
| 35 | +```math |
| 36 | +cpu.weight = (1 + ((cpu.shares - 2) * 9999) / 262142) |
| 37 | +``` |
| 38 | + |
| 39 | +This formula linearly maps between \([2^1 - 2^18]\) to \([10^0 - 10^4]\). |
| 40 | + |
| 41 | + |
| 42 | +While this approach is simple, the linear mapping imposes a few significant |
| 43 | +problems and impacts both performance and configuration granularity. |
| 44 | + |
| 45 | +## Problems with Previous Conversion Formula |
| 46 | + |
| 47 | +The current conversion formula creates two major issues: |
| 48 | + |
| 49 | +### 1. Reduced Priority Against Non-Kubernetes Workloads |
| 50 | + |
| 51 | +In cgroup v1, the default CPU shares is `1024`, meaning a container |
| 52 | +requesting 1 CPU has equal priority with system processes that live outside |
| 53 | +of Kubernetes' scope. |
| 54 | +However, in cgroup v2, the default CPU weight is `100`, but the current |
| 55 | +formula converts 1 CPU (1024m) to only `~39` weight - less than 40% of the |
| 56 | +default. |
| 57 | + |
| 58 | +**Example:** |
| 59 | +- Container requesting 1 CPU (1024m) |
| 60 | +- cgroup v1: `cpu.shares = 1024` (equal to default) |
| 61 | +- cgroup v2 (current): `cpu.weight = 39` (much lower than default 100) |
| 62 | + |
| 63 | +This means that after moving to cgroup v2, Kubernetes (or, OCI) workloads would |
| 64 | +de-factor reduce their CPU priority against non-Kubernetes processes. The |
| 65 | +problem can be severe for setups with many system daemons that run |
| 66 | +outside of Kubernetes' scope and expect Kubernetes workloads to have |
| 67 | +priority, especially in situations of resource starvation. |
| 68 | + |
| 69 | +### 2. Unmanageable Granularity |
| 70 | + |
| 71 | +The current formula produces very low values for small CPU requests, |
| 72 | +limiting the ability to create sub-cgroups within containers for |
| 73 | +fine-grained resource distribution (which will possibly be much easier moving |
| 74 | +forward, see [KEP #5474](https://github.com/kubernetes/enhancements/issues/5474) for more info). |
| 75 | + |
| 76 | +**Example:** |
| 77 | +- Container requesting 100m CPU |
| 78 | +- cgroup v1: `cpu.shares = 102` |
| 79 | +- cgroup v2 (current): `cpu.weight = 4` (too low for sub-cgroup |
| 80 | + configuration) |
| 81 | + |
| 82 | +With cgroup v1, requesting 1 CPU which led to 102 CPU shares was manageable |
| 83 | +in the sense that sub-cgroups could have been created inside the main |
| 84 | +container, assigning fine-grained CPU priorities for different groups of |
| 85 | +processes. With cgroup v2 however, having 4 shares is very hard to |
| 86 | +distribute between sub-cgroups since it's not granular enough. |
| 87 | + |
| 88 | +With plans to allow [writable cgroups for unprivileged containers](https://github.com/kubernetes/enhancements/issues/5474), |
| 89 | +this becomes even |
| 90 | +more relevant. |
| 91 | + |
| 92 | +## New Conversion Formula |
| 93 | + |
| 94 | +### Description |
| 95 | +The new formula is more complicated, but does a much better job mapping |
| 96 | +between cgroup v1 CPU shares and cgroup v2 CPU weight: |
| 97 | + |
| 98 | +```math |
| 99 | +cpu.weight = \lceil 10^{(L^2/612 + 125L/612 - 7/34)} \rceil, where: L = log₂(cpu.shares) |
| 100 | +``` |
| 101 | + |
| 102 | +The idea is that this is a quadratic function to cross the following values: |
| 103 | +- (2, 1): The minimum values for both ranges. |
| 104 | +- (1024, 100): The default values for both ranges. |
| 105 | +- (262144, 10000): The maximum values for both ranges. |
| 106 | + |
| 107 | +Visually, the new function looks as follows: |
| 108 | + |
| 109 | + |
| 110 | +And if we zoom in to the important part: |
| 111 | + |
| 112 | + |
| 113 | +The new formula is "close to linear", yet it is sophistically designed to |
| 114 | +map the ranges in a clever way so the three important points above would |
| 115 | +cross. |
| 116 | + |
| 117 | +### How It Solves the Problems |
| 118 | + |
| 119 | +1. **Better Priority Alignment:** |
| 120 | +- Container requesting 1 CPU (1024m) will now get a `cpu.weight = 102`. This |
| 121 | + value is close to cgroup v2's default 100. |
| 122 | +- This restores the intended priority relationship between Kubernetes |
| 123 | + workloads and system processes. |
| 124 | + |
| 125 | +2. **Improved Granularity:** |
| 126 | +- Container requesting 100m CPU will get `cpu.weight = 17`, (see |
| 127 | + [here](https://go.dev/play/p/sLlAfCg54Eg)). |
| 128 | +- Enables better fine-grained resource distribution within containers. |
| 129 | + |
| 130 | +## Adoption and integration |
| 131 | + |
| 132 | +This change was implemented as an OCI-level implementation. |
| 133 | +In other words, this is not implemented Kubernetes itself, therefore the |
| 134 | +adoption of the new conversion formula depends solely on the OCI runtime |
| 135 | +adoption. |
| 136 | + |
| 137 | +For example: |
| 138 | +* runc: The new formula is enabled from version [1.3.2](https://github.com/opencontainers/runc/releases/tag/v1.3.2). |
| 139 | +* crun: The new formula is enabled from version [1.23](https://github.com/containers/crun/releases/tag/1.23). |
| 140 | + |
| 141 | +### Impact on Existing Deployments |
| 142 | + |
| 143 | +**Important:** Some consumers may be affected if they assume the older linear conversion formula. |
| 144 | +Applications or monitoring tools that directly calculate expected CPU weight values based on the |
| 145 | +previous formula may need updates to account for the new quadratic conversion. |
| 146 | +This is particularly relevant for: |
| 147 | + |
| 148 | +- Custom resource management tools that predict CPU weight values. |
| 149 | +- Monitoring systems that validate or expect specific weight values. |
| 150 | +- Applications that programmatically set or verify CPU weight values. |
| 151 | + |
| 152 | +We recommend testing the new conversion formula in non-production environments before |
| 153 | +upgrading OCI runtimes to ensure compatibility with existing tooling. |
| 154 | + |
| 155 | +## Where Can I Learn More? |
| 156 | + |
| 157 | +For those interested in this enhancement: |
| 158 | + |
| 159 | +- [Kubernetes GitHub Issue #131216](https://github.com/kubernetes/kubernetes/issues/131216) - Detailed technical |
| 160 | +analysis and examples, including discussions and reasoning for choosing the |
| 161 | +above formula. |
| 162 | +- [KEP-2254: cgroup v2](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) - |
| 163 | +Original cgroup v2 implementation in Kubernetes. |
| 164 | +- [Kubernetes cgroup documentation](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) - |
| 165 | +Current resource management guidance. |
| 166 | + |
| 167 | +## How Do I Get Involved? |
| 168 | + |
| 169 | +For those interested in getting involved with Kubernetes node-level |
| 170 | +features, join the [Kubernetes Node Special Interest Group](https://github.com/kubernetes/community/tree/master/sig-node). |
| 171 | +We always welcome new contributors and diverse perspectives on resource management |
| 172 | +challenges. |
0 commit comments