|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: 'New conversion from cgroup v1 CPU shares to v2 CPU weight' |
| 4 | +date: 2025-10-25T05:00:00-08:00 |
| 5 | +slug: new-cgroup-v1-to-v2-cpu-conversion-formula |
| 6 | +author: > |
| 7 | + [Itamar Holder](https://github.com/iholder101) (Red Hat) |
| 8 | +--- |
| 9 | + |
| 10 | +We're excited to announce the implementation of an improved conversion formula |
| 11 | +from cgroup v1 CPU shares to cgroup v2 CPU weight. This enhancement addresses |
| 12 | +critical issues with CPU priority allocation for Kubernetes workloads when |
| 13 | +running on systems with cgroup v2. |
| 14 | + |
| 15 | +## Background |
| 16 | + |
| 17 | +Kubernetes was originally designed with cgroup v1 in mind, where CPU shares |
| 18 | +were defined simply by assigning the container's CPU requests in millicpu |
| 19 | +form. |
| 20 | + |
| 21 | +For example, a container requesting 1 CPU (1024m) would get |
| 22 | +`cpu.shares = 1024`. |
| 23 | + |
| 24 | +After a while, cgroup v1 was stared being replaced by its successor, |
| 25 | +cgroup v2. In cgroup v2, the concept of CPU shares (which ranges from 2 to |
| 26 | +262144, or from 2^1 to 2^18) was replaced with CPU weight (which ranges from |
| 27 | +1 to 10000, or 10^10 to 10^4). |
| 28 | + |
| 29 | +With the transition to cgroup v2, |
| 30 | +[KEP-2254](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) |
| 31 | +introduced a conversion formula to map cgroup v1 CPU shares to cgroup v2 CPU |
| 32 | +weight. The conversion formula is defined as: |
| 33 | + |
| 34 | +`cpu.weight = (1 + ((cpu.shares - 2) * 9999) / 262142) // convert from [2-262144] to [1-10000]`. |
| 35 | + |
| 36 | +This formula linearly maps between `[2^1 - 2^18]` to `[10^0 - 10^4]`. |
| 37 | + |
| 38 | + |
| 39 | +While this approach is simple, the linear mapping imposes a few significant |
| 40 | +problems and impacts both performance and configuration granularity. |
| 41 | + |
| 42 | +## Problems with Current Conversion Formula |
| 43 | + |
| 44 | +The current conversion formula creates two major issues: |
| 45 | + |
| 46 | +### 1. Reduced Priority Against Non-Kubernetes Workloads |
| 47 | + |
| 48 | +In cgroup v1, the default CPU shares is `1024`, meaning a container |
| 49 | +requesting 1 CPU has equal priority with system processes that live outside |
| 50 | +of Kubernetes' scope. |
| 51 | +However, in cgroup v2, the default CPU weight is `100`, but the current |
| 52 | +formula converts 1 CPU (1024m) to only `~39` weight - less than 40% of the |
| 53 | +default. |
| 54 | + |
| 55 | +**Example:** |
| 56 | +- Container requesting 1 CPU (1024m) |
| 57 | +- cgroup v1: `cpu.shares = 1024` (equal to default) |
| 58 | +- cgroup v2 (current): `cpu.weight = 39` (much lower than default 100) |
| 59 | + |
| 60 | +This means that after moving to cgroup v2, Kubernetes workloads would |
| 61 | +de-factor reduce their CPU priority against non-Kubernetes processes. The |
| 62 | +problem can be severe for setups that run many system daemons that run |
| 63 | +outside of Kubernetes' scope and expect Kubernetes workloads to have |
| 64 | +priority, especially in situations of resource starvation. |
| 65 | + |
| 66 | +### 2. Unmanageable Granularity |
| 67 | + |
| 68 | +The current formula produces very low values for small CPU requests, |
| 69 | +limiting the ability to create sub-cgroups within containers for |
| 70 | +fine-grained resource distribution. |
| 71 | + |
| 72 | +**Example:** |
| 73 | +- Container requesting 100m CPU |
| 74 | +- cgroup v1: `cpu.shares = 102` |
| 75 | +- cgroup v2 (current): `cpu.weight = 4` (too low for sub-cgroup |
| 76 | + configuration) |
| 77 | + |
| 78 | +With cgroup v1, requesting 1 CPU which led to 102 CPU shares was manageable |
| 79 | +in the sense that sub-cgroups could have been created inside the main |
| 80 | +container, assigning fine-grained CPU priorities for different groups of |
| 81 | +processes. With cgroup v2 however, having 4 shares is very hard to |
| 82 | +distribute between sub-cgroups since it's not granular enough. |
| 83 | + |
| 84 | +With plans to allow [writable cgroups for unprivileged containers](https://github.com/kubernetes/enhancements/issues/5474), |
| 85 | +this becomes even |
| 86 | +more relevant. |
| 87 | + |
| 88 | +## New Conversion Formula |
| 89 | + |
| 90 | +### Description |
| 91 | +The new formula is more complicated, but does a much better job mapping |
| 92 | +between cgroup v1 CPU shares and cgroup v2 CPU weight: |
| 93 | + |
| 94 | +`cpu.weight = ⌈10^(L²/612 + 125L/612 - 7/34)⌉` where `L = log₂(cpu.shares)`. |
| 95 | + |
| 96 | +The idea is that this is a quadratic function to cross the following values: |
| 97 | +- (2, 1): The minimum values for both ranges. |
| 98 | +- (1024, 100): The default values for both ranges. |
| 99 | +- (262144, 10000): The maximum values for both ranges. |
| 100 | + |
| 101 | +Visually, the new function looks as follows: |
| 102 | + |
| 103 | + |
| 104 | +And if we zoom in to the important part: |
| 105 | + |
| 106 | + |
| 107 | +The new formula is "close to linear", yet it is sophistically designed to |
| 108 | +map the ranges in a clever way so the three important points above would |
| 109 | +cross. |
| 110 | + |
| 111 | +### How It Solves the Problems |
| 112 | + |
| 113 | +**1. Better Priority Alignment:** |
| 114 | +- Container requesting 1 CPU (1024m) will now get a `cpu.weight = 102`. This |
| 115 | + value is close to cgroup v2's default 100. |
| 116 | +- This restores the intended priority relationship between Kubernetes |
| 117 | + workloads and system processes. |
| 118 | + |
| 119 | +**2. Improved Granularity:** |
| 120 | +- Container requesting 100m CPU will get `cpu.weight = 17`, (see |
| 121 | + [here](https://go.dev/play/p/sLlAfCg54Eg)). |
| 122 | +- Enables better fine-grained resource distribution within containers. |
| 123 | + |
| 124 | +## Adoption and integration |
| 125 | + |
| 126 | +This change was implemented as an OCI-level implementation. |
| 127 | +In other words, this is not implemented Kubernetes itself, therefore the |
| 128 | +adoption of the new conversion formula depends solely on the OCI runtime |
| 129 | +adoption. |
| 130 | + |
| 131 | +For example: |
| 132 | +- runc: The new formula is enabled from [version 1.4.0-rc.1](https://github.com/opencontainers/runc/releases/tag/v1.4.0-rc.1). |
| 133 | +- crun: The new formula is enabled from [version 1.23](https://github.com/containers/crun/releases/tag/1.23). |
| 134 | + |
| 135 | +## Where Can I Learn More? |
| 136 | + |
| 137 | +For those interested in this enhancement: |
| 138 | + |
| 139 | +- [Kubernetes GitHub Issue #131216](https://github.com/kubernetes/kubernetes/issues/131216) - Detailed technical |
| 140 | +analysis and examples, including discussions and reasoning for choosing the |
| 141 | +above formula. |
| 142 | +- [KEP-2254: cgroup v2](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) - |
| 143 | +Original cgroup v2 implementation in Kubernetes. |
| 144 | +- [Kubernetes cgroup documentation](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) - |
| 145 | +Current resource management guidance. |
| 146 | + |
| 147 | +## How Do I Get Involved? |
| 148 | + |
| 149 | +For those interested in getting involved with Kubernetes node-level |
| 150 | +features, join the [Kubernetes Node Special Interest Group](https://github.com/kubernetes/community/tree/master/sig-node). |
| 151 | +We always welcome new contributors and diverse perspectives on resource management |
| 152 | +challenges. |
0 commit comments