node: cpumgr: add metrics information

During the review of the KEP, it emerged there are possible metrics we should add, tracking admission and errors. CPU allocation is done at admission time, and extracting these metrics is expected to be both cheap and useful for monitoring. Signed-off-by: Francesco Romani <fromani@redhat.com>
kubernetes · Oct 5, 2022 · 07270bc · 07270bc
1 parent b57f64b
commit 07270bc
Show file tree

Hide file tree

Showing 2 changed files with 15 additions and 5 deletions.
diff --git a/keps/sig-node/3570-cpumanager/README.md b/keps/sig-node/3570-cpumanager/README.md
@@ -530,7 +530,11 @@ Already running workload will not be affected if the node state is steady
 
 ###### What specific metrics should inform a rollback?
 
-Pod creation errors on a node-by-node basis.
+"cpu_manager_pinning_errors_total". It must be noted that even in fully healthy system there are known benign condition
+that can cause CPU allocation failures. Few selected examples are:
+
+- requesting odd numbered cores (not a full physical core) when the cpumanager is configured with the `full-pcpus-only` option
+- requesting NUMA-aligned cores, with Topology Manager enabled.
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
@@ -573,9 +577,14 @@ or accessing the podresources API of the kubelet.
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-No, because all the metrics we were aware of leaked hardware details.
-All of the metrics experimented by consumers of the feature so far require to expose hardware details of the
-worker nodes, and are dependent on the worker node hardware configuration (e.g. processor core layout).
+"cpu_manager_pinning_requests_total" and "cpu_manager_pinning_errors_total"
+We need to find a careful balance here because we don't want to leak hardware details, or in general informations
+dependent on the worker node hardware configuration (example, even if arguable extreme, is the processor core layout).
+
+It is possible to infer which pod would trigger a CPU pinning from the
+[pod resources request](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy)
+but adding these two metrics is both very cheap and helping for the observability of the system.
+
 
 ### Dependencies
 

diff --git a/keps/sig-node/3570-cpumanager/kep.yaml b/keps/sig-node/3570-cpumanager/kep.yaml
@@ -47,4 +47,5 @@ disable-supported: true
 
 # The following PRR answers are required at beta release
 metrics:
-  - N/A
+  - cpu_manager_pinning_requests_total
+  - cpu_manager_pinning_errors_total