Skip to content

Commit

Permalink
node: cpumgr: add metrics information
Browse files Browse the repository at this point in the history
During the review of the KEP, it emerged there are possible
metrics we should add, tracking admission and errors.
CPU allocation is done at admission time, and extracting
these metrics is expected to be both cheap and useful for monitoring.

Signed-off-by: Francesco Romani <fromani@redhat.com>
  • Loading branch information
ffromani committed Oct 5, 2022
1 parent b57f64b commit 07270bc
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 5 deletions.
17 changes: 13 additions & 4 deletions keps/sig-node/3570-cpumanager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -530,7 +530,11 @@ Already running workload will not be affected if the node state is steady

###### What specific metrics should inform a rollback?

Pod creation errors on a node-by-node basis.
"cpu_manager_pinning_errors_total". It must be noted that even in fully healthy system there are known benign condition
that can cause CPU allocation failures. Few selected examples are:

- requesting odd numbered cores (not a full physical core) when the cpumanager is configured with the `full-pcpus-only` option
- requesting NUMA-aligned cores, with Topology Manager enabled.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Expand Down Expand Up @@ -573,9 +577,14 @@ or accessing the podresources API of the kubelet.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

No, because all the metrics we were aware of leaked hardware details.
All of the metrics experimented by consumers of the feature so far require to expose hardware details of the
worker nodes, and are dependent on the worker node hardware configuration (e.g. processor core layout).
"cpu_manager_pinning_requests_total" and "cpu_manager_pinning_errors_total"
We need to find a careful balance here because we don't want to leak hardware details, or in general informations
dependent on the worker node hardware configuration (example, even if arguable extreme, is the processor core layout).

It is possible to infer which pod would trigger a CPU pinning from the
[pod resources request](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy)
but adding these two metrics is both very cheap and helping for the observability of the system.


### Dependencies

Expand Down
3 changes: 2 additions & 1 deletion keps/sig-node/3570-cpumanager/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,5 @@ disable-supported: true

# The following PRR answers are required at beta release
metrics:
- N/A
- cpu_manager_pinning_requests_total
- cpu_manager_pinning_errors_total

0 comments on commit 07270bc

Please sign in to comment.