Update Topology Manager docs

* Added information on how device plugins can take advantage of Topology Manager * Updated the Topology Manager documentation to include additionalinformation and update some out of date sections
kubernetes · Nov 22, 2019 · 23981fc · 23981fc
1 parent b57e73a
commit 23981fc
Show file tree

Hide file tree

Showing 2 changed files with 65 additions and 5 deletions.
diff --git a/content/en/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins.md b/content/en/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins.md
@@ -184,6 +184,32 @@ DaemonSet, `/var/lib/kubelet/pod-resources` must be mounted as a
 
 Support for the "PodResources service" requires `KubeletPodResources` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) to be enabled. It is enabled by default starting with Kubernetes 1.15.
 
+## Device Plugin integration with the Topology Manager
+
+{{< feature-state for_k8s_version="v1.17" state="alpha" >}}
+
+The Topology Manager is a Kubelet component that allows resources to be co-ordintated in a Topology aligned manner. In order to do this, the Device Plugin API was extended to include a `TopologyInfo` struct. 
+
+
+```gRPC
+message TopologyInfo {
+ repeated NUMANode nodes = 1;
+}
+
+message NUMANode {
+ int64 ID = 1;
+}
+```
+Device Plugins that wish to leverage the Topology Manager can send back a populated TopologyInfo struct as part of the device registration, along with the device IDs and the health of the device. The device manager will then use this information to consult with the Topology Manager and make resource assingment decisions.
+
+`TopologyInfo` supports a `nodes` field that is either `nil` (the default) or a list of NUMA nodes. This lets the Device Plugin publish that can span NUMA nodes.
+
+An example `TopologyInfo` struct populated for a device by a Device Plugin:
+
+```
+pluginapi.Device{ID: "25102017", Health: pluginapi.Healthy, Topology:&pluginapi.TopologyInfo{Nodes: []*pluginapi.NUMANode{&pluginapi.NUMANode{ID: 0,},}}}
+```
+
 ## Device plugin examples {#examples}
 
 Here are some examples of device plugin implementations:
@@ -205,5 +231,6 @@ Here are some examples of device plugin implementations:
 * Learn about [scheduling GPU resources](/docs/tasks/manage-gpus/scheduling-gpus/) using device plugins
 * Learn about [advertising extended resources](/docs/tasks/administer-cluster/extended-resource-node/) on a node
 * Read about using [hardware acceleration for TLS ingress](https://kubernetes.io/blog/2019/04/24/hardware-accelerated-ssl/tls-termination-in-ingress-controllers-using-kubernetes-device-plugins-and-runtimeclass/) with Kubernetes
+* Learn about [The Topology Manager] (/docs/tasks/adminster-cluster/topology-manager.md)
 
 {{% /capture %}}
diff --git a/content/en/docs/tasks/administer-cluster/topology-manager.md b/content/en/docs/tasks/administer-cluster/topology-manager.md
@@ -1,5 +1,6 @@
 ---
 title: Control Topology Management Policies on a node
+
 reviewers:
 - ConnorDoyle
 - klueska
@@ -48,9 +49,9 @@ The hint is then stored in the Topology Manager for use by the *Hint Providers*
 The Topology Manager currently:
 
  - Works on Nodes with the `static` CPU Manager Policy enabled. See [control CPU Management Policies](/docs/tasks/administer-cluster/cpu-management-policies/)
- - Works on Pods in the `Guaranteed` {{< glossary_tooltip text="QoS class" term_id="qos-class" >}}
+ - Works on Pods making CPU requests or Device requests via extended resources
 
-If these conditions are met, Topology Manager will align CPU and device requests.
+If these conditions are met, Topology Manager will align the requested resources.
 
 Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag, `--topology-manager-policy`.
 There are four supported policies:
@@ -83,6 +84,9 @@ Using this information, the Topology Manager stores the
 preferred NUMA Node affinity for that container. If the affinity is not preferred, 
 Topology Manager will reject this pod from the node. This will result in a pod in a `Terminated` state with a pod admission failure.
 
+Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of the pod.
+An external control loop could be also implemented to trigger a redeployment of pods that have the `Topology Affinity` error.
+
 If the pod is admitted, the *Hint Providers* can then use this information when making the 
 resource allocation decision.
 
@@ -95,6 +99,8 @@ If it is, Topology Manager will store this and the *Hint Providers* can then use
 resource allocation decision.
 If, however, this is not possible then the Topology Manager will reject the pod from the node. This will result in a pod in a `Terminated` state with a pod admission failure.
 
+Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to reschedule the pod. It is recommended a Deployment with Replicas to trigger a redeploy of the pod.
+An external control loop could be also implemented to trigger a redeployment of pods that have the `Topology Affinity` error.
 
 ### Pod Interactions with Topology Manager Policies
 
@@ -146,9 +152,36 @@ spec:
 
 This pod runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.
 
-Topology Manager would consider this Pod. The Topology Manager consults the CPU Manager `static` policy, which returns the topology of available CPUs. 
-Topology Manager also consults Device Manager to discover the topology of available devices for example.com/device.
 
-Topology Manager will use this information to store the best Topology for this container. In the case of this Pod, CPU and Device Manager will use this stored information at the resource allocation stage.
+```yaml
+spec:
+ containers:
+ - name: nginx
+ image: nginx
+ resources:
+ limits:
+ example.com/deviceA: "1"
+ example.com/deviceB: "1"
+ requests:
+ example.com/deviceA: "1"
+ example.com/deviceB: "1"
+```
+This pod runs in the `BestEffort` QoS class because there are no CPU and memory requests.
+
+The Topology Manager would consider both of the above pods. The Topology Manager would consult the Hint Providers, which are CPU and Device Manager to get topology hints for the pods. 
+In the case of the `Guaranteed` pod the `static` CPU Manager policy would return hints relating to the CPU request and the Device Manager would send back hints for the requested device.
+
+In the case of the `BestEffort` pod the CPU Manager would send back the default hint as there is no CPU request and the Device Manager would send back the hints for each of the requested devices.
+
+Using this information the Topology Manager calculates the optimal hint for the pod and stores this information, which will be used by the Hint Providers when they are making their resource assignments. 
+
+### Known Limitations
+1. As of K8s 1.16 the Topology Manager is currently only guaranteed to work if a *single* container in the pod spec requires aligned resources. This is due to the hint generation being based on current resource allocations, and all containers in a pod generate hints before any resource allocation has been made. This results in unreliable hints for all but the first container in a pod.
+*Due to this limitation if multiple pods/containers are considered by Kubelet in quick succession they may not respect the Topology Manager policy.
+
+2. The maximum number of NUMA nodes that Topology Manager will allow is 8, past this there will be a state explosion when trying to enumerate the possible NUMA affinities and generating their hints.
+
+3. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager. 
+
 
 {{% /capture %}}