-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: CPU Affinity and NUMA Topology Awareness #171
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,216 @@ | ||
**Author**: Derek Carr | ||
|
||
**Last** **Updated**: 12/12/2016 | ||
|
||
**Status**: Pending Review | ||
|
||
# CPU Affinity and NUMA Topology Awareness | ||
|
||
This proposal describes enhancements to the Kubernetes API to improve the | ||
utilization of compute resources for containers that have a need to avoid | ||
cross NUMA node memory access by containers. | ||
|
||
This proposal describes the set of changes recommended to support NUMA | ||
awareness in the `Node` and `Pod` API to support this feature. It does not | ||
yet prescribe specifically the actor that performs NUMA aware scheduling | ||
decisions, instead it focuses purely on the underlying primitives required | ||
to support building more advanced scheduling capabilities. | ||
|
||
## What is NUMA? | ||
|
||
Non-uniform memory architecture (NUMA) describes multi-socket machines that | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Non-uniform memory architecture => Non-Uniform Memory Architecture (NUMA) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A in NUMA is "Access" https://en.wikipedia.org/wiki/Non-uniform_memory_access Should it be "Non-Uniform Memory Access (NUMA) architecture" ? |
||
subdivide memory into nodes where each node is associated with a list of | ||
CPU cores. This architecture is the norm for modern machines. | ||
|
||
An interconnect bus provides connections between nodes so each CPU can | ||
access all memory. The interconnect can be overwhelmed by concurrent | ||
cross-node traffic, and as a result, processes that need to access memory | ||
on a different node can experience increased latency. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NB, the increased latency of cross-node memory access exists regardless of whether the interconnect is overwhelmed or not - remote node memory access always has a penalty based on the distance between nodes. For example, if local node distance is 10 and distance to remote node is 15, then memory access will always be at least x1.5 slower. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @berrange - agreed. will clean up text. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe "a process running on a different node from a memory bank it needs to access can experience increased latency" |
||
|
||
As a result, many applications see a performance benefit when the workload | ||
is affined to a particular NUMA node and CPU core(s). | ||
|
||
## NUMA topology | ||
|
||
In order to support NUMA affined workloads, the `Node` must make its | ||
NUMA topology available for introspection by other agents that schedule | ||
pods. | ||
|
||
This proposal recommends that the `NodeStatus` is augmented as follows: | ||
|
||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use "```go"? |
||
// NodeStatus is information about the current status of a node. | ||
type NodeStatus struct { | ||
... | ||
// Topology represents the NUMA topology of a node to aid NUMA aware scheduling. | ||
// +optional | ||
Topology NUMATopology | ||
} | ||
|
||
// NUMATopology describes the NUMA topology of a node. | ||
type NUMATopology struct { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Following the convention of "labels everywhere", would you embed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TypeMeta is more appropriate for top-level objects in the API, the topology is embedded in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for not including the use case. I was thinking of representing shared (but not consumable) per-numa-node resources like device locality. Anyway, in that case it would be more appropriate as part of |
||
// NUMANodes represents the list of NUMA nodes in the topology. | ||
NUMANodes []NUMANode | ||
} | ||
|
||
// NUMANode describes a single NUMA node. | ||
type NUMANode struct { | ||
// Identifies a NUMA node on a single host. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW the way this is structured would complicate the scheduler which currently rips through capacities for most of it's matching. If it's possible to expand capacities that might make this easier. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @timothysc -- i am open to ideas on how to flatten, but we still need to know the topology to know what cores are in what node, the siblings, and their distances (possibly) in future. |
||
NUMANodeID string | ||
// Capacity represents the total resources associated to the NUMA node. | ||
// cpu: 4 <number of cores> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarification: is this intended to represent the logical CPU ids as reported by Linux, or only the number of physical cores on the socket? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i had intended for this to align with what was returned in the following:
so in that case it was logical and not physical cores. this was consistent with what we do today when reporting CPU counts in cAdvisor via i agree we will need to know what are siblings in the numa topology. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is this (logical) also what is returned in "cpu" in Allocatable and Capacity in NodeStatus? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @davidopp -- yes. the node returns logical cores to in "cpu" in Allocatable and Capacity. |
||
// memory: <amount of memory in normal page size> | ||
// hugepages: <amount of memory in huge page size> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs much more work to correctly future proof it to cope well with huge pages. Historically x86 only had 1 huge page size (2MB or 4MB depending on kernel address mode), but these days you can also have 1 GB huge pages. On PPC architecture there are a huge number of page sizes available. Any single host can have multiple page sizes available concurrently. So memory needs to be represented as a struct in its own right eg something more like type PageInfo struct { Then the node would have an array of info Memory []PageInfo The 'cpu: 4' count seems redundant given that you have a CPUSet string below from which you could easily determine the count if represented as an array instead of string. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. generally, we have discussed the need to have a mechanism to describe the attributes of a resource better than what we have in the current model. usage information fluctuates so we tend to not capture in the node status and instead would use a monitoring pipeline for that information. the pagesize will need to be reflected. its possible the convention will be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I agree with Derek here, they could be represented as separate entries in the resource list. Given that pagesize could be represented by a conventional resource name, Also agree that the better way to address the need for more expressive resource types is to iterate the resource model versus working around it with many custom fields. By the same token, the CPUSet field could be represented as a Range type if it were available. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As long as there's a way to represent multiple different huge pages sizes concurrently I'm ok - if we want to use a flat structure and so have multiple "hugepages.: X", that'll work fine There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or, and I know this is contrarian, we don't. or maybe we don't YET. We can not represent every facet of every hardware system. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @thockin Amen. I'm kind of chuckling as I read some of this (and the huge page proposal) thinking how much crazy code went into OpenStack Nova for all of this NFV (or NFV-centric) functionality. It made (and continues to make) the scheduler and resource tracking subsystems in Nova a total pain in the behind to read and work with. Also worth noting: what is the plan to functionally test all of this in a continuous manner? Who will provide the hardware and systems? This is something we spent literally years trying to get companies to contribute and even now still struggle to get stable third-party CI systems integrated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are and will be some workloads that we don't handle perfectly. I can live with that. |
||
Capacity ResourceList | ||
// Allocatable represents the resources of a NUMA node that are available for scheduling. | ||
// +optional | ||
Allocatable ResourceList | ||
// CPUSet represents the physical numbers of the CPU cores | ||
// associated with this node. | ||
// Example: 0-3 or 0,2,4,6 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarification: are these limited to the physical cpu ids, and if so is it planned to support scheduling onto their hyperthread siblings? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, we have heard from some users that they would like to consume a pair of hyperthreads from one physical core. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this was logical core listing, we will need to let you know what the siblings are. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for simplicity imo we should bound affinity to the core. I question the cost/benefits to allocating to hyperthreads. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the 90% general usage, I think it is fine to ignore the sockets/cores/threads distinction and just treat all logical cpus as having equivalent "power". In the more niche use cases there's definitely cases where considering threads as specal will help. There are applications which are very latency sensitive and benefit from having their threads run on a pair of hyperthread siblings in order to maximise cache hits. Conversely there's applications which absolutely don't ever want to be on hyperthread siblings because they'll contend on the same shared resources. That all said, I think it is possible to retrofit support for thread at a later date. It simply requires adding a new array attribute later listing the thread sibling IDs. So this simple list of logical CPU IDs is sufficient granularity for a first impl. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @berrange but why wait to address this use case? Most of the users we have talked to that need core pinning are the extreme low-latency people. Without this level of topology awareness the feature will not meet their needs and we'll still have users hacking around the problem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is cpu in Capacity always guaranteed to be equal to the number of cores listed here? (If not, why not?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @davidopp -- the list of cpus should always match the equal number of cores in alloctable. if the kubelet itself is pegged to a particular core, its possible the capacity != allocatable. |
||
// The values are expressed in the List Format syntax specified | ||
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html | ||
CPUSet string | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using the string syntax is user friendly, but not very machine friendly as all operations would have to convert to a bitmap to do any useful calculations. IOW, it'd be better to represent as something like '[]int' storing the CPU numbers in that node. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i assume we would have internal utilities to convert to []int There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, you could provide utility functions to convert to a []int format, but IMHO it is preferable to store the data in a canonical format in the first place and avoid the need to convert it everywhere |
||
} | ||
``` | ||
|
||
## Node Configuration | ||
|
||
### Isolating host processes | ||
|
||
By default, load balancing is done across all CPUs, except those marked isolated | ||
using the kernel boot time `isolcpus=` argument. When configuring a node to support | ||
CPU and NUMA affinity, many operators may wish to isolate host processes to particular | ||
cores. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, the isolcpus= kernel argument is not suitable for general purpose isolation of host OS processes from container/pod processes because it has two effects, only one of which is desirable. Aside from isolating the host processes, it also disables schedular load balancing from those isolated CPUs. If you have a container that is placed on isolated CPUs, the processes in that container will never move between available CPUs - they'll forever run on the first CPU on which they were placed, even if all other CPUs are idle. The systemd CPUAffinity setting should pretty much always be used - isolcpus when running real-time processes on the dedicated CPUs where you genuiently don't want the kernel scheduler rebalancing CPUs/. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i raised the former because not all distros running kube were on systemd. agreed with above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even if a distro is not running systemd, they should never use isolcpus= unless the containers are intended to run real-time tasks where each individual process is manually pinned to a suitable host CPU. Non-systemd distros would have to configure cgroups cpuset controller in some custom manner to setup isolation. |
||
|
||
It is recommended that operators set a CPU value for `--system-reserved` | ||
in whole cores that aligns with the set of cpus that are made available to the default | ||
kernel scheduling algorithm. If an operator is on a `systemd` managed platform, they | ||
may choose instead to set the `CPUAffinity` value for the root slice to the set of CPU | ||
cores that are reserved for the host processes. | ||
|
||
**TODO** | ||
|
||
1. how should `kubelet` discover the reserved `cpu-set` value? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could infer it based on the CPU mask it is running with itself. eg if you have CPUAffinity=0-3 on a 16 cpus system, then kubelet would see its affinity as 0-3, and so it can infer that 4-15 are the ones reserved for pod usage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, i was hoping to avoid the need for additional flags, will think one this some. |
||
1. in a numa system, `kubelet` reservation for memory needs to be removed from | ||
a particular numa node capacity so numa node allocatable is as expected. | ||
|
||
### Configuring Taints | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't understand the reason for the taints. I suspect there is some misunderstanding about the intention of taints/tolerations, but I'm not clear enough on why you are using them to be able to judge. Can you explain what problem(s) you are trying to solve? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the problems:
I basically chose to use taints/tolerations as a means of decorating the resource description and scheduling requirement. |
||
|
||
The following `Taint` keys are defined to enable CPU pinning and NUMA awareness. | ||
|
||
#### CPUAffinity | ||
|
||
* Effect: `NoScheduleNoAdmitNoExecute` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This might have to be represented as two taints here and elsewhere. See kubernetes/kubernetes#28082 for details. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @balajismaniam ack, was talking with @aveshagarwal about this in the background. in addition, just using |
||
* Potential values: | ||
* `dedicated` | ||
|
||
If `dedicated`, all pods that match this taint will require dedicated compute resources. Each | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit (here and repeated below): s/match this taint/tolerate this taint |
||
pod bound to this node must request CPU in whole cores. The CPU limit must equal the request. | ||
|
||
#### NUMACPUAffinity | ||
|
||
* Effect: `NoScheduleNoAdmitNoExecute` | ||
* Potential values: | ||
* `strict` | ||
|
||
If `strict`, all pods that match this taint must request CPU (whole or fractional cores) that | ||
fit a single NUMA node `cpu` allocatable. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This feels somewhat inflexible wrt potential future expansion. It would be better to express a desired number of node count. eg NUMANodes=N, where N is one or more, to indicate how many NUMA nodes it is willing / able to consume. eg an application inside a container may be capable of dividing its workload into 2 pieces, so could deal wtih being spread on 2 NUMA nodes, thus could request NUMANodes=1,2 to indicate its able to deal with 1 or 2 NUMA nodes, but not more than that. |
||
|
||
#### NUMAMemoryPolicy | ||
|
||
* Effect: `NoScheduleNoAdmitNoExecute` | ||
* Potential values: | ||
* `strict` | ||
* `preferred` | ||
|
||
If `strict`, all pods that match this taint must request `memory` that fits it's assigned | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/it's/its |
||
NUMA node `memory` allocatable. | ||
|
||
If `preferred`, all pods that match this taint are not required to have their `memory` request | ||
fit it's assigned NUMA node `memory` allocatable. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/it's/its |
||
|
||
## Pod Specification | ||
|
||
### API changes | ||
|
||
The following API changes are proposed to the `PodSpec` to allow CPU and NUMA affinity to be defined. | ||
|
||
``` | ||
// PodSpec is a description of a pod | ||
type PodSpec struct { | ||
... | ||
// NodeName is a request to schedule this pod onto a specific node. If it is non-empty, | ||
// the scheduler simply schedules this pod onto that node, assuming that it fits resource | ||
// requirements. | ||
// +optional | ||
NodeName string | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why did you list this here? Are you planning to somehow use the existing NodeName field for NUMA node scheduling? It wasn't clear to me from the comment (which appears to be identical to the existing NodeName comment). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i listed it here just to be clear that numa and cpu affined workloads are a triplet of node name + numa node + cpuset. the only thing that is new is the numa node id and cpu affinity values. |
||
// Identifies a NUMA node that affines the pod. If it is non-empty, the value must | ||
// correspond to a particular NUMA node on the same node that the pod is scheduled against. | ||
// This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations | ||
// are present on the pod. | ||
// +optional | ||
NUMANodeID string | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've said before and I still assert - this is a terrible plan. Offering up this much control is going to kill off any chances we have of actually doing the right thing automatically for most users. This is an attractive nuisance (https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine) - people are going to get hurt by it. If the borg experience counts for anything, the lesson is that saying "yes" to things like this turned out terribly, and is a drag on the system. I proposed a less first-class alternative. Something like: annotate pods and run a highly-privileged agent on the node react to pods being scheduled by programming cpusets and calling numactl and whatever else it needs to do. Make this VERY explicit to use and DO NOT make it part of the API. If we need to collect NUMA status and use something about NUMA as a hint to scheduling, I can MAYBE see this, maybe even a different scheduler enitrely. If we need to sell "dedicated caches", then we should add that as resource. If I need NUMA because I want an LLC to myself, let me spec that. If we need to sell "link to high-perf IO device", then we should add that as resource. I'm trying to not overreact, but I have a visceral response to this. |
||
// CPUAffinity controls the CPU affinity of the executed pod. | ||
// If it is non-empty, the value must correspond to a particular set | ||
// of CPU cores in the matching NUMA node on the machine that the pod is scheduled against. | ||
// This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations | ||
// are present on the pod. | ||
// The values are expressed in the List Format syntax specified here: | ||
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html | ||
// +optional | ||
CPUAffinity string | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is responsible for setting the NUMANodeID / CPUAffinity values ? Is that to be set by the end user directly, or to be set indirectly by the schedular when making placement decision. If the latter, then that's ok, but the former is not a good idea. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's intended to be set by a scheduling agent of some kind. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Stupid question: are the identifiers used in this list global across the machine, or local per NUMA node? Is there any reason why you would specify CPUAffinity without specifying NUMANodeID? (I assume a scheduling agent would know both aspects of the topology.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the identifiers are global across the machine. i actually debated not including NUMANodeID and just having CPUAffinity, and am open to discussion on that. If NUMANodeID is set CPUAffinity is either the 1-N cores on that node, or some subset. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ^^^ this too. |
||
``` | ||
|
||
### REST API changes | ||
|
||
The `/pod/<pod-name>/bind` operation will allow updating the NUMA and CPU | ||
affinity values. The same permissions required to schedule a pod to a | ||
node in the cluster will be required to bind a pod to a particular NUMA node | ||
and CPU set. | ||
|
||
### Tolerations | ||
|
||
Pods that require CPU and NUMA affinity prior to execution must set the | ||
appropriate `Tolerations` for the associated taints. | ||
|
||
### Multiple containers | ||
|
||
If a pod has multiple containers, the set of containers must all fit | ||
a specific NUMA node, and the set of affined CPUs are shared among containers. | ||
|
||
Pod level cgroups are used to actually affine the container to the specified | ||
CPU set. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a proposal or other reference on how pod-level cgroups will work and programs can understand their assignments? We've had a long-standing discussion around how can a C++ program discover/understand it's resource boundaries so that it can operate properly within them. A classic third-party example is a Java JVM auto-discovering how many GC threads it can create and how much memory it's been allocated. It would be desirable to have a path for codebases that may have previously assumed they had to manage a whole machine's worth of resources and handle setting their own scheduler affinities today to be able to understand their partial-machine resource set so they can make their own proper suballocations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lcarstensen They could either feed in from ENV vars, or iirc there is a mount trick that is done to allow them see the world through a cgroup-i-fied lens. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lcarstensen -- recommend using the downward api in that context: kubernetes/website#591 |
||
|
||
## Resource Quota changes | ||
|
||
Operators must be able to limit the consumption of dedicated CPU cores via quota. | ||
|
||
## Kubelet changes | ||
|
||
The `kubelet` will enforce the presence of the required pod tolerations assigned to the node. | ||
|
||
The `kubelet` will pend the execution of any pod that is assigned to the node, but has | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for how the proposal solves this problem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does Kubelet reject the pod if the specified NUMANodeID and/or CPUAffinity are already in use by an existing pod? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, it would reject if there was a collision in the case that you asked for dedicated cores, and were not getting them. |
||
not populated the required fields for a particular toleration. | ||
|
||
* If the toleration `CPUAffinity` is present on a `Pod`, the pod will not start | ||
any associated container until the `Pod.Spec.CPUAffinity` is populated. | ||
* If the toleration `NUMAAffinity` is present on a `Pod`, the pod will not start | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/NUMAAffinity/NUMACPUAffinity/ |
||
any associated container until the `Pod.Spec.NUMANodeID` is populated. | ||
|
||
The delayed execution of the pod enables both a single and dual-phase scheduler to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean by "single and dual-phase scheduler"? (And what do you mean by "delayed execution of the pod"?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it seems that there is some debate in the community on whether the cluster scheduler assigns pods to cpu cores and numa nodes, or if a node local agent schedules it. we had some of this debate at kubecon. when i mention single or dual phase scheduler, i mean either a scheduler that does machine+numa aware scheduling (single) or a pair of schedulers where one does machine assignment, and another does numa assignment. personally, i am biased to the single approach, but i got the sense there was some debate in this space. |
||
place pods on a particular NUMA node and set of CPU cores. | ||
|
||
## Required work: | ||
|
||
1. pod level cgroup support roll-out | ||
1. implement support for `NoScheduleNoAdmitNoExecute` taint effect | ||
1. expose NUMA topology in cAdvisor | ||
1. expose NUMA topology in node status | ||
1. pod level cgroup support for enabling cpu set | ||
|
||
## Future considerations | ||
|
||
1. Author `NUMATopologyPredicate` in scheduler to enable NUMA aware scheduling. | ||
1. Restrict vertical autoscaling of CPU and NUMA affined workloads. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this have to do with limiting dynamic logical core assignments, or specifically fixing the number of cores allocated for the lifetime of a pod? If the latter, what's the rationale for it? The default vertical auto-scaler (non-numa-aware) could emit impossible allocations (spanning sockets, fractional cores, etc) but you could also imagine a numa-aware vertical auto-scaler that can do the right thing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe i should phrase this differently. i mean to say that we should not just auto-scale these things ignorant of NUMA. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sgtm There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this proposal assumes that we run workloads on monolithic hardware. Our use case is different.
Server 2 has topology:
Consider a scenario: We want to run workloads that perform best if they use all CPUs on one NUMA node. Can we try a different approach? Add a capability to specify CPUset to Opaque Integer Resources. One Opaque Integer Resource could be predefined (lets call it "numa"). When a user requests OIR numa 1, the scheduler would specify CPUset This approach would also solve more advanced use cases. Lets say we want to run a workload that is pinned to one CPU that is close to a specific PCI device. On one server this CPU may be located on numa0, on another on numa1. In this case a user defines a new OIR (lets call it "fast-pci") and specifies CPUset for this OIR. It would be different for different servers. The schedule would pin the workload to the CPU close to the PCI on each server. I assume that the OIR definition would be performed by a Node-Feature-Discovery pod (as defined in kubernetes/kubernetes#10570 (comment)). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jakub-d -- i don't think numa should be handled via OIR, but I see value in letting you have numa node capacity as a counted resource on node. I will think about how to handle that in this proposal. I think the major tension on the proposal is just the presence of NUMANodeID/CPUAffinity on the pod declaration which I think I would still want even with your proposed model. Thanks for the scenario feedback! |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/by containers//