Proposal: CPU Affinity and NUMA Topology Awareness #171

derekwaynecarr · 2016-12-13T16:52:47Z

This proposal describes a potential mechanism to support CPU affinity and NUMA topology.

It is intended to help guide future discussions around resource management in @kubernetes/sig-node and @kubernetes/sig-scheduling in the resource management work-group.

berrange · 2016-12-13T18:27:24Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+An interconnect bus provides connections between nodes so each CPU can
+access all memory.  The interconnect can be overwhelmed by concurrent
+cross-node traffic, and as a result, processes that need to access memory
+on a different node can experience increased latency.


NB, the increased latency of cross-node memory access exists regardless of whether the interconnect is overwhelmed or not - remote node memory access always has a penalty based on the distance between nodes. For example, if local node distance is 10 and distance to remote node is 15, then memory access will always be at least x1.5 slower.

@berrange - agreed. will clean up text.

maybe "a process running on a different node from a memory bank it needs to access can experience increased latency"

berrange

As a general point I think that the huge page proposal here kubernetes/kubernetes@68b82f3 needs to be folded into this proposal as NUMA & huge pages needs to be modeled together in order to get a long term future proof design.

berrange · 2016-12-13T18:31:52Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	// Capacity represents the total resources associated to the NUMA node.
+	// cpu: 4 <number of cores>
+	// memory: <amount of memory in normal page size>
+	// hugepages: <amount of memory in huge page size>


This needs much more work to correctly future proof it to cope well with huge pages. Historically x86 only had 1 huge page size (2MB or 4MB depending on kernel address mode), but these days you can also have 1 GB huge pages. On PPC architecture there are a huge number of page sizes available. Any single host can have multiple page sizes available concurrently.

So memory needs to be represented as a struct in its own right eg something more like

type PageInfo struct {
PageSize uint
Count uint64
Used uint64
}

Then the node would have an array of info Memory []PageInfo

The 'cpu: 4' count seems redundant given that you have a CPUSet string below from which you could easily determine the count if represented as an array instead of string.

generally, we have discussed the need to have a mechanism to describe the attributes of a resource better than what we have in the current model. usage information fluctuates so we tend to not capture in the node status and instead would use a monitoring pipeline for that information. the pagesize will need to be reflected. its possible the convention will be hugepages.<size>: X. Mostly trying to capture the relationship between NUMA and hugepages that you have called out.

I think I agree with Derek here, they could be represented as separate entries in the resource list. Given that pagesize could be represented by a conventional resource name, nr_hugepages and free_hugepages can be derived from the values in capacity and allocatable.

Also agree that the better way to address the need for more expressive resource types is to iterate the resource model versus working around it with many custom fields. By the same token, the CPUSet field could be represented as a Range type if it were available.

As long as there's a way to represent multiple different huge pages sizes concurrently I'm ok - if we want to use a flat structure and so have multiple "hugepages.: X", that'll work fine

Or, and I know this is contrarian, we don't. or maybe we don't YET. We can not represent every facet of every hardware system.

@thockin Amen. I'm kind of chuckling as I read some of this (and the huge page proposal) thinking how much crazy code went into OpenStack Nova for all of this NFV (or NFV-centric) functionality. It made (and continues to make) the scheduler and resource tracking subsystems in Nova a total pain in the behind to read and work with.

Also worth noting: what is the plan to functionally test all of this in a continuous manner? Who will provide the hardware and systems? This is something we spent literally years trying to get companies to contribute and even now still struggle to get stable third-party CI systems integrated.

There are and will be some workloads that we don't handle perfectly. I can live with that.

berrange · 2016-12-13T18:33:20Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	// Example: 0-3 or 0,2,4,6
+	// The values are expressed in the List Format syntax specified
+	// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
+	CPUSet string


Using the string syntax is user friendly, but not very machine friendly as all operations would have to convert to a bitmap to do any useful calculations. IOW, it'd be better to represent as something like '[]int' storing the CPU numbers in that node.

i assume we would have internal utilities to convert to []int

Yes, you could provide utility functions to convert to a []int format, but IMHO it is preferable to store the data in a canonical format in the first place and avoid the need to convert it everywhere

berrange · 2016-12-13T18:37:15Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+By default, load balancing is done across all CPUs, except those marked isolated
+using the kernel boot time `isolcpus=` argument.  When configuring a node to support
+CPU and NUMA affinity, many operators may wish to isolate host processes to particular
+cores.


FWIW, the isolcpus= kernel argument is not suitable for general purpose isolation of host OS processes from container/pod processes because it has two effects, only one of which is desirable. Aside from isolating the host processes, it also disables schedular load balancing from those isolated CPUs. If you have a container that is placed on isolated CPUs, the processes in that container will never move between available CPUs - they'll forever run on the first CPU on which they were placed, even if all other CPUs are idle.

The systemd CPUAffinity setting should pretty much always be used - isolcpus when running real-time processes on the dedicated CPUs where you genuiently don't want the kernel scheduler rebalancing CPUs/.

i raised the former because not all distros running kube were on systemd. agreed with above.

Even if a distro is not running systemd, they should never use isolcpus= unless the containers are intended to run real-time tasks where each individual process is manually pinned to a suitable host CPU. Non-systemd distros would have to configure cgroups cpuset controller in some custom manner to setup isolation.

berrange · 2016-12-13T18:38:47Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+
+**TODO**
+
+1. how should `kubelet` discover the reserved `cpu-set` value?


It could infer it based on the CPU mask it is running with itself. eg if you have CPUAffinity=0-3 on a 16 cpus system, then kubelet would see its affinity as 0-3, and so it can infer that 4-15 are the ones reserved for pod usage.

yeah, i was hoping to avoid the need for additional flags, will think one this some.

berrange · 2016-12-13T18:42:27Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+ * `strict`
+
+If `strict`, all pods that match this taint must request CPU (whole or fractional cores) that
+fit a single NUMA node `cpu` allocatable.


This feels somewhat inflexible wrt potential future expansion. It would be better to express a desired number of node count. eg NUMANodes=N, where N is one or more, to indicate how many NUMA nodes it is willing / able to consume. eg an application inside a container may be capable of dividing its workload into 2 pieces, so could deal wtih being spread on 2 NUMA nodes, thus could request NUMANodes=1,2 to indicate its able to deal with 1 or 2 NUMA nodes, but not more than that.

berrange · 2016-12-13T18:45:22Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	// The values are expressed in the List Format syntax specified here:
+	// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
+	// +optional
+	CPUAffinity string	


What is responsible for setting the NUMANodeID / CPUAffinity values ? Is that to be set by the end user directly, or to be set indirectly by the schedular when making placement decision. If the latter, then that's ok, but the former is not a good idea.

it's intended to be set by a scheduling agent of some kind.

Stupid question: are the identifiers used in this list global across the machine, or local per NUMA node? Is there any reason why you would specify CPUAffinity without specifying NUMANodeID? (I assume a scheduling agent would know both aspects of the topology.)

the identifiers are global across the machine.

i actually debated not including NUMANodeID and just having CPUAffinity, and am open to discussion on that. If NUMANodeID is set CPUAffinity is either the 1-N cores on that node, or some subset.

lcarstensen · 2016-12-13T22:19:14Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+a specific NUMA node, and the set of affined CPUs are shared among containers.
+
+Pod level cgroups are used to actually affine the container to the specified
+CPU set.


Is there a proposal or other reference on how pod-level cgroups will work and programs can understand their assignments?

We've had a long-standing discussion around how can a C++ program discover/understand it's resource boundaries so that it can operate properly within them. A classic third-party example is a Java JVM auto-discovering how many GC threads it can create and how much memory it's been allocated. It would be desirable to have a path for codebases that may have previously assumed they had to manage a whole machine's worth of resources and handle setting their own scheduler affinities today to be able to understand their partial-machine resource set so they can make their own proper suballocations.

@lcarstensen They could either feed in from ENV vars, or iirc there is a mount trick that is done to allow them see the world through a cgroup-i-fied lens.

@lcarstensen -- recommend using the downward api in that context: kubernetes/website#591

ConnorDoyle

Nice work on this write up.

ConnorDoyle · 2016-12-14T01:09:39Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+ * `strict`
+ * `preferred`
+
+If `strict`, all pods that match this taint must request `memory` that fits it's assigned


ConnorDoyle · 2016-12-14T01:09:48Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+NUMA node `memory` allocatable.
+
+If `preferred`, all pods that match this taint are not required to have their `memory` request
+fit it's assigned NUMA node `memory` allocatable.


ConnorDoyle · 2016-12-14T01:12:31Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+* Potential values:
+ * `dedicated`
+
+If `dedicated`, all pods that match this taint will require dedicated compute resources. Each


nit (here and repeated below): s/match this taint/tolerate this taint

ConnorDoyle · 2016-12-14T01:32:11Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	// Capacity represents the total resources associated to the NUMA node.
+	// cpu: 4 <number of cores>
+	// memory: <amount of memory in normal page size>
+	// hugepages: <amount of memory in huge page size>


I think I agree with Derek here, they could be represented as separate entries in the resource list. Given that pagesize could be represented by a conventional resource name, nr_hugepages and free_hugepages can be derived from the values in capacity and allocatable.

Also agree that the better way to address the need for more expressive resource types is to iterate the resource model versus working around it with many custom fields. By the same token, the CPUSet field could be represented as a Range type if it were available.

ConnorDoyle · 2016-12-14T01:50:17Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+An interconnect bus provides connections between nodes so each CPU can
+access all memory.  The interconnect can be overwhelmed by concurrent
+cross-node traffic, and as a result, processes that need to access memory
+on a different node can experience increased latency.


maybe "a process running on a different node from a memory bank it needs to access can experience increased latency"

ConnorDoyle · 2016-12-14T01:52:42Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+
+The `kubelet` will enforce the presence of the required pod tolerations assigned to the node.
+
+The `kubelet` will pend the execution of any pod that is assigned to the node, but has


+1 for how the proposal solves this problem.

Does Kubelet reject the pod if the specified NUMANodeID and/or CPUAffinity are already in use by an existing pod?

yes, it would reject if there was a collision in the case that you asked for dedicated cores, and were not getting them.

hongchaodeng · 2016-12-14T02:07:57Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+
+This proposal recommends that the `NodeStatus` is augmented as follows:
+
+```


use "```go"?

balajismaniam · 2016-12-14T19:00:29Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+
+This proposal describes enhancements to the Kubernetes API to improve the
+utilization of compute resources for containers that have a need to avoid
+cross NUMA node memory access by containers.


s/by containers//

balajismaniam · 2016-12-14T19:07:10Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+
+#### CPUAffinity
+
+* Effect: `NoScheduleNoAdmitNoExecute`


This might have to be represented as two taints here and elsewhere. See kubernetes/kubernetes#28082 for details.

@balajismaniam ack, was talking with @aveshagarwal about this in the background. in addition, just using NoScheduleNoAdmit would be sufficient if we drain nodes before applying taint.

balajismaniam · 2016-12-14T19:11:52Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+
+* If the toleration `CPUAffinity` is present on a `Pod`, the pod will not start
+any associated container until the `Pod.Spec.CPUAffinity` is populated.
+* If the toleration `NUMAAffinity` is present on a `Pod`, the pod will not start


s/NUMAAffinity/NUMACPUAffinity/

ConnorDoyle · 2016-12-14T19:41:32Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+}
+
+// NUMATopology describes the NUMA topology of a node.
+type NUMATopology struct {


Following the convention of "labels everywhere", would you embed k8s.io/kubernetes/pkg/apis/meta/v1.TypeMeta here?

TypeMeta is more appropriate for top-level objects in the API, the topology is embedded in Node, maybe I am missing a use case you had for labeling of numa nodes within a Node? Isn't the id sufficient?

Sorry for not including the use case. I was thinking of representing shared (but not consumable) per-numa-node resources like device locality. Anyway, in that case it would be more appropriate as part of NUMANode.

ConnorDoyle · 2016-12-14T21:11:26Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+## Future considerations
+
+1. Author `NUMATopologyPredicate` in scheduler to enable NUMA aware scheduling.
+1. Restrict vertical autoscaling of CPU and NUMA affined workloads.


Does this have to do with limiting dynamic logical core assignments, or specifically fixing the number of cores allocated for the lifetime of a pod? If the latter, what's the rationale for it? The default vertical auto-scaler (non-numa-aware) could emit impossible allocations (spanning sockets, fractional cores, etc) but you could also imagine a numa-aware vertical auto-scaler that can do the right thing.

maybe i should phrase this differently. i mean to say that we should not just auto-scale these things ignorant of NUMA.

ConnorDoyle · 2016-12-15T00:22:36Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	// Identifies a NUMA node on a single host.
+	NUMANodeID string
+	// Capacity represents the total resources associated to the NUMA node.
+	// cpu: 4 <number of cores>


Clarification: is this intended to represent the logical CPU ids as reported by Linux, or only the number of physical cores on the socket?

i had intended for this to align with what was returned in the following:

$ sudo lscpu | grep NUMA NUMA node(s): 1 NUMA node0 CPU(s): 0-3

so in that case it was logical and not physical cores. this was consistent with what we do today when reporting CPU counts in cAdvisor via $ cat /proc/cpuinfo | grep 'processor\t*' | wc -l

i agree we will need to know what are siblings in the numa topology.

this was consistent with what we do today when reporting CPU counts in cAdvisor

Is this (logical) also what is returned in "cpu" in Allocatable and Capacity in NodeStatus?

@davidopp -- yes. the node returns logical cores to in "cpu" in Allocatable and Capacity.

ConnorDoyle · 2016-12-15T00:24:10Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	Allocatable ResourceList
+	// CPUSet represents the physical numbers of the CPU cores
+	// associated with this node.
+	// Example: 0-3 or 0,2,4,6


Clarification: are these limited to the physical cpu ids, and if so is it planned to support scheduling onto their hyperthread siblings?

FWIW, we have heard from some users that they would like to consume a pair of hyperthreads from one physical core.

this was logical core listing, we will need to let you know what the siblings are.

for simplicity imo we should bound affinity to the core. I question the cost/benefits to allocating to hyperthreads.

For the 90% general usage, I think it is fine to ignore the sockets/cores/threads distinction and just treat all logical cpus as having equivalent "power". In the more niche use cases there's definitely cases where considering threads as specal will help. There are applications which are very latency sensitive and benefit from having their threads run on a pair of hyperthread siblings in order to maximise cache hits. Conversely there's applications which absolutely don't ever want to be on hyperthread siblings because they'll contend on the same shared resources.

That all said, I think it is possible to retrofit support for thread at a later date. It simply requires adding a new array attribute later listing the thread sibling IDs. So this simple list of logical CPU IDs is sufficient granularity for a first impl.

@berrange but why wait to address this use case? Most of the users we have talked to that need core pinning are the extreme low-latency people. Without this level of topology awareness the feature will not meet their needs and we'll still have users hacking around the problem.

Is cpu in Capacity always guaranteed to be equal to the number of cores listed here? (If not, why not?)

@davidopp -- the list of cpus should always match the equal number of cores in alloctable. if the kubelet itself is pegged to a particular core, its possible the capacity != allocatable.

timothysc · 2016-12-15T19:57:29Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+
+// NUMANode describes a single NUMA node.
+type NUMANode struct {
+	// Identifies a NUMA node on a single host.


FWIW the way this is structured would complicate the scheduler which currently rips through capacities for most of it's matching. If it's possible to expand capacities that might make this easier.

@timothysc -- i am open to ideas on how to flatten, but we still need to know the topology to know what cores are in what node, the siblings, and their distances (possibly) in future.

davidopp · 2016-12-18T03:33:18Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	// Identifies a NUMA node on a single host.
+	NUMANodeID string
+	// Capacity represents the total resources associated to the NUMA node.
+	// cpu: 4 <number of cores>


this was consistent with what we do today when reporting CPU counts in cAdvisor

Is this (logical) also what is returned in "cpu" in Allocatable and Capacity in NodeStatus?

davidopp · 2016-12-18T03:38:54Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	Allocatable ResourceList
+	// CPUSet represents the physical numbers of the CPU cores
+	// associated with this node.
+	// Example: 0-3 or 0,2,4,6


Is cpu in Capacity always guaranteed to be equal to the number of cores listed here? (If not, why not?)

davidopp · 2016-12-18T03:42:53Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	// the scheduler simply schedules this pod onto that node, assuming that it fits resource
+	// requirements.
+	// +optional
+	NodeName string


Why did you list this here? Are you planning to somehow use the existing NodeName field for NUMA node scheduling? It wasn't clear to me from the comment (which appears to be identical to the existing NodeName comment).

i listed it here just to be clear that numa and cpu affined workloads are a triplet of node name + numa node + cpuset.

the only thing that is new is the numa node id and cpu affinity values.

davidopp · 2016-12-18T03:44:15Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+* If the toleration `NUMAAffinity` is present on a `Pod`, the pod will not start
+any associated container until the `Pod.Spec.NUMANodeID` is populated.
+
+The delayed execution of the pod enables both a single and dual-phase scheduler to


What do you mean by "single and dual-phase scheduler"? (And what do you mean by "delayed execution of the pod"?)

it seems that there is some debate in the community on whether the cluster scheduler assigns pods to cpu cores and numa nodes, or if a node local agent schedules it. we had some of this debate at kubecon. when i mention single or dual phase scheduler, i mean either a scheduler that does machine+numa aware scheduling (single) or a pair of schedulers where one does machine assignment, and another does numa assignment. personally, i am biased to the single approach, but i got the sense there was some debate in this space.

davidopp · 2016-12-18T03:45:31Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+
+The `kubelet` will enforce the presence of the required pod tolerations assigned to the node.
+
+The `kubelet` will pend the execution of any pod that is assigned to the node, but has


Does Kubelet reject the pod if the specified NUMANodeID and/or CPUAffinity are already in use by an existing pod?

davidopp · 2016-12-18T04:05:43Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	// The values are expressed in the List Format syntax specified here:
+	// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
+	// +optional
+	CPUAffinity string	


Stupid question: are the identifiers used in this list global across the machine, or local per NUMA node? Is there any reason why you would specify CPUAffinity without specifying NUMANodeID? (I assume a scheduling agent would know both aspects of the topology.)

davidopp · 2016-12-18T04:11:02Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+1. in a numa system, `kubelet` reservation for memory needs to be removed from
+a particular numa node capacity so numa node allocatable is as expected. 
+
+### Configuring Taints


I didn't understand the reason for the taints. I suspect there is some misunderstanding about the intention of taints/tolerations, but I'm not clear enough on why you are using them to be able to judge. Can you explain what problem(s) you are trying to solve?

the problems:

nodes that support dedicated cpu cores are special

pods that require dedicated cpu cores are special

needed a way to distinguish consumption of a dedicated core as distinct/special and did not want to change resource model

I basically chose to use taints/tolerations as a means of decorating the resource description and scheduling requirement.

thockin · 2016-12-22T23:16:15Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	// Capacity represents the total resources associated to the NUMA node.
+	// cpu: 4 <number of cores>
+	// memory: <amount of memory in normal page size>
+	// hugepages: <amount of memory in huge page size>


Or, and I know this is contrarian, we don't. or maybe we don't YET. We can not represent every facet of every hardware system.

thockin · 2016-12-22T23:19:48Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	// The values are expressed in the List Format syntax specified here:
+	// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
+	// +optional
+	CPUAffinity string	


^^^ this too.

thockin · 2016-12-22T23:23:47Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+	// This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
+	// are present on the pod.
+	// +optional
+	NUMANodeID string


I've said before and I still assert - this is a terrible plan. Offering up this much control is going to kill off any chances we have of actually doing the right thing automatically for most users. This is an attractive nuisance (https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine) - people are going to get hurt by it. If the borg experience counts for anything, the lesson is that saying "yes" to things like this turned out terribly, and is a drag on the system.

I proposed a less first-class alternative. Something like: annotate pods and run a highly-privileged agent on the node react to pods being scheduled by programming cpusets and calling numactl and whatever else it needs to do. Make this VERY explicit to use and DO NOT make it part of the API.

If we need to collect NUMA status and use something about NUMA as a hint to scheduling, I can MAYBE see this, maybe even a different scheduler enitrely.

If we need to sell "dedicated caches", then we should add that as resource. If I need NUMA because I want an LLC to myself, let me spec that.

If we need to sell "link to high-perf IO device", then we should add that as resource.

I'm trying to not overreact, but I have a visceral response to this.

derekwaynecarr · 2016-12-23T02:43:01Z

@thockin I'm not trying to cause you a heart attack this close to the holiday :-) I thought the approach outlined here was closer to what was discussed at Kubecon as the pod tolerations captured an intent based model. This proposal was not meant to exclude the alternative option. It's clear the alternative you have in mind satisfied a broad swath of workloads at Google, but I suspected we would still want to persist the information on the node and pod similar to what was presented even when an alternative privileged agent was doing the NUMA assignment so we could do checkpointing and handle restart policies gracefully in the kubelet.

…

On Thu, Dec 22, 2016 at 6:29 PM Tim Hockin ***@***.***> wrote: ***@***.**** requested changes on this pull request. ------------------------------ In contributors/design-proposals/cpu-affinity-numa-awareness.md <#171 (review)> : > +} + +// NUMATopology describes the NUMA topology of a node. +type NUMATopology struct { + // NUMANodes represents the list of NUMA nodes in the topology. + NUMANodes []NUMANode +} + +// NUMANode describes a single NUMA node. +type NUMANode struct { + // Identifies a NUMA node on a single host. + NUMANodeID string + // Capacity represents the total resources associated to the NUMA node. + // cpu: 4 <number of cores> + // memory: <amount of memory in normal page size> + // hugepages: <amount of memory in huge page size> Or, and I know this is contrarian, we don't. or maybe we don't YET. We can not represent every facet of every hardware system. ------------------------------ In contributors/design-proposals/cpu-affinity-numa-awareness.md <#171 (review)> : > + NodeName string + // Identifies a NUMA node that affines the pod. If it is non-empty, the value must + // correspond to a particular NUMA node on the same node that the pod is scheduled against. + // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations + // are present on the pod. + // +optional + NUMANodeID string + // CPUAffinity controls the CPU affinity of the executed pod. + // If it is non-empty, the value must correspond to a particular set + // of CPU cores in the matching NUMA node on the machine that the pod is scheduled against. + // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations + // are present on the pod. + // The values are expressed in the List Format syntax specified here: + // here: http://man7.org/linux/man-pages/man7/cpuset.7.html + // +optional + CPUAffinity string ^^^ this too. ------------------------------ In contributors/design-proposals/cpu-affinity-numa-awareness.md <#171 (review)> : > + +``` +// PodSpec is a description of a pod +type PodSpec struct { +... + // NodeName is a request to schedule this pod onto a specific node. If it is non-empty, + // the scheduler simply schedules this pod onto that node, assuming that it fits resource + // requirements. + // +optional + NodeName string + // Identifies a NUMA node that affines the pod. If it is non-empty, the value must + // correspond to a particular NUMA node on the same node that the pod is scheduled against. + // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations + // are present on the pod. + // +optional + NUMANodeID string I've said before and I still assert - this is a terrible plan. Offering up this much control is going to kill off any chances we have of actually doing the right thing automatically for most users. This is an attractive nuisance (https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine) - people are going to get hurt by it. If the borg experience counts for anything, the lesson is that saying "yes" to things like this turned out terribly, and is a drag on the system. I proposed a less first-class alternative. Something like: annotate pods and run a highly-privileged agent on the node react to pods being scheduled by programming cpusets and calling numactl and whatever else it needs to do. Make this VERY explicit to use and DO NOT make it part of the API. If we need to collect NUMA status and use something about NUMA as a hint to scheduling, I can MAYBE see this, maybe even a different scheduler enitrely. If we need to sell "dedicated caches", then we should add that as resource. If I need NUMA because I want an LLC to myself, let me spec that. If we need to sell "link to high-perf IO device", then we should add that as resource. I'm trying to not overreact, but I have a visceral response to this. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#171 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF8dbCCJHymMlCChkiVjtvjDqXa6Jz1rks5rKwfSgaJpZM4LL9zq> .

thockin · 2016-12-23T04:38:05Z

On Thu, Dec 22, 2016 at 6:43 PM, Derek Carr ***@***.***> wrote: @thockin I'm not trying to cause you a heart attack this close to the holiday :-) I thought the approach outlined here was closer to what was discussed at Kubecon as the pod tolerations captured an intent based model.

The intent is more like "raw performance matters a lot for this pod" or even "raw perf is more important than access to extra cores", to which the system would automatically affine it to a NUMA node. Specifying "lock me to NUMA node 3" or even "lock me to any one NUMA node" is mechanism.

This proposal was not meant to exclude the alternative option. It's clear

To be clear my "alternative" is hand-wavy at best.

the alternative you have in mind satisfied a broad swath of workloads at Google, but I suspected we would still want to persist the information on the node and pod similar to what was presented even when an alternative privileged agent was doing the NUMA assignment so we could do checkpointing and handle restart policies gracefully in the kubelet.

I would consider that a pretty niche case. This numa-agent could checkpoint itself, sure, but I doubt we would want to spec that checkpoint in a way that some OTHER numa agent could consume it. That seems like a lot of work for almost no reward.

…

On Thu, Dec 22, 2016 at 6:29 PM Tim Hockin ***@***.***> wrote: > ***@***.**** requested changes on this pull request. > > > > > ------------------------------ > > > > > In contributors/design-proposals/cpu-affinity-numa-awareness.md > <#171 (review)> > : > > > > +} > > + > > +// NUMATopology describes the NUMA topology of a node. > > +type NUMATopology struct { > > + // NUMANodes represents the list of NUMA nodes in the topology. > > + NUMANodes []NUMANode > > +} > > + > > +// NUMANode describes a single NUMA node. > > +type NUMANode struct { > > + // Identifies a NUMA node on a single host. > > + NUMANodeID string > > + // Capacity represents the total resources associated to the NUMA node. > > + // cpu: 4 <number of cores> > > + // memory: <amount of memory in normal page size> > > + // hugepages: <amount of memory in huge page size> > > > > Or, and I know this is contrarian, we don't. or maybe we don't YET. We can > not represent every facet of every hardware system. > > > > > ------------------------------ > > > > > In contributors/design-proposals/cpu-affinity-numa-awareness.md > <#171 (review)> > : > > > > + NodeName string > > + // Identifies a NUMA node that affines the pod. If it is non-empty, the value must > > + // correspond to a particular NUMA node on the same node that the pod is scheduled against. > > + // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations > > + // are present on the pod. > > + // +optional > > + NUMANodeID string > > + // CPUAffinity controls the CPU affinity of the executed pod. > > + // If it is non-empty, the value must correspond to a particular set > > + // of CPU cores in the matching NUMA node on the machine that the pod is scheduled against. > > + // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations > > + // are present on the pod. > > + // The values are expressed in the List Format syntax specified here: > > + // here: http://man7.org/linux/man-pages/man7/cpuset.7.html > > + // +optional > > + CPUAffinity string > > > > ^^^ this too. > > > > > ------------------------------ > > > > > In contributors/design-proposals/cpu-affinity-numa-awareness.md > <#171 (review)> > : > > > > + > > +``` > > +// PodSpec is a description of a pod > > +type PodSpec struct { > > +... > > + // NodeName is a request to schedule this pod onto a specific node. If it is non-empty, > > + // the scheduler simply schedules this pod onto that node, assuming that it fits resource > > + // requirements. > > + // +optional > > + NodeName string > > + // Identifies a NUMA node that affines the pod. If it is non-empty, the value must > > + // correspond to a particular NUMA node on the same node that the pod is scheduled against. > > + // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations > > + // are present on the pod. > > + // +optional > > + NUMANodeID string > > > > I've said before and I still assert - this is a terrible plan. Offering up > this much control is going to kill off any chances we have of actually > doing the right thing automatically for most users. This is an attractive > nuisance (https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine) - > people are going to get hurt by it. If the borg experience counts for > anything, the lesson is that saying "yes" to things like this turned out > terribly, and is a drag on the system. > > > I proposed a less first-class alternative. Something like: annotate pods > and run a highly-privileged agent on the node react to pods being scheduled > by programming cpusets and calling numactl and whatever else it needs to > do. Make this VERY explicit to use and DO NOT make it part of the API. > > > If we need to collect NUMA status and use something about NUMA as a hint > to scheduling, I can MAYBE see this, maybe even a different scheduler > enitrely. > > > If we need to sell "dedicated caches", then we should add that as > resource. If I need NUMA because I want an LLC to myself, let me spec that. > > > If we need to sell "link to high-perf IO device", then we should add that > as resource. > > > I'm trying to not overreact, but I have a visceral response to this. > > > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#171 (review)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AF8dbCCJHymMlCChkiVjtvjDqXa6Jz1rks5rKwfSgaJpZM4LL9zq> > . > > > > > > > > > > > > > > > > > > > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jakub-d · 2017-01-03T08:42:27Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+## Future considerations
+
+1. Author `NUMATopologyPredicate` in scheduler to enable NUMA aware scheduling.
+1. Restrict vertical autoscaling of CPU and NUMA affined workloads.


I think this proposal assumes that we run workloads on monolithic hardware.

Our use case is different.
We have servers with 1,2 or 4 NUMA nodes. They have different CPU models (different number of cores) and different NUMA topologies. For example:
Server 1 has topology:

NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39

Server 2 has topology:

NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39

Consider a scenario:

We want to run workloads that perform best if they use all CPUs on one NUMA node.
If the server has 4 NUMA nodes, we will run 4 workloads on them.
It will be difficult to run this scenario with the proposed implementation.

Can we try a different approach?

Add a capability to specify CPUset to Opaque Integer Resources. One Opaque Integer Resource could be predefined (lets call it "numa"). When a user requests OIR numa 1, the scheduler would specify CPUset 0-9,20-29 on server-1 from my example, and 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40 on server-2.

This approach would also solve more advanced use cases. Lets say we want to run a workload that is pinned to one CPU that is close to a specific PCI device. On one server this CPU may be located on numa0, on another on numa1. In this case a user defines a new OIR (lets call it "fast-pci") and specifies CPUset for this OIR. It would be different for different servers. The schedule would pin the workload to the CPU close to the PCI on each server.

I assume that the OIR definition would be performed by a Node-Feature-Discovery pod (as defined in kubernetes/kubernetes#10570 (comment)).

@jakub-d -- i don't think numa should be handled via OIR, but I see value in letting you have numa node capacity as a counted resource on node. I will think about how to handle that in this proposal. I think the major tension on the proposal is just the presence of NUMANodeID/CPUAffinity on the pod declaration which I think I would still want even with your proposed model. Thanks for the scenario feedback!

derekwaynecarr · 2017-01-13T17:33:52Z

Recording pointer to kubernetes/kubernetes#39818 which means to me that a Toleration alone is not enough to provide additional meaning to a resource as there is no mechanism to assert that the Toleration must be coupled with a node of matching of taint (at this time).

ConnorDoyle · 2017-01-17T13:57:23Z

This discussion seems to have stalled a bit.

@thockin: to grossly oversimplify your concerns about this proposal (please correct if this is wrong), you are opposed to exposing the NUMANodeID and CPUAffinity fields to user pod submissions?

What if these fields were only settable by a privileged agent like you described? That way those fields become part of the API between the system and the cluster operator. Users would have to specify something higher-level. Examples of higher-level could be anything from "I am a workload of class XYZ" to "I need to deliver 95% tail response latency of 100ms or less per my SLA".

derekwaynecarr · 2017-01-17T18:29:35Z

@ConnorDoyle -- i need to make an update to this proposal around its usage of tolerations, but i agree that fundamental issue is around those two fields. in addition, i am wondering if people feel differently about "exclusive cores" versus numa/cpu affinity. in that mode, we still want something similar to CPUAffinity to provide the contract, but the NUMANodeID would not necessarily be relevant.

vishh · 2017-01-17T18:32:39Z

Can core allocation & NUMA be just a node level optimization for now to begin with? For example, we could consider letting applications specify that they are sensitive to latencies and have kubelet automatically provide these optimizations.

…

On Tue, Jan 17, 2017 at 10:29 AM, Derek Carr ***@***.***> wrote: @ConnorDoyle <https://github.com/ConnorDoyle> -- i need to make an update to this proposal around its usage of tolerations, but i agree that fundamental issue is around those two fields. in addition, i am wondering if people feel differently about "exclusive cores" versus numa/cpu affinity. in that mode, we still want something similar to CPUAffinity to provide the contract, but the NUMANodeID would not necessarily be relevant. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKNlIvqXNgsysZHEE18efXYxYGAMeks5rTQiTgaJpZM4LL9zq> .

ConnorDoyle · 2017-01-17T18:42:05Z

@derekwaynecarr good point. Following the above, if NUMA is what the user requires, they could say so in a way that the operator / system understands and the privileged agent could set up CPUAffinity accordingly.

@vishh It sounds like most people are on board with that goal. Teaching Kubelet to auto-tune settings to uphold SLAs is a much harder problem than simply allowing operators to provide some pre-configured settings on behalf of users. Also, it sounds like there may be some scenarios that Kubelet won't commit to supporting. In the meantime, and for the latter cases, could we build a bridge that avoids exposing too much detail (as per @thockin's comments)? Landing the low-level mechanism described in this proposal could be the groundwork for the default controller that ships w/ core to implement those higher level things.

vishh · 2017-01-17T18:54:54Z

@ConnorDoyle

Teaching Kubelet to auto-tune settings to uphold SLAs is a much harder problem than simply allowing operators to provide some pre-configured settings on behalf of users.

It is better to make Kubernetes a bit more complex than exposing that complexity to end users. Also, we are working on providing extensible isolators in the node, which would allow for varying levels of isolation.
And, I don't really see how cpu & numa allocation can be pre-configured? It has to be dynamic right?

Landing the low-level mechanism described in this proposal could be the groundwork for the default controller that ships w/ core to implement those higher level things.

The uncertainty around this proposal indicates that we should iterate on this feature rather than make API decisions now. In that regard, let's avoid exposing more knobs unless necessary.
In addition to that, let's try to experiment with this feature using a kubelet extension rather than make upstream changes right away.

thockin · 2017-01-18T06:18:53Z

if NUMA is what the users requires

NUMA is never what the user requires. It's a means to an end. The things they need are more deterministic memory latency, protection from cache-thrashers, proximity to a particular device, etc. Think of it another way - if a machine had low-latency uniform memory access from all cores and had no shared LLC, users would have everything they need without NUMA, right? So let's describe those things as much as we can. I'm less averse to "I need a dedicated LLC" than I am to "I need node 1", though it is still mechanism oriented, rather than intent. Here's my real scar: Once upon a time, perf-sensitive Borg users said "we need dedicated cores", so borg gave them dedicated cores. It meshed well enough with chargeback and things like that. Over time, we made CPU isolation better and more automatic. We could offer an SLA on scheduler wakeups, and even super latency-sensitive apps were happy - without dedicated cores! The existence of dedicated cores was now a drag on the system - it was a constraint around which we could not automate. So we tried to get rid of them, but we met with much resistance - I think the phrase "cold dead hands" was used. We asked people to just TEST without it, and we realize that many MANY teams were effectively unstaffed, or had ZERO bandwidth for an experiment. And so the feature stuck, and the automatic efficiency of Borg is capped by the number of people using (and cargo-culting) dedicated cores. We know we CAN get to automated awesomeness in many (not all) ways. I don't want to make the exact same decisions. I can't call them "bad" decisions because they were the right decisions AT THE TIME. So, can we make better decisions now? Can we side-load this stuff until it's either paifully obvious we should have it first-class or until we don't need it? Let's make the users of these features be the ones to carry the burden.

…

On Tue, Jan 17, 2017 at 10:54 AM, Vish Kannan ***@***.***> wrote: @ConnorDoyle <https://github.com/ConnorDoyle> Teaching Kubelet to auto-tune settings to uphold SLAs is a much harder problem than simply allowing operators to provide some pre-configured settings on behalf of users. It is better to make Kubernetes a bit more complex than exposing that complexity to end users. Also, we are working on providing extensible isolators in the node, which would allow for varying levels of isolation. And, I don't really see how cpu & numa allocation can be pre-configured? It has to be dynamic right? Landing the low-level mechanism described in this proposal could be the groundwork for the default controller that ships w/ core to implement those higher level things. The uncertainty around this proposal indicates that we should iterate on this feature rather than make API decisions now. In that regard, let's avoid exposing more knobs unless necessary. In addition to that, let's try to experiment with this feature using a kubelet extension rather than make upstream changes right away. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVN5KBcFQ5Gszo033r45ms4FCfA3Eks5rTQ6AgaJpZM4LL9zq> .

ghost · 2017-01-18T08:14:00Z

One potential way around the problems Tim mentions might be to think in terms of a general proximity scale (beware, some wild handwaving to follow, but I think we might be able to bash the vague idea into something usable). At one end of the scale we have pods very far apart (think different data centers), with high inter-pod communication cost/latency, low failure correlation, and low interference (including network bandwidth contention, disk contention etc) At the other end of the scale we have pods very close together, with low inter-pod communication cost/latency, high failure correlation, and high interference (including LLC, network, disk etc). Think of it as a scale from 1 to 100 if that helps. One fairly typical requirement is for pods to be proximate to one another other within some range on that scale (typically for some combination of the above reasons of communication, failure correlation and interference. For example, "I want my master and slave to be at least on different rack switches (for failure isolation), but within the same building (for low network cost/latency)", might be expressed as a simple proximity range from, say 30 to 70, which expresses the general intent, but leaves the scheduler to make the actual placement based on this intent. Similarly, "I want my pod to be on a machine all by itself (to avoid memory, disk and network contention" might be expressed as a proximity requirement with a range of 10-100 w.r.t. all other pods. Having your own core, NUMA channel, LLC etc just becomes a range (with a lower bound on the low end) of that scale. Thoughts? On Tue, Jan 17, 2017 at 10:19 PM, Tim Hockin <notifications@github.com> wrote:

…

> if NUMA is what the users requires NUMA is never what the user requires. It's a means to an end. The things they need are more deterministic memory latency, protection from cache-thrashers, proximity to a particular device, etc. Think of it another way - if a machine had low-latency uniform memory access from all cores and had no shared LLC, users would have everything they need without NUMA, right? So let's describe those things as much as we can. I'm less averse to "I need a dedicated LLC" than I am to "I need node 1", though it is still mechanism oriented, rather than intent. Here's my real scar: Once upon a time, perf-sensitive Borg users said "we need dedicated cores", so borg gave them dedicated cores. It meshed well enough with chargeback and things like that. Over time, we made CPU isolation better and more automatic. We could offer an SLA on scheduler wakeups, and even super latency-sensitive apps were happy - without dedicated cores! The existence of dedicated cores was now a drag on the system - it was a constraint around which we could not automate. So we tried to get rid of them, but we met with much resistance - I think the phrase "cold dead hands" was used. We asked people to just TEST without it, and we realize that many MANY teams were effectively unstaffed, or had ZERO bandwidth for an experiment. And so the feature stuck, and the automatic efficiency of Borg is capped by the number of people using (and cargo-culting) dedicated cores. We know we CAN get to automated awesomeness in many (not all) ways. I don't want to make the exact same decisions. I can't call them "bad" decisions because they were the right decisions AT THE TIME. So, can we make better decisions now? Can we side-load this stuff until it's either paifully obvious we should have it first-class or until we don't need it? Let's make the users of these features be the ones to carry the burden. On Tue, Jan 17, 2017 at 10:54 AM, Vish Kannan ***@***.***> wrote: > @ConnorDoyle <https://github.com/ConnorDoyle> > > Teaching Kubelet to auto-tune settings to uphold SLAs is a much harder > problem than simply allowing operators to provide some pre-configured > settings on behalf of users. > > It is better to make Kubernetes a bit more complex than exposing that > complexity to end users. Also, we are working on providing extensible > isolators in the node, which would allow for varying levels of isolation. > And, I don't really see how cpu & numa allocation can be pre-configured? > It has to be dynamic right? > > Landing the low-level mechanism described in this proposal could be the > groundwork for the default controller that ships w/ core to implement those > higher level things. > > The uncertainty around this proposal indicates that we should iterate on > this feature rather than make API decisions now. In that regard, let's > avoid exposing more knobs unless necessary. > In addition to that, let's try to experiment with this feature using a > kubelet extension rather than make upstream changes right away. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#171 (comment) >, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AFVgVN5KBcFQ5Gszo033r45ms4FCfA3Eks5rTQ6AgaJpZM4LL9zq> > . > — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJ6NATZH8bE-CK0HsvnoDJtcN4v_OaWTks5rTa7XgaJpZM4LL9zq> .

-- Quinton Hoole quinton@hoole.biz

vishh · 2017-01-18T19:56:05Z

@thockin IIUC, I think we are in agreement. Like I mentioned earlier, we need to express an intent and have NUMA and core allocation be one of the means to achieve that intent. The intent could be that a pod is "Latency sensitive".
At the same time, if we let the intent become too loose (similar to a broad scale that @quinton-hoole mentioned), users might not be able to consume such an API at all.

IIUC, this feature is primarily meant to tackle workloads that are sensitive to scheduling and memory access latencies. So let's try to model based on that "need" rather than the means.

derekwaynecarr · 2017-01-20T21:48:52Z

@vishh and i spoke in more detail, and i will be writing up an alternate document to contrast with this proposal around extensible pod isolators, hopefully in time for next resource mgmt wg meeting so we can weigh pro/cons.

…

On Wed, Jan 18, 2017 at 2:56 PM, Vish Kannan ***@***.***> wrote: @thockin <https://github.com/thockin> IIUC, I think we are in agreement. Like I mentioned earlier <#171 (comment)>, we need to express an intent and have NUMA and core allocation be one of the means to achieve that intent. The intent could be that a pod is "Latency sensitive". At the same time, if we let the intent become too loose (similar to a broad scale that @quinton-hoole <https://github.com/quinton-hoole> mentioned), users might not be able to consume such an API at all. IIUC, this feature is primarily meant to tackle workloads that are sensitive to scheduling and memory access latencies. So let's try to model based on that "need" rather than the means. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF8dbPcFTqCFthR2RO0i_bcXoomDMxM7ks5rTm5XgaJpZM4LL9zq> .

jakub-d · 2017-03-17T07:21:06Z

Any updates on this subject?

ashish-billore · 2017-03-25T05:27:41Z

contributors/design-proposals/cpu-affinity-numa-awareness.md

+
+## What is NUMA?
+
+Non-uniform memory architecture (NUMA) describes multi-socket machines that


Non-uniform memory architecture => Non-Uniform Memory Architecture (NUMA)
Very minor change, enhances readability.

A in NUMA is "Access" https://en.wikipedia.org/wiki/Non-uniform_memory_access

Should it be "Non-Uniform Memory Access (NUMA) architecture" ?

derekwaynecarr · 2017-04-17T19:29:38Z

for reference, I plan to update this proposal to align w/ more recent discussions in our resource mgmt workgroup. i hope to have an update for review in time for our planned f2f on 5/1/2017

derekwaynecarr · 2017-05-12T01:53:01Z

I am closing this proposal given the outcome of the resource mgmt f2f where we agreed on a different high-level approach.

gmarkey · 2017-05-12T03:35:03Z

@derekwaynecarr are you able to share more details about the F2F?

vishh · 2017-05-12T04:10:10Z

https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#

…

On Thu, May 11, 2017 at 8:35 PM, Greg Markey ***@***.***> wrote: @derekwaynecarr <https://github.com/derekwaynecarr> are you able to share more details about the F2F? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKBuv0W8mYk6UZ_Aci3XHhWI52yGYks5r49NugaJpZM4LL9zq> .

teferi · 2017-05-23T11:20:12Z

A relevant formal proposal to k8s/community #654

Signed-off-by: Andrew Burden <aburden@redhat.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 13, 2016

derekwaynecarr force-pushed the cpu-pinning branch from 5b1c62a to 2f4f619 Compare December 13, 2016 16:59

Proposal: CPU Affinity and NUMA Topology Awareness

daa68fb

derekwaynecarr force-pushed the cpu-pinning branch from 2f4f619 to daa68fb Compare December 13, 2016 17:25

berrange reviewed Dec 13, 2016

View reviewed changes

berrange suggested changes Dec 13, 2016

View reviewed changes

lcarstensen reviewed Dec 13, 2016

View reviewed changes

ConnorDoyle reviewed Dec 14, 2016

View reviewed changes

hongchaodeng reviewed Dec 14, 2016

View reviewed changes

balajismaniam reviewed Dec 14, 2016

View reviewed changes

ConnorDoyle reviewed Dec 14, 2016

View reviewed changes

ConnorDoyle reviewed Dec 15, 2016

View reviewed changes

timothysc reviewed Dec 15, 2016

View reviewed changes

davidopp reviewed Dec 18, 2016

View reviewed changes

thockin suggested changes Dec 22, 2016

View reviewed changes

davidopp mentioned this pull request Jan 2, 2017

Determine if we should support cpuset-cpus and cpuset-mem kubernetes/kubernetes#10570

Closed

jakub-d reviewed Jan 3, 2017

View reviewed changes

bgrant0607 assigned thockin and davidopp Jan 18, 2017

ashish-billore reviewed Mar 25, 2017

View reviewed changes

ConnorDoyle mentioned this pull request Apr 24, 2017

[WIP] Pod and container-level resource isolator interface (plus isolator library, examples, tests) kubernetes/kubernetes#44870

Closed

derekwaynecarr closed this May 12, 2017

ConnorDoyle mentioned this pull request Jul 6, 2018

KEP: New Resource API proposal #2265

Closed

rmohr pushed a commit to rmohr/community that referenced this pull request May 4, 2022

Correcting the github ID for Andrew Burden (kubernetes#171)

75cc4a8

Signed-off-by: Andrew Burden <aburden@redhat.com>


		TODO

		1. how should `kubelet` discover the reserved `cpu-set` value?


		The `kubelet` will enforce the presence of the required pod tolerations assigned to the node.

		The `kubelet` will pend the execution of any pod that is assigned to the node, but has


		This proposal recommends that the `NodeStatus` is augmented as follows:

		```


		## What is NUMA?

		Non-uniform memory architecture (NUMA) describes multi-socket machines that

Proposal: CPU Affinity and NUMA Topology Awareness #171

Proposal: CPU Affinity and NUMA Topology Awareness #171

Conversation

derekwaynecarr commented Dec 13, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berrange left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaypipes Dec 28, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConnorDoyle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Dec 13, 2016 •

edited

Loading

jaypipes Dec 28, 2016 •

edited

Loading