Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: CPU Affinity and NUMA Topology Awareness #171

Closed
wants to merge 1 commit into from

Conversation

derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented Dec 13, 2016

This proposal describes a potential mechanism to support CPU affinity and NUMA topology.

It is intended to help guide future discussions around resource management in @kubernetes/sig-node and @kubernetes/sig-scheduling in the resource management work-group.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 13, 2016
An interconnect bus provides connections between nodes so each CPU can
access all memory. The interconnect can be overwhelmed by concurrent
cross-node traffic, and as a result, processes that need to access memory
on a different node can experience increased latency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB, the increased latency of cross-node memory access exists regardless of whether the interconnect is overwhelmed or not - remote node memory access always has a penalty based on the distance between nodes. For example, if local node distance is 10 and distance to remote node is 15, then memory access will always be at least x1.5 slower.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@berrange - agreed. will clean up text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "a process running on a different node from a memory bank it needs to access can experience increased latency"

Copy link

@berrange berrange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general point I think that the huge page proposal here kubernetes/kubernetes@68b82f3 needs to be folded into this proposal as NUMA & huge pages needs to be modeled together in order to get a long term future proof design.

// Capacity represents the total resources associated to the NUMA node.
// cpu: 4 <number of cores>
// memory: <amount of memory in normal page size>
// hugepages: <amount of memory in huge page size>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs much more work to correctly future proof it to cope well with huge pages. Historically x86 only had 1 huge page size (2MB or 4MB depending on kernel address mode), but these days you can also have 1 GB huge pages. On PPC architecture there are a huge number of page sizes available. Any single host can have multiple page sizes available concurrently.

So memory needs to be represented as a struct in its own right eg something more like

type PageInfo struct {
PageSize uint
Count uint64
Used uint64
}

Then the node would have an array of info Memory []PageInfo

The 'cpu: 4' count seems redundant given that you have a CPUSet string below from which you could easily determine the count if represented as an array instead of string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally, we have discussed the need to have a mechanism to describe the attributes of a resource better than what we have in the current model. usage information fluctuates so we tend to not capture in the node status and instead would use a monitoring pipeline for that information. the pagesize will need to be reflected. its possible the convention will be hugepages.<size>: X. Mostly trying to capture the relationship between NUMA and hugepages that you have called out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with Derek here, they could be represented as separate entries in the resource list. Given that pagesize could be represented by a conventional resource name, nr_hugepages and free_hugepages can be derived from the values in capacity and allocatable.

Also agree that the better way to address the need for more expressive resource types is to iterate the resource model versus working around it with many custom fields. By the same token, the CPUSet field could be represented as a Range type if it were available.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as there's a way to represent multiple different huge pages sizes concurrently I'm ok - if we want to use a flat structure and so have multiple "hugepages.: X", that'll work fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, and I know this is contrarian, we don't. or maybe we don't YET. We can not represent every facet of every hardware system.

Copy link
Contributor

@jaypipes jaypipes Dec 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin Amen. I'm kind of chuckling as I read some of this (and the huge page proposal) thinking how much crazy code went into OpenStack Nova for all of this NFV (or NFV-centric) functionality. It made (and continues to make) the scheduler and resource tracking subsystems in Nova a total pain in the behind to read and work with.

Also worth noting: what is the plan to functionally test all of this in a continuous manner? Who will provide the hardware and systems? This is something we spent literally years trying to get companies to contribute and even now still struggle to get stable third-party CI systems integrated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are and will be some workloads that we don't handle perfectly. I can live with that.

// Example: 0-3 or 0,2,4,6
// The values are expressed in the List Format syntax specified
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
CPUSet string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the string syntax is user friendly, but not very machine friendly as all operations would have to convert to a bitmap to do any useful calculations. IOW, it'd be better to represent as something like '[]int' storing the CPU numbers in that node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume we would have internal utilities to convert to []int

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you could provide utility functions to convert to a []int format, but IMHO it is preferable to store the data in a canonical format in the first place and avoid the need to convert it everywhere

By default, load balancing is done across all CPUs, except those marked isolated
using the kernel boot time `isolcpus=` argument. When configuring a node to support
CPU and NUMA affinity, many operators may wish to isolate host processes to particular
cores.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the isolcpus= kernel argument is not suitable for general purpose isolation of host OS processes from container/pod processes because it has two effects, only one of which is desirable. Aside from isolating the host processes, it also disables schedular load balancing from those isolated CPUs. If you have a container that is placed on isolated CPUs, the processes in that container will never move between available CPUs - they'll forever run on the first CPU on which they were placed, even if all other CPUs are idle.

The systemd CPUAffinity setting should pretty much always be used - isolcpus when running real-time processes on the dedicated CPUs where you genuiently don't want the kernel scheduler rebalancing CPUs/.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i raised the former because not all distros running kube were on systemd. agreed with above.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if a distro is not running systemd, they should never use isolcpus= unless the containers are intended to run real-time tasks where each individual process is manually pinned to a suitable host CPU. Non-systemd distros would have to configure cgroups cpuset controller in some custom manner to setup isolation.


**TODO**

1. how should `kubelet` discover the reserved `cpu-set` value?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could infer it based on the CPU mask it is running with itself. eg if you have CPUAffinity=0-3 on a 16 cpus system, then kubelet would see its affinity as 0-3, and so it can infer that 4-15 are the ones reserved for pod usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i was hoping to avoid the need for additional flags, will think one this some.

* `strict`

If `strict`, all pods that match this taint must request CPU (whole or fractional cores) that
fit a single NUMA node `cpu` allocatable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels somewhat inflexible wrt potential future expansion. It would be better to express a desired number of node count. eg NUMANodes=N, where N is one or more, to indicate how many NUMA nodes it is willing / able to consume. eg an application inside a container may be capable of dividing its workload into 2 pieces, so could deal wtih being spread on 2 NUMA nodes, thus could request NUMANodes=1,2 to indicate its able to deal with 1 or 2 NUMA nodes, but not more than that.

// The values are expressed in the List Format syntax specified here:
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
// +optional
CPUAffinity string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is responsible for setting the NUMANodeID / CPUAffinity values ? Is that to be set by the end user directly, or to be set indirectly by the schedular when making placement decision. If the latter, then that's ok, but the former is not a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's intended to be set by a scheduling agent of some kind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid question: are the identifiers used in this list global across the machine, or local per NUMA node? Is there any reason why you would specify CPUAffinity without specifying NUMANodeID? (I assume a scheduling agent would know both aspects of the topology.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the identifiers are global across the machine.

i actually debated not including NUMANodeID and just having CPUAffinity, and am open to discussion on that. If NUMANodeID is set CPUAffinity is either the 1-N cores on that node, or some subset.

a specific NUMA node, and the set of affined CPUs are shared among containers.

Pod level cgroups are used to actually affine the container to the specified
CPU set.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a proposal or other reference on how pod-level cgroups will work and programs can understand their assignments?

We've had a long-standing discussion around how can a C++ program discover/understand it's resource boundaries so that it can operate properly within them. A classic third-party example is a Java JVM auto-discovering how many GC threads it can create and how much memory it's been allocated. It would be desirable to have a path for codebases that may have previously assumed they had to manage a whole machine's worth of resources and handle setting their own scheduler affinities today to be able to understand their partial-machine resource set so they can make their own proper suballocations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcarstensen They could either feed in from ENV vars, or iirc there is a mount trick that is done to allow them see the world through a cgroup-i-fied lens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcarstensen -- recommend using the downward api in that context: kubernetes/website#591

Copy link
Contributor

@ConnorDoyle ConnorDoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this write up.

* `strict`
* `preferred`

If `strict`, all pods that match this taint must request `memory` that fits it's assigned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it's/its

NUMA node `memory` allocatable.

If `preferred`, all pods that match this taint are not required to have their `memory` request
fit it's assigned NUMA node `memory` allocatable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it's/its

* Potential values:
* `dedicated`

If `dedicated`, all pods that match this taint will require dedicated compute resources. Each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (here and repeated below): s/match this taint/tolerate this taint

// Capacity represents the total resources associated to the NUMA node.
// cpu: 4 <number of cores>
// memory: <amount of memory in normal page size>
// hugepages: <amount of memory in huge page size>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with Derek here, they could be represented as separate entries in the resource list. Given that pagesize could be represented by a conventional resource name, nr_hugepages and free_hugepages can be derived from the values in capacity and allocatable.

Also agree that the better way to address the need for more expressive resource types is to iterate the resource model versus working around it with many custom fields. By the same token, the CPUSet field could be represented as a Range type if it were available.

An interconnect bus provides connections between nodes so each CPU can
access all memory. The interconnect can be overwhelmed by concurrent
cross-node traffic, and as a result, processes that need to access memory
on a different node can experience increased latency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "a process running on a different node from a memory bank it needs to access can experience increased latency"


The `kubelet` will enforce the presence of the required pod tolerations assigned to the node.

The `kubelet` will pend the execution of any pod that is assigned to the node, but has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for how the proposal solves this problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Kubelet reject the pod if the specified NUMANodeID and/or CPUAffinity are already in use by an existing pod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it would reject if there was a collision in the case that you asked for dedicated cores, and were not getting them.


This proposal recommends that the `NodeStatus` is augmented as follows:

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use "```go"?


This proposal describes enhancements to the Kubernetes API to improve the
utilization of compute resources for containers that have a need to avoid
cross NUMA node memory access by containers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/by containers//


#### CPUAffinity

* Effect: `NoScheduleNoAdmitNoExecute`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have to be represented as two taints here and elsewhere. See kubernetes/kubernetes#28082 for details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balajismaniam ack, was talking with @aveshagarwal about this in the background. in addition, just using NoScheduleNoAdmit would be sufficient if we drain nodes before applying taint.


* If the toleration `CPUAffinity` is present on a `Pod`, the pod will not start
any associated container until the `Pod.Spec.CPUAffinity` is populated.
* If the toleration `NUMAAffinity` is present on a `Pod`, the pod will not start

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/NUMAAffinity/NUMACPUAffinity/

}

// NUMATopology describes the NUMA topology of a node.
type NUMATopology struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the convention of "labels everywhere", would you embed k8s.io/kubernetes/pkg/apis/meta/v1.TypeMeta here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeMeta is more appropriate for top-level objects in the API, the topology is embedded in Node, maybe I am missing a use case you had for labeling of numa nodes within a Node? Isn't the id sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not including the use case. I was thinking of representing shared (but not consumable) per-numa-node resources like device locality. Anyway, in that case it would be more appropriate as part of NUMANode.

## Future considerations

1. Author `NUMATopologyPredicate` in scheduler to enable NUMA aware scheduling.
1. Restrict vertical autoscaling of CPU and NUMA affined workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to do with limiting dynamic logical core assignments, or specifically fixing the number of cores allocated for the lifetime of a pod? If the latter, what's the rationale for it? The default vertical auto-scaler (non-numa-aware) could emit impossible allocations (spanning sockets, fractional cores, etc) but you could also imagine a numa-aware vertical auto-scaler that can do the right thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i should phrase this differently. i mean to say that we should not just auto-scale these things ignorant of NUMA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

// Identifies a NUMA node on a single host.
NUMANodeID string
// Capacity represents the total resources associated to the NUMA node.
// cpu: 4 <number of cores>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: is this intended to represent the logical CPU ids as reported by Linux, or only the number of physical cores on the socket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had intended for this to align with what was returned in the following:

$ sudo lscpu | grep NUMA
NUMA node(s):          1
NUMA node0 CPU(s):     0-3

so in that case it was logical and not physical cores. this was consistent with what we do today when reporting CPU counts in cAdvisor via $ cat /proc/cpuinfo | grep 'processor\t*' | wc -l

i agree we will need to know what are siblings in the numa topology.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was consistent with what we do today when reporting CPU counts in cAdvisor

Is this (logical) also what is returned in "cpu" in Allocatable and Capacity in NodeStatus?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp -- yes. the node returns logical cores to in "cpu" in Allocatable and Capacity.

Allocatable ResourceList
// CPUSet represents the physical numbers of the CPU cores
// associated with this node.
// Example: 0-3 or 0,2,4,6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: are these limited to the physical cpu ids, and if so is it planned to support scheduling onto their hyperthread siblings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we have heard from some users that they would like to consume a pair of hyperthreads from one physical core.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was logical core listing, we will need to let you know what the siblings are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for simplicity imo we should bound affinity to the core. I question the cost/benefits to allocating to hyperthreads.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 90% general usage, I think it is fine to ignore the sockets/cores/threads distinction and just treat all logical cpus as having equivalent "power". In the more niche use cases there's definitely cases where considering threads as specal will help. There are applications which are very latency sensitive and benefit from having their threads run on a pair of hyperthread siblings in order to maximise cache hits. Conversely there's applications which absolutely don't ever want to be on hyperthread siblings because they'll contend on the same shared resources.

That all said, I think it is possible to retrofit support for thread at a later date. It simply requires adding a new array attribute later listing the thread sibling IDs. So this simple list of logical CPU IDs is sufficient granularity for a first impl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@berrange but why wait to address this use case? Most of the users we have talked to that need core pinning are the extreme low-latency people. Without this level of topology awareness the feature will not meet their needs and we'll still have users hacking around the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cpu in Capacity always guaranteed to be equal to the number of cores listed here? (If not, why not?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp -- the list of cpus should always match the equal number of cores in alloctable. if the kubelet itself is pegged to a particular core, its possible the capacity != allocatable.


// NUMANode describes a single NUMA node.
type NUMANode struct {
// Identifies a NUMA node on a single host.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the way this is structured would complicate the scheduler which currently rips through capacities for most of it's matching. If it's possible to expand capacities that might make this easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timothysc -- i am open to ideas on how to flatten, but we still need to know the topology to know what cores are in what node, the siblings, and their distances (possibly) in future.

// Identifies a NUMA node on a single host.
NUMANodeID string
// Capacity represents the total resources associated to the NUMA node.
// cpu: 4 <number of cores>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was consistent with what we do today when reporting CPU counts in cAdvisor

Is this (logical) also what is returned in "cpu" in Allocatable and Capacity in NodeStatus?

Allocatable ResourceList
// CPUSet represents the physical numbers of the CPU cores
// associated with this node.
// Example: 0-3 or 0,2,4,6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cpu in Capacity always guaranteed to be equal to the number of cores listed here? (If not, why not?)

// the scheduler simply schedules this pod onto that node, assuming that it fits resource
// requirements.
// +optional
NodeName string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you list this here? Are you planning to somehow use the existing NodeName field for NUMA node scheduling? It wasn't clear to me from the comment (which appears to be identical to the existing NodeName comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i listed it here just to be clear that numa and cpu affined workloads are a triplet of node name + numa node + cpuset.

the only thing that is new is the numa node id and cpu affinity values.

* If the toleration `NUMAAffinity` is present on a `Pod`, the pod will not start
any associated container until the `Pod.Spec.NUMANodeID` is populated.

The delayed execution of the pod enables both a single and dual-phase scheduler to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "single and dual-phase scheduler"? (And what do you mean by "delayed execution of the pod"?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that there is some debate in the community on whether the cluster scheduler assigns pods to cpu cores and numa nodes, or if a node local agent schedules it. we had some of this debate at kubecon. when i mention single or dual phase scheduler, i mean either a scheduler that does machine+numa aware scheduling (single) or a pair of schedulers where one does machine assignment, and another does numa assignment. personally, i am biased to the single approach, but i got the sense there was some debate in this space.


The `kubelet` will enforce the presence of the required pod tolerations assigned to the node.

The `kubelet` will pend the execution of any pod that is assigned to the node, but has
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Kubelet reject the pod if the specified NUMANodeID and/or CPUAffinity are already in use by an existing pod?

// The values are expressed in the List Format syntax specified here:
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
// +optional
CPUAffinity string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid question: are the identifiers used in this list global across the machine, or local per NUMA node? Is there any reason why you would specify CPUAffinity without specifying NUMANodeID? (I assume a scheduling agent would know both aspects of the topology.)

1. in a numa system, `kubelet` reservation for memory needs to be removed from
a particular numa node capacity so numa node allocatable is as expected.

### Configuring Taints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand the reason for the taints. I suspect there is some misunderstanding about the intention of taints/tolerations, but I'm not clear enough on why you are using them to be able to judge. Can you explain what problem(s) you are trying to solve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problems:

  1. nodes that support dedicated cpu cores are special
  2. pods that require dedicated cpu cores are special
  3. needed a way to distinguish consumption of a dedicated core as distinct/special and did not want to change resource model

I basically chose to use taints/tolerations as a means of decorating the resource description and scheduling requirement.

// Capacity represents the total resources associated to the NUMA node.
// cpu: 4 <number of cores>
// memory: <amount of memory in normal page size>
// hugepages: <amount of memory in huge page size>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, and I know this is contrarian, we don't. or maybe we don't YET. We can not represent every facet of every hardware system.

// The values are expressed in the List Format syntax specified here:
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
// +optional
CPUAffinity string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^^ this too.

// This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
// are present on the pod.
// +optional
NUMANodeID string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've said before and I still assert - this is a terrible plan. Offering up this much control is going to kill off any chances we have of actually doing the right thing automatically for most users. This is an attractive nuisance (https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine) - people are going to get hurt by it. If the borg experience counts for anything, the lesson is that saying "yes" to things like this turned out terribly, and is a drag on the system.

I proposed a less first-class alternative. Something like: annotate pods and run a highly-privileged agent on the node react to pods being scheduled by programming cpusets and calling numactl and whatever else it needs to do. Make this VERY explicit to use and DO NOT make it part of the API.

If we need to collect NUMA status and use something about NUMA as a hint to scheduling, I can MAYBE see this, maybe even a different scheduler enitrely.

If we need to sell "dedicated caches", then we should add that as resource. If I need NUMA because I want an LLC to myself, let me spec that.

If we need to sell "link to high-perf IO device", then we should add that as resource.

I'm trying to not overreact, but I have a visceral response to this.

@derekwaynecarr
Copy link
Member Author

derekwaynecarr commented Dec 23, 2016 via email

@thockin
Copy link
Member

thockin commented Dec 23, 2016 via email

## Future considerations

1. Author `NUMATopologyPredicate` in scheduler to enable NUMA aware scheduling.
1. Restrict vertical autoscaling of CPU and NUMA affined workloads.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this proposal assumes that we run workloads on monolithic hardware.

Our use case is different.
We have servers with 1,2 or 4 NUMA nodes. They have different CPU models (different number of cores) and different NUMA topologies. For example:
Server 1 has topology:

NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39

Server 2 has topology:

NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39

Consider a scenario:

We want to run workloads that perform best if they use all CPUs on one NUMA node.
If the server has 4 NUMA nodes, we will run 4 workloads on them.
It will be difficult to run this scenario with the proposed implementation.

Can we try a different approach?

Add a capability to specify CPUset to Opaque Integer Resources. One Opaque Integer Resource could be predefined (lets call it "numa"). When a user requests OIR numa 1, the scheduler would specify CPUset 0-9,20-29 on server-1 from my example, and 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40 on server-2.

This approach would also solve more advanced use cases. Lets say we want to run a workload that is pinned to one CPU that is close to a specific PCI device. On one server this CPU may be located on numa0, on another on numa1. In this case a user defines a new OIR (lets call it "fast-pci") and specifies CPUset for this OIR. It would be different for different servers. The schedule would pin the workload to the CPU close to the PCI on each server.

I assume that the OIR definition would be performed by a Node-Feature-Discovery pod (as defined in kubernetes/kubernetes#10570 (comment)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakub-d -- i don't think numa should be handled via OIR, but I see value in letting you have numa node capacity as a counted resource on node. I will think about how to handle that in this proposal. I think the major tension on the proposal is just the presence of NUMANodeID/CPUAffinity on the pod declaration which I think I would still want even with your proposed model. Thanks for the scenario feedback!

@derekwaynecarr
Copy link
Member Author

Recording pointer to kubernetes/kubernetes#39818 which means to me that a Toleration alone is not enough to provide additional meaning to a resource as there is no mechanism to assert that the Toleration must be coupled with a node of matching of taint (at this time).

@ConnorDoyle
Copy link
Contributor

This discussion seems to have stalled a bit.

@thockin: to grossly oversimplify your concerns about this proposal (please correct if this is wrong), you are opposed to exposing the NUMANodeID and CPUAffinity fields to user pod submissions?

What if these fields were only settable by a privileged agent like you described? That way those fields become part of the API between the system and the cluster operator. Users would have to specify something higher-level. Examples of higher-level could be anything from "I am a workload of class XYZ" to "I need to deliver 95% tail response latency of 100ms or less per my SLA".

@derekwaynecarr
Copy link
Member Author

@ConnorDoyle -- i need to make an update to this proposal around its usage of tolerations, but i agree that fundamental issue is around those two fields. in addition, i am wondering if people feel differently about "exclusive cores" versus numa/cpu affinity. in that mode, we still want something similar to CPUAffinity to provide the contract, but the NUMANodeID would not necessarily be relevant.

@vishh
Copy link
Contributor

vishh commented Jan 17, 2017 via email

@ConnorDoyle
Copy link
Contributor

ConnorDoyle commented Jan 17, 2017

@derekwaynecarr good point. Following the above, if NUMA is what the user requires, they could say so in a way that the operator / system understands and the privileged agent could set up CPUAffinity accordingly.

@vishh It sounds like most people are on board with that goal. Teaching Kubelet to auto-tune settings to uphold SLAs is a much harder problem than simply allowing operators to provide some pre-configured settings on behalf of users. Also, it sounds like there may be some scenarios that Kubelet won't commit to supporting. In the meantime, and for the latter cases, could we build a bridge that avoids exposing too much detail (as per @thockin's comments)? Landing the low-level mechanism described in this proposal could be the groundwork for the default controller that ships w/ core to implement those higher level things.

@vishh
Copy link
Contributor

vishh commented Jan 17, 2017

@ConnorDoyle

Teaching Kubelet to auto-tune settings to uphold SLAs is a much harder problem than simply allowing operators to provide some pre-configured settings on behalf of users.

It is better to make Kubernetes a bit more complex than exposing that complexity to end users. Also, we are working on providing extensible isolators in the node, which would allow for varying levels of isolation.
And, I don't really see how cpu & numa allocation can be pre-configured? It has to be dynamic right?

Landing the low-level mechanism described in this proposal could be the groundwork for the default controller that ships w/ core to implement those higher level things.

The uncertainty around this proposal indicates that we should iterate on this feature rather than make API decisions now. In that regard, let's avoid exposing more knobs unless necessary.
In addition to that, let's try to experiment with this feature using a kubelet extension rather than make upstream changes right away.

@thockin
Copy link
Member

thockin commented Jan 18, 2017 via email

@ghost
Copy link

ghost commented Jan 18, 2017 via email

@vishh
Copy link
Contributor

vishh commented Jan 18, 2017

@thockin IIUC, I think we are in agreement. Like I mentioned earlier, we need to express an intent and have NUMA and core allocation be one of the means to achieve that intent. The intent could be that a pod is "Latency sensitive".
At the same time, if we let the intent become too loose (similar to a broad scale that @quinton-hoole mentioned), users might not be able to consume such an API at all.

IIUC, this feature is primarily meant to tackle workloads that are sensitive to scheduling and memory access latencies. So let's try to model based on that "need" rather than the means.

@derekwaynecarr
Copy link
Member Author

derekwaynecarr commented Jan 20, 2017 via email

@jakub-d
Copy link

jakub-d commented Mar 17, 2017

Any updates on this subject?


## What is NUMA?

Non-uniform memory architecture (NUMA) describes multi-socket machines that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-uniform memory architecture => Non-Uniform Memory Architecture (NUMA)
Very minor change, enhances readability.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A in NUMA is "Access" https://en.wikipedia.org/wiki/Non-uniform_memory_access

Should it be "Non-Uniform Memory Access (NUMA) architecture" ?

@derekwaynecarr
Copy link
Member Author

for reference, I plan to update this proposal to align w/ more recent discussions in our resource mgmt workgroup. i hope to have an update for review in time for our planned f2f on 5/1/2017

@derekwaynecarr
Copy link
Member Author

I am closing this proposal given the outcome of the resource mgmt f2f where we agreed on a different high-level approach.

@gmarkey
Copy link

gmarkey commented May 12, 2017

@derekwaynecarr are you able to share more details about the F2F?

@vishh
Copy link
Contributor

vishh commented May 12, 2017 via email

@teferi
Copy link

teferi commented May 23, 2017

A relevant formal proposal to k8s/community #654

rmohr pushed a commit to rmohr/community that referenced this pull request May 4, 2022
Signed-off-by: Andrew Burden <aburden@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.