-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: CPU Affinity and NUMA Topology Awareness #171
Conversation
5b1c62a
to
2f4f619
Compare
2f4f619
to
daa68fb
Compare
An interconnect bus provides connections between nodes so each CPU can | ||
access all memory. The interconnect can be overwhelmed by concurrent | ||
cross-node traffic, and as a result, processes that need to access memory | ||
on a different node can experience increased latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB, the increased latency of cross-node memory access exists regardless of whether the interconnect is overwhelmed or not - remote node memory access always has a penalty based on the distance between nodes. For example, if local node distance is 10 and distance to remote node is 15, then memory access will always be at least x1.5 slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@berrange - agreed. will clean up text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "a process running on a different node from a memory bank it needs to access can experience increased latency"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general point I think that the huge page proposal here kubernetes/kubernetes@68b82f3 needs to be folded into this proposal as NUMA & huge pages needs to be modeled together in order to get a long term future proof design.
// Capacity represents the total resources associated to the NUMA node. | ||
// cpu: 4 <number of cores> | ||
// memory: <amount of memory in normal page size> | ||
// hugepages: <amount of memory in huge page size> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs much more work to correctly future proof it to cope well with huge pages. Historically x86 only had 1 huge page size (2MB or 4MB depending on kernel address mode), but these days you can also have 1 GB huge pages. On PPC architecture there are a huge number of page sizes available. Any single host can have multiple page sizes available concurrently.
So memory needs to be represented as a struct in its own right eg something more like
type PageInfo struct {
PageSize uint
Count uint64
Used uint64
}
Then the node would have an array of info Memory []PageInfo
The 'cpu: 4' count seems redundant given that you have a CPUSet string below from which you could easily determine the count if represented as an array instead of string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally, we have discussed the need to have a mechanism to describe the attributes of a resource better than what we have in the current model. usage information fluctuates so we tend to not capture in the node status and instead would use a monitoring pipeline for that information. the pagesize will need to be reflected. its possible the convention will be hugepages.<size>: X
. Mostly trying to capture the relationship between NUMA and hugepages that you have called out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with Derek here, they could be represented as separate entries in the resource list. Given that pagesize could be represented by a conventional resource name, nr_hugepages
and free_hugepages
can be derived from the values in capacity and allocatable.
Also agree that the better way to address the need for more expressive resource types is to iterate the resource model versus working around it with many custom fields. By the same token, the CPUSet field could be represented as a Range type if it were available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as there's a way to represent multiple different huge pages sizes concurrently I'm ok - if we want to use a flat structure and so have multiple "hugepages.: X", that'll work fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, and I know this is contrarian, we don't. or maybe we don't YET. We can not represent every facet of every hardware system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin Amen. I'm kind of chuckling as I read some of this (and the huge page proposal) thinking how much crazy code went into OpenStack Nova for all of this NFV (or NFV-centric) functionality. It made (and continues to make) the scheduler and resource tracking subsystems in Nova a total pain in the behind to read and work with.
Also worth noting: what is the plan to functionally test all of this in a continuous manner? Who will provide the hardware and systems? This is something we spent literally years trying to get companies to contribute and even now still struggle to get stable third-party CI systems integrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are and will be some workloads that we don't handle perfectly. I can live with that.
// Example: 0-3 or 0,2,4,6 | ||
// The values are expressed in the List Format syntax specified | ||
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html | ||
CPUSet string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the string syntax is user friendly, but not very machine friendly as all operations would have to convert to a bitmap to do any useful calculations. IOW, it'd be better to represent as something like '[]int' storing the CPU numbers in that node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume we would have internal utilities to convert to []int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you could provide utility functions to convert to a []int format, but IMHO it is preferable to store the data in a canonical format in the first place and avoid the need to convert it everywhere
By default, load balancing is done across all CPUs, except those marked isolated | ||
using the kernel boot time `isolcpus=` argument. When configuring a node to support | ||
CPU and NUMA affinity, many operators may wish to isolate host processes to particular | ||
cores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, the isolcpus= kernel argument is not suitable for general purpose isolation of host OS processes from container/pod processes because it has two effects, only one of which is desirable. Aside from isolating the host processes, it also disables schedular load balancing from those isolated CPUs. If you have a container that is placed on isolated CPUs, the processes in that container will never move between available CPUs - they'll forever run on the first CPU on which they were placed, even if all other CPUs are idle.
The systemd CPUAffinity setting should pretty much always be used - isolcpus when running real-time processes on the dedicated CPUs where you genuiently don't want the kernel scheduler rebalancing CPUs/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i raised the former because not all distros running kube were on systemd. agreed with above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if a distro is not running systemd, they should never use isolcpus= unless the containers are intended to run real-time tasks where each individual process is manually pinned to a suitable host CPU. Non-systemd distros would have to configure cgroups cpuset controller in some custom manner to setup isolation.
|
||
**TODO** | ||
|
||
1. how should `kubelet` discover the reserved `cpu-set` value? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could infer it based on the CPU mask it is running with itself. eg if you have CPUAffinity=0-3 on a 16 cpus system, then kubelet would see its affinity as 0-3, and so it can infer that 4-15 are the ones reserved for pod usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i was hoping to avoid the need for additional flags, will think one this some.
* `strict` | ||
|
||
If `strict`, all pods that match this taint must request CPU (whole or fractional cores) that | ||
fit a single NUMA node `cpu` allocatable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels somewhat inflexible wrt potential future expansion. It would be better to express a desired number of node count. eg NUMANodes=N, where N is one or more, to indicate how many NUMA nodes it is willing / able to consume. eg an application inside a container may be capable of dividing its workload into 2 pieces, so could deal wtih being spread on 2 NUMA nodes, thus could request NUMANodes=1,2 to indicate its able to deal with 1 or 2 NUMA nodes, but not more than that.
// The values are expressed in the List Format syntax specified here: | ||
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html | ||
// +optional | ||
CPUAffinity string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is responsible for setting the NUMANodeID / CPUAffinity values ? Is that to be set by the end user directly, or to be set indirectly by the schedular when making placement decision. If the latter, then that's ok, but the former is not a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's intended to be set by a scheduling agent of some kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stupid question: are the identifiers used in this list global across the machine, or local per NUMA node? Is there any reason why you would specify CPUAffinity without specifying NUMANodeID? (I assume a scheduling agent would know both aspects of the topology.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the identifiers are global across the machine.
i actually debated not including NUMANodeID and just having CPUAffinity, and am open to discussion on that. If NUMANodeID is set CPUAffinity is either the 1-N cores on that node, or some subset.
a specific NUMA node, and the set of affined CPUs are shared among containers. | ||
|
||
Pod level cgroups are used to actually affine the container to the specified | ||
CPU set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a proposal or other reference on how pod-level cgroups will work and programs can understand their assignments?
We've had a long-standing discussion around how can a C++ program discover/understand it's resource boundaries so that it can operate properly within them. A classic third-party example is a Java JVM auto-discovering how many GC threads it can create and how much memory it's been allocated. It would be desirable to have a path for codebases that may have previously assumed they had to manage a whole machine's worth of resources and handle setting their own scheduler affinities today to be able to understand their partial-machine resource set so they can make their own proper suballocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lcarstensen They could either feed in from ENV vars, or iirc there is a mount trick that is done to allow them see the world through a cgroup-i-fied lens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lcarstensen -- recommend using the downward api in that context: kubernetes/website#591
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on this write up.
* `strict` | ||
* `preferred` | ||
|
||
If `strict`, all pods that match this taint must request `memory` that fits it's assigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/it's/its
NUMA node `memory` allocatable. | ||
|
||
If `preferred`, all pods that match this taint are not required to have their `memory` request | ||
fit it's assigned NUMA node `memory` allocatable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/it's/its
* Potential values: | ||
* `dedicated` | ||
|
||
If `dedicated`, all pods that match this taint will require dedicated compute resources. Each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (here and repeated below): s/match this taint/tolerate this taint
// Capacity represents the total resources associated to the NUMA node. | ||
// cpu: 4 <number of cores> | ||
// memory: <amount of memory in normal page size> | ||
// hugepages: <amount of memory in huge page size> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with Derek here, they could be represented as separate entries in the resource list. Given that pagesize could be represented by a conventional resource name, nr_hugepages
and free_hugepages
can be derived from the values in capacity and allocatable.
Also agree that the better way to address the need for more expressive resource types is to iterate the resource model versus working around it with many custom fields. By the same token, the CPUSet field could be represented as a Range type if it were available.
An interconnect bus provides connections between nodes so each CPU can | ||
access all memory. The interconnect can be overwhelmed by concurrent | ||
cross-node traffic, and as a result, processes that need to access memory | ||
on a different node can experience increased latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "a process running on a different node from a memory bank it needs to access can experience increased latency"
|
||
The `kubelet` will enforce the presence of the required pod tolerations assigned to the node. | ||
|
||
The `kubelet` will pend the execution of any pod that is assigned to the node, but has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for how the proposal solves this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Kubelet reject the pod if the specified NUMANodeID and/or CPUAffinity are already in use by an existing pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it would reject if there was a collision in the case that you asked for dedicated cores, and were not getting them.
|
||
This proposal recommends that the `NodeStatus` is augmented as follows: | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use "```go"?
|
||
This proposal describes enhancements to the Kubernetes API to improve the | ||
utilization of compute resources for containers that have a need to avoid | ||
cross NUMA node memory access by containers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/by containers//
|
||
#### CPUAffinity | ||
|
||
* Effect: `NoScheduleNoAdmitNoExecute` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might have to be represented as two taints here and elsewhere. See kubernetes/kubernetes#28082 for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@balajismaniam ack, was talking with @aveshagarwal about this in the background. in addition, just using NoScheduleNoAdmit
would be sufficient if we drain nodes before applying taint.
|
||
* If the toleration `CPUAffinity` is present on a `Pod`, the pod will not start | ||
any associated container until the `Pod.Spec.CPUAffinity` is populated. | ||
* If the toleration `NUMAAffinity` is present on a `Pod`, the pod will not start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/NUMAAffinity/NUMACPUAffinity/
} | ||
|
||
// NUMATopology describes the NUMA topology of a node. | ||
type NUMATopology struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the convention of "labels everywhere", would you embed k8s.io/kubernetes/pkg/apis/meta/v1.TypeMeta
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeMeta is more appropriate for top-level objects in the API, the topology is embedded in Node
, maybe I am missing a use case you had for labeling of numa nodes within a Node? Isn't the id sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not including the use case. I was thinking of representing shared (but not consumable) per-numa-node resources like device locality. Anyway, in that case it would be more appropriate as part of NUMANode
.
## Future considerations | ||
|
||
1. Author `NUMATopologyPredicate` in scheduler to enable NUMA aware scheduling. | ||
1. Restrict vertical autoscaling of CPU and NUMA affined workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to do with limiting dynamic logical core assignments, or specifically fixing the number of cores allocated for the lifetime of a pod? If the latter, what's the rationale for it? The default vertical auto-scaler (non-numa-aware) could emit impossible allocations (spanning sockets, fractional cores, etc) but you could also imagine a numa-aware vertical auto-scaler that can do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i should phrase this differently. i mean to say that we should not just auto-scale these things ignorant of NUMA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
// Identifies a NUMA node on a single host. | ||
NUMANodeID string | ||
// Capacity represents the total resources associated to the NUMA node. | ||
// cpu: 4 <number of cores> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification: is this intended to represent the logical CPU ids as reported by Linux, or only the number of physical cores on the socket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had intended for this to align with what was returned in the following:
$ sudo lscpu | grep NUMA
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
so in that case it was logical and not physical cores. this was consistent with what we do today when reporting CPU counts in cAdvisor via $ cat /proc/cpuinfo | grep 'processor\t*' | wc -l
i agree we will need to know what are siblings in the numa topology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was consistent with what we do today when reporting CPU counts in cAdvisor
Is this (logical) also what is returned in "cpu" in Allocatable and Capacity in NodeStatus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidopp -- yes. the node returns logical cores to in "cpu" in Allocatable and Capacity.
Allocatable ResourceList | ||
// CPUSet represents the physical numbers of the CPU cores | ||
// associated with this node. | ||
// Example: 0-3 or 0,2,4,6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification: are these limited to the physical cpu ids, and if so is it planned to support scheduling onto their hyperthread siblings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we have heard from some users that they would like to consume a pair of hyperthreads from one physical core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was logical core listing, we will need to let you know what the siblings are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for simplicity imo we should bound affinity to the core. I question the cost/benefits to allocating to hyperthreads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the 90% general usage, I think it is fine to ignore the sockets/cores/threads distinction and just treat all logical cpus as having equivalent "power". In the more niche use cases there's definitely cases where considering threads as specal will help. There are applications which are very latency sensitive and benefit from having their threads run on a pair of hyperthread siblings in order to maximise cache hits. Conversely there's applications which absolutely don't ever want to be on hyperthread siblings because they'll contend on the same shared resources.
That all said, I think it is possible to retrofit support for thread at a later date. It simply requires adding a new array attribute later listing the thread sibling IDs. So this simple list of logical CPU IDs is sufficient granularity for a first impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@berrange but why wait to address this use case? Most of the users we have talked to that need core pinning are the extreme low-latency people. Without this level of topology awareness the feature will not meet their needs and we'll still have users hacking around the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is cpu in Capacity always guaranteed to be equal to the number of cores listed here? (If not, why not?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidopp -- the list of cpus should always match the equal number of cores in alloctable. if the kubelet itself is pegged to a particular core, its possible the capacity != allocatable.
|
||
// NUMANode describes a single NUMA node. | ||
type NUMANode struct { | ||
// Identifies a NUMA node on a single host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the way this is structured would complicate the scheduler which currently rips through capacities for most of it's matching. If it's possible to expand capacities that might make this easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timothysc -- i am open to ideas on how to flatten, but we still need to know the topology to know what cores are in what node, the siblings, and their distances (possibly) in future.
// Identifies a NUMA node on a single host. | ||
NUMANodeID string | ||
// Capacity represents the total resources associated to the NUMA node. | ||
// cpu: 4 <number of cores> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was consistent with what we do today when reporting CPU counts in cAdvisor
Is this (logical) also what is returned in "cpu" in Allocatable and Capacity in NodeStatus?
Allocatable ResourceList | ||
// CPUSet represents the physical numbers of the CPU cores | ||
// associated with this node. | ||
// Example: 0-3 or 0,2,4,6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is cpu in Capacity always guaranteed to be equal to the number of cores listed here? (If not, why not?)
// the scheduler simply schedules this pod onto that node, assuming that it fits resource | ||
// requirements. | ||
// +optional | ||
NodeName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you list this here? Are you planning to somehow use the existing NodeName field for NUMA node scheduling? It wasn't clear to me from the comment (which appears to be identical to the existing NodeName comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i listed it here just to be clear that numa and cpu affined workloads are a triplet of node name + numa node + cpuset.
the only thing that is new is the numa node id and cpu affinity values.
* If the toleration `NUMAAffinity` is present on a `Pod`, the pod will not start | ||
any associated container until the `Pod.Spec.NUMANodeID` is populated. | ||
|
||
The delayed execution of the pod enables both a single and dual-phase scheduler to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "single and dual-phase scheduler"? (And what do you mean by "delayed execution of the pod"?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that there is some debate in the community on whether the cluster scheduler assigns pods to cpu cores and numa nodes, or if a node local agent schedules it. we had some of this debate at kubecon. when i mention single or dual phase scheduler, i mean either a scheduler that does machine+numa aware scheduling (single) or a pair of schedulers where one does machine assignment, and another does numa assignment. personally, i am biased to the single approach, but i got the sense there was some debate in this space.
|
||
The `kubelet` will enforce the presence of the required pod tolerations assigned to the node. | ||
|
||
The `kubelet` will pend the execution of any pod that is assigned to the node, but has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Kubelet reject the pod if the specified NUMANodeID and/or CPUAffinity are already in use by an existing pod?
// The values are expressed in the List Format syntax specified here: | ||
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html | ||
// +optional | ||
CPUAffinity string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stupid question: are the identifiers used in this list global across the machine, or local per NUMA node? Is there any reason why you would specify CPUAffinity without specifying NUMANodeID? (I assume a scheduling agent would know both aspects of the topology.)
1. in a numa system, `kubelet` reservation for memory needs to be removed from | ||
a particular numa node capacity so numa node allocatable is as expected. | ||
|
||
### Configuring Taints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand the reason for the taints. I suspect there is some misunderstanding about the intention of taints/tolerations, but I'm not clear enough on why you are using them to be able to judge. Can you explain what problem(s) you are trying to solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problems:
- nodes that support dedicated cpu cores are special
- pods that require dedicated cpu cores are special
- needed a way to distinguish consumption of a dedicated core as distinct/special and did not want to change resource model
I basically chose to use taints/tolerations as a means of decorating the resource description and scheduling requirement.
// Capacity represents the total resources associated to the NUMA node. | ||
// cpu: 4 <number of cores> | ||
// memory: <amount of memory in normal page size> | ||
// hugepages: <amount of memory in huge page size> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, and I know this is contrarian, we don't. or maybe we don't YET. We can not represent every facet of every hardware system.
// The values are expressed in the List Format syntax specified here: | ||
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html | ||
// +optional | ||
CPUAffinity string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^^ this too.
// This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations | ||
// are present on the pod. | ||
// +optional | ||
NUMANodeID string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've said before and I still assert - this is a terrible plan. Offering up this much control is going to kill off any chances we have of actually doing the right thing automatically for most users. This is an attractive nuisance (https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine) - people are going to get hurt by it. If the borg experience counts for anything, the lesson is that saying "yes" to things like this turned out terribly, and is a drag on the system.
I proposed a less first-class alternative. Something like: annotate pods and run a highly-privileged agent on the node react to pods being scheduled by programming cpusets and calling numactl and whatever else it needs to do. Make this VERY explicit to use and DO NOT make it part of the API.
If we need to collect NUMA status and use something about NUMA as a hint to scheduling, I can MAYBE see this, maybe even a different scheduler enitrely.
If we need to sell "dedicated caches", then we should add that as resource. If I need NUMA because I want an LLC to myself, let me spec that.
If we need to sell "link to high-perf IO device", then we should add that as resource.
I'm trying to not overreact, but I have a visceral response to this.
@thockin I'm not trying to cause you a heart attack this close to the
holiday :-) I thought the approach outlined here was closer to what was
discussed at Kubecon as the pod tolerations captured an intent based model.
This proposal was not meant to exclude the alternative option. It's clear
the alternative you have in mind satisfied a broad swath of workloads at
Google, but I suspected we would still want to persist the information on
the node and pod similar to what was presented even when an alternative
privileged agent was doing the NUMA assignment so we could do checkpointing
and handle restart policies gracefully in the kubelet.
…On Thu, Dec 22, 2016 at 6:29 PM Tim Hockin ***@***.***> wrote:
***@***.**** requested changes on this pull request.
------------------------------
In contributors/design-proposals/cpu-affinity-numa-awareness.md
<#171 (review)>
:
> +}
+
+// NUMATopology describes the NUMA topology of a node.
+type NUMATopology struct {
+ // NUMANodes represents the list of NUMA nodes in the topology.
+ NUMANodes []NUMANode
+}
+
+// NUMANode describes a single NUMA node.
+type NUMANode struct {
+ // Identifies a NUMA node on a single host.
+ NUMANodeID string
+ // Capacity represents the total resources associated to the NUMA node.
+ // cpu: 4 <number of cores>
+ // memory: <amount of memory in normal page size>
+ // hugepages: <amount of memory in huge page size>
Or, and I know this is contrarian, we don't. or maybe we don't YET. We can
not represent every facet of every hardware system.
------------------------------
In contributors/design-proposals/cpu-affinity-numa-awareness.md
<#171 (review)>
:
> + NodeName string
+ // Identifies a NUMA node that affines the pod. If it is non-empty, the value must
+ // correspond to a particular NUMA node on the same node that the pod is scheduled against.
+ // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
+ // are present on the pod.
+ // +optional
+ NUMANodeID string
+ // CPUAffinity controls the CPU affinity of the executed pod.
+ // If it is non-empty, the value must correspond to a particular set
+ // of CPU cores in the matching NUMA node on the machine that the pod is scheduled against.
+ // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
+ // are present on the pod.
+ // The values are expressed in the List Format syntax specified here:
+ // here: http://man7.org/linux/man-pages/man7/cpuset.7.html
+ // +optional
+ CPUAffinity string
^^^ this too.
------------------------------
In contributors/design-proposals/cpu-affinity-numa-awareness.md
<#171 (review)>
:
> +
+```
+// PodSpec is a description of a pod
+type PodSpec struct {
+...
+ // NodeName is a request to schedule this pod onto a specific node. If it is non-empty,
+ // the scheduler simply schedules this pod onto that node, assuming that it fits resource
+ // requirements.
+ // +optional
+ NodeName string
+ // Identifies a NUMA node that affines the pod. If it is non-empty, the value must
+ // correspond to a particular NUMA node on the same node that the pod is scheduled against.
+ // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
+ // are present on the pod.
+ // +optional
+ NUMANodeID string
I've said before and I still assert - this is a terrible plan. Offering up
this much control is going to kill off any chances we have of actually
doing the right thing automatically for most users. This is an attractive
nuisance (https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine) -
people are going to get hurt by it. If the borg experience counts for
anything, the lesson is that saying "yes" to things like this turned out
terribly, and is a drag on the system.
I proposed a less first-class alternative. Something like: annotate pods
and run a highly-privileged agent on the node react to pods being scheduled
by programming cpusets and calling numactl and whatever else it needs to
do. Make this VERY explicit to use and DO NOT make it part of the API.
If we need to collect NUMA status and use something about NUMA as a hint
to scheduling, I can MAYBE see this, maybe even a different scheduler
enitrely.
If we need to sell "dedicated caches", then we should add that as
resource. If I need NUMA because I want an LLC to myself, let me spec that.
If we need to sell "link to high-perf IO device", then we should add that
as resource.
I'm trying to not overreact, but I have a visceral response to this.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#171 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF8dbCCJHymMlCChkiVjtvjDqXa6Jz1rks5rKwfSgaJpZM4LL9zq>
.
|
On Thu, Dec 22, 2016 at 6:43 PM, Derek Carr ***@***.***> wrote:
@thockin I'm not trying to cause you a heart attack this close to the
holiday :-) I thought the approach outlined here was closer to what was
discussed at Kubecon as the pod tolerations captured an intent based model.
The intent is more like "raw performance matters a lot for this pod"
or even "raw perf is more important than access to extra cores", to
which the system would automatically affine it to a NUMA node.
Specifying "lock me to NUMA node 3" or even "lock me to any one NUMA
node" is mechanism.
This proposal was not meant to exclude the alternative option. It's clear
To be clear my "alternative" is hand-wavy at best.
the alternative you have in mind satisfied a broad swath of workloads at
Google, but I suspected we would still want to persist the information on
the node and pod similar to what was presented even when an alternative
privileged agent was doing the NUMA assignment so we could do checkpointing
and handle restart policies gracefully in the kubelet.
I would consider that a pretty niche case. This numa-agent could
checkpoint itself, sure, but I doubt we would want to spec that
checkpoint in a way that some OTHER numa agent could consume it. That
seems like a lot of work for almost no reward.
…
On Thu, Dec 22, 2016 at 6:29 PM Tim Hockin ***@***.***> wrote:
> ***@***.**** requested changes on this pull request.
>
>
>
>
> ------------------------------
>
>
>
>
> In contributors/design-proposals/cpu-affinity-numa-awareness.md
> <#171 (review)>
> :
>
>
> > +}
>
> +
>
> +// NUMATopology describes the NUMA topology of a node.
>
> +type NUMATopology struct {
>
> + // NUMANodes represents the list of NUMA nodes in the topology.
>
> + NUMANodes []NUMANode
>
> +}
>
> +
>
> +// NUMANode describes a single NUMA node.
>
> +type NUMANode struct {
>
> + // Identifies a NUMA node on a single host.
>
> + NUMANodeID string
>
> + // Capacity represents the total resources associated to the NUMA node.
>
> + // cpu: 4 <number of cores>
>
> + // memory: <amount of memory in normal page size>
>
> + // hugepages: <amount of memory in huge page size>
>
>
>
> Or, and I know this is contrarian, we don't. or maybe we don't YET. We can
> not represent every facet of every hardware system.
>
>
>
>
> ------------------------------
>
>
>
>
> In contributors/design-proposals/cpu-affinity-numa-awareness.md
> <#171 (review)>
> :
>
>
> > + NodeName string
>
> + // Identifies a NUMA node that affines the pod. If it is non-empty, the value must
>
> + // correspond to a particular NUMA node on the same node that the pod is scheduled against.
>
> + // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
>
> + // are present on the pod.
>
> + // +optional
>
> + NUMANodeID string
>
> + // CPUAffinity controls the CPU affinity of the executed pod.
>
> + // If it is non-empty, the value must correspond to a particular set
>
> + // of CPU cores in the matching NUMA node on the machine that the pod is scheduled against.
>
> + // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
>
> + // are present on the pod.
>
> + // The values are expressed in the List Format syntax specified here:
>
> + // here: http://man7.org/linux/man-pages/man7/cpuset.7.html
>
> + // +optional
>
> + CPUAffinity string
>
>
>
> ^^^ this too.
>
>
>
>
> ------------------------------
>
>
>
>
> In contributors/design-proposals/cpu-affinity-numa-awareness.md
> <#171 (review)>
> :
>
>
> > +
>
> +```
>
> +// PodSpec is a description of a pod
>
> +type PodSpec struct {
>
> +...
>
> + // NodeName is a request to schedule this pod onto a specific node. If it is non-empty,
>
> + // the scheduler simply schedules this pod onto that node, assuming that it fits resource
>
> + // requirements.
>
> + // +optional
>
> + NodeName string
>
> + // Identifies a NUMA node that affines the pod. If it is non-empty, the value must
>
> + // correspond to a particular NUMA node on the same node that the pod is scheduled against.
>
> + // This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
>
> + // are present on the pod.
>
> + // +optional
>
> + NUMANodeID string
>
>
>
> I've said before and I still assert - this is a terrible plan. Offering up
> this much control is going to kill off any chances we have of actually
> doing the right thing automatically for most users. This is an attractive
> nuisance (https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine) -
> people are going to get hurt by it. If the borg experience counts for
> anything, the lesson is that saying "yes" to things like this turned out
> terribly, and is a drag on the system.
>
>
> I proposed a less first-class alternative. Something like: annotate pods
> and run a highly-privileged agent on the node react to pods being scheduled
> by programming cpusets and calling numactl and whatever else it needs to
> do. Make this VERY explicit to use and DO NOT make it part of the API.
>
>
> If we need to collect NUMA status and use something about NUMA as a hint
> to scheduling, I can MAYBE see this, maybe even a different scheduler
> enitrely.
>
>
> If we need to sell "dedicated caches", then we should add that as
> resource. If I need NUMA because I want an LLC to myself, let me spec that.
>
>
> If we need to sell "link to high-perf IO device", then we should add that
> as resource.
>
>
> I'm trying to not overreact, but I have a visceral response to this.
>
>
>
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#171 (review)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AF8dbCCJHymMlCChkiVjtvjDqXa6Jz1rks5rKwfSgaJpZM4LL9zq>
> .
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
## Future considerations | ||
|
||
1. Author `NUMATopologyPredicate` in scheduler to enable NUMA aware scheduling. | ||
1. Restrict vertical autoscaling of CPU and NUMA affined workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this proposal assumes that we run workloads on monolithic hardware.
Our use case is different.
We have servers with 1,2 or 4 NUMA nodes. They have different CPU models (different number of cores) and different NUMA topologies. For example:
Server 1 has topology:
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Server 2 has topology:
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39
Consider a scenario:
We want to run workloads that perform best if they use all CPUs on one NUMA node.
If the server has 4 NUMA nodes, we will run 4 workloads on them.
It will be difficult to run this scenario with the proposed implementation.
Can we try a different approach?
Add a capability to specify CPUset to Opaque Integer Resources. One Opaque Integer Resource could be predefined (lets call it "numa"). When a user requests OIR numa 1, the scheduler would specify CPUset 0-9,20-29
on server-1 from my example, and 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40
on server-2.
This approach would also solve more advanced use cases. Lets say we want to run a workload that is pinned to one CPU that is close to a specific PCI device. On one server this CPU may be located on numa0, on another on numa1. In this case a user defines a new OIR (lets call it "fast-pci") and specifies CPUset for this OIR. It would be different for different servers. The schedule would pin the workload to the CPU close to the PCI on each server.
I assume that the OIR definition would be performed by a Node-Feature-Discovery pod (as defined in kubernetes/kubernetes#10570 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakub-d -- i don't think numa should be handled via OIR, but I see value in letting you have numa node capacity as a counted resource on node. I will think about how to handle that in this proposal. I think the major tension on the proposal is just the presence of NUMANodeID/CPUAffinity on the pod declaration which I think I would still want even with your proposed model. Thanks for the scenario feedback!
Recording pointer to kubernetes/kubernetes#39818 which means to me that a Toleration alone is not enough to provide additional meaning to a resource as there is no mechanism to assert that the Toleration must be coupled with a node of matching of taint (at this time). |
This discussion seems to have stalled a bit. @thockin: to grossly oversimplify your concerns about this proposal (please correct if this is wrong), you are opposed to exposing the What if these fields were only settable by a privileged agent like you described? That way those fields become part of the API between the system and the cluster operator. Users would have to specify something higher-level. Examples of higher-level could be anything from "I am a workload of class XYZ" to "I need to deliver 95% tail response latency of 100ms or less per my SLA". |
@ConnorDoyle -- i need to make an update to this proposal around its usage of tolerations, but i agree that fundamental issue is around those two fields. in addition, i am wondering if people feel differently about "exclusive cores" versus numa/cpu affinity. in that mode, we still want something similar to |
Can core allocation & NUMA be just a node level optimization for now to
begin with? For example, we could consider letting applications specify
that they are sensitive to latencies and have kubelet automatically provide
these optimizations.
…On Tue, Jan 17, 2017 at 10:29 AM, Derek Carr ***@***.***> wrote:
@ConnorDoyle <https://github.com/ConnorDoyle> -- i need to make an update
to this proposal around its usage of tolerations, but i agree that
fundamental issue is around those two fields. in addition, i am wondering
if people feel differently about "exclusive cores" versus numa/cpu
affinity. in that mode, we still want something similar to CPUAffinity to
provide the contract, but the NUMANodeID would not necessarily be
relevant.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#171 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKNlIvqXNgsysZHEE18efXYxYGAMeks5rTQiTgaJpZM4LL9zq>
.
|
@derekwaynecarr good point. Following the above, if NUMA is what the user requires, they could say so in a way that the operator / system understands and the privileged agent could set up CPUAffinity accordingly. @vishh It sounds like most people are on board with that goal. Teaching Kubelet to auto-tune settings to uphold SLAs is a much harder problem than simply allowing operators to provide some pre-configured settings on behalf of users. Also, it sounds like there may be some scenarios that Kubelet won't commit to supporting. In the meantime, and for the latter cases, could we build a bridge that avoids exposing too much detail (as per @thockin's comments)? Landing the low-level mechanism described in this proposal could be the groundwork for the default controller that ships w/ core to implement those higher level things. |
It is better to make Kubernetes a bit more complex than exposing that complexity to end users. Also, we are working on providing extensible isolators in the node, which would allow for varying levels of isolation.
The uncertainty around this proposal indicates that we should iterate on this feature rather than make API decisions now. In that regard, let's avoid exposing more knobs unless necessary. |
if NUMA is what the users requires
NUMA is never what the user requires. It's a means to an end. The things
they need are more deterministic memory latency, protection from
cache-thrashers, proximity to a particular device, etc. Think of it
another way - if a machine had low-latency uniform memory access from all
cores and had no shared LLC, users would have everything they need without
NUMA, right?
So let's describe those things as much as we can.
I'm less averse to "I need a dedicated LLC" than I am to "I need node 1",
though it is still mechanism oriented, rather than intent.
Here's my real scar: Once upon a time, perf-sensitive Borg users said "we
need dedicated cores", so borg gave them dedicated cores. It meshed well
enough with chargeback and things like that. Over time, we made CPU
isolation better and more automatic. We could offer an SLA on scheduler
wakeups, and even super latency-sensitive apps were happy - without
dedicated cores! The existence of dedicated cores was now a drag on the
system - it was a constraint around which we could not automate. So we
tried to get rid of them, but we met with much resistance - I think the
phrase "cold dead hands" was used. We asked people to just TEST without
it, and we realize that many MANY teams were effectively unstaffed, or had
ZERO bandwidth for an experiment. And so the feature stuck, and the
automatic efficiency of Borg is capped by the number of people using (and
cargo-culting) dedicated cores.
We know we CAN get to automated awesomeness in many (not all) ways. I
don't want to make the exact same decisions. I can't call them "bad"
decisions because they were the right decisions AT THE TIME.
So, can we make better decisions now? Can we side-load this stuff until
it's either paifully obvious we should have it first-class or until we
don't need it? Let's make the users of these features be the ones to carry
the burden.
…On Tue, Jan 17, 2017 at 10:54 AM, Vish Kannan ***@***.***> wrote:
@ConnorDoyle <https://github.com/ConnorDoyle>
Teaching Kubelet to auto-tune settings to uphold SLAs is a much harder
problem than simply allowing operators to provide some pre-configured
settings on behalf of users.
It is better to make Kubernetes a bit more complex than exposing that
complexity to end users. Also, we are working on providing extensible
isolators in the node, which would allow for varying levels of isolation.
And, I don't really see how cpu & numa allocation can be pre-configured?
It has to be dynamic right?
Landing the low-level mechanism described in this proposal could be the
groundwork for the default controller that ships w/ core to implement those
higher level things.
The uncertainty around this proposal indicates that we should iterate on
this feature rather than make API decisions now. In that regard, let's
avoid exposing more knobs unless necessary.
In addition to that, let's try to experiment with this feature using a
kubelet extension rather than make upstream changes right away.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#171 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVN5KBcFQ5Gszo033r45ms4FCfA3Eks5rTQ6AgaJpZM4LL9zq>
.
|
One potential way around the problems Tim mentions might be to think in
terms of a general proximity scale (beware, some wild handwaving to follow,
but I think we might be able to bash the vague idea into something usable).
At one end of the scale we have pods very far apart (think different data
centers), with high inter-pod communication cost/latency, low failure
correlation, and low interference (including network bandwidth contention,
disk contention etc)
At the other end of the scale we have pods very close together, with low
inter-pod communication cost/latency, high failure correlation, and high
interference (including LLC, network, disk etc).
Think of it as a scale from 1 to 100 if that helps.
One fairly typical requirement is for pods to be proximate to one another
other within some range on that scale (typically for some combination of
the above reasons of communication, failure correlation and interference.
For example, "I want my master and slave to be at least on different rack
switches (for failure isolation), but within the same building (for low
network cost/latency)", might be expressed as a simple proximity range
from, say 30 to 70, which expresses the general intent, but leaves the
scheduler to make the actual placement based on this intent.
Similarly, "I want my pod to be on a machine all by itself (to avoid
memory, disk and network contention" might be expressed as a proximity
requirement with a range of 10-100 w.r.t. all other pods.
Having your own core, NUMA channel, LLC etc just becomes a range (with a
lower bound on the low end) of that scale.
Thoughts?
On Tue, Jan 17, 2017 at 10:19 PM, Tim Hockin <notifications@github.com>
wrote:
… > if NUMA is what the users requires
NUMA is never what the user requires. It's a means to an end. The things
they need are more deterministic memory latency, protection from
cache-thrashers, proximity to a particular device, etc. Think of it
another way - if a machine had low-latency uniform memory access from all
cores and had no shared LLC, users would have everything they need without
NUMA, right?
So let's describe those things as much as we can.
I'm less averse to "I need a dedicated LLC" than I am to "I need node 1",
though it is still mechanism oriented, rather than intent.
Here's my real scar: Once upon a time, perf-sensitive Borg users said "we
need dedicated cores", so borg gave them dedicated cores. It meshed well
enough with chargeback and things like that. Over time, we made CPU
isolation better and more automatic. We could offer an SLA on scheduler
wakeups, and even super latency-sensitive apps were happy - without
dedicated cores! The existence of dedicated cores was now a drag on the
system - it was a constraint around which we could not automate. So we
tried to get rid of them, but we met with much resistance - I think the
phrase "cold dead hands" was used. We asked people to just TEST without
it, and we realize that many MANY teams were effectively unstaffed, or had
ZERO bandwidth for an experiment. And so the feature stuck, and the
automatic efficiency of Borg is capped by the number of people using (and
cargo-culting) dedicated cores.
We know we CAN get to automated awesomeness in many (not all) ways. I
don't want to make the exact same decisions. I can't call them "bad"
decisions because they were the right decisions AT THE TIME.
So, can we make better decisions now? Can we side-load this stuff until
it's either paifully obvious we should have it first-class or until we
don't need it? Let's make the users of these features be the ones to carry
the burden.
On Tue, Jan 17, 2017 at 10:54 AM, Vish Kannan ***@***.***>
wrote:
> @ConnorDoyle <https://github.com/ConnorDoyle>
>
> Teaching Kubelet to auto-tune settings to uphold SLAs is a much harder
> problem than simply allowing operators to provide some pre-configured
> settings on behalf of users.
>
> It is better to make Kubernetes a bit more complex than exposing that
> complexity to end users. Also, we are working on providing extensible
> isolators in the node, which would allow for varying levels of isolation.
> And, I don't really see how cpu & numa allocation can be pre-configured?
> It has to be dynamic right?
>
> Landing the low-level mechanism described in this proposal could be the
> groundwork for the default controller that ships w/ core to implement
those
> higher level things.
>
> The uncertainty around this proposal indicates that we should iterate on
> this feature rather than make API decisions now. In that regard, let's
> avoid exposing more knobs unless necessary.
> In addition to that, let's try to experiment with this feature using a
> kubelet extension rather than make upstream changes right away.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#171 (comment)
>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AFVgVN5KBcFQ5Gszo033r45ms4FCfA3Eks5rTQ6AgaJpZM4LL9zq>
> .
>
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#171 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJ6NATZH8bE-CK0HsvnoDJtcN4v_OaWTks5rTa7XgaJpZM4LL9zq>
.
--
Quinton Hoole
quinton@hoole.biz
|
@thockin IIUC, I think we are in agreement. Like I mentioned earlier, we need to express an intent and have NUMA and core allocation be one of the means to achieve that intent. The intent could be that a pod is "Latency sensitive". IIUC, this feature is primarily meant to tackle workloads that are sensitive to scheduling and memory access latencies. So let's try to model based on that "need" rather than the means. |
@vishh and i spoke in more detail, and i will be writing up an alternate
document to contrast with this proposal around extensible pod isolators,
hopefully in time for next resource mgmt wg meeting so we can weigh
pro/cons.
…On Wed, Jan 18, 2017 at 2:56 PM, Vish Kannan ***@***.***> wrote:
@thockin <https://github.com/thockin> IIUC, I think we are in agreement.
Like I mentioned earlier
<#171 (comment)>,
we need to express an intent and have NUMA and core allocation be one of
the means to achieve that intent. The intent could be that a pod is
"Latency sensitive".
At the same time, if we let the intent become too loose (similar to a
broad scale that @quinton-hoole <https://github.com/quinton-hoole>
mentioned), users might not be able to consume such an API at all.
IIUC, this feature is primarily meant to tackle workloads that are
sensitive to scheduling and memory access latencies. So let's try to model
based on that "need" rather than the means.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#171 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF8dbPcFTqCFthR2RO0i_bcXoomDMxM7ks5rTm5XgaJpZM4LL9zq>
.
|
Any updates on this subject? |
|
||
## What is NUMA? | ||
|
||
Non-uniform memory architecture (NUMA) describes multi-socket machines that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-uniform memory architecture => Non-Uniform Memory Architecture (NUMA)
Very minor change, enhances readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A in NUMA is "Access" https://en.wikipedia.org/wiki/Non-uniform_memory_access
Should it be "Non-Uniform Memory Access (NUMA) architecture" ?
for reference, I plan to update this proposal to align w/ more recent discussions in our resource mgmt workgroup. i hope to have an update for review in time for our planned f2f on 5/1/2017 |
I am closing this proposal given the outcome of the resource mgmt f2f where we agreed on a different high-level approach. |
@derekwaynecarr are you able to share more details about the F2F? |
… On Thu, May 11, 2017 at 8:35 PM, Greg Markey ***@***.***> wrote:
@derekwaynecarr <https://github.com/derekwaynecarr> are you able to share
more details about the F2F?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#171 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKBuv0W8mYk6UZ_Aci3XHhWI52yGYks5r49NugaJpZM4LL9zq>
.
|
A relevant formal proposal to k8s/community #654 |
Signed-off-by: Andrew Burden <aburden@redhat.com>
This proposal describes a potential mechanism to support CPU affinity and NUMA topology.
It is intended to help guide future discussions around resource management in @kubernetes/sig-node and @kubernetes/sig-scheduling in the resource management work-group.