-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CPU manager proposal. #654
Conversation
Attn @kubernetes/sig-node-proposals |
@ConnorDoyle: These labels do not exist in this repository: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
fyi @jeremyeder |
Guaranteed pod with 1 or more cores of cpu, the system will try to make | ||
sure that the pod gets its cpu quota primarily from reserved core(s), | ||
resulting in fewer context switches and higher cache affinity". | ||
1. Support the case where in a given pod, one container is latency-critical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume this is the case where a pod has a container that uses integral cores and another container that uses fractional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's how the static policy as written satisfies this requirement. Does that need clarification here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its probably ok, i have too much history in the discussion that as phrased here, it signals something deeper.
CRI update needed by CPU Manager - kubernetes/kubernetes#46105 |
Will it be possible to reserve a core for a pod but not actually assign a "container" to it? We actually dedicate cores to very specific application "threads". We currently do this by setting aside cores at boot time using Not disturbing the reactor thread is critical for our low latency / low jitter workloads. Instead of using taskset it might make more sense to perhaps communicate with the manager at runtime from the container? |
@chino Making CPU Manager aware of |
@chino there has some debate about similar use cases so far. This proposal represents a simplified compromise between the most sophisticated scenarios and some level of support that will benefit many users. We at Intel have worked with users with exactly this requirement, and the strategy you described is pretty common among the low latency crowd. Process scheduler load balancing is off for those CPUs, so applications need awareness to call sched_setaffinity() (or wrap with taskset/numactl) in order to use more than one. You could argue that the additional complexity breaks fungibility, so one way to look at isolcpus could be as a separate resource. In that case, one of the policies could be expanded to discover, advertise and make node-level allocation decisions for an isolated-cpu resource. If the use case isn't a natural fit with the upstream CPU Manager policies, then it may motivate a plugin interface so more specialized policies can be added by users. Before we retreat to that alternative, we should try to see how it could be satisfied in this proposal. |
Yes, I would like to replace the isolcpus method by working with the manager to achieve the same result. I was only describing what we currently do as an example. The main point is that we want to lock a specific app "thread" to a core and not have anything else but that thread on the core. We don't know what the thread-id is until the process starts so we have a wrapper script that handles that in the container. |
@chino by coincidence, a similar point just came up in the resource management workgroup weekly while discussing constraining cache partitions for garbage collector threads in a JVM. How about exploring the other direction of communication first? That is, Kubelet/CPU Manager could communicate down to the container the intersection of its CPU mask and isolcpus? That way the subprocess or thread could continue to set its own affinity without needing to call back into the CRI or Kubelet. The container would gain the advantage of allowing the CPU Manager to track the assigned cpusets, therefore guaranteeing that it won't accidentally collide with another user of an isolated CPU. My gut reaction is that implementing a bidirectional API for Kubelet/CRI to collaborate with the user application for these settings would get pretty complicated, but maybe we just haven't seen the right abstractions for it yet. |
@ConnorDoyle @chino -- the cpu assigned by the kubelet to a container needs to be discoverable via downward API (should be in this spec, if not there yet). |
Yes, exactly. I figured the simplest solution which wouldn't introduce too much complexity would be to We generally have 1 core on each socket set to handle interrupts and os/background threads. Sometimes we need to leave a few extra free to handle the interrupts on them fast enough for disk and what not. Obviously in a true multi-tenant cluster it would be nice to leave cores for os/k8 and then also group up cores into "shared pools" between certain pods and as noted "exclusive cores" for specific containers/threads. It could be interesting if there was someway to specify how to find the right thread to pin but it seems complicated and in our case that could be java (hence need to use jstack) or could be a native application where you need to use another method to identify the right thread. Another issue we currently have is we'd like to share ipc namespace and mmap files across sets of pods in the same tenant. Right now the only way to do that is to share it to the entire host which isn't very multi-tenant friendly/secure. |
@ConnorDoyle I like that idea, as long as we can specify easily shared vs dedicated. We previously used a very simple set of cgroups pre-docker but container runtimes create a whole set of different cgroup hierarchies and sometimes the container might not have write access. If something like cgroup namespaces or something formal can make this process easy to manage inside the private region than that's great. Again to be clear, I think multiple pods will want to share the same cores at times for background threads because there isn't enough cores to go around and controlling interrupt settings will be important. |
for “fast” virtual network functions (want to approach line rate on | ||
modern server NICs.) | ||
|
||
_Solution requirements:_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a requirement that a container must be able to discover the set of cpus it has been assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree that this should be a requirement. I see the value, where people are wanting to do thread affinity inside the pod, but it comes with baggage.
It pins us in when we go to add the dynamic cpuset manager later.
It also creates problem in the kubelet restart case. If our contract is "I'll put you on an exclusive core", we have to freedom to "reschedule" on kubelet restart rather than writing a nasty cgroup parser to reconstitute the cpuset manager state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reschedule as in possibly shift a container to a new core after it's already been running? That would be rather scary for many workloads\users if that happened.
The cpuset is just a bit flag which is easy to parse?
Could the assigned cores be saved to etcd or a local cache so it's recoverable perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chino the CPU mask would only be changed after process start if the dynamic CPU Manager policy is enabled. The static policy guarantees never to do that.
@derekwaynecarr and @sjenning: Inside the container, workloads that care about their CPU mask can easily get it by reading /proc/self/status
(see below). Is that sufficient?
connor:~/ $ grep -i cpus /proc/self/status
Cpus_allowed: 77
Cpus_allowed_list: 0-2,4-6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a dynamic policy, we should send a send a signal. For static, I agree it's not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backfilling from off-list discussion: we talked today about providing a signal in the following way. We could project (a subset of) the CPU manager state into a volume visible to selected containers. User workloads could subscribe to update events in a normal Linux manner (e.g. inotify.) We also decided, more or less, that this won't be necessary until the dynamic policy is implemented. We can document that user containers in the shared pool must avoid setting their own CPU affinity, since it will be overwritten without notice any time the membership of the shared pool changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catching up on doc, so this may be covered elsewhere, but we should say something about signalling in a future iteration of this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add it to the dynamic policy section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@derekparker yes, that would be great The main thing that worries me is we literally turn off interrupts and even high performance tick timers. We cannot have the top level container pid touching the core at all. Hence why I asked if we could dedicate but not assign anything to the core so that we can taskset our selves and know that it's completely void of anything running on it. When you work with low latency apps in the nano/micros time scale then any interrupt will cause measurable blips and jitter in the otherwise steady message rates. Especially in distributed applications this can cause race conditions and outliers. |
APologies for this delay. I'll review this proposal next week. |
I do not think that it's required in a first step, but: Some applications need a way to specify how threads are assigned to cores, to optimize performance. qemu/libvirt is one such example. Thus it might be valuable if at some point an workload can express how it expects the threads to be pinned to cores. |
| ------------------------------------------ | ------------------------------ | | ||
| Pod [Guaranteed]:<br /> A:<br />  cpu: 0.5 | Container **A** is assigned to the shared cpuset. | | ||
| Pod [Guaranteed]:<br /> A:<br />  cpu: 2.0 | Container **A** is assigned two sibling threads on the same physical core (HT) or two physical cores on the same socket (no HT.)<br /><br /> The shared cpuset is shrunk to make room for the exclusively allocated CPUs. | | ||
| Pod [Guaranteed]:<br /> A:<br />  cpu: 1.0<br /> A:<br />  cpu: 0.5 | Container **A** is assigned one exclusive CPU and container **B** is assigned to the shared cpuset. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: here and next line Pod column mentions 2 A containers. easy to spot on rendered page, but not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, will fix.
CPUs topologically. For example, allocating two CPUs on a system that has | ||
hyper-threading turned on yields both sibling threads on the same | ||
physical core. Likewise, allocating two CPUs on a non-hyper-threaded | ||
system yields two cores on the same socket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should happen if there are enough CPUs on a node, but they're far away from each other (I'm requesting 2 CPUs, but the ones left are on different cores). For some workloads admitting such Pod may be acceptable, for some not. Would it be a good idea to make this behaviour configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, our plan is to have the first policy (static) provide the "best" allocation when assigning dedicated cores. So, at higher G pod density it would be possible to observe the topology fragmentation you described.
If it were configurable in the policy, do you have ideas about how you would like that to look (name for the knob, values, defaults etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it — the more I realize how complicated things actually are =/
Generally my perception is that by default CPUs of a single container inside a Pod should be as close to each other as possible, while the CPUs of a different containers inside a Pod should be as far as possible from each other. The only case I can imagine when I would want to pack as much dedicated containers on a single socket as possible is when I would want to launch multiple small (1-CPU) containers and then launch a single giant one, that asks for half the CPUs of a core.
For a container if we want to allow to control how close the CPUs are, I imagine it might be a double parameter in the pod spec. Smth like:
corePolicy:
placement: dense|separate
strict: true|false
with dense and false as defaults. separate
would mean to attempt to spread CPUs as much as possible and dense
would mean to pack as close as possible. With strict: false
meaning that workload can tolerate if the requirements are not met. And strict: true
on the other side should mean that the container has to be rescheduled.
And then there is a whole new level with HT enabled. I can imagine smth like:
coreThreadPolicy:
placement: dense|separate|separate_restrict
strict: true|false
with dense
and false
as defaults. dense
here would mean allow giving a container multiple sibling CPUs. separate
would mean attempt to spread CPUs across non-siblings. spearate_restict
would mean spread CPUs across non-siblings and reserve the sibling, so that it is not assigned to anyone.
I'm afraid that I might be over-thinking the whole problem =)
As a small disclaimer I'd like to mention that I come from OpenStack, so my perception migh be tainted with how OS does it =) Here is a link to a doc that describes how OS approaches similar problems.
(cc @ashish-billore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@teferi I like the idea of corePolicy
.
As for coreThreadPolicy
- isn't it a little bit too much ? i.e. not all environments are homogeneous, some nodes may not have HT enabled (probably not much). Besides separate_restrict
can be accomplished by requesting i.e. 2 cores. If node has HT and
corePolicy:
placement: dense
strict: true
then it would result in 2 CPUs on 1 physical core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flyingcougar separate_restrict is more about not allowing others to use core's sibling. It probably would be very difficult to implement, since k8s would still think that we have those cores available for scheduling. So anyway it was probably a weird idea.
corePolicy
is about packing the CPUs as tightly as possible on cores in the same socket, while coreThreadPolicy
is about packing CPUs as tightly as possible within one core. So imo both of those have their uses. I think I need to draw a couple of diagrams to illustrate these ideas...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @fabiand I guess you should participate in ^^ part of the discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it — the more I realize how complicated things actually are =/
Yes :)
I'm not an expert on this domain, but …
It's pretty clear that we - kubevirt - need the 'strict dense placement' option as a first step.
I actually wonder if additional differentiations are needed: Either you care, and you require an optimal placement, or you don't and get best effort.
Some workloads, like kubevirt (again), might want to manage parts of numa nodes themselfes, in a more fine granular fashion. In those cases it might make sense, to be able to tell the kubelet, that some nodes, sockets, core, should be reserved/exclusive to a specific container. A process can then take those reserved parts and manage them as needed (i.e. in our case, libvirt would take care of managing those nodes).
Long story short, we might want a mechanism to lend nodes to a process.
Another thing is that in future we need - and thus should already think of - to combine the placement with ResourceClasses #782, because in reality you want a process close to the device it uses, to avoid numa-node remote memory access.
Closely related to this is actually also the relationship to huge-pages.
I do not think that we need to solve all of this now, but we should understand the picture, to not create something which will already contradict with what we see coming up.
@berrange @mpolednik thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand there's exactly that topic on the resource management workgroup agenda for this morning. Besides the device binding logic and the hugetlb controller settings, CNI plugins could also take topology into account on a multi-socket system with multiple network interfaces. I hope we can achieve agreement that a solution is needed to unify affinity somehow in the v1.9 or v1.10 releases.
On the topic of the CPU manager itself, policies that operators want can get really complicated. Hopefully we can get a good default scoped and implemented first and then extend with more advanced policies. In the resource management face-to-face where this component was conceived, we had a pretty long discussion about implementing a more flexible policy driven by explicit pool configuration. Maybe we could think of ways to represent the behaviors needed for the advanced cases in the config domain. In my mind, a rich configurable policy may buy enough flexibility so that we don't need another plugin system for external policies. At the same time, for v1.8 at least, changes to the pod spec API are probably out of scope.
Let me just raise our use-case: When spawning VMs it is important to map the virtual topology to the real one, to achieve optimal performance. /cc @berrange |
|
||
1. _A container that was assigned exclusive cores terminates._ | ||
1. Kuberuntime unregisters the container with the CPU manager. | ||
1. CPU manager unregisters the contaner with the static policy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: container
1. Asynchronously, the CPU manager's reconcile loop updates the | ||
cpuset for all containers running in the shared pool. | ||
|
||
1. _The shared pool becomes empty._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we previously discussed a node condition for cpu pressure. is the taint the replacement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for pre-emption, will the taint be understood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, lost track of the progress to convert node conditions to taints. Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
||
#### Policy 3: "dynamic" cpuset control | ||
|
||
_TODO: Describe the policy._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summarize what was discussed at f2f, and just state that this is planned for prototyping post kube 1.8.
* Checkpointing assignments | ||
* The CPU Manager must be able to pick up where it left off in case the | ||
Kubelet restarts for any reason. | ||
* Read effective CPU assinments at runtime for alerting. This could be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assignments
CRI. Runc/libcontainer allows container cgroup settings to be updated | ||
after creation, but neither the Kubelet docker shim nor the CRI | ||
implement a similar interface. | ||
1. Mitigation: [PR 46105](https://github.com/kubernetes/kubernetes/pull/46105) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is needed independent of cpu policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any update required in this text? (what are the other things that need it, VPA?)
|
||
## Implementation roadmap | ||
|
||
### Phase 1: No-op policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make clear kube 1.8 targets phase 1 and 2.
@ConnorDoyle @sjenning -- we need to capture what happens when all cpus are used, and make clear that we need to evict a BE pod if already resident on the node. |
@derekwaynecarr updated |
@derekwaynecarr and @vishh (approvers per the rm working group planning spreadsheet), Aside from the node condition details that @sjenning is currently working to add, are there any other blockers to accept this proposal? |
log forwarding, metrics collection and the like.) | ||
1. Do not cap CPU quota for guaranteed containers that are granted | ||
exclusive cores, since that would be antithetical to (1) above. | ||
1. Take physical processor topology into account in the CPU affinity policy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: incorrect numbering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is github markdown. When viewed as markdown, not source, the numbering is correct.
document][node-allocatable] for details. The CPU manager will claim | ||
`ceiling(node.status.allocatable.cpu)` as the number of CPUs available to | ||
assign to pods, starting from the highest-numbered physical core and | ||
descending topologically. It is recommended to configure an integer value for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ConnorDoyle Can you please give an example here illustrating with sample topology, kube-reserved
, system-reserved
and node.status.allocatable.cpu
?
Operator documentation will be updated to explain how to configure the | ||
system to use the low-numbered physical cores for kube-reserved and | ||
system-reserved slices. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ConnorDoyle Will it make sense to add minimal details in this document as well for example "user-story"?
that prevents BestEffort and Burstable QoS class pods from | ||
running on the node. | ||
|
||
1. _The shared pool becomes nonempty._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: numbering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before re: github markdown
1. _The shared pool becomes empty._ | ||
1. The CPU manager adds a taint with effect NoSchedule, NoExecute | ||
that prevents BestEffort and Burstable QoS class pods from | ||
running on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is reading the state and updating the container resources via CRI synchronous with "start the container"? If yes, how kuberuntime gets to know when to invoke "start"?
@ConnorDoyle -- i will take one final pass by eod tomorrow. right now, this generally lgtm. |
@ConnorDoyle @sjenning -- please make sure to keep updated with implementation progress, but I think this proposal is fine to proceed implementing. Thanks! /lgtm |
@derekwaynecarr looks like merging is manual in this repo. Should I just git the Big Green Button? |
@sjenning Yes, feel free to manually merge this :) |
@sjenning @derekwaynecarr @vishh I believe your PR covers this, but the EPYC processor from AMD has some new (somewhat odd) differences between the Xeons when it comes to core affinity and NUMA. You can see in the diagram here that there are multiple latency paths that are possible when you cross NUMA boundaries, depending on which core originated the request: http://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/2 |
@cliffburdick it is true that LLC latency is not uniform among cores on the same socket for the new AMD CCX based processors. If the a CCX must access a remote CCXs LLC, there is additional latency. It is basically a topology level between cores and sockets. Note this is only an issue if you have two threads sharing data. If the working sets are independent, this is not an issue. Right now, we can't make the distinction as cadvisor does not make the distinction. But this is a last 5% optimization in my mind. The existing cpuset allocator should work well, even without explicit knowledge of the CCXs. |
@cliffburdick there's some historical context here, too. We agreed to defer a bunch of what this feature was originally spec'd out to be, out to a later release. That includes some of the more advanced topology detection and affinity (incl. PCI locality). It is a roadmap item for the working group. From RH perspective and I would say others too, it is a very important item to tackle to make sure this cpumanager delivers on the intended use-cases. The compromise we made in order to even get this far was that the complexity of topology discovery and pinning would be hidden behind QoS tiers. So users request a QoS tier and a core count in whole integers (rather than millicores) to get the performance they want. I would love it if you have more concrete requirements that you write them up, send them to our mailing list and hopefully participate directly in the next resource management working group meeting. Please review this: kubernetes/kubernetes#49964 You can find all of the mailing list, meeting and agenda details here: |
Hi @jeremyeder, I'd be glad to list a couple here, and some of it is based on your work. We're running DPDK in Docker and have IO cards that should be affined to the same NUMA zone/CPU. For DPDK, isolcpus is the first obvious thing to turn on for all the packet processing and worker threads; we do not want any Linux housekeeping or other applications floating there. I've seen some recent threads on other message boards related to using isolcpus with containers, and it appears that it either works, or there are workaround to make it work. Taking it a step further, Intel recommends disabling the RCU threads on any isolcpus cores. I don't believe this changes anything, but it's another thing enabled on the kernel boot line. Lastly I wanted to experiment with the tickless kernel since many applications would see latency benefits by a single thread per core. I don't know how well this will work in a container since I haven't tried it, but I believe if two threads try to run on the same core, the kernel will revert to the normal scheduler for that core. I think it should work based on looking at the processes started by Docker, though. In my case, I almost never will be crossing the NUMA boundary. All of the work will come in a NIC attached to a single processor, and all of the work will be done on that processor or accelerator cards attached to that processor before going back out the same NIC. Of course, DPDK has other implications as well, which I'm currently working through. I will send to the mailing list as well. |
@cliffburdick this is a fantastic discussion, but the comments in a closed PR is the wrong place for it. I'm afraid all this good info will be lost. Could we please move to the ML? |
@ConnorDoyle sure. Thanks. |
Add CPU manager proposal.
Notes for reviewers
First proposal submitted to the community repo, please advise if something's not right with the format or procedure, etc.
cc @flyingcougar @sjenning @derekwaynecarr