-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set temporary single CPU affinity before cgroup cpuset transition. #3923
Conversation
Thanks for working on this. I changed it to draft until all the issues with the code and test cases are fixed, and left some minor comments. |
85f2d35
to
3e05b1c
Compare
Thanks @kolyshkin , it should be ready for review now |
@cclerget Please rebase |
@lifubang Done |
// Use a goroutine to dedicate an OS thread. | ||
go func() { | ||
cpuSet := new(unix.CPUSet) | ||
cpuSet.Zero() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK this is not needed, in Go everything is initialized to default values (0s in this case), and you've just instantiated a new CPUSet.
tests/integration/exec.bats
Outdated
@@ -340,3 +340,168 @@ EOF | |||
[ ${#lines[@]} -eq 1 ] | |||
[[ ${lines[0]} = *"exec /run.sh: no such file or directory"* ]] | |||
} | |||
|
|||
@test "runc exec with isolated cpus affinity temporary transition [cgroup cpuset]" { | |||
requires root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add cgroups_cpuset
to the requires list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for other tests
tests/integration/exec.bats
Outdated
local all_cpus | ||
all_cpus="$(cat /sys/devices/system/cpu/online)" | ||
|
||
update_config ".linux.resources.cpu.cpus = \"$all_cpus\"" | ||
|
||
# set temporary isolated CPU affinity transition | ||
update_config '.annotations += {"org.opencontainers.runc.exec.isolated-cpu-affinity-transition": "temporary"}' | ||
|
||
local mems | ||
mems="$(cat /sys/devices/system/node/online 2>/dev/null || true)" | ||
[[ -n $mems ]] && update_config ".linux.resources.cpu.mems = \"$mems\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you can separate this into a function (at least the mems and all_cpus part).
tests/integration/exec.bats
Outdated
# fix unbound variable in condition below | ||
PLATFORM_ID=${PLATFORM_ID:-} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you use VERSION_ID
instead, it looks easier?
@kolyshkin addressed your comments in 9eb05cc, will squash the commits after approval |
The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 | ||
in 5.7 has affected a deterministic scheduling behavior by distributing tasks | ||
across CPU cores within a cgroups cpuset. It means that some runc operations | ||
like `runc exec` might be impacted under some circumstances, by example when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a review, just a question (but maybe you'll decide to clarify this further). Without looking at the code, and only at this piece of documentation (and at the commit message):
- will this only improve the behavior of threads launched with
runc exec
? (if so, then the documentation should not be "some runc operations like") - otherwise, what other commands / situations / runc operations will benefit from this patch? (i.e., in which cases do I want to annotate my pods with this annotation to revert to the behavior pre 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76). The documentation so far only mentions
runc exec
- related to the above, will this change the behavior of any new process spawned in the container (an example would be a DPDK application using the vhost device to create kernel vhost threads, would those still float around freely, or be sent to the first CPU with the annotation in place - in this scenario, no
runc exec
session is involved) - adjust the commit message also, because "This handles a corner case when joining a container having all
the processes running exclusively on isolated CPU cores to force
the kernel to schedule runc process on the first CPU core within the
cgroups cpuset." sounds ambiguous to me? "joining" as in connecting to the container with an exec session? (because joining could also mean "joining something together")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this only improve the behavior of threads launched with runc exec? (if so, then the documentation should not be "some runc operations like")
Yes it does affect runc exec operation only, I will fix that part, thanks !
otherwise, what other commands or situations will benefit from this patch? (i.e., in which cases do I want to annotate my pods with this annotation to revert to the behavior pre 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76)
No other commands will benefit from this patch.
For situations, by example when kubernetes is configured with a static policy (https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy) and --reserved-cpus
contains the isolated CPUs, all pods running with resources limits/requests set to the same number of CPUs >= 1 will be granted X exclusive isolated CPUs, with such setup and pod spec, a container real-time application running on those isolated CPUs could create threads with SCHED_FIFO policy except on the first isolated CPU, such that things like kubernetes exec probes or kubectl exec
will benefit from this patch to use the first isolated CPU without interfering with the real-time application threads. You can look at the original issue #3922 to have more context.
related to the above, will this change the behavior of any new process spawned in the container (an example would be a DPDK application using the vhost device to create kernel vhost threads, would those still float around freely, or be sent to the first CPU with the annotation in place
It won't change the behavior for the new processes, only processes spawned through runc exec
are impacted by this annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for the clarification and thanks for your work on this!
@kolyshkin changes ok with you ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; please squash the commits
This handles a corner case when joining a container having all the processes running exclusively on isolated CPU cores to force the kernel to schedule runc process on the first CPU core within the cgroups cpuset. The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic scheduling behavior by distributing tasks across CPU cores within the cgroups cpuset. Some intensive real-time application are relying on this deterministic behavior and use the first CPU core to run a slow thread while other CPU cores are fully used by real-time threads with SCHED_FIFO policy. Such applications prevents runc process from joining a container when the runc process is randomly scheduled on a CPU core owned by a real-time thread. Introduces isolated CPU affinity transition OCI runtime annotation org.opencontainers.runc.exec.isolated-cpu-affinity-transition to restore the behavior during runc exec. Fix issue with kernel >= 6.2 not resetting CPU affinity for container processes. Signed-off-by: Cédric Clerget <cedric.clerget@gmail.com>
Done thanks ! |
Thank goodness this is finally merged! Really appreciate your help everyone! |
The more I look into this the more I think I did a bad job reviewing this, and it needs to be redone in a different way:
Let's fix this:
|
while I recognize that this is a complex issue, I am unsure if reverting this is the best path forward, considering implementations of this are already deployed and in use. This review has taken an exceedingly long time and we are now on the cusp of even more review. It would be disappointing if we have to start this whole process over again. |
I have to admit I did a sloppy job reviewing this; yet this is not in any of the released runc versions (and this is why it needs to be reverted now, before we officially release it). Now, this has to be re-implemented in the right way, starting from runtime-spec (see opencontainers/runtime-spec#1253), then in runtimes (such as cri-o and containerd), when in low level runtimes (such as runc and crun). Any help (esp in cri-o and containerd) is appreciated. |
@cclerget I am sorry this had to be reverted, but now when opencontainers/runtime-spec#1253 is merged we can work on reviving it if you're still interested. Now, 90% of the work is to be done by containerd/cri-o (they should do the work of converting pod annotation (what used to be Let me know if you still want to work on this and we'll figure it out. |
@kolyshkin It's unfortunate but understand the rational behind this decision. I'm more concerned about the revert of the fix not directly related to the original issue like mentioned in my comment #3923 (comment). It has also been noted by the customer when using a tuned profile like CPU-partitioning, the CPU affinity patch at least on kernel < 6.2 or RHEL < 9 doesn't fix the runc "noisy neighbor" behavior after cgroup transition, on those kernels and systems I'd like to but right now I haven't much room to work on it, maybe in two weeks |
@kolyshkin has there been any commitment by |
This handles a corner case when joining a container having all the processes running exclusively on isolated CPU cores to force the kernel to schedule runc process on the first CPU core within the cgroups cpuset.
The introduction of the kernel commit
46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic scheduling behavior by distributing tasks across CPU cores within the cgroups cpuset. Some intensive real-time application are relying on this deterministic behavior and use the first CPU core to run a slow thread while other CPU cores are fully used by real-time threads with SCHED_FIFO policy. Such applications prevents runc process from joining a container when the runc process is randomly scheduled on a CPU core owned by a real-time thread.