Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set temporary single CPU affinity before cgroup cpuset transition. #3923

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

cclerget
Copy link
Contributor

This handles a corner case when joining a container having all the processes running exclusively on isolated CPU cores to force the kernel to schedule runc process on the first CPU core within the cgroups cpuset.

The introduction of the kernel commit
46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic scheduling behavior by distributing tasks across CPU cores within the cgroups cpuset. Some intensive real-time application are relying on this deterministic behavior and use the first CPU core to run a slow thread while other CPU cores are fully used by real-time threads with SCHED_FIFO policy. Such applications prevents runc process from joining a container when the runc process is randomly scheduled on a CPU core owned by a real-time thread.

@kolyshkin kolyshkin marked this pull request as draft June 30, 2023 20:49
@kolyshkin
Copy link
Contributor

Thanks for working on this. I changed it to draft until all the issues with the code and test cases are fixed, and left some minor comments.

@cclerget cclerget force-pushed the issue-3922 branch 16 times, most recently from 85f2d35 to 3e05b1c Compare July 3, 2023 14:01
@cclerget
Copy link
Contributor Author

cclerget commented Jul 3, 2023

Thanks @kolyshkin , it should be ready for review now

@cclerget cclerget marked this pull request as ready for review July 3, 2023 14:01
@lifubang
Copy link
Member

lifubang commented Jul 7, 2023

@cclerget Please rebase

@cclerget
Copy link
Contributor Author

@lifubang Done

// Use a goroutine to dedicate an OS thread.
go func() {
cpuSet := new(unix.CPUSet)
cpuSet.Zero()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this is not needed, in Go everything is initialized to default values (0s in this case), and you've just instantiated a new CPUSet.

@@ -340,3 +340,168 @@ EOF
[ ${#lines[@]} -eq 1 ]
[[ ${lines[0]} = *"exec /run.sh: no such file or directory"* ]]
}

@test "runc exec with isolated cpus affinity temporary transition [cgroup cpuset]" {
requires root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add cgroups_cpuset to the requires list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other tests

Comment on lines 349 to 359
local all_cpus
all_cpus="$(cat /sys/devices/system/cpu/online)"

update_config ".linux.resources.cpu.cpus = \"$all_cpus\""

# set temporary isolated CPU affinity transition
update_config '.annotations += {"org.opencontainers.runc.exec.isolated-cpu-affinity-transition": "temporary"}'

local mems
mems="$(cat /sys/devices/system/node/online 2>/dev/null || true)"
[[ -n $mems ]] && update_config ".linux.resources.cpu.mems = \"$mems\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can separate this into a function (at least the mems and all_cpus part).

Comment on lines 413 to 414
# fix unbound variable in condition below
PLATFORM_ID=${PLATFORM_ID:-}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you use VERSION_ID instead, it looks easier?

@kolyshkin kolyshkin added this to the 1.2.0 milestone Apr 2, 2024
@cclerget
Copy link
Contributor Author

cclerget commented Apr 3, 2024

@kolyshkin addressed your comments in 9eb05cc, will squash the commits after approval

The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76
in 5.7 has affected a deterministic scheduling behavior by distributing tasks
across CPU cores within a cgroups cpuset. It means that some runc operations
like `runc exec` might be impacted under some circumstances, by example when
Copy link

@andreaskaris andreaskaris Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a review, just a question (but maybe you'll decide to clarify this further). Without looking at the code, and only at this piece of documentation (and at the commit message):

  • will this only improve the behavior of threads launched with runc exec? (if so, then the documentation should not be "some runc operations like")
  • otherwise, what other commands / situations / runc operations will benefit from this patch? (i.e., in which cases do I want to annotate my pods with this annotation to revert to the behavior pre 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76). The documentation so far only mentions runc exec
  • related to the above, will this change the behavior of any new process spawned in the container (an example would be a DPDK application using the vhost device to create kernel vhost threads, would those still float around freely, or be sent to the first CPU with the annotation in place - in this scenario, no runc exec session is involved)
  • adjust the commit message also, because "This handles a corner case when joining a container having all
    the processes running exclusively on isolated CPU cores to force
    the kernel to schedule runc process on the first CPU core within the
    cgroups cpuset." sounds ambiguous to me? "joining" as in connecting to the container with an exec session? (because joining could also mean "joining something together")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this only improve the behavior of threads launched with runc exec? (if so, then the documentation should not be "some runc operations like")

Yes it does affect runc exec operation only, I will fix that part, thanks !

otherwise, what other commands or situations will benefit from this patch? (i.e., in which cases do I want to annotate my pods with this annotation to revert to the behavior pre 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76)

No other commands will benefit from this patch.
For situations, by example when kubernetes is configured with a static policy (https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy) and --reserved-cpus contains the isolated CPUs, all pods running with resources limits/requests set to the same number of CPUs >= 1 will be granted X exclusive isolated CPUs, with such setup and pod spec, a container real-time application running on those isolated CPUs could create threads with SCHED_FIFO policy except on the first isolated CPU, such that things like kubernetes exec probes or kubectl exec will benefit from this patch to use the first isolated CPU without interfering with the real-time application threads. You can look at the original issue #3922 to have more context.

related to the above, will this change the behavior of any new process spawned in the container (an example would be a DPDK application using the vhost device to create kernel vhost threads, would those still float around freely, or be sent to the first CPU with the annotation in place

It won't change the behavior for the new processes, only processes spawned through runc exec are impacted by this annotation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for the clarification and thanks for your work on this!

@cclerget
Copy link
Contributor Author

cclerget commented Apr 5, 2024

@kolyshkin changes ok with you ?

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; please squash the commits

This handles a corner case when joining a container having all
the processes running exclusively on isolated CPU cores to force
the kernel to schedule runc process on the first CPU core within the
cgroups cpuset.

The introduction of the kernel commit
46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic
scheduling behavior by distributing tasks across CPU cores within the
cgroups cpuset. Some intensive real-time application are relying on this
deterministic behavior and use the first CPU core to run a slow thread
while other CPU cores are fully used by real-time threads with SCHED_FIFO
policy. Such applications prevents runc process from joining a container
when the runc process is randomly scheduled on a CPU core owned by a
real-time thread.

Introduces isolated CPU affinity transition OCI runtime annotation
org.opencontainers.runc.exec.isolated-cpu-affinity-transition to restore
the behavior during runc exec.

Fix issue with kernel >= 6.2 not resetting CPU affinity for container processes.

Signed-off-by: Cédric Clerget <cedric.clerget@gmail.com>
@cclerget
Copy link
Contributor Author

Done thanks !

@kolyshkin kolyshkin merged commit 6a2813f into opencontainers:main Apr 16, 2024
38 checks passed
@MatthewHink
Copy link

Thank goodness this is finally merged! Really appreciate your help everyone!

@kolyshkin
Copy link
Contributor

The more I look into this the more I think I did a bad job reviewing this, and it needs to be redone in a different way:

  1. There's too much logic here figuring out which CPUs to use. Runc is a low level tool and is not supposed to be that "smart".
  2. What's worse, this logic is executed on every exec, making it slower.
  3. Some of the logic in (*setnsProcess).start is executed even if no annotation is set, thus making ALL execs slow.
  4. As pointed out in config: add annotation for exec isolated CPU affinity runtime-spec#1252, if we want to support this across different runtimes (e.g. crun and runc), this should not be an annotation, but rather a process parameter.

Let's fix this:

  1. Revert this PR.
  2. Move some of the functionality (determining which CPU to pin exec to) to upper level runtime (cri-o/containerd).
  3. Open a PR in runtime-spec proposing the changes to runc/crun.
  4. Open a PR to runc implementing that proposal.

@NeilHanlon
Copy link

while I recognize that this is a complex issue, I am unsure if reverting this is the best path forward, considering implementations of this are already deployed and in use.

This review has taken an exceedingly long time and we are now on the cusp of even more review.

It would be disappointing if we have to start this whole process over again.

@kolyshkin
Copy link
Contributor

while I recognize that this is a complex issue, I am unsure if reverting this is the best path forward, considering implementations of this are already deployed and in use.

This review has taken an exceedingly long time and we are now on the cusp of even more review.

It would be disappointing if we have to start this whole process over again.

I have to admit I did a sloppy job reviewing this; yet this is not in any of the released runc versions (and this is why it needs to be reverted now, before we officially release it).

Now, this has to be re-implemented in the right way, starting from runtime-spec (see opencontainers/runtime-spec#1253), then in runtimes (such as cri-o and containerd), when in low level runtimes (such as runc and crun).

Any help (esp in cri-o and containerd) is appreciated.

@lifubang lifubang mentioned this pull request Jun 10, 2024
@kolyshkin
Copy link
Contributor

@cclerget I am sorry this had to be reverted, but now when opencontainers/runtime-spec#1253 is merged we can work on reviving it if you're still interested. Now, 90% of the work is to be done by containerd/cri-o (they should do the work of converting pod annotation (what used to be org.opencontainers.runc.exec.isolated-cpu-affinity-transition) to process.ExecCPUAffinity setting for runc, and runc should only enforce the affinity.

Let me know if you still want to work on this and we'll figure it out.

@cclerget
Copy link
Contributor Author

cclerget commented Jul 9, 2024

@kolyshkin It's unfortunate but understand the rational behind this decision.

I'm more concerned about the revert of the fix not directly related to the original issue like mentioned in my comment #3923 (comment).

It has also been noted by the customer when using a tuned profile like CPU-partitioning, the CPU affinity patch at least on kernel < 6.2 or RHEL < 9 doesn't fix the runc "noisy neighbor" behavior after cgroup transition, on those kernels and systems execCPUAffinity.final won't be effective without a rework of cgroup handling, I'm not sure if this would work with the various cgroup managers, but the idea is to create a child cgroup (let's call it nsexec) under the container cgroup when execCPUAffinity.initial is set and execCPUAffinity.final is set and different, in order to put the runc exec process in this nsexec child cgroup, such that this child cgroup could set a more restricted cpuset reflecting execCPUAffinity.final when different from execCPUAffinity.initial and restrict the executed process to this specific cpuset to remediate to the "noisy neighbor" behavior. Does that make sense ?

I'd like to but right now I haven't much room to work on it, maybe in two weeks

@jsco2t
Copy link

jsco2t commented Aug 8, 2024

@kolyshkin has there been any commitment by containerd/cri-o as to when they might land their portion of this solution? I looked through the recent issues in containerd but didn't see anything to suggest to me that they were doing planning around this feature area (though I certainly could have missed it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue joining cgroups cpuset with kernel scheduler task "random" distribution