Issue joining cgroups cpuset with kernel scheduler task "random" distribution #3922

cclerget · 2023-06-30T13:48:06Z

Description

A customer reported us an issue when attempting to join a running container inside kubernetes (kubectl exec ...). The container is running a real time application taking advantage of cores allocated to this container, the application uses the first CPU core of the allocated range as a slow thread (SCHED_OTHER policy) responsible for spawning RT threads (running under SCHED_FIFO policy) each running on a core.

They have configured kubernetes to ensure that it will allocate CPU cores within a specific range (all marked as isolated CPUs), they are using the kubernetes CPU manager with the static policy and have excluded all housekeeping CPUs from being allocated to a pod/container. Their machine is configured like this:

For the kernel command line:

isolcpus=managed_irq,domain,2-23,26-47 nmi_watchdog=0 nohz=on nohz_full=2-23,26-47
rcu_nocb_poll=1 rcu_nocbs=2-23,26-47 irqaffinity=0,1,24,25

Relevant sysctl:

kernel.hung_task_timeout_secs = 600
kernel.nmi_watchdog = 0
kernel.sched_rt_runtime_us = -1
vm.stat_interval = 10
kernel.timer_migration = 0

Customer has used this configuration successfully until RHEL 8.4, but with the introduction of this patch in 8.4, a random CPU assignment/scheduling occurs when a process enter (runc in this context) in a cgroup cpuset, before the patch addition, runc was always scheduled on the first CPU core of the cgroup cpuset, it worked fine as the first CPU core was used by a slow thread running under SCHED_OTHER policy, since the introduction of the kernel patch, runc is randomly scheduled on a core that can be fully taken by a RT threads running under SCHED_FIFO policy and with kernel.sched_rt_runtime_us=-1 there is no room for runc execution and the process get stuck, when it occurs it was observed that some other processes become unresponsive, so far systemd pid 1 was also stuck in a kernel call to proc_cgroup_show .

This is a corner case issue but serious enough to lock down a system.

Steps to reproduce the issue

Please find in attachment an archive with a reproducer based on vagrant/libvirt.

Decompress the archive and run vagrant up && vagrant halt && vagrant up

Then run a vagrant VM terminal with vagrant ssh and execute:

./reproducer.sh install
./reproducer.sh run 2-3,5

In another vagrant VM terminal, run ./reproducer.sh exec sh, the command should stuck and the system also, you shouldn't be able to open another vagrant terminal with vagrant ssh until the command in the first terminal is interrupted.

If you retry by running ./reproducer.sh run 2-3,5 in the first terminal but now ./reproducer.sh exec-patch sh in the second terminal, the system is now operating correctly (PR patch on going)

cpuset-issue-runc-repro.tar.gz

Describe the results you received and expected

The system stucks instead of operating correctly

What version of runc are you using?

runc 1.0.2 (but doesn't really matter here)

Host OS information

RHEL 8.X

Host kernel information

RHEL 8.X kernels

The text was updated successfully, but these errors were encountered:

This allows to set initial and final CPU affinity for a process being run in a container, which is needed to solve the issue described in [1]. [1] opencontainers/runc#3922 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin · 2024-07-03T17:12:02Z

This is going to be implemented via opencontainers/runtime-spec#1253

cyphar · 2024-10-21T07:19:10Z

Moving to 1.3.0 since it's a spec issue, and we agreed to move it in the 1.2.0 mega-thread.

This allows to set initial and final CPU affinity for a process being run in a container, which is needed to solve the issue described in [1]. [1] opencontainers/runc#3922 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

cclerget mentioned this issue Jun 30, 2023

Set temporary single CPU affinity before cgroup cpuset transition. #3923

Merged

kolyshkin closed this as completed in #3923 Apr 16, 2024

kolyshkin reopened this May 18, 2024

kolyshkin mentioned this issue May 18, 2024

Add CPU affinity to executed processes opencontainers/runtime-spec#1253

Merged

kolyshkin added this to the 1.2.0 milestone Jul 3, 2024

kolyshkin mentioned this issue Jul 3, 2024

Blockers for v1.2.0 #4114

Closed

21 tasks

cyphar modified the milestones: 1.2.0, 1.3.0 Oct 21, 2024

kolyshkin linked a pull request Jan 16, 2025 that will close this issue

runc exec: implement CPU affinity #4327

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue joining cgroups cpuset with kernel scheduler task "random" distribution #3922

Issue joining cgroups cpuset with kernel scheduler task "random" distribution #3922

cclerget commented Jun 30, 2023

kolyshkin commented Jul 3, 2024

cyphar commented Oct 21, 2024

Issue joining cgroups cpuset with kernel scheduler task "random" distribution #3922

Issue joining cgroups cpuset with kernel scheduler task "random" distribution #3922

Comments

cclerget commented Jun 30, 2023

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

kolyshkin commented Jul 3, 2024

cyphar commented Oct 21, 2024