Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue joining cgroups cpuset with kernel scheduler task "random" distribution #3922

Open
Tracked by #4114
cclerget opened this issue Jun 30, 2023 · 2 comments · Fixed by #3923 · May be fixed by #4327
Open
Tracked by #4114

Issue joining cgroups cpuset with kernel scheduler task "random" distribution #3922

cclerget opened this issue Jun 30, 2023 · 2 comments · Fixed by #3923 · May be fixed by #4327
Milestone

Comments

@cclerget
Copy link
Contributor

Description

A customer reported us an issue when attempting to join a running container inside kubernetes (kubectl exec ...). The container is running a real time application taking advantage of cores allocated to this container, the application uses the first CPU core of the allocated range as a slow thread (SCHED_OTHER policy) responsible for spawning RT threads (running under SCHED_FIFO policy) each running on a core.

They have configured kubernetes to ensure that it will allocate CPU cores within a specific range (all marked as isolated CPUs), they are using the kubernetes CPU manager with the static policy and have excluded all housekeeping CPUs from being allocated to a pod/container. Their machine is configured like this:

  • For the kernel command line:
isolcpus=managed_irq,domain,2-23,26-47 nmi_watchdog=0 nohz=on nohz_full=2-23,26-47
rcu_nocb_poll=1 rcu_nocbs=2-23,26-47 irqaffinity=0,1,24,25
  • Relevant sysctl:
kernel.hung_task_timeout_secs = 600
kernel.nmi_watchdog = 0
kernel.sched_rt_runtime_us = -1
vm.stat_interval = 10
kernel.timer_migration = 0

Customer has used this configuration successfully until RHEL 8.4, but with the introduction of this patch in 8.4, a random CPU assignment/scheduling occurs when a process enter (runc in this context) in a cgroup cpuset, before the patch addition, runc was always scheduled on the first CPU core of the cgroup cpuset, it worked fine as the first CPU core was used by a slow thread running under SCHED_OTHER policy, since the introduction of the kernel patch, runc is randomly scheduled on a core that can be fully taken by a RT threads running under SCHED_FIFO policy and with kernel.sched_rt_runtime_us=-1 there is no room for runc execution and the process get stuck, when it occurs it was observed that some other processes become unresponsive, so far systemd pid 1 was also stuck in a kernel call to proc_cgroup_show .

This is a corner case issue but serious enough to lock down a system.

Steps to reproduce the issue

Please find in attachment an archive with a reproducer based on vagrant/libvirt.

Decompress the archive and run vagrant up && vagrant halt && vagrant up

Then run a vagrant VM terminal with vagrant ssh and execute:

./reproducer.sh install
./reproducer.sh run 2-3,5

In another vagrant VM terminal, run ./reproducer.sh exec sh, the command should stuck and the system also, you shouldn't be able to open another vagrant terminal with vagrant ssh until the command in the first terminal is interrupted.

If you retry by running ./reproducer.sh run 2-3,5 in the first terminal but now ./reproducer.sh exec-patch sh in the second terminal, the system is now operating correctly (PR patch on going)

cpuset-issue-runc-repro.tar.gz

Describe the results you received and expected

The system stucks instead of operating correctly

What version of runc are you using?

runc 1.0.2 (but doesn't really matter here)

Host OS information

RHEL 8.X

Host kernel information

RHEL 8.X kernels

@kolyshkin kolyshkin reopened this May 18, 2024
kolyshkin added a commit to kolyshkin/runtime-spec that referenced this issue May 18, 2024
This allows to set initial and final CPU affinity for a process being
run in a container, which is needed to solve the issue described in [1].

[1] opencontainers/runc#3922

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
kolyshkin added a commit to kolyshkin/runtime-spec that referenced this issue May 18, 2024
This allows to set initial and final CPU affinity for a process being
run in a container, which is needed to solve the issue described in [1].

[1] opencontainers/runc#3922

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
kolyshkin added a commit to kolyshkin/runtime-spec that referenced this issue May 19, 2024
This allows to set initial and final CPU affinity for a process being
run in a container, which is needed to solve the issue described in [1].

[1] opencontainers/runc#3922

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
kolyshkin added a commit to kolyshkin/runtime-spec that referenced this issue May 19, 2024
This allows to set initial and final CPU affinity for a process being
run in a container, which is needed to solve the issue described in [1].

[1] opencontainers/runc#3922

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
kolyshkin added a commit to kolyshkin/runtime-spec that referenced this issue Jun 2, 2024
This allows to set initial and final CPU affinity for a process being
run in a container, which is needed to solve the issue described in [1].

[1] opencontainers/runc#3922

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
kolyshkin added a commit to kolyshkin/runtime-spec that referenced this issue Jun 11, 2024
This allows to set initial and final CPU affinity for a process being
run in a container, which is needed to solve the issue described in [1].

[1] opencontainers/runc#3922

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
kolyshkin added a commit to kolyshkin/runtime-spec that referenced this issue Jun 11, 2024
This allows to set initial and final CPU affinity for a process being
run in a container, which is needed to solve the issue described in [1].

[1] opencontainers/runc#3922

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin kolyshkin added this to the 1.2.0 milestone Jul 3, 2024
@kolyshkin kolyshkin mentioned this issue Jul 3, 2024
21 tasks
@kolyshkin
Copy link
Contributor

This is going to be implemented via opencontainers/runtime-spec#1253

@cyphar
Copy link
Member

cyphar commented Oct 21, 2024

Moving to 1.3.0 since it's a spec issue, and we agreed to move it in the 1.2.0 mega-thread.

@cyphar cyphar modified the milestones: 1.2.0, 1.3.0 Oct 21, 2024
Zheaoli pushed a commit to Zheaoli/runtime-spec that referenced this issue Dec 20, 2024
This allows to set initial and final CPU affinity for a process being
run in a container, which is needed to solve the issue described in [1].

[1] opencontainers/runc#3922

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin kolyshkin linked a pull request Jan 16, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants