Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroup: Use kernel command line to disable memory cgroup #6439

Merged

Conversation

mairacanal
Copy link
Contributor

Commit 94a23e9 ("cgroup: Disable cgroup "memory" by default") disabled the memory cgroup by default when initing the cgroups. However, it's possible to disable the memory cgroup by a kernel command line. Hard-coding such a feature can be problematic as some memory management features depend on the order that things are set.

For example, it is possible to see a NULL pointer dereference caused by commit 94a23e9. The NULL pointer dereference is triggered by the memory shrinker and ends up in a kernel crash.

[ 50.028629] ==================================================================
[ 50.028645] BUG: KASAN: null-ptr-deref in do_shrink_slab+0x1fc/0x978
[ 50.028663] Write of size 8 at addr 0000000000000000 by task gfxrecon-replay/1965

[ 50.028676] CPU: 3 UID: 1000 PID: 1965 Comm: gfxrecon-replay Tainted: G C 6.12.0-rc4-v8-thp-kasan+ #85
[ 50.028685] Tainted: [C]=CRAP
[ 50.028689] Hardware name: Raspberry Pi 5 Model B Rev 1.0 (DT)
[ 50.028694] Call trace:
[ 50.028697] dump_backtrace+0xfc/0x120
[ 50.028706] show_stack+0x24/0x38
[ 50.028711] dump_stack_lvl+0x40/0x88
[ 50.028720] print_report+0xe4/0x708
[ 50.028728] kasan_report+0xcc/0x130
[ 50.028733] kasan_check_range+0x254/0x298
[ 50.028738] __kasan_check_write+0x20/0x30
[ 50.028745] do_shrink_slab+0x1fc/0x978
[ 50.028751] shrink_slab+0x318/0xc38
[ 50.028756] shrink_one+0x254/0x6d8
[ 50.028762] shrink_node+0x26b4/0x2848
[ 50.028767] do_try_to_free_pages+0x3e4/0x1190
[ 50.028773] try_to_free_pages+0x5a4/0xb40
[ 50.028778] __alloc_pages_direct_reclaim+0x144/0x298
[ 50.028787] __alloc_pages_slowpath+0x5c4/0xc70
[ 50.028793] __alloc_pages_noprof+0x4a8/0x6a8
[ 50.028800] __folio_alloc_noprof+0x24/0xa8
[ 50.028806] shmem_alloc_and_add_folio+0x2ec/0xce0
[ 50.028812] shmem_get_folio_gfp+0x380/0xc20
[ 50.028818] shmem_read_folio_gfp+0xe0/0x160
[ 50.028824] drm_gem_get_pages+0x238/0x620 [drm]
[ 50.029039] drm_gem_shmem_get_pages_sgt+0xd8/0x4b8 [drm_shmem_helper]
[ 50.029053] v3d_bo_create_finish+0x58/0x1e0 [v3d]
[ 50.029083] v3d_create_bo_ioctl+0xac/0x210 [v3d]
[ 50.029105] drm_ioctl_kernel+0x1d8/0x2b8 [drm]
[ 50.029220] drm_ioctl+0x4b4/0x920 [drm]
[ 50.029330] __arm64_sys_ioctl+0x11c/0x160
[ 50.029337] invoke_syscall+0x88/0x268
[ 50.029345] el0_svc_common+0x160/0x1d8
[ 50.029351] do_el0_svc+0x50/0x68
[ 50.029358] el0_svc+0x34/0x80
[ 50.029364] el0t_64_sync_handler+0x84/0x100
[ 50.029371] el0t_64_sync+0x190/0x198
[ 50.029376] ==================================================================

This happens because the memory shrinker is unaware that we are artificially disabling the memory cgroups and therefore it doesn't allocate nr_deferred (as it would if we used the kernel command line).

To avoid such an issue, revert the artificial disablement and disable it through the command line. If a user wants to enable the feature, it can use the cgroup_enable= command line.

@pelwell
Copy link
Contributor

pelwell commented Oct 25, 2024

Hi Maira,

Thanks for this. Your patches missed out a few platforms because of the non-trivial include hierarchy. You may find it easier to build on top of #6442, which is effectively a cosmetic change that reduces the number of files to change to 6:

arch/arm/boot/dts/broadcom/bcm2708-rpi-bt.dtsi:27:              bootargs = "coherent_pool=1M 8250.nr_uarts=1 snd_bcm2835.enable_headphones=0";
arch/arm/boot/dts/broadcom/bcm270x.dtsi:7:              bootargs = "coherent_pool=1M snd_bcm2835.enable_headphones=0";
arch/arm/boot/dts/broadcom/bcm2711-rpi-cm4s.dts:151:            bootargs = "coherent_pool=1M snd_bcm2835.enable_headphones=0 numa_policy=interleave";
arch/arm/boot/dts/broadcom/bcm2711-rpi-ds.dtsi:6:               bootargs = "coherent_pool=1M 8250.nr_uarts=1 snd_bcm2835.enable_headphones=0 numa_policy=interleave";
arch/arm/boot/dts/broadcom/bcm271x-rpi-bt.dtsi:27:              bootargs = "coherent_pool=1M 8250.nr_uarts=1 snd_bcm2835.enable_headphones=0";
arch/arm64/boot/dts/broadcom/bcm2712-rpi.dtsi:102:              bootargs = "reboot=w coherent_pool=1M 8250.nr_uarts=1 pci=pcie_bus_safe numa_policy=interleave iommu_dm

@mairacanal
Copy link
Contributor Author

Thanks @pelwell! I'll wait #6442 to land and then I'll rebase this PR modifying the DTS needed.

@pelwell
Copy link
Contributor

pelwell commented Oct 25, 2024

#6442 is merged, so over to you.

Commit 94a23e9 ("cgroup: Disable cgroup "memory" by default")
disabled the memory cgroup by default when initing the cgroups. However,
it's possible to disable the memory cgroup by a kernel command line.
Hard-coding such a feature can be problematic as some memory management
features depend on the order that things are set.

For example, it is possible to see a NULL pointer dereference caused by
commit 94a23e9. The NULL pointer
dereference is triggered by the memory shrinker and ends up in a kernel
crash.

[   50.028629] ==================================================================
[   50.028645] BUG: KASAN: null-ptr-deref in do_shrink_slab+0x1fc/0x978
[   50.028663] Write of size 8 at addr 0000000000000000 by task gfxrecon-replay/1965

[   50.028676] CPU: 3 UID: 1000 PID: 1965 Comm: gfxrecon-replay Tainted: G         C         6.12.0-rc4-v8-thp-kasan+ raspberrypi#85
[   50.028685] Tainted: [C]=CRAP
[   50.028689] Hardware name: Raspberry Pi 5 Model B Rev 1.0 (DT)
[   50.028694] Call trace:
[   50.028697]  dump_backtrace+0xfc/0x120
[   50.028706]  show_stack+0x24/0x38
[   50.028711]  dump_stack_lvl+0x40/0x88
[   50.028720]  print_report+0xe4/0x708
[   50.028728]  kasan_report+0xcc/0x130
[   50.028733]  kasan_check_range+0x254/0x298
[   50.028738]  __kasan_check_write+0x20/0x30
[   50.028745]  do_shrink_slab+0x1fc/0x978
[   50.028751]  shrink_slab+0x318/0xc38
[   50.028756]  shrink_one+0x254/0x6d8
[   50.028762]  shrink_node+0x26b4/0x2848
[   50.028767]  do_try_to_free_pages+0x3e4/0x1190
[   50.028773]  try_to_free_pages+0x5a4/0xb40
[   50.028778]  __alloc_pages_direct_reclaim+0x144/0x298
[   50.028787]  __alloc_pages_slowpath+0x5c4/0xc70
[   50.028793]  __alloc_pages_noprof+0x4a8/0x6a8
[   50.028800]  __folio_alloc_noprof+0x24/0xa8
[   50.028806]  shmem_alloc_and_add_folio+0x2ec/0xce0
[   50.028812]  shmem_get_folio_gfp+0x380/0xc20
[   50.028818]  shmem_read_folio_gfp+0xe0/0x160
[   50.028824]  drm_gem_get_pages+0x238/0x620 [drm]
[   50.029039]  drm_gem_shmem_get_pages_sgt+0xd8/0x4b8 [drm_shmem_helper]
[   50.029053]  v3d_bo_create_finish+0x58/0x1e0 [v3d]
[   50.029083]  v3d_create_bo_ioctl+0xac/0x210 [v3d]
[   50.029105]  drm_ioctl_kernel+0x1d8/0x2b8 [drm]
[   50.029220]  drm_ioctl+0x4b4/0x920 [drm]
[   50.029330]  __arm64_sys_ioctl+0x11c/0x160
[   50.029337]  invoke_syscall+0x88/0x268
[   50.029345]  el0_svc_common+0x160/0x1d8
[   50.029351]  do_el0_svc+0x50/0x68
[   50.029358]  el0_svc+0x34/0x80
[   50.029364]  el0t_64_sync_handler+0x84/0x100
[   50.029371]  el0t_64_sync+0x190/0x198
[   50.029376] ==================================================================

This happens because the memory shrinker is unaware that we are
artificially disabling the memory cgroups and therefore it doesn't
allocate `nr_deferred` (as it would if we used the kernel command line).

To avoid such an issue, revert the artificial disablement and disable it
through the command line. If a user wants to enable the feature, it can
use the `cgroup_enable=` command line.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
@mairacanal mairacanal force-pushed the cgroup/disable-cgroup-cmdline branch from 0974ecb to a870565 Compare October 25, 2024 15:13
@mairacanal
Copy link
Contributor Author

I rebased and updated the DTS files.

@pelwell pelwell merged commit 86099de into raspberrypi:rpi-6.6.y Oct 25, 2024
11 of 12 checks passed
@pelwell
Copy link
Contributor

pelwell commented Oct 25, 2024

Thanks!

popcornmix added a commit to raspberrypi/firmware that referenced this pull request Nov 1, 2024
kernel: drivers: media: bcm2835_isp: Cache LS table dmabuf
See: raspberrypi/linux#6429

kernel: arm64: dts: Sort out CM5 and I/O board I2C ports
See: raspberrypi/linux#6441

kernel: cgroup: Use kernel command line to disable memory cgroup
See: raspberrypi/linux#6439
popcornmix added a commit to raspberrypi/rpi-firmware that referenced this pull request Nov 1, 2024
kernel: drivers: media: bcm2835_isp: Cache LS table dmabuf
See: raspberrypi/linux#6429

kernel: arm64: dts: Sort out CM5 and I/O board I2C ports
See: raspberrypi/linux#6441

kernel: cgroup: Use kernel command line to disable memory cgroup
See: raspberrypi/linux#6439
@dkadioglu
Copy link

dkadioglu commented Dec 9, 2024

As far as I understand this change correctly, I should be able to reenable the cgroups memory controller by adding cgroup_enable=memory to the cmdline.txt. However, although I have set that, the cgroups memory controller is still not enabled:

dmesg | grep cgroup
[    0.000000] Kernel command line: reboot=w coherent_pool=1M 8250.nr_uarts=1 pci=pcie_bus_safe cgroup_disable=memory numa_policy=interleave  smsc95xx.macaddr=D8:3A:DD:9F:BE:95 vc_mem.mem_base=0x3fc00000 vc_mem.mem_size=0x40000000  root=PARTUUID=0008f858-02 rw rootwait console=ttyAMA4,115200 console=tty1 rootfstype=ext4 fsck.repair=yes kgdboc=ttyAMA4,115200 cpufreq.default_governor=performance net.ifnames=0 cgroup_memory=1 cgroup_enable=memory
[    0.000000] cgroup: Disabling memory control group subsystem
[    0.000000] Unknown kernel command line parameters "cgroup_memory=1 cgroup_enable=memory", will be passed to user space.
[    3.434730]     cgroup_memory=1
[    3.434731]     cgroup_enable=memory
#subsys_name  hierarchy  num_cgroups  enabled
cpuset        0          95           1
cpu           0          95           1
cpuacct       0          95           1
blkio         0          95           1
memory        0          95           0
devices       0          95           1
freezer       0          95           1
net_cls       0          95           1
perf_event    0          95           1
net_prio      0          95           1
pids          0          95           1

System is a RPi 5 with Arch Linux Arm and the most recent version of the linux-rpi kernel: Linux host 6.12.3-1-rpi #1 SMP PREEMPT Sat Dec 7 07:39:05 MST 2024 aarch64 GNU/Linux
Am I doing something wrong here?

@mairacanal
Copy link
Contributor Author

@dkadioglu Please note that cgroup_enable is unchanged in this PR. Therefore, your issue is unrelated to it.

Also, note that commit 94a23e9 ("cgroup: Disable cgroup "memory" by default") doesn't exist in the rpi-6.12.y tree. This means that cgroup_enable= doesn't exist in the branch. I'd suggest you create an issue with your report.

@pelwell
Copy link
Contributor

pelwell commented Dec 9, 2024

I think that's a bit of a waste of effort - we'll sort it out.

@pelwell
Copy link
Contributor

pelwell commented Dec 9, 2024

Do you really need the V1 memory cgroup support? I have a patch that re-enables the memory cgroup (see #6524), but if you look in /proc/cgroups you'll find that it is absent. However, /sys/fs/cgroup/ includes memory.numa_state, memory.reclaim, etc., all of which disappear without my patch and cgroup_enable=memory.

If you were an RPiOS user I'd suggest you wait about 35 minutes for the auto-builds to complete, then run sudo rpi-update pulls/6524, after first backing up any important data. I'm not sure if that will work on Arch.

If you don't know for certain whether the legacy V1 memory cgroup support is needed, perhaps you could say what your use case is so we can test it here.

@dkadioglu
Copy link

dkadioglu commented Dec 9, 2024

Actually, I am not sure, if I need V1 memory cgroup support. My use case is running docker containers with resource limitations, for example:

deploy:
   resources:
    limits:
      cpus: '0.70'
      memory: 1G
    reservations:
      cpus: '0.5'
      memory: 512M

However, with the kernels 6.6.x I needed cgroup_enable=memory to run a container with such resource limits without error. Now, with 6.12 I see again 'Your kernel does not support memory soft limit capabilities or the cgroup is not mounted. Limitation discarded.', although I have cgroup_enable=memory in the cmdline. Is maybe Docker the culprit here, that they still use V1 cgroups? Or do I miss anything else here?

When I run ls /sys/fs/cgroup/ I see the following:

cgroup.controllers  cgroup.max.descendants  cgroup.procs  cgroup.subtree_control  cpu.pressure           cpuset.cpus.isolated   cpu.stat        dev-mqueue.mount  io.pressure  memory.pressure     proc-sys-fs-binfmt_misc.mount  sys-kernel-config.mount  sys-kernel-tracing.mount  user.slice
cgroup.max.depth    cgroup.pressure         cgroup.stat   cgroup.threads          cpuset.cpus.effective  cpuset.mems.effective  cpu.stat.local  init.scope        io.stat      proc-fs-nfsd.mount  sys-fs-fuse-connections.mount  sys-kernel-debug.mount   system.slice

The only memory related entry seems to be memory.pressure. Why do you see more here?

@pelwell
Copy link
Contributor

pelwell commented Dec 9, 2024

I'm pretty sure that you'll need my patch - the question is whether you also need CONFIG_MEMCG_V1.

@dkadioglu
Copy link

At least on my current kernel it seems to be enabled, zgrep 'CONFIG_MEMCG' /proc/config.gz says:

CONFIG_MEMCG=y
CONFIG_MEMCG_V1=y

@pelwell
Copy link
Contributor

pelwell commented Dec 9, 2024

OK - in that case I'll just merge the PR, and eventually it will appear in an Arch kernel build.

@dkadioglu
Copy link

Thank you very much!
One question though, as I am a bit puzzled here: Is Docker doing something wrong here or do they not (fully) support cgroup V2? And do I understand it right, that the cgroup V2 memory endpoints do not appear under /sys/fs/cgroup/ due to the cgroup_disable=memory in the device tree?

@pelwell
Copy link
Contributor

pelwell commented Dec 9, 2024

Is Docker doing something wrong here or do they not (fully) support cgroup V2?

There's no evidence to support that. In order to find out, you'd need a kernel without CONFIG_MEMCG_V1 and with my patch.

And do I understand it right, that the cgroup V2 memory endpoints do not appear under /sys/fs/cgroup/ due to the cgroup_disable=memory in the device tree?

Yes, but with my patch you'll be able to override that with cgroup_enable=memory in cmdline.txt.

@dkadioglu
Copy link

dkadioglu commented Dec 10, 2024

I can confirm that the patch fixed it for me, thank you very much for your immediate support! ls /sys/fs/cgroup/:

cgroup.controllers      cgroup.pressure  cgroup.subtree_control  cpuset.cpus.effective  cpu.stat          init.scope   memory.numa_stat  memory.stat             proc-sys-fs-binfmt_misc.mount  sys-kernel-debug.mount    user.slice
cgroup.max.depth        cgroup.procs     cgroup.threads          cpuset.cpus.isolated   cpu.stat.local    io.pressure  memory.pressure   memory.zswap.writeback  sys-fs-fuse-connections.mount  sys-kernel-tracing.mount
cgroup.max.descendants  cgroup.stat      cpu.pressure            cpuset.mems.effective  dev-mqueue.mount  io.stat      memory.reclaim    proc-fs-nfsd.mount      sys-kernel-config.mount        system.slice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants