Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workqueue: events_freezable mmc_rescan crash with Raspbian kernel 4.14.79-v7+ #2810

Open
op-ecoprintq opened this issue Jan 14, 2019 · 71 comments

Comments

@op-ecoprintq
Copy link

I have a few Raspberry Pi 3 B+ exhibiting the same problem. They crash after 2-3 days of uptime with the following error:

[169451.220021] INFO: task kworker/0:3:10949 blocked for more than 120 seconds.
[169451.220036] Tainted: G C 4.14.79-v7+ #1159
[169451.220041] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[169451.220048] kworker/0:3 D 0 10949 2 0x00000000
[169451.220077] Workqueue: events_freezable mmc_rescan
[169451.220110] [<8079ef70>] (__schedule) from [<8079f5d8>] (schedule+0x50/0xa8)
[169451.220130] [<8079f5d8>] (schedule) from [<8061a2d0>] (__mmc_claim_host+0xb8/0x1cc)
[169451.220147] [<8061a2d0>] (__mmc_claim_host) from [<8061a414>] (mmc_get_card+0x30/0x34)
[169451.220163] [<8061a414>] (mmc_get_card) from [<80623010>] (mmc_sd_detect+0x20/0x74)
[169451.220179] [<80623010>] (mmc_sd_detect) from [<8061ccdc>] (mmc_rescan+0x1c8/0x394)
[169451.220197] [<8061ccdc>] (mmc_rescan) from [<801379b4>] (process_one_work+0x158/0x454)
[169451.220212] [<801379b4>] (process_one_work) from [<80137d14>] (worker_thread+0x64/0x5b8)
[169451.220227] [<80137d14>] (worker_thread) from [<8013dd98>] (kthread+0x13c/0x16c)
[169451.220246] [<8013dd98>] (kthread) from [<801080ac>] (ret_from_fork+0x14/0x28)

The machines are running Rasbian Stretch

@ldrolez
Copy link

ldrolez commented Jan 18, 2019

Same problem on a 3B after upgrading to 4.14.79-v7+
Crash every 1-3 days. Now downgrading to 4.14.52-v7+

@JamesH65
Copy link
Contributor

Hmm, what are the Pi's doing during that time? We haven't seen anything like this in the office, so would be interested to know what they are doing that might cause this.

@JamesH65
Copy link
Contributor

JamesH65 commented Jan 18, 2019

Actually, this is a duplicate of an issue on our Linux tracker which is the correct place for it to be. Please continue conversation there...#2810

@JamesH65 JamesH65 reopened this Jan 18, 2019
@JamesH65
Copy link
Contributor

Bum, closed wrong one. Reopened. Will now attempt to close the one in firmware...

@op-ecoprintq
Copy link
Author

So the machines in question are running LXDE, a Java application, and the onboard keyboard.

@pelwell
Copy link
Contributor

pelwell commented Jan 18, 2019

That doesn't tell me anything about their level of activity - CPU, file system, network etc.

@Alexurb
Copy link

Alexurb commented Jan 19, 2019

Same issue here. Attached is my kernel log.
kern.log
The issue occurs around Jan 18 19:57:02
This is on a headless install. The green light is constant, the PI responds to pings but is unable to open up an ssh session.
This actually occurs for me during a movie being streamed by plex , around an hour and a half in.
I left the pi to see if it would recover overnight, sadly it didn't.

@Alexurb
Copy link

Alexurb commented Jan 21, 2019

Happened again mid movie. Attached is another kernel log
kern.log

@Alexurb
Copy link

Alexurb commented Jan 23, 2019

I've so far found some advice from https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/
summary:
sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5
sudo sysctl -p
seems to do the trick so far, still testing (3 movies down)
I'll comment if it occurs again.
To make it permanent(after reboot) read the link

@Alexurb
Copy link

Alexurb commented Jan 23, 2019

I've so far found some advice from https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/
summary:
sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5
sudo sysctl -p
seems to do the trick so far, still testing (3 movies down)
I'll comment if it occurs again.
To make it permanent(after reboot) read the link

So far so good.

@peabody
Copy link

peabody commented Aug 29, 2019

Just going say this thread saved my pi 3b+. Was hanging terrible on any intense IO. The solution by Alexurb fixed the problem for me.

@jdrch
Copy link

jdrch commented Apr 4, 2020

Same problem here on my Pi 3B+ running Raspbian Buster. Gonna try that edit and then reboot. Fingers crossed!

@paralin
Copy link

paralin commented Sep 22, 2020

Just ran into this with kernel 5.4.51-v8 aarch64 on Pi 4.

Ill try the fix recommended above in this thread:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Here's the dmesg errors:

[11722.001585] INFO: task kworker/1:0:1663 blocked for more than 120 seconds.
[11722.005139]       Tainted: G         C        5.4.51-v8 #1
[11722.008672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11722.012222] kworker/1:0     D    0  1663      2 0x00000028
[11722.015773] Workqueue: events_freezable mmc_rescan
[11722.019269] Call trace:
[11722.022724]  __switch_to+0x110/0x180
[11722.026149]  __schedule+0x2f4/0x750
[11722.029523]  schedule+0x44/0xe0
[11722.032807]  __mmc_claim_host+0xb8/0x210
[11722.036111]  mmc_get_card+0x38/0x50
[11722.039407]  mmc_sd_detect+0x24/0x90
[11722.042671]  mmc_rescan+0xc8/0x390
[11722.045892]  process_one_work+0x1c0/0x470
[11722.049078]  worker_thread+0x50/0x430
[11722.052405]  kthread+0x100/0x130
[11722.055223]  ret_from_fork+0x10/0x18

paralin added a commit to skiffos/SkiffOS that referenced this issue Sep 22, 2020
May fix a lockup observed against Pi 4 64 bit:

INFO: task kworker/1:0:1663 blocked for more than 120 seconds.
Workqueue: events_freezable mmc_rescan
Call trace:
  __switch_to+0x110/0x180
  __schedule+0x2f4/0x750
  schedule+0x44/0xe0
  __mmc_claim_host+0xb8/0x210
  mmc_get_card+0x38/0x50
  mmc_sd_detect+0x24/0x90
  mmc_rescan+0xc8/0x390
  process_one_work+0x1c0/0x470
  worker_thread+0x50/0x430
  kthread+0x100/0x130

sets syctl values for pi boards:

+vm.dirty_background_ratio = 5
+vm.dirty_ratio = 10

Reference: raspberrypi/linux#2810

Signed-off-by: Christian Stewart <christian@paral.in>
paralin added a commit to skiffos/SkiffOS that referenced this issue Sep 22, 2020
May fix a lockup observed against Pi 4 64 bit:

INFO: task kworker/1:0:1663 blocked for more than 120 seconds.
Workqueue: events_freezable mmc_rescan
Call trace:
  __switch_to+0x110/0x180
  __schedule+0x2f4/0x750
  schedule+0x44/0xe0
  __mmc_claim_host+0xb8/0x210
  mmc_get_card+0x38/0x50
  mmc_sd_detect+0x24/0x90
  mmc_rescan+0xc8/0x390
  process_one_work+0x1c0/0x470
  worker_thread+0x50/0x430
  kthread+0x100/0x130

sets syctl values for pi boards:

+vm.dirty_background_ratio = 5
+vm.dirty_ratio = 10

Reference: raspberrypi/linux#2810

Signed-off-by: Christian Stewart <christian@paral.in>
paralin added a commit to skiffos/SkiffOS that referenced this issue Sep 22, 2020
May fix a lockup observed against Pi 4 64 bit:

INFO: task kworker/1:0:1663 blocked for more than 120 seconds.
Workqueue: events_freezable mmc_rescan
Call trace:
  __switch_to+0x110/0x180
  __schedule+0x2f4/0x750
  schedule+0x44/0xe0
  __mmc_claim_host+0xb8/0x210
  mmc_get_card+0x38/0x50
  mmc_sd_detect+0x24/0x90
  mmc_rescan+0xc8/0x390
  process_one_work+0x1c0/0x470
  worker_thread+0x50/0x430
  kthread+0x100/0x130

sets syctl values for pi boards:

+vm.dirty_background_ratio = 5
+vm.dirty_ratio = 10

Reference: raspberrypi/linux#2810

Signed-off-by: Christian Stewart <christian@paral.in>
@jdrch
Copy link

jdrch commented Nov 22, 2020

will Too much of write can reduce life of SD CARDS?
vm.dirty_background_ratio = 5
vm.dirty_ratio = 80
Can we try this setting.

My card hasn't failed since I made the change on April 4 of this year. I've been using the same SanDisk microSD card I bought new since mid-2018. I think you only need to worry about microSD card life if you're using super cheap or no name cards. If you use a high tier SanDisk you should be just fine.

@ranma
Copy link
Contributor

ranma commented Feb 23, 2021

Seeing the same, seems to be a memory leak in lxpanel causing it to run out of memory and start swapping until it dies.
htop output from ssh session I left running:

  Mem[||||||||||||||||||||||||||||||||||||||||||||882M/924M]   Uptime: 07:48:17
[...]
  719 pi         20   0  882M  488M   204 D  0.7 52.9 35:19.23 lxpanel --profile LXDE-pi
[...]

Straight after reboot:

  Mem[||||||||||||||||||||||||||||||||||||||||||||390M/924M]   Uptime: 00:03:50
[...]
 726 pi         20   0  401M 30128 23492 S  5.1  3.2  0:05.68 lxpanel --profile LXDE-pi
[...]

See also https://www.raspberrypi.org/forums/viewtopic.php?t=267015

@jdrch
Copy link

jdrch commented Feb 23, 2021

I haven't noticed this problem on my 3B+ since July 21 of last year, which is just before The Foundation engineer announced a fix had been committed.

Are you fully patched and updated on Buster and still seeing the issue?

One thing I'm thinking based on both my repo issue thread and the Raspberry Pi Forum one you linked to (thanks!) is this could be triggered by having the CPU % lxpanel plugin AND a process that causes CPU % to spike for a significant length of time, e.g. backintime as I did. If you read through my issue thread you'll notice the whole thing went away once I removed backintime.

@ranma
Copy link
Contributor

ranma commented Mar 6, 2021

This is on a 1GB 3B+ with a freshly installed and updated raspbian image, so should it is buster:
Raspberry Pi 3 Model B Plus Rev 1.3
Raspberry Pi reference 2021-01-11
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 21090519d85bdaa1615d5d5057d37b09368ea5d2, stage4
Linux raspberrypi 5.10.11-v7+ #1399 SMP Thu Jan 28 12:06:05 GMT 2021 armv7l GNU/Linux
Videocore Jan 27 2021 22:26:53

It was crashing every ~6 hours from high lxpanel memory usage until I set up a script to restart lxpanel once an hour.

I also added:
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

Besides lxde there's influxdb and a autostart chrome with grafana running, plus I'm using the official 7" DSI touch display.
It's actually still likes to crash but less often, I haven't investigated the more recent ones yet.

@ranma
Copy link
Contributor

ranma commented Mar 6, 2021

Actually it's still crashing from lxpanel it seems, so the hourly restart isn't good enough to prevent whatever lxpanel is doing.
I also would have expected the OOM-killer to kick in, but somehow the system seems to lock up instead?

htop excerpt:

  1  [||||||||                                        13.0%]   Tasks: 78, 157 thr; 3 running
  2  [|||||||||                                       15.3%]   Load average: 39.39 40.12 37.97 
  3  [|||||||||                                       14.3%]   Uptime: 07:53:55
  4  [||||||||||                                      16.8%]
  Mem[||||||||||||||||||||||||||||||||||||||||||||881M/924M]
  Swp[||||||||||||||||||||||||||||||||||||||||100.0M/100.0M]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command                                                         
  457 grafana    20   0  934M 10728     0 S 32.2  1.1 29:40.85 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pid
  451 influxdb   20   0  978M  9476     0 D 28.4  1.0 35:29.89 /usr/bin/influxd -config /etc/influxdb/influxdb.conf
  494 grafana    20   0  934M 10728     0 R 20.4  1.1 15:36.50 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pid
  731 grafana    20   0  934M 10728     0 S 11.6  1.1  9:18.06 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pid
 1347 influxdb   20   0  978M  9476     0 D  6.1  1.0  5:24.33 /usr/bin/influxd -config /etc/influxdb/influxdb.conf
[...]
  795 pi         20   0  636M 39568     0 D  1.0  4.2  3:43.82 /usr/lib/chromium-browser/chromium-browser-v7 --force-renderer-a
 1282 pi         20   0  429M 72396   372 D  0.8  7.6  4:00.12 /usr/lib/chromium-browser/chromium-browser-v7 --type=renderer --
[...]
 760 pi         20   0 1012M  591M     0 D  0.4 64.0 28:24.12 lxpanel --profile LXDE-pi

Kernel messages excerpt:

[  119.755918] logitech-hidpp-device 0003:046D:404D.0004: HID++ 4.1 device connected.
[14620.750775] INFO: task kworker/2:0:7862 blocked for more than 122 seconds.
[14620.750798]       Tainted: G         C        5.10.11-v7+ #1399
[14620.750807] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[14620.750818] task:kworker/2:0     state:D stack:    0 pid: 7862 ppid:     2 flags:0x00000000
[14620.750859] Workqueue: events_freezable mmc_rescan
[14620.750876] Backtrace:
[14620.750906] [<809e5190>] (__schedule) from [<809e5b60>] (schedule+0x68/0xe4)
[14620.750924]  r10:822b1800 r9:ffffe000 r8:00000000 r7:00000000 r6:40000113 r5:9d8b2e80
[14620.750934]  r4:ffffe000
[14620.750951] [<809e5af8>] (schedule) from [<807f72f8>] (__mmc_claim_host+0xe0/0x238)
[14620.750962]  r5:822b1a18 r4:00000002
[14620.750978] [<807f7218>] (__mmc_claim_host) from [<807f7488>] (mmc_get_card+0x38/0x3c)
[14620.750994]  r10:b6b53640 r9:00000000 r8:b6b56c00 r7:00000080 r6:822b1a18 r5:00000000
[14620.751005]  r4:822b2800
[14620.751041] [<807f7450>] (mmc_get_card) from [<808010a0>] (mmc_sd_detect+0x24/0x7c)
[14620.751053]  r5:822b1800 r4:822b1800
[14620.751071] [<8080107c>] (mmc_sd_detect) from [<807f9cc0>] (mmc_rescan+0xdc/0x3b0)
[14620.751083]  r5:822b1800 r4:822b1a7c
[14620.751103] [<807f9be4>] (mmc_rescan) from [<8013b78c>] (process_one_work+0x254/0x584)
[14620.751119]  r9:00000000 r8:b6b56c00 r7:00000080 r6:b6b53640 r5:89128200 r4:822b1a7c
[14620.751137] [<8013b538>] (process_one_work) from [<8013bb1c>] (worker_thread+0x60/0x5c4)
[14620.751153]  r10:b6b53640 r9:80e03d00 r8:b6b53658 r7:00000008 r6:b6b53640 r5:89128214
[14620.751164]  r4:89128200
[14620.751182] [<8013babc>] (worker_thread) from [<801435a4>] (kthread+0x170/0x174)
[14620.751198]  r10:adf9be74 r9:89128200 r8:8013babc r7:9b4ba000 r6:00000000 r5:ade02c80
[14620.751209]  r4:9170bf80
[14620.751225] [<80143434>] (kthread) from [<801000ec>] (ret_from_fork+0x14/0x28)
[14620.751236] Exception stack(0x9b4bbfb0 to 0x9b4bbff8)
[14620.751249] bfa0:                                     00000000 00000000 00000000 00000000
[14620.751264] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[14620.751277] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000  
[14620.751293]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:80143434
[14620.751303]  r4:ade02c80

@jdrch
Copy link

jdrch commented Mar 6, 2021

Hmmm ... I would suggest disabling cron jobs and/or processes and services one by one and seeing how that goes. As I said, once I removed backintime the issue went away (or at least is now sufficiently infrequent for me to not notice it at all). While lxpanel is crashing, I believe in some cases the crash is being prompted by something else.

@ranma
Copy link
Contributor

ranma commented Mar 7, 2021

So I've enabled memory cgroups + swap on zram and put lxpanel into a limited cgroup.
As expected, this morning I found it almost maxing out it's allocation limit and heavily swapping (but only the lxpanel process, thanks to the cgroup).

Time for some investigation:

(gdb) attach 3924
Attaching to program: /usr/bin/lxpanel, process 3924
[New LWP 3928]
[New LWP 3942]
[New LWP 3948]
[New LWP 3988]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
0x73822510 in g_dbus_object_get_object_path@plt () from /usr/lib/arm-linux-gnueabihf/lxpanel/plugins/bluetooth.so
(gdb) bt
#0  0x73822510 in g_dbus_object_get_object_path@plt () from /usr/lib/arm-linux-gnueabihf/lxpanel/plugins/bluetooth.so
#1  0x73824ae0 in ?? () from /usr/lib/arm-linux-gnueabihf/lxpanel/plugins/bluetooth.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) gcore lxpanel.core
warning: target file /proc/3924/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 1048576 bytes at 0x60aff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x64aff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x68aff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6cb00000.
warning: Memory read failed for corefile section, 135168 bytes at 0x6dd00000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6df00000.
warning: Memory read failed for corefile section, 135168 bytes at 0x6e700000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6e8ff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6f100000.
warning: Memory read failed for corefile section, 135168 bytes at 0x6f900000.
warning: Memory read failed for corefile section, 393216 bytes at 0x6fa30000.
warning: Memory read failed for corefile section, 237568 bytes at 0x6fac5000.
warning: Memory read failed for corefile section, 135168 bytes at 0x70300000.
warning: Memory read failed for corefile section, 4096 bytes at 0x70465000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x70468000.
warning: Memory read failed for corefile section, 135168 bytes at 0x74600000.
warning: Memory read failed for corefile section, 8192 bytes at 0x75b32000.
warning: Memory read failed for corefile section, 12288 bytes at 0x7639d000.
warning: Memory read failed for corefile section, 131072 bytes at 0x76ed6000.
warning: Memory read failed for corefile section, 135168 bytes at 0x7ea79000.
Saved corefile lxpanel.core

Let's try looking at that core file

$ ls -lh lxpanel.core 
-rw-r--r-- 1 pi pi 432M Mar  7 10:37 lxpanel.core
$ strings lxpanel.core  | sort | uniq -c | sort -n -k1 -r | head -20
   6272 :1.18
   4376 daRvPaRv
   2569 org.bluez.Device1
   2414 org.freedesktop.DBus.Properties
   2080 org.freedesktop.DBus.Introspectable
   1595 RSSI
   1094 UUIDs
   1094 Alias
   1094 AddressType
   1094 Address
   1092 Trusted
   1092 ServicesResolved
   1092 Paired
   1092 LegacyPairing
   1092 Connected
   1092 Blocked
   1092 Adapter
    980 MXo=
    972 ManufacturerData
    888 /org/bluez/hci0

So I think it's likely that it's the bluetooth lxpanel plugin that is leaking memory in my case, I'll try disabling the bluetooth plugin and see how it behaves after that.

@ranma
Copy link
Contributor

ranma commented Mar 7, 2021

While lxpanal is crashing, I believe in some cases the crash is being prompted by something else.

Technically it looks more like a lockup from out-of-memory, not a crash.
lxpanel is the clear culprit because its resident set size is initially small, but is at 591M out of ~1GB total at the point of the lockup.
Also putting lxpanel into a cgroup prevents the lockup for me.

There's actually two separate issues here:

  1. The system seems to lock up instead of the OOM-killer triggering and killing the offending lxpanel process (at which point the system would recover and continue working since lxpanel is auto-respawned).
    It seems it might be that swapping on SD card is running into some kernel issue with the default kernel settings?

  2. The memory leak in lxpanel (or rather most likely the raspi-specific lxplug-bluetooth)
    I'll open a new bug at https://github.com/raspberrypi-ui/lxplug-bluetooth/ for this one, I think there are missing calls to g_variant_unref in cb_interface_properties. I'll also need to see if heaptrack is usable, I tried valgrinds massif yesterday and it was just too slow.

@nickidw
Copy link

nickidw commented Dec 24, 2021

I'm running homeassistant on Pi4, after 2 SD cards destroyed in relatively short time I changed to USB SSD, and zero issues after months. If the distro used does a lot of writing (log files, db updates, etc) expect the sd cards not to last (that's been my experience).

@joetriskaide
Copy link

joetriskaide commented Apr 10, 2022

A few observations from my side to the mmc_rescan event:
I had this issue yesterday (again). What strikes me is the fact that the PI is running for days, getting into trouble and after having the trouble the first time, even a reboot leads more or less immediately into the same trouble after a few minutes.

Here is my setup:
/proc/cpuinfo:
Hardware	: BCM2835
Revision	: a020d3
Serial		: 0000000040ea362b
Model		: Raspberry Pi 3 Model B Plus Rev 1.3

I started with a traditional sd card, but recognized the speed would be to slow to run iobroker. So I decided to run the entire OS and everything else from an external USB-SSD. I created a 1:1 copy of the SD card on the USB.
I did not use the fuse in the RPi to let the Pi boot for USB, I did change the "root=" command line parameter in the boot sector on the SD card to

/boot/cmdline.txt
console=serial0,115200 console=tty1 root=/dev/sda2 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait dwc_otg.lpm_enable=0 

consequently, /boot was mounted via fstab to the boot sector on the SD-Card

Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/mmcblk0p1     97070   49522     47548  52% /boot
/dev/root        4159192 3317080    825728  81% /

Given this setup, there should be no or very low file activity on the SD-Card. When I initially created the USB-SDD disk setup of my iobroker Instance, I used the remaining space on the SD-Card as swap space. I assumed swap to become slow, but keeping an eye on it should be sufficient.
When I encountered the first "Workqueue: events_freezable mmc_rescan" event, I changed from SD-Card based swap device to an USB-SDD based swap device. I got rid of the /var/swap swap file at all.
I need to mention that I use LVM for my files except /boot and / file systems and my swap device is a logical volume on the USB-SDD. No PV on the SD Card at all.
What I do not understand is the fact, that with only /boot mounted and nothing else accessing the SD-Card, my PI yesterday ran into the unresponsive mode. Even watchdog seem to have stopped:

2022-04-09 17:58:34 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 
2022-04-09 17:58:45 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 
2022-04-09 17:58:52 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 
2022-04-09 17:59:08 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 
2022-04-09 20:27:35 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 
2022-04-09 20:27:45 test        addr b8:27:eb:xx:xx:xx  ssid <ssid> 

I just started to use watchdog in a test mode and as you can see, watchdog did not run the test script after 17:59. The entry at 20:27 was created after the reboot of the Pi. But that might (or might not) be another issue.
Another strange observation:
After running for three weeks without issues, the Pi became unresponsive and when I powered off the Pi (by pulling the plug!), it needs three or four reboot (=pulling the plug) before I could access the Pi long enough to make changes to the configuration.
Another observation, not sure it has to do with this rescan_mmc issue, is that my Pi from time to time is not reachable by Wifi, but usually comes back online after some time (tens of minutes).
Finally I have to mention that I run a 2nd Pi, same model, same setup with USB-SDD with kodi media center and I do not yet see any issues with mmc_rescan as I can see in my Pi running iobroker.
I changed system settings a proposed by samyk, keep fingers crossed and let you know, hat happens - or not.

@jdrch
Copy link

jdrch commented Apr 10, 2022

FWIW I haven't seen this problem on my 3 B+ since I updated to Raspberry Pi OS 11.x. Currently:

$ cat /etc/debian_version
11.3

I would encourage everyone to update and see what happens. And yes, I updated in place with no issues at all using this method.

@All3xJ
Copy link

All3xJ commented Apr 24, 2022

I still have this bug with 5.15.30-v8+ 64 bit Raspberry Pi 3B+.... the system freezed and resumed after 3 hours of freeze.
Here are journalctl logs:
journalctl.txt

popcornmix pushed a commit that referenced this issue Apr 25, 2022
…mpound

Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
in order to allow the sub-pages to be refcounted by callers such as
"remap_vmalloc_page [sic]" (remap_vmalloc_range).

However a similar problem exists for other struct page fields callers
use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
not only refcounts it but uses ->lru, ->mapping, ->index.

This is not compatible with compound sub-pages, and can cause bad page
state issues like

  BUG: Bad page state in process swapper/0  pfn:00743
  page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743
  flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff)
  raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000
  raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: corrupted mapping in tail page
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810
  Call Trace:
    dump_stack_lvl+0x74/0xa8 (unreliable)
    bad_page+0x12c/0x170
    free_tail_pages_check+0xe8/0x190
    free_pcp_prepare+0x31c/0x4e0
    free_unref_page+0x40/0x1b0
    __vunmap+0x1d8/0x420
    ...

The correct approach is to use split high-order pages for the huge
vmalloc backing. These allow callers to treat them in exactly the same
way as individually-allocated order-0 pages.

Link: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Song Liu <songliubraving@fb.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
@delgurth
Copy link

delgurth commented May 8, 2022

...

hardware issue, removed comment since it's not relevant.

@jdrch
Copy link

jdrch commented May 8, 2022

FWIW I haven't seen this problem on my 3 B+ since I updated to Raspberry Pi OS 11.x. Currently:

$ cat /etc/debian_version
11.3

I would encourage everyone to update and see what happens. And yes, I updated in place with no issues at all using this method.

Welp, within the past 2 days I had my 1st recurrence of this issue.

@paralin
Copy link

paralin commented May 8, 2022

If this happens your SD card is probably bad. Best replace it.

Even relatively new SD cards can go bad, because they are not designed for repeated continuous writes to the same location over & over, nor continuous 24/7 operation.

I have had the best luck with SanDisk Ultra microSDHC UHS-I Memory Card DSQUA4-032G-GN6MT which rarely fail. Regardless I would recommend swapping out SD cards once every year to be safe.

The kernel should be updated nevertheless with better error handling, in my opinion.

@delgurth
Copy link

delgurth commented May 8, 2022

If this happens your SD card is probably bad. Best replace it.

Even relatively new SD cards can go bad, because they are not designed for repeated continuous writes to the same location over & over, nor continuous 24/7 operation.

I have had the best luck with SanDisk Ultra microSDHC UHS-I Memory Card DSQUA4-032G-GN6MT which rarely fail. Regardless I would recommend swapping out SD cards once every year to be safe.

The kernel should be updated nevertheless with better error handling, in my opinion.

Seems you are right. Can't even clear the partition table on the darn thing now. Will order some new cards. Thanks for the advice on what works well for you. Will also update my comment.

@All3xJ
Copy link

All3xJ commented May 24, 2022

I did change the sysctl setting as described here: https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/
but the freeze still happens. It happens ALWAYS in this situation:

I noticed this when I used VS Code's SSH Remote Extension which install a bunch of stuff in the remote server. This cause a HUGE use of CPU by "node" program, and then everything freezes, I couldn't even connect via SSH because the server is freezed. My remote server is a simple Raspberry Pi 3 B+, so not very powerful.

Something curious: when all this happens, I noticed in both my host machine and remote server a process "kswapd0" that consumes much CPU too, can this be related?

Also, if the scheduler sends interrupts constantly to context switch and to pass to another process, so why a certain process that consumes too much CPU can freeze the pi? Shouldn't scheduler go on with other processes equally? Why can it monopolize the CPU and freeze the pi?

@paralin
Copy link

paralin commented May 24, 2022

@All3xJ I'm regularly running entire Gentoo builds on the pi itself that completely saturate all cpu cores, and yet never run into this error after replacing the faulty microsd cards.

Try a brand new card.

@JamesH65
Copy link
Contributor

I did change the sysctl setting as described here: https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/ but the freeze still happens. It happens ALWAYS in this situation:

I noticed this when I used VS Code's SSH Remote Extension which install a bunch of stuff in the remote server. This cause a HUGE use of CPU by "node" program, and then everything freezes, I couldn't even connect via SSH because the server is freezed. My remote server is a simple Raspberry Pi 3 B+, so not very powerful.

Something curious: when all this happens, I noticed in both my host machine and remote server a process "kswapd0" that consumes much CPU too, can this be related?

Also, if the scheduler sends interrupts constantly to context switch and to pass to another process, so why a certain process that consumes too much CPU can freeze the pi? Shouldn't scheduler go on with other processes equally? Why can it monopolize the CPU and freeze the pi?

I believe you are running out of memory, which is why kswapd is going mad, rather than running out of CPU power.

@paralin
Copy link

paralin commented May 24, 2022

@JamesH65 is probably right, you're out of memory.

My bad, I assumed that the comment was related to this issue and not some other unrelated thing :)

@jdrch
Copy link

jdrch commented May 24, 2022

If this happens your SD card is probably bad. Best replace it.

Even relatively new SD cards can go bad, because they are not designed for repeated continuous writes to the same location over & over, nor continuous 24/7 operation.

I have had the best luck with SanDisk Ultra microSDHC UHS-I Memory Card DSQUA4-032G-GN6MT which rarely fail. Regardless I would recommend swapping out SD cards once every year to be safe.

The kernel should be updated nevertheless with better error handling, in my opinion.

All storage devices fail eventually ;) But yes, I have had the best luck with SanDisk, and their warranty (both length and replacement policy) is the best in the business.

Last Friday I retired the 32 GB card I'd been using since I got the Pi in 2018 and replaced it with a 200 GB card. We'll see what happens from there.

For those wondering about the process for doing so:

  1. Use SD Card Copier (Raspberry Pi OS GUI app) to copy existing card to new card
  2. Physically swap in new card
  3. (If desired) Use GParted to:
    • resize the shared data partition to its minimum size at the end of the partition table
    • extend the root partition into the unallocated space resulting from the above

@All3xJ
Copy link

All3xJ commented May 24, 2022

I believe you are running out of memory, which is why kswapd is going mad, rather than running out of CPU power.

@JamesH65 @paralin
Is there a way to fix this? Or to workaround this?

@JamesH65
Copy link
Contributor

Use less memory? Use Pi4 with more memory? Use a USB SSD and put a swap file on it, that might be faster than swap on the SD card?

What is the Pi doing that is using all the RAM?

@All3xJ
Copy link

All3xJ commented May 25, 2022

What is the Pi doing that is using all the RAM?

I noticed this when I used VS Code's SSH Remote Extension which install and runs a bunch of stuff in the remote server. This cause a HUGE use of memory, and then everything freezes, I couldn't even connect via SSH because the server is freezed.

EDIT: I fixed by increasing the swap partition size!

@shakin89
Copy link

I would like to thanks everyone for all the comments i've found on this page.
I've an rpi4 4G memory, and from this summer i've began to have this issue too frequently.

My system is a plain debian 11, with /boot/firmware and / mounted on sdcard; /home, /var and /tmp are mounted on ssd.
I think i'll move my / to ssd to and activate the swap, and first of all i'll replace my sdcard!!! 😄

@jdrch
Copy link

jdrch commented Oct 26, 2022

If this happens your SD card is probably bad. Best replace it.
Even relatively new SD cards can go bad, because they are not designed for repeated continuous writes to the same location over & over, nor continuous 24/7 operation.
I have had the best luck with SanDisk Ultra microSDHC UHS-I Memory Card DSQUA4-032G-GN6MT which rarely fail. Regardless I would recommend swapping out SD cards once every year to be safe.
The kernel should be updated nevertheless with better error handling, in my opinion.

All storage devices fail eventually ;) But yes, I have had the best luck with SanDisk, and their warranty (both length and replacement policy) is the best in the business.

Last Friday I retired the 32 GB card I'd been using since I got the Pi in 2018 and replaced it with a 200 GB card. We'll see what happens from there.

For those wondering about the process for doing so:

1. Use SD Card Copier (Raspberry Pi OS GUI app) to copy existing card to new card

2. Physically swap in new card

3. (If desired) Use GParted to:
   
   * resize the shared data partition to its minimum size at the end of the partition table
   * extend the root partition into the unallocated space resulting from the above

FWIW replacing the microSD card has fixed the problem. I think I've experienced only 1 crash since.

@sirjeannot
Copy link

sirjeannot commented May 23, 2023

Hello, just joining the thread on this issue, which also happens to my raspberry pi 3b+ rev1.2.

It had been running fine since ~2018 on raspbian armv7, however this issue has started ever since I migrated to aarch64 raspiOS in march. I'm currently running kernel 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux. It is running docker with HomeAssistant, Nodered, Signal-CLI and mqtt. The raspi locks up every 6 to 8 days.
Also:

cat /etc/debian_version
11.7

I've had to replace the SD card before this migration several times, but the symptoms of a dying SD were different, being random crashes of containers, not a full system lockup.
In this case replacing the SD doesn't change anything. (suing Sandisk 16/32GiB A1 class 10)
I'm currently trying the proposed fix (applied it 3 days ago, currently monitoring):

sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5
sudo sysctl -p

If it doesn't work, I'll move the swap file to a USB stick and likely the whole system to a USB SSD afterwards.

edit:
the system partly locked up with the fix. the system still runs (replies to pings) but any write will not complete.
now moving swap to a usb key (not ssd, no power budget left)

edit2 (+1month):
it appears the system running low on memory on many occasions mainly due to 2 docker containers requiring more and more ressources (homeassistant and signal-cli in json-rpc mode). Some are now set to restart daily to contain the increasing memory consumption.
the swap on a usb key did work out quite well. However the USB key died in a few weeks : the system crashed within minutes of booting, corrupting docker log files ('\x00' error).
I've "moved" the swap to zram with aggressive values.

vm.vfs_cache_pressure = 500
vm.swappiness = 100
vm.dirty_background_ratio = 1
vm.dirty_ratio = 50

It's been a few days, some more time is needed to see if it helps mitigating the issue.
SSD migration still needs to be done, but zram swap doesn't interfere with this upgrade.

edit3 (+1.5months):
so far there are no more hangs with the zram swap. I've graphed the ram and swap usage to get a view of what is happening. it shows there is a major memory consumption after a few days with my workload, with only a few MiB of memory left even in the swap. so the swap increasing slowness on the sd-card was the culprit. no more hangs but I'll have to migrate to a sbc with more memory.
image

edit4 (+3 months)
still using the raspi3b+ 1GiB of RAM and no hang ups. The memory is heavily occupied but no more crashes.
over the last 1.5months (no reboot)
image
image

edit5 (+10 months)
still using raspi3b+ but encountering memory depletion and usb ssd disconnects (other kernel module issue, not power replated).
moved back to armv7l and emmc disk, not a single issue since : 300MiB+ free and no hangups or disconnects.

@djr-spectrummfg
Copy link

I am able to reproduce this on a CM4 (lite, 1GB ram, no wifi, no heatsink) using the official CM4 IO board and a Sandisk EVO Select 128GB microSD card (which has not seen much abuse). I am using a 64bit kernel, v5.15.34-v8, built using Yocto and meta-raspberrypi.

I can trigger it by compiling a large C++ project (https://github.com/OpenLightingProject/ola using make -j4) on the RPi. After some time the build will often hang and eventually produce the hung task kernel message. As well as stressing the CPU, this build does seem to consume most of the available RAM as at times I have seen oom-killer messages.

Sample of kernel message:

[ 7250.956746] INFO: task kworker/2:1:312 blocked for more than 120 seconds.
[ 7250.963695]       Tainted: G         C O      5.15.34-v8 #1
[ 7250.969448] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7250.977466] task:kworker/2:1     state:D stack:    0 pid:  312 ppid:     2 flags:0x00000208
[ 7250.986034] Workqueue: events_freezable mmc_rescan
[ 7250.990939] Call trace:
[ 7250.993481]  __switch_to+0x110/0x164
[ 7250.997144]  __schedule+0x314/0x8a0
[ 7251.000768]  schedule+0x50/0x10c
[ 7251.004065]  __mmc_claim_host+0xc0/0x1f0
[ 7251.008084]  mmc_get_card+0x40/0x50
[ 7251.011652]  mmc_sd_detect+0x2c/0xa0
[ 7251.015305]  mmc_rescan+0x98/0x2f0
[ 7251.018776]  process_one_work+0x1e0/0x48c
[ 7251.022871]  worker_thread+0x144/0x470
[ 7251.027399]  kthread+0x144/0x160
[ 7251.030716]  ret_from_fork+0x10/0x20

@gec75
Copy link

gec75 commented Feb 16, 2024

I also found a similar problem in a completely different setup.
https://bugzilla.kernel.org/show_bug.cgi?id=218493

@sresam89
Copy link

sresam89 commented Mar 3, 2024

I finding the same issue on newer kernel
Trying not to cross-post
Issue linked here

@djr-spectrummfg
Copy link

Observed this again after updating my kernel to 5.15.92 (still on a CM4 with the same uSD card, this time without running anything stressing the CPU/RAM/SD card IO):

[236227.555571] INFO: task kworker/1:0:1010415 blocked for more than 120 seconds.
[236227.562933]       Tainted: G         C O      5.15.92-v8 #1
[236227.569081] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[236227.577030] task:kworker/1:0     state:D stack:    0 pid:1010415 ppid:     2 flags:0x00000208
[236227.585685] Workqueue: events_freezable mmc_rescan
[236227.590613] Call trace:
[236227.593207]  __switch_to+0x110/0x164
[236227.597452]  __schedule+0x32c/0x8f4
[236227.601081]  schedule+0x70/0x12c
[236227.604426]  __mmc_claim_host+0xc0/0x1f0
[236227.608463]  mmc_get_card+0x3c/0x50
[236227.612079]  mmc_sd_detect+0x28/0xa0
[236227.616265]  mmc_rescan+0x94/0x2f0
[236227.619794]  process_one_work+0x1dc/0x4a4
[236227.623935]  worker_thread+0x144/0x490
[236227.627810]  kthread+0x140/0x154
[236227.631256]  ret_from_fork+0x10/0x20

@jdrch
Copy link

jdrch commented Apr 15, 2024

updating my kernel to 5.15.92

Isn't Raspberry Pi OS on v6+ for all devices? I'd think that's where any fixes are most likely to be.

@djr-spectrummfg
Copy link

I'm not using Raspberry Pi OS, I'm using an image built with Yocto kirkstone and meta-raspberrypi. This kernel update was a result of the update in meta-raspberrypi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests