Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for watchdog service #298

Closed
diederikdehaas opened this issue Oct 16, 2015 · 49 comments
Closed

Add option for watchdog service #298

diederikdehaas opened this issue Oct 16, 2015 · 49 comments
Labels
Milestone

Comments

@diederikdehaas
Copy link
Member

Apparently the Pi has a hardware watchdog timer, see http://blog.ricardoarturocabral.com/2013/01/auto-reboot-hung-raspberry-pi-using-on.html

It could be a nice feature to add.

@Mausy5043
Copy link
Contributor

That would be very nice. Post installing this is a little egg ;-) (een eitje)
Testing this now on my 1.0.x jessie install.
There's more than just a hung-process-detector in it : http://linux.die.net/man/8/watchdog

@Mausy5043
Copy link
Contributor

I've done some testing. It looks like its working as described on the page you referred to. The Arch instructions are the same as for jessie.
However, the sudo systemctl start watchdog.service doesn't seem to survive a reboot.

@diederikdehaas
Copy link
Member Author

Do you get an error message? Often systemctl status watchdog.service can tell you that.

@Mausy5043
Copy link
Contributor

Mmm. While stressing:

pi@rbian:~ $ journalctl |grep dog
Dec 25 11:55:58 rbian kernel: bcm2708 watchdog, heartbeat=10 sec (nowayout=0)
Dec 25 11:55:58 rbian systemd-modules-load[119]: Inserted module 'bcm2708_wdog'
pi@rbian:~ $ systemctl list-unit-files|grep dog
watchdog.service                       static
pi@rbian:~ $ systemctl status watchdog.service
● watchdog.service - watchdog daemon
   Loaded: loaded (/lib/systemd/system/watchdog.service; static)
   Active: inactive (dead)
pi@rbian:~ $

stressedpi

Only oom-killer is resisting 😉

@Mausy5043
Copy link
Contributor

But, when I do this:

pi@rbian:~ $ journalctl |grep dog
Dec 25 13:57:39 rbian kernel: bcm2708 watchdog, heartbeat=10 sec (nowayout=0)
Dec 25 13:57:39 rbian systemd-modules-load[118]: Inserted module 'bcm2708_wdog'
pi@rbian:~ $ systemctl status watchdog.service
● watchdog.service - watchdog daemon
   Loaded: loaded (/lib/systemd/system/watchdog.service; static)
   Active: inactive (dead)

OK, so I do:

pi@rbian:~ $ sudo systemctl start watchdog.service
pi@rbian:~ $ systemctl status watchdog.service
● watchdog.service - watchdog daemon
   Loaded: loaded (/lib/systemd/system/watchdog.service; static)
   Active: active (running) since Fri 2015-12-25 13:59:59 CET; 7s ago
  Process: 560 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin/watchdog $watchdog_options (code=exited, status=0/SUCCESS)
  Process: 558 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module (code=exited, status=0/SUCCESS)
 Main PID: 563 (watchdog)
   CGroup: /system.slice/watchdog.service
           └─563 /usr/sbin/watchdog

Dec 25 13:59:59 rbian watchdog[563]: int=1s realtime=yes sync=no soft=no mla=24 mem=0
Dec 25 13:59:59 rbian watchdog[563]: ping: no machine to check
Dec 25 13:59:59 rbian watchdog[563]: file: no file to check
Dec 25 13:59:59 rbian watchdog[563]: pidfile: no server process to check
Dec 25 13:59:59 rbian watchdog[563]: interface: no interface to check
Dec 25 13:59:59 rbian watchdog[563]: temperature: no sensors to check
Dec 25 13:59:59 rbian watchdog[563]: test=none(0) repair=none(0) alive=/dev/watchdog heartbeat=none to=root no_act=no force=no
Dec 25 13:59:59 rbian watchdog[563]: cannot set timeout 60 (errno = 22 = 'Invalid argument')
Dec 25 13:59:59 rbian watchdog[563]: hardware watchdog identity: BCM2708
Dec 25 13:59:59 rbian systemd[1]: Started watchdog daemon.

Then if I stress the Pi it reboots:

Dec 25 14:03:16 rbian kernel: Out of memory: Kill process 738 (stress) score 138 or sacrifice child
Dec 25 14:03:16 rbian kernel: Killed process 738 (stress) total-vm:104704kB, anon-rss:67572kB, file-rss:576kB
Dec 25 14:03:17 rbian watchdog[563]: loadavg 25 9 3 is higher than the given threshold 24 18 12!
Dec 25 14:03:17 rbian watchdog[563]: /usr/lib/sendmail does not exist or is not executable (errno = 2)
Dec 25 14:03:17 rbian watchdog[563]: shutting down the system because of error 253

@diederikdehaas
Copy link
Member Author

I enabled (comment out) watchdog-device and max-load-1 = 24 in /etc/watchdog.conf and can confirm that it doesn't start up after reboot :-/

root@rasppi-2b:/home/diederik# journalctl | grep dog
dec 25 14:00:20 rasppi-2b kernel: bcm2708 watchdog, heartbeat=10 sec (nowayout=0)
dec 25 14:00:20 rasppi-2b systemd-modules-load[166]: Inserted module 'bcm2708_wdog'
root@rasppi-2b:/home/diederik# systemctl list-unit-files|grep dog
watchdog.service                       static  
root@rasppi-2b:/home/diederik# systemctl status watchdog.service
● watchdog.service - watchdog daemon
   Loaded: loaded (/lib/systemd/system/watchdog.service; static)
   Active: inactive (dead)
root@rasppi-2b:/home/diederik# systemctl start watchdog.service
root@rasppi-2b:/home/diederik# systemctl status watchdog.service
● watchdog.service - watchdog daemon
   Loaded: loaded (/lib/systemd/system/watchdog.service; static)
   Active: active (running) since vr 2015-12-25 14:02:19 CET; 23s ago
  Process: 448 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin/watchdog $watchdog_options (code=exited, status=0/SUCCESS)
  Process: 445 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module (code=exited, status=0/SUCCESS)
 Main PID: 450 (watchdog)
   CGroup: /system.slice/watchdog.service
           └─450 /usr/sbin/watchdog

dec 25 14:02:19 rasppi-2b watchdog[450]: int=1s realtime=yes sync=no soft=no mla=24 mem=0
dec 25 14:02:19 rasppi-2b watchdog[450]: ping: no machine to check
dec 25 14:02:19 rasppi-2b watchdog[450]: file: no file to check
dec 25 14:02:19 rasppi-2b watchdog[450]: pidfile: no server process to check
dec 25 14:02:19 rasppi-2b watchdog[450]: interface: no interface to check
dec 25 14:02:19 rasppi-2b watchdog[450]: temperature: no sensors to check
dec 25 14:02:19 rasppi-2b watchdog[450]: test=none(0) repair=none(0) alive=/dev/watchdog heartbeat=none to=root no_act=no force=no
dec 25 14:02:19 rasppi-2b watchdog[450]: cannot set timeout 60 (errno = 22 = 'Invalid argument')
dec 25 14:02:19 rasppi-2b watchdog[450]: hardware watchdog identity: BCM2708
dec 25 14:02:19 rasppi-2b systemd[1]: Started watchdog daemon.

Haha! While you added your latest post, I collected the same data 😉

@Mausy5043
Copy link
Contributor

👍

@diederikdehaas
Copy link
Member Author

We're not the only ones who noticed the not starting: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=793309

@Mausy5043
Copy link
Contributor

OK. Let's wait for them to figure this out. I don't think this is a showstopper even for v1.1.x

@diederikdehaas
Copy link
Member Author

I'm not really the guy who likes to wait ... 😉
There is a 'solution' posted in the bug report, but IIRC you're not supposed to edit /lib/systemd/system/watchdog.service (or any file in /lib/systemd/)

Time to play 😄

@diederikdehaas
Copy link
Member Author

One issue fixed: add watchdog-timeout = 15 to /etc/watchdog.conf.
Source: https://www.raspberrypi.org/forums/viewtopic.php?p=479021#p479021

@diederikdehaas
Copy link
Member Author

Another one: /lib/systemd/system/watchdog.service is missing a ' at the end of the ExecStartPre= line.
Source: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=798294

@Mausy5043
Copy link
Contributor

What issue is fixed by setting the time-out? I see no differences.

@diederikdehaas
Copy link
Member Author

dec 25 14:02:19 rasppi-2b watchdog[450]: cannot set timeout 60 (errno = 22 = 'Invalid argument')

@Mausy5043
Copy link
Contributor

Fixing /lib/systemd/system/watchdog.service doesn't help either.

@Mausy5043
Copy link
Contributor

time-out must be <=15 (me thinks)
and its watchdog-timeout not timeout

@diederikdehaas
Copy link
Member Author

Correct. It's not that it must be 15, but it can't be bigger then 15.
Setting it to <=15 does make that error/warning go away

@Mausy5043
Copy link
Contributor

Anyway, those "fixes" don't work for me.

@diederikdehaas
Copy link
Member Author

I've done the following changes and with that, the watchdog service does start up on boot up without errors on my system:
Modified /lib/systemd/system/watchdog.service as follows:

diff --git a/lib/systemd/system/watchdog.service.orig b/lib/systemd/system/watchdog.service
index 40d3a72..1dc6694 100644
--- a/lib/systemd/system/watchdog.service.orig
+++ b/lib/systemd/system/watchdog.service
@@ -7,8 +7,9 @@ OnFailure=wd_keepalive.service
 [Service]
 Type=forking
 EnvironmentFile=/etc/default/watchdog
-ExecStartPre=/bin/sh -c '[ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module
+ExecStartPre=/bin/sh -c '[ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module'
 ExecStart=/bin/sh -c '[ $run_watchdog != 1 ] || exec /usr/sbin/watchdog $watchdog_options'
 ExecStopPost=/bin/sh -c '[ $run_wd_keepalive != 1 ] || false'

 [Install]
+WantedBy=multi-user.target

And modified /etc/watchdog.conf as follows:

diff --git a/etc/watchdog.conf.orig b/etc/watchdog.conf
index 44f7886..0d689c2 100644
--- a/etc/watchdog.conf.orig
+++ b/etc/watchdog.conf
@@ -7,7 +7,7 @@
 # Uncomment to enable test. Setting one of these values to '0' disables it.
 # These values will hopefully never reboot your machine during normal use
 # (if your machine is really hung, the loadavg will go much higher than 25)
-#max-load-1            = 24
+max-load-1             = 24
 #max-load-5            = 18
 #max-load-15           = 12

@@ -21,7 +21,8 @@
 #test-binary           = 
 #test-timeout          = 

-#watchdog-device       = /dev/watchdog
+watchdog-device        = /dev/watchdog
+watchdog-timeout       = 15

 # Defaults compiled into the binary
 #temperature-device    =

After a reboot I get this:

diederik@bagend:~$ ssh rasppi-2b

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri Dec 25 15:23:22 2015 from bagend.home.cknow.org
diederik@rasppi-2b:~$ su
Password: 
root@rasppi-2b:/home/diederik# systemctl status watchdog.service
● watchdog.service - watchdog daemon
   Loaded: loaded (/lib/systemd/system/watchdog.service; enabled)
   Active: active (running) since vr 2015-12-25 15:36:55 CET; 32s ago
  Process: 416 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin/watchdog $watchdog_options (code=exited, status=0/SUCCESS)
  Process: 412 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module (code=exited, status=0/SUCCESS)
 Main PID: 418 (watchdog)
   CGroup: /system.slice/watchdog.service
           └─418 /usr/sbin/watchdog

dec 25 15:36:55 rasppi-2b watchdog[418]: int=1s realtime=yes sync=no soft=no mla=24 mem=0
dec 25 15:36:55 rasppi-2b watchdog[418]: ping: no machine to check
dec 25 15:36:55 rasppi-2b watchdog[418]: file: no file to check
dec 25 15:36:55 rasppi-2b watchdog[418]: pidfile: no server process to check
dec 25 15:36:55 rasppi-2b watchdog[418]: interface: no interface to check
dec 25 15:36:55 rasppi-2b watchdog[418]: temperature: no sensors to check
dec 25 15:36:55 rasppi-2b watchdog[418]: test=none(0) repair=none(0) alive=/dev/watchdog heartbeat=none to=root no_act=no force=no
dec 25 15:36:55 rasppi-2b watchdog[418]: watchdog now set to 15 seconds
dec 25 15:36:55 rasppi-2b watchdog[418]: hardware watchdog identity: BCM2708
dec 25 15:36:55 rasppi-2b systemd[1]: Started watchdog daemon.

@diederikdehaas
Copy link
Member Author

Just informed http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=798294 (and indirectly #783166) and https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=793309 about these issues.

'Luckily' there is also a serious (=RC (Release Critical)) FTBFS (Fails to Build from Source) bug and that often results in action on the package 😄

@Mausy5043
Copy link
Contributor

@diederikdehaas your solution does not work for my R-Pi (Pi1B+; Linux rbian 3.18.0-trunk-rpi #1 PREEMPT Debian 3.18.5-1~exp1+rpi19 (2015-08-08) armv6l GNU/Linux)

@diederikdehaas
Copy link
Member Author

Good to know 👍
It does work on my Pi 2, but it has to work on all of them (also on wheezy)

@Mausy5043
Copy link
Contributor

I can test on wheezy tomorrow. Going to watch "Life of Brian" now 😉

@diederikdehaas
Copy link
Member Author

Have fun, Collaborator 😄

@Mausy5043
Copy link
Contributor

In the coffee-break I did quickly run a test (couldn't help myself).
This is required for things to work on wheezy (tested on RPi1 B; not B+):

sudo apt-get install watchdog
echo "bcm2708_wdog" | sudo tee -a /etc/modules
sudo modprobe bcm2708_wdog
sudo update-rc.d watchdog defaults

Then make the changes to /etc/watchdog.conf as discussed above. and start:
sudo /etc/init.d/watchdog start

stress --cpu 30 (requires a package installed by sudo apt-get install stress) will cause the Pi to reboot automagically.

@diederikdehaas
Copy link
Member Author

In the coffee-break I did quickly run a test (couldn't help myself).

LOL

sudo update-rc.d watchdog defaults

That shouldn't be needed and is a (very) old way of doing things. In most cases it would also start the service automatically, but given the issues in the package and having to modify /etc/watchdog.conf, in this case you (very) likely would.
But does this mean that it works pretty much OOTB on wheezy? That would be nice 😄

@diederikdehaas
Copy link
Member Author

I just installed a jessie system on a Pi 1B+ (and upgraded it to stretch :-P) and encountered the same issue as you did.
The solution is rather simple though: systemctl enable watchdog.service
And then it did work properly 😄

@diederikdehaas diederikdehaas added this to the v1.1.0 milestone Dec 26, 2015
@Mausy5043
Copy link
Contributor

@diederikdehaas : Confirmed. jessie works.
(I thought I had started that service already, but I guess it didn't actually get started due to the ' being missing in /lib/systemd/system/watchdog.service).

For wheezy: IMHO you need to "enable" the service in init.d somehow. The update-rc.d does the trick.

@diederikdehaas
Copy link
Member Author

I found some 'interesting' things in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=768168 ... (to be continued)

@tvannahl
Copy link

On a distribution with systemd there is no need for running a watchdog. I setup everything like the following:

post-install.txt::

echo "bcm2708_wdog" >> /rootfs/etc/modules   # Loads the Kernelmodule for the watchdog
echo "RuntimeWatchdogSec=10" >> /rootfs/etc/systemd/system.conf   # Activates watchdog in systemd

@goranche
Copy link
Contributor

On a distribution with systemd there is no need for running a watchdog

uhm... what? 😄

your example set's up the watchdog, and instructs systemd to let it know it's still alive, every 10 seconds

@tvannahl
Copy link

Thats exactly what I am doing ;-). I forgott to mention that I do that in the post-install.txt script. If you need more information, background, or want to bind your services to a software watchdog I can recommend the article at 0pointer.

@goranche
Copy link
Contributor

yes, another piece of software that systemd feels a need to (re)implement (even though this should be the kernels responsibility)... because ... reasons

but beside that, relying on systemd to handle your watchdog needs is ... well, it makes me question if you actually understand the benefits a watchdog offers...

@goranche
Copy link
Contributor

and in case someone reads this and asks "ok, how to use it then"... (this is pretty specific to the RPi, other use cases, like servers, might have other needs, of course)

if you set up a mission critical system, make sure you reset the watchdogs timer, usually inside your main runloop... this way even if your process hangs (or maybe "hangs" because some callback from a library doesn't get called... which wouldn't be picked up by systemd... actual example), a reboot will be triggered by the watchdog...

don't rely on "everything and the kitchen sink" software, as you will have problems tracking down issues...

@tvannahl
Copy link

I just wanted to point out this way so it can be considered if there should be a option for watchdog in raspbian-ui-netinst. Such a option would not make very much sense to use, if you want to give the watchdog in your own applications hand.
Using systemds watchdog gives you the advantage to watchdog multiple services. So if you set up a critical system, your service resets its own timer within systemd. If the services gets stuck for whatever reason it gets restarted by systemd. If the whole system gets stuck systemd is no longer able to reset the hardware timer and the raspberry pi reboots. If you take a look from development perspective it makes sense now to implement exactly that mechanism into your applications since systemd is served by most modern distributions. Due to that you will be able to cover the needs of embedded systems as well as those of servers without interfering with other topics like access controll (who gets write access to /dev/watchdog?).

Another advantage I see in using the systemds watchdog is that you don't rely on a deamon to be started after boot. systemd has to be started anyway and the implementation doesn't seem to be that much of a hassle (if I may refer to L. Poetterings Blog).

I hope I was able to clearify how that mechanism works and what it provides. Sorry for not pointing that out earlier.

@goranche
Copy link
Contributor

advantage I see in using the systemds watchdog is that you don't rely on a deamon to be started

you see that as an advantage? oh well 😒

@tvannahl
Copy link

I am just assuming that you missunderstand that sentence because I didn't care enought on writing that paragraph. So let me clearify that:

What I tried to say is, that you do not rely on the watchdog daemon (e.g. watchdog.service) to start in the first place. As you probably aware the this would cause the watchdog never to be activated.

Which advantages do you see in using the watchdog.service instead of using the systemd build in?

@diederikdehaas
Copy link
Member Author

Thanks for your info @tvannahl 👍

I am just assuming that you misunderstand that sentence

No, he doesn't. He just hates systemd.
If that is a wrong assumption and there are technical grounds for not wanting to consider using systemd's service, I love to hear it. My experience with watchdog services is very limited, so I'd love to be educated.
I do prefer to use an init system neutral solution over init system specifics though.
But I also have to conclude that the watchdog package itself is not in good shape.
On the bright side, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=798294 now has the pending tag added to it.

@goranche
Copy link
Contributor

Which advantages do you see

choice, reliability (do one thing and do that good, systemd is quite the oposite), battleground proven, ...

or if we go by the exact wording from the unix philosophy: do one thing and do it well

@goranche
Copy link
Contributor

and it's true... I'm not a fan of systemd... and that would be putting it mildly

@Mausy5043
Copy link
Contributor

@tvannahl backward compatibility. systemd doesn't work on wheezy.

@tvannahl
Copy link

@diederikdehaas Is has been a pleasure. I can understand that you're more attracted to a init independent solution. @Mausy5043 stated a good point in the backward compatibility to wheezy. I do have no need for wheezy anymore since I have everything ported to jessie but there are probably many out there who still require wheezy.

@imasonaz
Copy link

imasonaz commented Feb 2, 2016

I know that this thread has strayed a bit, but I just wanted to throw in that the build-in Watchdog (part of the processor) has word best from me as is setup below. Everyone sort of seems ot have their own opinion on how to set it up from everything I've read, so I thought I'd throw my ¢2

sudo modprobe bcm2708_wdog
sudo echo "bcm2708_wdog" >> /etc/modules
sudo apt-get install watchdog chkconfig
chkconfig watchdog on
sudo /etc/init.d/watchdog start
sudo nano /etc/watchdog.conf

Uncomment the line:

watchdog-device = /dev/watchdog

This part is more optional, and is simply to have the system be more careful - it helped me with a couple critical Pi's that would get stuck.

sudo echo "kernel.panic = 60" >> /etc/sysctl.conf
sudo echo "kernel.panic_on_oops = 60" >> /etc/sysctl.conf

I hope this isn't too late to be of use for you in adding a WDT option!

@goranche
Copy link
Contributor

since this issue is marked for the v1.1.x milestone, we need to do something about it...

I personally feel that this is something that's very user specific and as such should be left to the individual end-user to decide on, and of course implement in a way that suits their preference...

that said, since we're stuck with systemd, I guess we could load the hardware watchdog module, and setting systemd up to make use of it...

I would like to have it as an (by default on) option, though 💭 this way anyone, who would like to use a different (better? 😈 ) way of handling the watchdog, can turn the systemd support off and enable the watchdog the way it's suits them...

this would include just the following:

  • load the watchdog module
  • add RuntimeWatchdogSec=20s to /etc/systemd/system.conf

this would cause systemd to ping the watchdog every 10 seconds, and the SOC to restart itself after 20 seconds, should the pinging stop... this way, the worst case scenario is that if something goes horribly wrong, the device will reboot itself after 30s

thoughts?

@diederikdehaas
Copy link
Member Author

Just informed http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=798294 (and indirectly #783166) and https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=793309 about these issues.

Those upstream bugs have been fixed and I have just requested a backport to stable for those 798294 793309. Hopefully he'll do it (regardless whether we'll use it or not).

@diederikdehaas
Copy link
Member Author

thoughts?

I'm not going to block #390 but if the watchdog maintainer does indeed backport those changes, which very much looks like my patch in #298 (comment), I would prefer that.
I'll try/verify if that solution works on all my Pi's in the meantime (even if not used).
If you feel the solution in #390 works good enough, you can of course merge it and we'll discard the watchdog package.

@goranche
Copy link
Contributor

If you feel the solution in #390 works good enough

nope... I don't... just wanted to get the ball rolling 😇

@goranche
Copy link
Contributor

nope... I don't

well... not anymore, that is... I was acting on the assumption that systemd would actually work as "advertised"... 😒
(no one can say I didn't at least try to accept systemd... I'm really trying)

so yeah, if the watchdog package works better, I'm all for it... Just close #390 when you feel it's not needed (for discussions / tests) anymore, and go with the watchdog package 👍

Mausy5043 added a commit that referenced this issue Mar 23, 2016
Add watchdog functionality. Solves issue #298.
@Mausy5043
Copy link
Contributor

Support for a watchdog is added with PR#390. Default setting is disabled. Users may enable the watchdog by adding enable_watchdog=1 in the installer-config.txt.

Closing this issue. Please re-open if you think the issue is not resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants