Hardware watchdog... how to find the cause? #1656

giig1967g · 2018-08-16T17:31:34Z

Hi,
in the last versions I am experiencing a constant reboot (every other day) with "Hardware watchdog" as the Reboot cause.
I have also changed from Static IP to DHCP.
How can I find the cause of the Hardware watchdog?
What does exactly it means?

TD-er · 2018-08-16T18:13:29Z

If some piece of code is running for over 6 seconds without calling any delay or yield, it will trigger a hardware watchdog, which performs a reset.

So there is some code in your setup either waiting for that long, or running an "infinite loop".
Could you please give more info on your setup?
Also do not set the "MessageDelay" too high, nor use "delay" in the rules.

thomastech · 2018-08-16T18:52:15Z

@TD-er: FYI, the latest crash/reboot I mentioned in #1643 was reported in the GUI as Reset Reason | Hardware Watchdog. This was the test system that had run for a day, then the WiFi went offline for a couple hours, then the board rebooted on its own. Might be related to this issue, or maybe not.

Thomas

giig1967g · 2018-08-18T09:44:51Z

here my config

TD-er · 2018-08-18T10:22:44Z

See my comment here: #1659 (comment)
Looks like my nodes are also "affected", which is good :)

thomastech · 2018-08-18T15:33:49Z

@TD-er: Yesterday I loaded the ESP_Easy_mega-20180815_test_ESP8266_4096 build on a NodeMCU. It ran great for 18 hours then rebooted. System Info says: Boot : Manual reboot (1), Reset Reason : Hardware Watchdog.

A second duplicate NodeMCU is still running fine. But it has only been running for 17 hrs, so it may face the reboot dance soon.

Thomas

TD-er · 2018-08-18T15:41:26Z

Would be great to see if it occurs at the same interval, or time of day.
Maybe it is some NTP refresh, or something else, who knows.

thomastech · 2018-08-18T15:57:05Z

@TD-er: That would be great. But so far I have not seen a pattern that indicates it is triggered by run duration or time of day.

My hunch is that it is something related to WiFi, such as a reconnect. But I have tried to torture the WiFi connection (force router offline, create weak RF signal levels) and nothing bad happened. So my hunch seems to be nonsense. Hopefully you find the cause and save us.

Thomas

thomastech · 2018-08-18T17:26:21Z

It rebooted again after running for about 2 hours. Now reports Manual reboot (2), Reset Reason : Hardware Watchdog.

The second duplicate NodeMCU is still running fine. About 19+ hours so far.

Thomas

TD-er · 2018-08-18T18:10:33Z

I found an issue with the handling of UDP traffic (when C013 is used). That could cause Exception crashes. (not likely a Watchdog reset)
I also added some checks when creating an UDP client for NTP, to see if that may cause infinite waiting.
Those can cause a watchdog reset.

When tested, I will make a commit for it.

thomastech · 2018-09-24T16:34:12Z

@TD-er
I've been testing the new builds as they are released. So far none have solved the Watchdog reset. However, the latest ESP_Easy_mega-20180922_dev_ESP8266_4096.bin build seems better than the last release.

The latest firmware has been successfully running on one of my NodeMCU's (current duration 1 day 22 hrs, no reboots). But the second NodeMCU has rebooted several times.

I have a feeling the rebooting is related to the WiFi access since the latest reboot occurred when I accessed the device from my browser. I also noticed a previous reboot occurred at a time when the good "working" device reported a beacon timeout. But I can't replicate the reboots on demand.

SysInfo reports this:
Boot: Manual reboot (8)
Reset Reason: Hardware Watchdog

Not sure if you are interested in this feedback. But it's been a month+ since my last updated and I thought I'd keep you posted on my findings.

Thomas

TD-er · 2018-09-24T17:00:03Z

I'm always interested in feedback, especially when there seems to be a bit of improvement :)

thomastech · 2018-09-24T19:25:32Z

The reboot issue certainly has been a tough nut to crack.

My winning streak with the "working" device just ended. After 48 hours runtime it rebooted. Sysinfo reports Manual reboot (1) Hardware Watchdog

Thomas

TD-er · 2018-09-30T23:30:29Z

See PR #1834
I just merged a change which moves the address space of the Arduino stack to be on top of the System stack.
The latest core library appears to have shifted the Arduino stack to overlap a bit with the System stack to save about 4k of memory.
But since we're allocating quite a lot on the System stack, this may have led to an increase in reports of HW Watchdog resets.

So please test with the October 1st build, as soon as that one is ready.

thomastech · 2018-10-01T15:54:22Z

I've installed ESP_Easy_mega-20181001_dev_ESP8266_4096.bin one two NodeMCU's. I will report back tomorrow (or sooner) if they experience a reboot.

Some initial comments:

The 4K decrease in system free ram is disappointing. If this change to stack memory does not help then please consider reverting back.
Not sure if it is related to the new firmware, but web page access is slow. Sometimes navigating the tabs take several seconds for each new page to populate.

Thanks again for your efforts. Fingers are crossed that this merge helps cure the W-dog reboot issue.

Thomas

TD-er · 2018-10-01T16:18:37Z

Hmm I had the impression it was loading a bit faster with the changed Arduino stack location.
But maybe getting the free stack for statistics is using quite some resources and is called a bit more in some functions

thomastech · 2018-10-01T19:57:47Z

@TD-er: I don't know why, but the slow page loading has gone away and refresh is OK now. If there are no other reports of it then I'd say it was an isolated issue related to my WiFi router.

Thomas

TD-er · 2018-10-01T20:00:52Z

Pfiew, I was afraid you had to report a crash/reboot already.

TD-er · 2018-10-01T20:08:46Z

I was thinking about this....
The webserver has some mechanism to free memory when it is too low.
This freeing memory may take some time which makes the webinterface slow down.

thomastech · 2018-10-02T02:32:59Z

Interesting, I wasn't aware the webserver could do that kind of magic. It may have been involved because during the slow browser response the system load was about 40%. But after the problem went away it settled down to about 30% load.

Now the bad news. One of the test devices rebooted after 10 hours.
System info report:
Boot: Manual reboot (1)
Reset Reason: Software Watchdog

I'll let the other device continue to run until it reboots. Maybe the force is stronger with this one.

Thomas

TD-er · 2018-10-02T05:32:54Z

It is not that bad, since a software watchdog is different from a hardware watchdog.
The software version means it is still doing stuff

thomastech · 2018-10-02T14:48:35Z

You shouldn't have said that it is not that bad because Murphy is watching us. The second device rebooted at 19 hours due to hardware reset.
System info report:
Boot: Manual reboot (2)
Reset Reason: Hardware Watchdog

BTW, the other device rebooted again a few minutes ago. Another Software reset.
System info report:
Boot: Manual reboot (3)
Reset Reason: Software Watchdog

Thomas

TD-er · 2018-10-02T15:24:44Z

OK, so that's not the fix :(
Can you also/already test using this PR: #1838 ?
I will later this evening add extra settimeout calls to other WiFi client instances, so it is not complete yet.

thomastech · 2018-10-02T15:53:44Z

No problem, I'll test PR #1838 after it is incorporated in the nightly build.

Thomas

thomastech · 2018-10-04T16:26:56Z

I'm running both devices on ESP_Easy_mega-20181004_dev_ESP8266_4096.bin. Twelve hours so far, no reboots. Fingers crossed.

Thomas

giig1967g · 2018-10-04T17:43:54Z

I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours.

s0170071 · 2018-10-04T18:09:13Z

@TD-er do you think this could help?
Somebody just needs to figure out how to conveniently link the .elf file from the build date to the exception decoder...

Another thought on this: If it is possible to catch the exception and save stuff, can't we just catch that exception too and do something with it ? Save some text, send an email, ignore it and carry on ?

TD-er · 2018-10-04T18:25:21Z

@thomastech Just curious, what are the memory and stack stats of that node running the dev build?

@s0170071 That's a very nice library.
I think we could try it, to see what's happening.
Also I could read the last crash log at boot and write it to SPIFFS.
Or just add a 'crash log report' option to send the crash to a server
I think it deserves its own issue. (to make it easier to find)

TD-er · 2018-10-09T08:39:23Z

That last remark sounds reasonable and may save a lot of searching :)

s0170071 · 2018-10-09T09:13:48Z

If we had heap fragmentation, what would happen if you cannot allocate new memory ? I would assume the pointer returned by new() to be null. Is this correct ?
If so, a viable test could be to now and then try to allocate some useful heap (leave 3k free for wifi) and see if that worked. And then just free it again.

TD-er · 2018-10-09T09:18:53Z

In the staged version of the core lib there is some development on that: https://github.com/esp8266/Arduino/pull/5090/commits

So I can have a look at that and make a test build which will also show the heap statistics when available.
Just to get some idea on what's happening.

Allocations with new should indeed return a NULL pointer, but String will fail silently.

s0170071 · 2018-10-09T11:02:03Z

very good. Seems to be an issue then.
About the strings failing silently: if you allocate / new() some memory block, free() it again you should be able to string.reserve() it afterwards without trouble, right ?

Sounds like a wrapper function #define for String.reserve()

TD-er · 2018-10-09T14:51:19Z

std::string is traditionally (in STL library for C++) a standard container, which does the allocation/deallocation for you. The Arduino String class is loosely based on the same principle, only with some extra's and also some other functions missing.
Maybe we can check for the actual capacity of the String after calling a reserve. Not sure yet if those are publicly accessible. But you shouldn't do new and free/delete on the members of String or else members will get out of sync.

s0170071 · 2018-10-09T15:40:59Z

No free on the string. I meant to check if there is heap available, try to allocate it, free it and then reserve the string buffer.

uzi18 · 2018-10-09T16:20:58Z

@TD-er maybe it is possible to just allocate/reserve some big buffer 200+ chars and use it as static place to manipulate with strings?

TD-er · 2018-10-09T16:34:46Z

Then you have to implement a lot of operations yourself.

TD-er · 2018-10-23T08:39:56Z

There is no mention of MQTT in this thread as far as will look for.

Yesterday I added a delay(1) to the readByte part of MQTT client PubSubClient.
Can you please test if this is now still an issue?

And if another plugin is active, please mention that one too.

thomastech · 2018-10-25T15:09:41Z

I've been running ESP_Easy_mega-20181023_dev_ESP8266_4096.bin on two NodeMCU devices. One rebooted today, hardware Wdog reset.

Load: | 23.20% (LC=9683)
Free Mem: 10520 (7232 - ruleMatch2)
Free Stack: 3536 (640 - LoadTaskSettings)
Boot: Manual reboot (3)
Reset Reason: Hardware Watchdog

Domosapiens · 2018-10-25T16:15:14Z

This could be a major game changer:
Release mega-20181025:
[WDT] Change yield() to delay(0)

thomastech · 2018-10-25T16:35:58Z

@Domosapiens: Thanks for the heads-up. I will flash ESP_Easy_mega-20181025_dev_ESP8266_4096.bin into my two devices.

Grovkillen · 2018-10-25T16:36:43Z

Yep we hope to close this on. 👍

TD-er · 2018-10-25T19:04:45Z

@thomastech What uptime did you get on your node?
And please have a look at the controller settings.
Especially those that may increase memory usage, like Max Queue depth and minimum send interval.

thomastech · 2018-10-25T19:36:57Z

@TD-er: The device that rebooted had been manually reset (RST button press). Then about 18 hours later it rebooted due to hardware wdog.

MQTT controller settings:

TD-er · 2018-10-25T19:41:17Z

Hmm, those are "interesting" settings.
No retries, no queue and "ignore new".
So in other words, a new sample will be tried once and kept in the queue when there is no wifi connection.
Also at first attempt it will be removed from the queue.

I would expect "delete oldest" when using no queue, or else you may prefer an older value when the broker has been unreachable for a while.

thomastech · 2018-10-25T20:27:21Z

Hmm, those are "interesting" settings.

They were the defaults when I originally installed the Controller. What should all the settings be for a typical OpenHab MQTT controller?

Thomas

TD-er · 2018-10-25T21:14:54Z

You can delete the controller and re-add it.
Then you have the new defaults. (make sure to press save after adding it)

Proper defaults are:

You may lower the minimum send interval if your broker is fast enough.
I run 10 msec here on a raspberry pi 3

Domosapiens · 2018-10-25T23:05:16Z

Installed Release mega-20181025 yesterday (because it was not available earlier ;)
No conclusions yet, but with the last daily releases, I have seen no memory nor stack problems.

(so great that you just can paste a snapshot!)

Up-time seems still be a problem.
But ... I'm hunting also for the cause of excessive RCWL-0516 (multiple units in the lab interfering?) detections
As with #1857 I need to use a rule for LDC On/Off resulting in excessive rule calls. So no conclusions yet.

TD-er · 2018-10-25T23:08:58Z

Nice to see the free stack is also increasing a few bytes at a time on new builds :)

thomastech · 2018-10-25T23:39:05Z

You can delete the controller and re-add it. Then you have the new defaults.

@TD-er: Thanks, MQTT controller has been updated with new defaults.

s0170071 · 2018-10-26T05:55:52Z

@Domosapiens I have a set of nodes running for several days now. The build from yesterday evening was running all night.
Your uptime problems must be due to something else. Try a fresh hardware and another power supply an no devices/plugins. Please report back if that worked better.

Domosapiens · 2018-10-26T09:32:58Z

@s0170071 Thanks for your advice.

I have 4 boxes under test as described here:
https://www.letscontrolit.com/forum/viewtopic.php?f=2&t=5955&sid=db230a574377fbb18394ecdcb9e9b75a
So fresh HW is not an option, power supply is sufficient and clean, and with no devices/plugins they are useless.

Yes, I can understand your positive experience ....
Without hardware there is no reason for the Hardware Watchdog to reboot ;)
But I will flash a few bare Wemos units.

One unit is running mega-2080322 for over 141 hr !!! No reboot. No DS18B20 NAN.

With the other 3, I follow the latest developments.
One unit did 40 hr, the others less.

Still hunting for the dog!

thomastech · 2018-10-27T01:13:19Z

Feedback on ESP_Easy_mega-20181025_dev_ESP8266_4096.bin

One NodeMCU still running without reboot. ~28 hrs.
Second NodeMCU rebooted at 27 hrs. Details below.

Load: | 25.50% (LC=9670)
Free Mem: | 10848 (8144 - sendContentBlocking)
Free Stack: | 3584 (720 - LoadTaskSettings)
Boot: | Manual reboot (2)
Reset Reason: | Hardware Watchdog

Thomas

TD-er · 2019-10-27T11:56:35Z

I think this is no longer an issue. If it still is an issue. please open a new issue.

I will close this one now, since its last post was a year ago.

TD-er added Status: Needs Info Needs more info before action can be taken Category: Stabiliy Things that work, but not as long as desired labels Aug 16, 2018

thomastech mentioned this issue Oct 25, 2018

[OpenHAB MQTT] Crash due to yield panic #1625

Closed

TD-er closed this as completed Oct 27, 2019

ptr727 mentioned this issue Oct 22, 2020

Add physical resource consumption diagnostic sensors esphome/feature-requests#963

Closed

Hardware watchdog... how to find the cause? #1656

Hardware watchdog... how to find the cause? #1656

Comments

giig1967g commented Aug 16, 2018

TD-er commented Aug 16, 2018

thomastech commented Aug 16, 2018 • edited Loading

giig1967g commented Aug 18, 2018

TD-er commented Aug 18, 2018

thomastech commented Aug 18, 2018

TD-er commented Aug 18, 2018

thomastech commented Aug 18, 2018

thomastech commented Aug 18, 2018

TD-er commented Aug 18, 2018

thomastech commented Sep 24, 2018

TD-er commented Sep 24, 2018

thomastech commented Sep 24, 2018 • edited Loading

TD-er commented Sep 30, 2018

thomastech commented Oct 1, 2018

TD-er commented Oct 1, 2018

thomastech commented Oct 1, 2018

TD-er commented Oct 1, 2018

TD-er commented Oct 1, 2018

thomastech commented Oct 2, 2018

TD-er commented Oct 2, 2018

thomastech commented Oct 2, 2018

TD-er commented Oct 2, 2018

thomastech commented Oct 2, 2018

thomastech commented Oct 4, 2018

giig1967g commented Oct 4, 2018

s0170071 commented Oct 4, 2018

TD-er commented Oct 4, 2018

TD-er commented Oct 9, 2018

s0170071 commented Oct 9, 2018

TD-er commented Oct 9, 2018

s0170071 commented Oct 9, 2018

TD-er commented Oct 9, 2018

s0170071 commented Oct 9, 2018

uzi18 commented Oct 9, 2018

TD-er commented Oct 9, 2018

TD-er commented Oct 23, 2018

thomastech commented Oct 25, 2018

Domosapiens commented Oct 25, 2018

thomastech commented Oct 25, 2018

Grovkillen commented Oct 25, 2018

TD-er commented Oct 25, 2018

thomastech commented Oct 25, 2018

TD-er commented Oct 25, 2018

thomastech commented Oct 25, 2018 • edited Loading

TD-er commented Oct 25, 2018

Domosapiens commented Oct 25, 2018

TD-er commented Oct 25, 2018

thomastech commented Oct 25, 2018

s0170071 commented Oct 26, 2018

Domosapiens commented Oct 26, 2018

thomastech commented Oct 27, 2018

TD-er commented Oct 27, 2019

thomastech commented Aug 16, 2018 •

edited

Loading

thomastech commented Sep 24, 2018 •

edited

Loading

thomastech commented Oct 25, 2018 •

edited

Loading