Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware watchdog... how to find the cause? #1656

Closed
giig1967g opened this issue Aug 16, 2018 · 70 comments
Closed

Hardware watchdog... how to find the cause? #1656

giig1967g opened this issue Aug 16, 2018 · 70 comments
Labels
Category: Stabiliy Things that work, but not as long as desired Status: Needs Info Needs more info before action can be taken

Comments

@giig1967g
Copy link
Contributor

Hi,
in the last versions I am experiencing a constant reboot (every other day) with "Hardware watchdog" as the Reboot cause.
I have also changed from Static IP to DHCP.
How can I find the cause of the Hardware watchdog?
What does exactly it means?

@TD-er
Copy link
Member

TD-er commented Aug 16, 2018

If some piece of code is running for over 6 seconds without calling any delay or yield, it will trigger a hardware watchdog, which performs a reset.

So there is some code in your setup either waiting for that long, or running an "infinite loop".
Could you please give more info on your setup?
Also do not set the "MessageDelay" too high, nor use "delay" in the rules.

@TD-er TD-er added Status: Needs Info Needs more info before action can be taken Category: Stabiliy Things that work, but not as long as desired labels Aug 16, 2018
@thomastech
Copy link
Contributor

thomastech commented Aug 16, 2018

@TD-er: FYI, the latest crash/reboot I mentioned in #1643 was reported in the GUI as Reset Reason | Hardware Watchdog. This was the test system that had run for a day, then the WiFi went offline for a couple hours, then the board rebooted on its own. Might be related to this issue, or maybe not.

  • Thomas

@giig1967g
Copy link
Contributor Author

here my config
schermata 2018-08-18 alle 11 42 14
schermata 2018-08-18 alle 11 42 33

@TD-er
Copy link
Member

TD-er commented Aug 18, 2018

See my comment here: #1659 (comment)
Looks like my nodes are also "affected", which is good :)

@thomastech
Copy link
Contributor

@TD-er: Yesterday I loaded the ESP_Easy_mega-20180815_test_ESP8266_4096 build on a NodeMCU. It ran great for 18 hours then rebooted. System Info says: Boot : Manual reboot (1), Reset Reason : Hardware Watchdog.

A second duplicate NodeMCU is still running fine. But it has only been running for 17 hrs, so it may face the reboot dance soon.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 18, 2018

Would be great to see if it occurs at the same interval, or time of day.
Maybe it is some NTP refresh, or something else, who knows.

@thomastech
Copy link
Contributor

@TD-er: That would be great. But so far I have not seen a pattern that indicates it is triggered by run duration or time of day.

My hunch is that it is something related to WiFi, such as a reconnect. But I have tried to torture the WiFi connection (force router offline, create weak RF signal levels) and nothing bad happened. So my hunch seems to be nonsense. Hopefully you find the cause and save us.

  • Thomas

@thomastech
Copy link
Contributor

It rebooted again after running for about 2 hours. Now reports Manual reboot (2), Reset Reason : Hardware Watchdog.

The second duplicate NodeMCU is still running fine. About 19+ hours so far.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 18, 2018

I found an issue with the handling of UDP traffic (when C013 is used). That could cause Exception crashes. (not likely a Watchdog reset)
I also added some checks when creating an UDP client for NTP, to see if that may cause infinite waiting.
Those can cause a watchdog reset.

When tested, I will make a commit for it.

@thomastech
Copy link
Contributor

@TD-er
I've been testing the new builds as they are released. So far none have solved the Watchdog reset. However, the latest ESP_Easy_mega-20180922_dev_ESP8266_4096.bin build seems better than the last release.

The latest firmware has been successfully running on one of my NodeMCU's (current duration 1 day 22 hrs, no reboots). But the second NodeMCU has rebooted several times.

I have a feeling the rebooting is related to the WiFi access since the latest reboot occurred when I accessed the device from my browser. I also noticed a previous reboot occurred at a time when the good "working" device reported a beacon timeout. But I can't replicate the reboots on demand.

SysInfo reports this:
Boot: Manual reboot (8)
Reset Reason: Hardware Watchdog

Not sure if you are interested in this feedback. But it's been a month+ since my last updated and I thought I'd keep you posted on my findings.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Sep 24, 2018

I'm always interested in feedback, especially when there seems to be a bit of improvement :)

@thomastech
Copy link
Contributor

thomastech commented Sep 24, 2018

The reboot issue certainly has been a tough nut to crack.

My winning streak with the "working" device just ended. After 48 hours runtime it rebooted. Sysinfo reports Manual reboot (1) Hardware Watchdog

  • Thomas

@TD-er
Copy link
Member

TD-er commented Sep 30, 2018

See PR #1834
I just merged a change which moves the address space of the Arduino stack to be on top of the System stack.
The latest core library appears to have shifted the Arduino stack to overlap a bit with the System stack to save about 4k of memory.
But since we're allocating quite a lot on the System stack, this may have led to an increase in reports of HW Watchdog resets.

So please test with the October 1st build, as soon as that one is ready.

@thomastech
Copy link
Contributor

I've installed ESP_Easy_mega-20181001_dev_ESP8266_4096.bin one two NodeMCU's. I will report back tomorrow (or sooner) if they experience a reboot.

Some initial comments:

  1. The 4K decrease in system free ram is disappointing. If this change to stack memory does not help then please consider reverting back.
  2. Not sure if it is related to the new firmware, but web page access is slow. Sometimes navigating the tabs take several seconds for each new page to populate.

Thanks again for your efforts. Fingers are crossed that this merge helps cure the W-dog reboot issue.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Oct 1, 2018

Hmm I had the impression it was loading a bit faster with the changed Arduino stack location.
But maybe getting the free stack for statistics is using quite some resources and is called a bit more in some functions

@thomastech
Copy link
Contributor

@TD-er: I don't know why, but the slow page loading has gone away and refresh is OK now. If there are no other reports of it then I'd say it was an isolated issue related to my WiFi router.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Oct 1, 2018

Pfiew, I was afraid you had to report a crash/reboot already.

@TD-er
Copy link
Member

TD-er commented Oct 1, 2018

I was thinking about this....
The webserver has some mechanism to free memory when it is too low.
This freeing memory may take some time which makes the webinterface slow down.

@thomastech
Copy link
Contributor

Interesting, I wasn't aware the webserver could do that kind of magic. It may have been involved because during the slow browser response the system load was about 40%. But after the problem went away it settled down to about 30% load.

Now the bad news. One of the test devices rebooted after 10 hours.
System info report:
Boot: Manual reboot (1)
Reset Reason: Software Watchdog

I'll let the other device continue to run until it reboots. Maybe the force is stronger with this one.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Oct 2, 2018

It is not that bad, since a software watchdog is different from a hardware watchdog.
The software version means it is still doing stuff

@thomastech
Copy link
Contributor

You shouldn't have said that it is not that bad because Murphy is watching us. The second device rebooted at 19 hours due to hardware reset.
System info report:
Boot: Manual reboot (2)
Reset Reason: Hardware Watchdog

BTW, the other device rebooted again a few minutes ago. Another Software reset.
System info report:
Boot: Manual reboot (3)
Reset Reason: Software Watchdog

  • Thomas

@TD-er
Copy link
Member

TD-er commented Oct 2, 2018

OK, so that's not the fix :(
Can you also/already test using this PR: #1838 ?
I will later this evening add extra settimeout calls to other WiFi client instances, so it is not complete yet.

@thomastech
Copy link
Contributor

No problem, I'll test PR #1838 after it is incorporated in the nightly build.

  • Thomas

@thomastech
Copy link
Contributor

I'm running both devices on ESP_Easy_mega-20181004_dev_ESP8266_4096.bin. Twelve hours so far, no reboots. Fingers crossed.

  • Thomas

@giig1967g
Copy link
Contributor Author

I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours.

@s0170071
Copy link
Contributor

s0170071 commented Oct 4, 2018

@TD-er do you think this could help?
Somebody just needs to figure out how to conveniently link the .elf file from the build date to the exception decoder...

Another thought on this: If it is possible to catch the exception and save stuff, can't we just catch that exception too and do something with it ? Save some text, send an email, ignore it and carry on ?

@TD-er
Copy link
Member

TD-er commented Oct 4, 2018

@thomastech Just curious, what are the memory and stack stats of that node running the dev build?

@s0170071 That's a very nice library.
I think we could try it, to see what's happening.
Also I could read the last crash log at boot and write it to SPIFFS.
Or just add a 'crash log report' option to send the crash to a server
I think it deserves its own issue. (to make it easier to find)

@TD-er
Copy link
Member

TD-er commented Oct 9, 2018

That last remark sounds reasonable and may save a lot of searching :)

@s0170071
Copy link
Contributor

s0170071 commented Oct 9, 2018

If we had heap fragmentation, what would happen if you cannot allocate new memory ? I would assume the pointer returned by new() to be null. Is this correct ?
If so, a viable test could be to now and then try to allocate some useful heap (leave 3k free for wifi) and see if that worked. And then just free it again.

@TD-er
Copy link
Member

TD-er commented Oct 9, 2018

In the staged version of the core lib there is some development on that: https://github.com/esp8266/Arduino/pull/5090/commits

So I can have a look at that and make a test build which will also show the heap statistics when available.
Just to get some idea on what's happening.

Allocations with new should indeed return a NULL pointer, but String will fail silently.

@s0170071
Copy link
Contributor

s0170071 commented Oct 9, 2018

very good. Seems to be an issue then.
About the strings failing silently: if you allocate / new() some memory block, free() it again you should be able to string.reserve() it afterwards without trouble, right ?

Sounds like a wrapper function #define for String.reserve()

@TD-er
Copy link
Member

TD-er commented Oct 9, 2018

std::string is traditionally (in STL library for C++) a standard container, which does the allocation/deallocation for you. The Arduino String class is loosely based on the same principle, only with some extra's and also some other functions missing.
Maybe we can check for the actual capacity of the String after calling a reserve. Not sure yet if those are publicly accessible. But you shouldn't do new and free/delete on the members of String or else members will get out of sync.

@s0170071
Copy link
Contributor

s0170071 commented Oct 9, 2018

No free on the string. I meant to check if there is heap available, try to allocate it, free it and then reserve the string buffer.

@uzi18
Copy link
Contributor

uzi18 commented Oct 9, 2018

@TD-er maybe it is possible to just allocate/reserve some big buffer 200+ chars and use it as static place to manipulate with strings?

@TD-er
Copy link
Member

TD-er commented Oct 9, 2018

Then you have to implement a lot of operations yourself.

@TD-er
Copy link
Member

TD-er commented Oct 23, 2018

There is no mention of MQTT in this thread as far as will look for.

Yesterday I added a delay(1) to the readByte part of MQTT client PubSubClient.
Can you please test if this is now still an issue?

And if another plugin is active, please mention that one too.

@thomastech
Copy link
Contributor

I've been running ESP_Easy_mega-20181023_dev_ESP8266_4096.bin on two NodeMCU devices. One rebooted today, hardware Wdog reset.

Load: | 23.20% (LC=9683)
Free Mem: 10520 (7232 - ruleMatch2)
Free Stack: 3536 (640 - LoadTaskSettings)
Boot: Manual reboot (3)
Reset Reason: Hardware Watchdog

controllers
devices

@Domosapiens
Copy link

This could be a major game changer:
Release mega-20181025:
[WDT] Change yield() to delay(0)

@thomastech
Copy link
Contributor

@Domosapiens: Thanks for the heads-up. I will flash ESP_Easy_mega-20181025_dev_ESP8266_4096.bin into my two devices.

@Grovkillen
Copy link
Member

Yep we hope to close this on. 👍

@TD-er
Copy link
Member

TD-er commented Oct 25, 2018

@thomastech What uptime did you get on your node?
And please have a look at the controller settings.
Especially those that may increase memory usage, like Max Queue depth and minimum send interval.

@thomastech
Copy link
Contributor

@TD-er: The device that rebooted had been manually reset (RST button press). Then about 18 hours later it rebooted due to hardware wdog.

MQTT controller settings:
controller_1

@TD-er
Copy link
Member

TD-er commented Oct 25, 2018

Hmm, those are "interesting" settings.
No retries, no queue and "ignore new".
So in other words, a new sample will be tried once and kept in the queue when there is no wifi connection.
Also at first attempt it will be removed from the queue.

I would expect "delete oldest" when using no queue, or else you may prefer an older value when the broker has been unreachable for a while.

@thomastech
Copy link
Contributor

thomastech commented Oct 25, 2018

Hmm, those are "interesting" settings.

They were the defaults when I originally installed the Controller. What should all the settings be for a typical OpenHab MQTT controller?

  • Thomas

@TD-er
Copy link
Member

TD-er commented Oct 25, 2018

You can delete the controller and re-add it.
Then you have the new defaults. (make sure to press save after adding it)

Proper defaults are:
image

You may lower the minimum send interval if your broker is fast enough.
I run 10 msec here on a raspberry pi 3

@Domosapiens
Copy link

Installed Release mega-20181025 yesterday (because it was not available earlier ;)
No conclusions yet, but with the last daily releases, I have seen no memory nor stack problems.
image

image
(so great that you just can paste a snapshot!)

image
Up-time seems still be a problem.
But ... I'm hunting also for the cause of excessive RCWL-0516 (multiple units in the lab interfering?) detections
As with #1857 I need to use a rule for LDC On/Off resulting in excessive rule calls. So no conclusions yet.

@TD-er
Copy link
Member

TD-er commented Oct 25, 2018

Nice to see the free stack is also increasing a few bytes at a time on new builds :)

@thomastech
Copy link
Contributor

You can delete the controller and re-add it. Then you have the new defaults.

@TD-er: Thanks, MQTT controller has been updated with new defaults.

@s0170071
Copy link
Contributor

@Domosapiens I have a set of nodes running for several days now. The build from yesterday evening was running all night.
Your uptime problems must be due to something else. Try a fresh hardware and another power supply an no devices/plugins. Please report back if that worked better.

@Domosapiens
Copy link

@s0170071 Thanks for your advice.

I have 4 boxes under test as described here:
https://www.letscontrolit.com/forum/viewtopic.php?f=2&t=5955&sid=db230a574377fbb18394ecdcb9e9b75a
So fresh HW is not an option, power supply is sufficient and clean, and with no devices/plugins they are useless.

Yes, I can understand your positive experience ....
Without hardware there is no reason for the Hardware Watchdog to reboot ;)
But I will flash a few bare Wemos units.

One unit is running mega-2080322 for over 141 hr !!! No reboot. No DS18B20 NAN.

With the other 3, I follow the latest developments.
One unit did 40 hr, the others less.

Still hunting for the dog!

@thomastech
Copy link
Contributor

Feedback on ESP_Easy_mega-20181025_dev_ESP8266_4096.bin

One NodeMCU still running without reboot. ~28 hrs.
Second NodeMCU rebooted at 27 hrs. Details below.

Load: | 25.50% (LC=9670)
Free Mem: | 10848 (8144 - sendContentBlocking)
Free Stack: | 3584 (720 - LoadTaskSettings)
Boot: | Manual reboot (2)
Reset Reason: | Hardware Watchdog

Thomas

@TD-er
Copy link
Member

TD-er commented Oct 27, 2019

I think this is no longer an issue. If it still is an issue. please open a new issue.

I will close this one now, since its last post was a year ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Stabiliy Things that work, but not as long as desired Status: Needs Info Needs more info before action can be taken
Projects
None yet
Development

No branches or pull requests

8 participants