-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardware watchdog... how to find the cause? #1656
Comments
If some piece of code is running for over 6 seconds without calling any delay or yield, it will trigger a hardware watchdog, which performs a reset. So there is some code in your setup either waiting for that long, or running an "infinite loop". |
@TD-er: FYI, the latest crash/reboot I mentioned in #1643 was reported in the GUI as Reset Reason | Hardware Watchdog. This was the test system that had run for a day, then the WiFi went offline for a couple hours, then the board rebooted on its own. Might be related to this issue, or maybe not.
|
See my comment here: #1659 (comment) |
@TD-er: Yesterday I loaded the ESP_Easy_mega-20180815_test_ESP8266_4096 build on a NodeMCU. It ran great for 18 hours then rebooted. System Info says: Boot : Manual reboot (1), Reset Reason : Hardware Watchdog. A second duplicate NodeMCU is still running fine. But it has only been running for 17 hrs, so it may face the reboot dance soon.
|
Would be great to see if it occurs at the same interval, or time of day. |
@TD-er: That would be great. But so far I have not seen a pattern that indicates it is triggered by run duration or time of day. My hunch is that it is something related to WiFi, such as a reconnect. But I have tried to torture the WiFi connection (force router offline, create weak RF signal levels) and nothing bad happened. So my hunch seems to be nonsense. Hopefully you find the cause and save us.
|
It rebooted again after running for about 2 hours. Now reports Manual reboot (2), Reset Reason : Hardware Watchdog. The second duplicate NodeMCU is still running fine. About 19+ hours so far.
|
I found an issue with the handling of UDP traffic (when C013 is used). That could cause Exception crashes. (not likely a Watchdog reset) When tested, I will make a commit for it. |
@TD-er The latest firmware has been successfully running on one of my NodeMCU's (current duration 1 day 22 hrs, no reboots). But the second NodeMCU has rebooted several times. I have a feeling the rebooting is related to the WiFi access since the latest reboot occurred when I accessed the device from my browser. I also noticed a previous reboot occurred at a time when the good "working" device reported a beacon timeout. But I can't replicate the reboots on demand. SysInfo reports this: Not sure if you are interested in this feedback. But it's been a month+ since my last updated and I thought I'd keep you posted on my findings.
|
I'm always interested in feedback, especially when there seems to be a bit of improvement :) |
The reboot issue certainly has been a tough nut to crack. My winning streak with the "working" device just ended. After 48 hours runtime it rebooted. Sysinfo reports Manual reboot (1) Hardware Watchdog
|
See PR #1834 So please test with the October 1st build, as soon as that one is ready. |
I've installed ESP_Easy_mega-20181001_dev_ESP8266_4096.bin one two NodeMCU's. I will report back tomorrow (or sooner) if they experience a reboot. Some initial comments:
Thanks again for your efforts. Fingers are crossed that this merge helps cure the W-dog reboot issue.
|
Hmm I had the impression it was loading a bit faster with the changed Arduino stack location. |
@TD-er: I don't know why, but the slow page loading has gone away and refresh is OK now. If there are no other reports of it then I'd say it was an isolated issue related to my WiFi router.
|
Pfiew, I was afraid you had to report a crash/reboot already. |
I was thinking about this.... |
Interesting, I wasn't aware the webserver could do that kind of magic. It may have been involved because during the slow browser response the system load was about 40%. But after the problem went away it settled down to about 30% load. Now the bad news. One of the test devices rebooted after 10 hours. I'll let the other device continue to run until it reboots. Maybe the force is stronger with this one.
|
It is not that bad, since a software watchdog is different from a hardware watchdog. |
You shouldn't have said that it is not that bad because Murphy is watching us. The second device rebooted at 19 hours due to hardware reset. BTW, the other device rebooted again a few minutes ago. Another Software reset.
|
OK, so that's not the fix :( |
No problem, I'll test PR #1838 after it is incorporated in the nightly build.
|
I'm running both devices on ESP_Easy_mega-20181004_dev_ESP8266_4096.bin. Twelve hours so far, no reboots. Fingers crossed.
|
I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours. |
@TD-er do you think this could help? Another thought on this: If it is possible to catch the exception and save stuff, can't we just catch that exception too and do something with it ? Save some text, send an email, ignore it and carry on ? |
@thomastech Just curious, what are the memory and stack stats of that node running the dev build? @s0170071 That's a very nice library. |
That last remark sounds reasonable and may save a lot of searching :) |
If we had heap fragmentation, what would happen if you cannot allocate new memory ? I would assume the pointer returned by new() to be null. Is this correct ? |
In the staged version of the core lib there is some development on that: https://github.com/esp8266/Arduino/pull/5090/commits So I can have a look at that and make a test build which will also show the heap statistics when available. Allocations with new should indeed return a NULL pointer, but String will fail silently. |
very good. Seems to be an issue then. Sounds like a wrapper function #define for String.reserve() |
std::string is traditionally (in STL library for C++) a standard container, which does the allocation/deallocation for you. The Arduino String class is loosely based on the same principle, only with some extra's and also some other functions missing. |
No free on the string. I meant to check if there is heap available, try to allocate it, free it and then reserve the string buffer. |
@TD-er maybe it is possible to just allocate/reserve some big buffer 200+ chars and use it as static place to manipulate with strings? |
Then you have to implement a lot of operations yourself. |
There is no mention of MQTT in this thread as far as will look for. Yesterday I added a delay(1) to the readByte part of MQTT client PubSubClient. And if another plugin is active, please mention that one too. |
This could be a major game changer: |
@Domosapiens: Thanks for the heads-up. I will flash ESP_Easy_mega-20181025_dev_ESP8266_4096.bin into my two devices. |
Yep we hope to close this on. 👍 |
@thomastech What uptime did you get on your node? |
@TD-er: The device that rebooted had been manually reset (RST button press). Then about 18 hours later it rebooted due to hardware wdog. |
Hmm, those are "interesting" settings. I would expect "delete oldest" when using no queue, or else you may prefer an older value when the broker has been unreachable for a while. |
They were the defaults when I originally installed the Controller. What should all the settings be for a typical OpenHab MQTT controller?
|
Installed Release mega-20181025 yesterday (because it was not available earlier ;)
|
Nice to see the free stack is also increasing a few bytes at a time on new builds :) |
@TD-er: Thanks, MQTT controller has been updated with new defaults. |
@Domosapiens I have a set of nodes running for several days now. The build from yesterday evening was running all night. |
@s0170071 Thanks for your advice. I have 4 boxes under test as described here: Yes, I can understand your positive experience .... One unit is running mega-2080322 for over 141 hr !!! No reboot. No DS18B20 NAN. With the other 3, I follow the latest developments. Still hunting for the dog! |
Feedback on ESP_Easy_mega-20181025_dev_ESP8266_4096.bin One NodeMCU still running without reboot. ~28 hrs.
Thomas |
I think this is no longer an issue. If it still is an issue. please open a new issue. I will close this one now, since its last post was a year ago. |
Hi,
in the last versions I am experiencing a constant reboot (every other day) with "Hardware watchdog" as the Reboot cause.
I have also changed from Static IP to DHCP.
How can I find the cause of the Hardware watchdog?
What does exactly it means?
The text was updated successfully, but these errors were encountered: