-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manual reboot () due to hardware watchdog in release mega-20180922 #1774
Comments
Do they have all these sensors at the same node? |
Yes |
I just flashed a node with the SDS011 with this firmware, to see if serial I/O may be causing these issues. |
Because the PMS7003 still does not work, most units have an SDS021. BME280 BH1750 Pir MH-Z19 TVoc CSS811 SDS021 So disable the BME280? |
But we still have a general problem, because even the units without connected sensors do not run long. I believe that deactivating sensors does not help us. First of all, the units would have to run without sensors for days or weeks. Then you can gradually add sensors. |
On the other hand, the reported reboot intervals you report are way shorter than anyone else. |
Yes, power is always a tricky issue to evaluate. I use 5V USB UPS on some of my units and they never reboot. So could you give us info about the setup you're using? |
My 16 nodes are all running fine now with all current changes for 1 day and 16 hours. No single reboot! :-) And I have all kind of sensors in use with usage of GPIO, I2C and HW serial.... |
OK. I control that. Each unit has its own power supply. I have always bought high-quality power supplies. (eg Aukey 2.4A with 48W power supply adapter, AUKEY USB C charger with 46W power, Volutz 60 Watt 12A 5V) |
Please do, but if they all are reporting reboots it may very well be a network issue as well? |
And just a curious question. Do they ALL have these capacitors? What if you remove those on one? As a test? |
The power sounds OK to me. It could still be WiFi related. |
I do not believe in a network problem. In the network are about 20 Wemos D1 with an old and simple, programmed by me software. There are several LED drivers with 3-6 100 Watt LEDs connected to these units. I have been using this for years to switch the light in the apartment via PIR. These units run for months without rebooting. So, the same hardware I use for ESPEasy. @Grovkillen |
What core lib do these other nodes use? And wifi related doesn't mean it is a problem in your accesspoint. Can also be something in the core libraries |
No idea. The version is already several years old. Because they work, I have not changed anything. |
I have just been looking at the uptime of my nodes. So I will look into what core lib that was. |
Now we have 2.4.2? |
Yep, so simply changing to 2.4.1 in the platformio.ini could help. |
Thank you Gijs. |
It may be a coincidence, but the unit with the BME280 disabled has been running for 5 hours now. |
That's also good news. |
Yes, the PMSx003 plugin only worked for me for a few minutes. The BME280 is quite important, I think. With temperature, pressure and humidity, I find no alternatives to a good price. |
It sure is and it is probably one of the more popular plugins. |
Wau, thats much time. |
Just one of such lines in my stats dump: So that is quite close to the (software) watchdog timeout and maybe the recent versions changed the hardware watchdog timeout to match the 2 sec. |
That may well be possible. Can you perhaps adjust something in the library (filter standby ...)? |
I found the bug in the BME280 plugin. Just curious, is it still running fine with BME280 disabled? The line I posted earlier now shows: |
I switched Off the reporting of the pulse counter to the controller. I don't understand your interrupt disable/enable remark ... Is it that the interrupt routine is corrupting something or has a misaligned return address? |
Nope, what I meant is that you're comparing behavior with some version of about 2 years ago. So it could be that the older one had a bug which resulted in HW watchdog reboots, but that bug may be fixed now and now we're looking at something else that may result in a HW watchdog reboot also. |
Inspired by this remark: #1774 (comment) When looking for the use of interrupts, I found this bug, where it is possible the interrupts were not enabled anymore if there was no reading.
As I described above I downgraded my most instable devices (four Wemos D1, each with 1xDS18B20 and 1x INA2219, sending voltage evry 10 sec, temperature every 30 sec to Domoticz MQTT) from mega-20190108 and mega-20190315 back to release mega-20181008, first of them on 4th April, the graphs are showing the maximum uptime of this device in minutes: |
Last week I made some extensive logging on one of my own nodes. It shows the uptime and the WiFi RSSI value. |
@TD-er Glad you found something, ... very interesting, |
@Domosapiens |
@TD-er But how to explain the RSSI value +31 ? |
The +31 value is an error state of the RSSI function. |
Great ! (found it in the ESP8266 SDK API guide) |
Well, wifi connection can be lost at any time.
But what does worry me, is that the ESP does miss beacon frames and also some other packets or events.
|
@TD-er ... related to RSSI Any positive result on your forked commit ... [WiFi] Use RSSI value to determine connected state ? |
@Domosapiens Using the RSSI value does seem to help a bit. One of my nodes that was rebooting about every day, is now reconnecting. But still it had a HW watchdog reboot 4 days ago. About the measured RSSI, it is a value that will change constantly. |
Thanks @TD-er , I see the same fluctuations. Reboot once in the 4 days would already be a big improvement ! Will this change applied to the main branch? |
Yep, I will merge it soon. |
@TD-er Ok, top RSSI difference ....I was talking about looking at WiFi scan and then directly at the Main page and vice versa. |
Not as far as I know. The only difference I can think of is that the scan is mainly listening (and switching channels) while during normal operations (especially when loading a web page), the ESP is also sending. But that's just speculation. |
I can confirm the reboots. Since I activated the pulse counter on a previously stable unit, it restarted several times. |
Intermediate report of the ongoing testing of different releases. The device mentioned above was flashed backwards to different releases without a stable result. meanwhile I have flashed back on 12th July to release of 2018-05-22 with the result: |
Do you have any wifi reconnects in those 17 days? |
That last one. The ConnectFailures value is being used in the wifi reconnect threshold to trigger a reboot if it wasn't able to connect to a host within set attempts. I really have no idea why some nodes do perform reconnects very wel and others don't. Edit: |
I will mark this one as fixed, especially with the fixes I merged this morning and considering almost all of my own devices run now for double-digit days without a problem. |
The current version still has the same problem as the versions of the last few weeks.
I use ESP_Easy_mega-20180922_test_ESP8266_4096.bin
The units without sensors run only slightly longer than one or four hours.
Units with Sensors (BME280 BH1750 Pir MH-Z19 TVoc CSS811 PMS7003)
boot after a few minutes.
The running time is between 3 minutes and 30 minutes. It is always different.
The text was updated successfully, but these errors were encountered: