Manual reboot () due to hardware watchdog in release mega-20180922 #1774

micropet · 2018-09-22T09:07:23Z

The current version still has the same problem as the versions of the last few weeks.
I use ESP_Easy_mega-20180922_test_ESP8266_4096.bin

The units without sensors run only slightly longer than one or four hours.

Units with Sensors (BME280 BH1750 Pir MH-Z19 TVoc CSS811 PMS7003)
boot after a few minutes.
The running time is between 3 minutes and 30 minutes. It is always different.

TD-er · 2018-09-22T09:10:03Z

Do they have all these sensors at the same node?

micropet · 2018-09-22T09:11:26Z

Yes

TD-er · 2018-09-22T09:14:33Z

I just flashed a node with the SDS011 with this firmware, to see if serial I/O may be causing these issues.
I will now flash another node with the BME280 and MH-Z19 to see if those are enough to get the same behavior.
I noticed the BME280 plugin sometimes has a long time logged in my statistics, so that may be one of the culprits. Maybe you could disable that one as a test to see if it improves stability.

micropet · 2018-09-22T09:14:56Z

Because the PMS7003 still does not work, most units have an SDS021.
So:

BME280 BH1750 Pir MH-Z19 TVoc CSS811 SDS021

So disable the BME280?
Good, I'll do that.

micropet · 2018-09-22T09:23:47Z

But we still have a general problem, because even the units without connected sensors do not run long.

I believe that deactivating sensors does not help us.

First of all, the units would have to run without sensors for days or weeks.

Then you can gradually add sensors.

TD-er · 2018-09-22T10:05:25Z

On the other hand, the reported reboot intervals you report are way shorter than anyone else.
So we should start to dig down somewhere.

Grovkillen · 2018-09-22T10:10:07Z

Yes, power is always a tricky issue to evaluate. I use 5V USB UPS on some of my units and they never reboot.

So could you give us info about the setup you're using?

v-a-d-e-r · 2018-09-22T10:10:53Z

My 16 nodes are all running fine now with all current changes for 1 day and 16 hours. No single reboot! :-) And I have all kind of sensors in use with usage of GPIO, I2C and HW serial....

micropet · 2018-09-22T10:25:14Z

OK. I control that.
That is difficult. There are currently 15 units running on different power supplies.

Each unit has its own power supply.
Each Wemos D1 has a 1800 μF capacitor
to 3.3 V and a 1800 μF capacitor to 5 V.

I have always bought high-quality power supplies. (eg Aukey 2.4A with 48W power supply adapter, AUKEY USB C charger with 46W power, Volutz 60 Watt 12A 5V)

Grovkillen · 2018-09-22T10:27:30Z

Please do, but if they all are reporting reboots it may very well be a network issue as well?

Grovkillen · 2018-09-22T10:29:02Z

And just a curious question. Do they ALL have these capacitors? What if you remove those on one? As a test?

TD-er · 2018-09-22T10:32:39Z

The power sounds OK to me.
Also those would not likely result in watchdog issues.
Maybe I could lower the core library to 1.7.3, since the number of reported watchdog reboots has increased a lot after the update to 1.8.0.
Not that they were not reported before, but the reports of those reboots is a lot more than before.

It could still be WiFi related.

micropet · 2018-09-22T10:36:15Z

I do not believe in a network problem.
Currently, 51 WLAN devices are registered on both Unifi Access Points.

In the network are about 20 Wemos D1 with an old and simple, programmed by me software.

There are several LED drivers with 3-6 100 Watt LEDs connected to these units. I have been using this for years to switch the light in the apartment via PIR.

These units run for months without rebooting.

So, the same hardware I use for ESPEasy.

@Grovkillen
Yes. All have this capacitors, also my own Units.

TD-er · 2018-09-22T10:37:21Z

What core lib do these other nodes use?
Could be 2.3.x or older even?

And wifi related doesn't mean it is a problem in your accesspoint. Can also be something in the core libraries

micropet · 2018-09-22T10:38:43Z

@TD-er

No idea. The version is already several years old.

Because they work, I have not changed anything.

TD-er · 2018-09-22T10:41:22Z

I have just been looking at the uptime of my nodes.
One of them is running for 42 days now and is running ESP_Easy_mega-20180513_normal_ESP8266_4096.bin

So I will look into what core lib that was.
It is core 2.4.1

micropet · 2018-09-22T13:16:27Z

Now we have 2.4.2?

TD-er · 2018-09-22T14:43:45Z

Yep, so simply changing to 2.4.1 in the platformio.ini could help.
I can make a build with that for you to try if you like.
What build version do you need? (normal/test and flash size)

micropet · 2018-09-22T14:47:04Z

Thank you Gijs.
Is not necessary. I can change the core version myself and compile with platformio.

micropet · 2018-09-22T14:49:38Z

It may be a coincidence, but the unit with the BME280 disabled has been running for 5 hours now.

TD-er · 2018-09-22T14:51:37Z

That's also good news.
I have a lot of those laying around, so that makes it easy for me to test.
I also have those PMSx003, but then I have to fix that plugin first ;)

micropet · 2018-09-22T14:55:25Z

Yes, the PMSx003 plugin only worked for me for a few minutes.
Then no more data comes.

The BME280 is quite important, I think. With temperature, pressure and humidity, I find no alternatives to a good price.

TD-er · 2018-09-22T14:56:56Z

It sure is and it is probably one of the more popular plugins.
So I will have a look at it, to see why it appears to take up-to 1.5 seconds sometimes. At least that's what my statistics claim

micropet · 2018-09-22T14:58:23Z

Wau, thats much time.

TD-er · 2018-09-22T15:01:45Z

Just one of such lines in my stats dump:
5132780 : PluginStats P_27_Environment - BMx280 ONCE_A_SECOND Count: 30 Avg/min/max 53395.60/306/1588655 usec

So that is quite close to the (software) watchdog timeout and maybe the recent versions changed the hardware watchdog timeout to match the 2 sec.

micropet · 2018-09-22T15:33:09Z

That may well be possible.
My unit is now running 6 hours 27 minutes without BME280. :)

Can you perhaps adjust something in the library (filter standby ...)?

TD-er · 2018-09-22T18:38:12Z

I found the bug in the BME280 plugin.
Should be fixed in #1779

Just curious, is it still running fine with BME280 disabled?

The line I posted earlier now shows:
561184 : PluginStats P_27_Environment - BMx280 ONCE_A_SECOND Count: 30 Avg/min/max 542.13/385/3036 usec

Domosapiens · 2019-04-14T21:28:58Z

I switched Off the reporting of the pulse counter to the controller.
Last 12hr 0 reboots for the unit with switched-Off 6Hz driven pulse counter and 2#DS18B20's
Last 12hr 5x reboots for the unit with switched-On 6Hz driven pulse counter and 2#DS18B20's.
Both units report once per minute the temperature.

I don't understand your interrupt disable/enable remark ...

Is it that the interrupt routine is corrupting something or has a misaligned return address?

TD-er · 2019-04-15T06:38:07Z

Nope, what I meant is that you're comparing behavior with some version of about 2 years ago.

So it could be that the older one had a bug which resulted in HW watchdog reboots, but that bug may be fixed now and now we're looking at something else that may result in a HW watchdog reboot also.
Thus even though the result is similar, the issue may be different. That's what I meant with it.

Inspired by this remark: #1774 (comment) When looking for the use of interrupts, I found this bug, where it is possible the interrupts were not enabled anymore if there was no reading.

Wiki591 · 2019-04-19T12:18:12Z

As I described above I downgraded my most instable devices (four Wemos D1, each with 1xDS18B20 and 1x INA2219, sending voltage evry 10 sec, temperature every 30 sec to Domoticz MQTT) from mega-20190108 and mega-20190315 back to release mega-20181008, first of them on 4th April, the graphs are showing the maximum uptime of this device in minutes:

TD-er · 2019-04-19T12:38:08Z

Last week I made some extensive logging on one of my own nodes.
The logs were recorded on the device, using the new (in development) Cache Controller, so all data is recorded, even when the unit crashes.

It shows the uptime and the WiFi RSSI value.
Every reboot in this chart correlates with a RSSI value of +31.
Apart from 1 reboot, all had reported RSSI value of 31 before the reboot.
So I would say not only a correlation, but also a causal relation.

Domosapiens · 2019-04-19T21:35:34Z

@TD-er Glad you found something, ... very interesting,
I am very curious ... what the effect-cause-effect relationship is.
It looks like overwritten data.
Due to data misalignment, pointer misalignment, stack residue .....?

TD-er · 2019-04-19T22:34:10Z

@Domosapiens
My best guess is that the ESP is waiting for some data that never will arrive and thus the watchdog kicks in.

See letscontrolit#1774 (comment)

Domosapiens · 2019-04-19T22:44:21Z

@TD-er But how to explain the RSSI value +31 ?
Can the RSSI measurement code calculate/create such value?

TD-er · 2019-04-19T22:53:17Z

The +31 value is an error state of the RSSI function.
So a simple check for RSSI < 0 should give a good indication if something is wrong.

Domosapiens · 2019-04-20T11:16:47Z

Great ! (found it in the ESP8266 SDK API guide)
But I think it's a symptom, not the cause.
Why is the WiFi lost ???

TD-er · 2019-04-20T20:53:28Z

Well, wifi connection can be lost at any time.

AP switches channel
session timeout
bad reception

But what does worry me, is that the ESP does miss beacon frames and also some other packets or events.
For example:

ARP requests ("Who has IP xxxx?" to which a reply must be sent "I handle that address, my MAC address is xxxx")
UDP packets (see that happen in the ESPeasy p2p messages if the CPU enters lower power mode)
disconnect events are not fired in our part of the code, unless there is some network traffic initiated from the ESP.

Domosapiens · 2019-04-26T09:32:45Z

@TD-er ... related to RSSI
When I execute a WiFi scan I see SSID: NPOWLAN2G with Ch:4 (-75dBm) WPA2/PSK
But the (refreshed) Main page gives RSSI: | -81 dB (NPOWLAN2G)
Refresh Main page again adapts the value, but I never see the same value.
Why this huge difference? Is it significant?

Any positive result on your forked commit ... [WiFi] Use RSSI value to determine connected state ?

TD-er · 2019-04-26T12:54:32Z

@Domosapiens Using the RSSI value does seem to help a bit. One of my nodes that was rebooting about every day, is now reconnecting. But still it had a HW watchdog reboot 4 days ago.
So I guess it may help a bit, or at least not hurt it.

About the measured RSSI, it is a value that will change constantly.

This is the RSSI value of a sensor mounted at a fixed point, so it has always the same orientation to the AP.
As you can see, its value does change a lot during the day.

Domosapiens · 2019-04-26T16:29:08Z

Thanks @TD-er , I see the same fluctuations.
But I was talking about looking at WiFi scan and then directly at the Main page and vice versa.

Reboot once in the 4 days would already be a big improvement !
I see 1 once a day on 2 nodes, to worst 10x on other nodes when heavenly interrupted by the pulse counter.

Will this change applied to the main branch?

TD-er · 2019-04-26T17:45:15Z

Yep, I will merge it soon.
But it is no magic cure for all HW watchdog reboots, since that's the result of several issues.

Domosapiens · 2019-04-26T18:39:00Z

@TD-er Ok, top

RSSI difference ....I was talking about looking at WiFi scan and then directly at the Main page and vice versa.
Is the Main page measuring RSSI in a different way as the WiFi scan page?

TD-er · 2019-04-26T19:51:44Z

Not as far as I know.

The only difference I can think of is that the scan is mainly listening (and switching channels) while during normal operations (especially when loading a web page), the ESP is also sending.
While sending, the power consumption is (quite a lot) higher, so maybe the 3V3 line will see a drop in voltage. The 3V3 line voltage is also used in the RF calibration, so maybe it is also used as some reference to relate the received signal strength.

But that's just speculation.

s0170071 · 2019-05-10T09:35:44Z

I can confirm the reboots. Since I activated the pulse counter on a previously stable unit, it restarted several times.

Wiki591 · 2019-07-29T16:10:51Z

Intermediate report of the ongoing testing of different releases.

The device mentioned above was flashed backwards to different releases without a stable result. meanwhile I have flashed back on 12th July to release of 2018-05-22 with the result:

TD-er · 2019-07-29T16:20:43Z

Do you have any wifi reconnects in those 17 days?

Wiki591 · 2019-07-29T16:43:58Z

Do you mean this?

Wiki591 · 2019-07-29T16:45:26Z

Or this?

TD-er · 2019-07-29T23:00:28Z

That last one.

The ConnectFailures value is being used in the wifi reconnect threshold to trigger a reboot if it wasn't able to connect to a host within set attempts.

I really have no idea why some nodes do perform reconnects very wel and others don't.
I also have one such node here.
It was up for over 2 weeks yesterday until I put last night's test build on it.
That one also had like 40+ reconnects while others WDT reboot after 1 - 4 reconnects.

Edit:
That one was running 20190523 if I remember correctly.

TD-er · 2019-10-27T11:54:59Z

I will mark this one as fixed, especially with the fixes I merged this morning and considering almost all of my own devices run now for double-digit days without a problem.
If it is still an issue, please let me know.

TD-er added Type: Bug Considered a bug Category: Stabiliy Things that work, but not as long as desired labels Sep 22, 2018

TD-er mentioned this issue Apr 15, 2019

Fix P005_DHT enabling interrupts on error reading. #2447

Merged

TD-er added a commit to TD-er/ESPEasy that referenced this issue Apr 19, 2019

[WiFi] Use RSSI value to determine connected state

77f2389

See letscontrolit#1774 (comment)

Domosapiens mentioned this issue May 8, 2019

Interrupts #2463

Open

TD-er closed this as completed Oct 27, 2019

Manual reboot () due to hardware watchdog in release mega-20180922 #1774

Manual reboot () due to hardware watchdog in release mega-20180922 #1774

Comments

micropet commented Sep 22, 2018

TD-er commented Sep 22, 2018

micropet commented Sep 22, 2018

TD-er commented Sep 22, 2018

micropet commented Sep 22, 2018 • edited Loading

micropet commented Sep 22, 2018 • edited Loading

TD-er commented Sep 22, 2018

Grovkillen commented Sep 22, 2018

v-a-d-e-r commented Sep 22, 2018 • edited Loading

micropet commented Sep 22, 2018 • edited Loading

Grovkillen commented Sep 22, 2018

Grovkillen commented Sep 22, 2018

TD-er commented Sep 22, 2018

micropet commented Sep 22, 2018

TD-er commented Sep 22, 2018 • edited Loading

micropet commented Sep 22, 2018

TD-er commented Sep 22, 2018 • edited Loading

micropet commented Sep 22, 2018

TD-er commented Sep 22, 2018

micropet commented Sep 22, 2018

micropet commented Sep 22, 2018

TD-er commented Sep 22, 2018

micropet commented Sep 22, 2018

TD-er commented Sep 22, 2018

micropet commented Sep 22, 2018

TD-er commented Sep 22, 2018

micropet commented Sep 22, 2018

TD-er commented Sep 22, 2018 • edited Loading

Domosapiens commented Apr 14, 2019

TD-er commented Apr 15, 2019

Wiki591 commented Apr 19, 2019

TD-er commented Apr 19, 2019

Domosapiens commented Apr 19, 2019

TD-er commented Apr 19, 2019

Domosapiens commented Apr 19, 2019 • edited Loading

TD-er commented Apr 19, 2019 • edited Loading

Domosapiens commented Apr 20, 2019

TD-er commented Apr 20, 2019

Domosapiens commented Apr 26, 2019

TD-er commented Apr 26, 2019

Domosapiens commented Apr 26, 2019

TD-er commented Apr 26, 2019

Domosapiens commented Apr 26, 2019

TD-er commented Apr 26, 2019

s0170071 commented May 10, 2019

Wiki591 commented Jul 29, 2019

TD-er commented Jul 29, 2019

Wiki591 commented Jul 29, 2019

Wiki591 commented Jul 29, 2019

TD-er commented Jul 29, 2019 • edited Loading

TD-er commented Oct 27, 2019

micropet commented Sep 22, 2018 •

edited

Loading

micropet commented Sep 22, 2018 •

edited

Loading

v-a-d-e-r commented Sep 22, 2018 •

edited

Loading

micropet commented Sep 22, 2018 •

edited

Loading

TD-er commented Sep 22, 2018 •

edited

Loading

TD-er commented Sep 22, 2018 •

edited

Loading

TD-er commented Sep 22, 2018 •

edited

Loading

Domosapiens commented Apr 19, 2019 •

edited

Loading

TD-er commented Apr 19, 2019 •

edited

Loading

TD-er commented Jul 29, 2019 •

edited

Loading