Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual reboot () due to hardware watchdog in release mega-20180922 #1774

Closed
micropet opened this issue Sep 22, 2018 · 199 comments
Closed

Manual reboot () due to hardware watchdog in release mega-20180922 #1774

micropet opened this issue Sep 22, 2018 · 199 comments
Labels
Category: Stabiliy Things that work, but not as long as desired Type: Bug Considered a bug

Comments

@micropet
Copy link

The current version still has the same problem as the versions of the last few weeks.
I use ESP_Easy_mega-20180922_test_ESP8266_4096.bin

The units without sensors run only slightly longer than one or four hours.

Units with Sensors (BME280 BH1750 Pir MH-Z19 TVoc CSS811 PMS7003)
boot after a few minutes.
The running time is between 3 minutes and 30 minutes. It is always different.

@TD-er
Copy link
Member

TD-er commented Sep 22, 2018

Do they have all these sensors at the same node?

@TD-er TD-er added Type: Bug Considered a bug Category: Stabiliy Things that work, but not as long as desired labels Sep 22, 2018
@micropet
Copy link
Author

Yes

@TD-er
Copy link
Member

TD-er commented Sep 22, 2018

I just flashed a node with the SDS011 with this firmware, to see if serial I/O may be causing these issues.
I will now flash another node with the BME280 and MH-Z19 to see if those are enough to get the same behavior.
I noticed the BME280 plugin sometimes has a long time logged in my statistics, so that may be one of the culprits. Maybe you could disable that one as a test to see if it improves stability.

@micropet
Copy link
Author

micropet commented Sep 22, 2018

Because the PMS7003 still does not work, most units have an SDS021.
So:

BME280 BH1750 Pir MH-Z19 TVoc CSS811 SDS021

So disable the BME280?
Good, I'll do that.

@micropet
Copy link
Author

micropet commented Sep 22, 2018

But we still have a general problem, because even the units without connected sensors do not run long.

I believe that deactivating sensors does not help us.

First of all, the units would have to run without sensors for days or weeks.

Then you can gradually add sensors.

@TD-er
Copy link
Member

TD-er commented Sep 22, 2018

On the other hand, the reported reboot intervals you report are way shorter than anyone else.
So we should start to dig down somewhere.

@Grovkillen
Copy link
Member

Yes, power is always a tricky issue to evaluate. I use 5V USB UPS on some of my units and they never reboot.

So could you give us info about the setup you're using?

@v-a-d-e-r
Copy link

v-a-d-e-r commented Sep 22, 2018

My 16 nodes are all running fine now with all current changes for 1 day and 16 hours. No single reboot! :-) And I have all kind of sensors in use with usage of GPIO, I2C and HW serial....

@micropet
Copy link
Author

micropet commented Sep 22, 2018

OK. I control that.
That is difficult. There are currently 15 units running on different power supplies.

Each unit has its own power supply.
Each Wemos D1 has a 1800 μF capacitor
to 3.3 V and a 1800 μF capacitor to 5 V.

I have always bought high-quality power supplies. (eg Aukey 2.4A with 48W power supply adapter, AUKEY USB C charger with 46W power, Volutz 60 Watt 12A 5V)

@Grovkillen
Copy link
Member

Please do, but if they all are reporting reboots it may very well be a network issue as well?

@Grovkillen
Copy link
Member

And just a curious question. Do they ALL have these capacitors? What if you remove those on one? As a test?

@TD-er
Copy link
Member

TD-er commented Sep 22, 2018

The power sounds OK to me.
Also those would not likely result in watchdog issues.
Maybe I could lower the core library to 1.7.3, since the number of reported watchdog reboots has increased a lot after the update to 1.8.0.
Not that they were not reported before, but the reports of those reboots is a lot more than before.

It could still be WiFi related.

@micropet
Copy link
Author

I do not believe in a network problem.
Currently, 51 WLAN devices are registered on both Unifi Access Points.

In the network are about 20 Wemos D1 with an old and simple, programmed by me software.

There are several LED drivers with 3-6 100 Watt LEDs connected to these units. I have been using this for years to switch the light in the apartment via PIR.

These units run for months without rebooting.

So, the same hardware I use for ESPEasy.

@Grovkillen
Yes. All have this capacitors, also my own Units.

@TD-er
Copy link
Member

TD-er commented Sep 22, 2018

What core lib do these other nodes use?
Could be 2.3.x or older even?

And wifi related doesn't mean it is a problem in your accesspoint. Can also be something in the core libraries

@micropet
Copy link
Author

@TD-er

No idea. The version is already several years old.

Because they work, I have not changed anything.

@TD-er
Copy link
Member

TD-er commented Sep 22, 2018

I have just been looking at the uptime of my nodes.
One of them is running for 42 days now and is running ESP_Easy_mega-20180513_normal_ESP8266_4096.bin 

So I will look into what core lib that was.
It is core 2.4.1

@micropet
Copy link
Author

Now we have 2.4.2?

@TD-er
Copy link
Member

TD-er commented Sep 22, 2018

Yep, so simply changing to 2.4.1 in the platformio.ini could help.
I can make a build with that for you to try if you like.
What build version do you need? (normal/test and flash size)

@micropet
Copy link
Author

Thank you Gijs.
Is not necessary. I can change the core version myself and compile with platformio.

@micropet
Copy link
Author

It may be a coincidence, but the unit with the BME280 disabled has been running for 5 hours now.

@TD-er
Copy link
Member

TD-er commented Sep 22, 2018

That's also good news.
I have a lot of those laying around, so that makes it easy for me to test.
I also have those PMSx003, but then I have to fix that plugin first ;)

@micropet
Copy link
Author

Yes, the PMSx003 plugin only worked for me for a few minutes.
Then no more data comes.

The BME280 is quite important, I think. With temperature, pressure and humidity, I find no alternatives to a good price.

@TD-er
Copy link
Member

TD-er commented Sep 22, 2018

It sure is and it is probably one of the more popular plugins.
So I will have a look at it, to see why it appears to take up-to 1.5 seconds sometimes. At least that's what my statistics claim

@micropet
Copy link
Author

Wau, thats much time.

@TD-er
Copy link
Member

TD-er commented Sep 22, 2018

Just one of such lines in my stats dump:
5132780 : PluginStats P_27_Environment - BMx280 ONCE_A_SECOND Count: 30 Avg/min/max 53395.60/306/1588655 usec

So that is quite close to the (software) watchdog timeout and maybe the recent versions changed the hardware watchdog timeout to match the 2 sec.

@micropet
Copy link
Author

That may well be possible.
My unit is now running 6 hours 27 minutes without BME280. :)

Can you perhaps adjust something in the library (filter standby ...)?

@TD-er
Copy link
Member

TD-er commented Sep 22, 2018

I found the bug in the BME280 plugin.
Should be fixed in #1779

Just curious, is it still running fine with BME280 disabled?

The line I posted earlier now shows:
561184 : PluginStats P_27_Environment - BMx280 ONCE_A_SECOND Count: 30 Avg/min/max 542.13/385/3036 usec

@Domosapiens
Copy link

I switched Off the reporting of the pulse counter to the controller.
Last 12hr 0 reboots for the unit with switched-Off 6Hz driven pulse counter and 2#DS18B20's
Last 12hr 5x reboots for the unit with switched-On 6Hz driven pulse counter and 2#DS18B20's.
Both units report once per minute the temperature.

I don't understand your interrupt disable/enable remark ...

Is it that the interrupt routine is corrupting something or has a misaligned return address?

@TD-er
Copy link
Member

TD-er commented Apr 15, 2019

Nope, what I meant is that you're comparing behavior with some version of about 2 years ago.

So it could be that the older one had a bug which resulted in HW watchdog reboots, but that bug may be fixed now and now we're looking at something else that may result in a HW watchdog reboot also.
Thus even though the result is similar, the issue may be different. That's what I meant with it.

TD-er added a commit that referenced this issue Apr 15, 2019
Inspired by this remark: #1774 (comment)
When looking for the use of interrupts, I found this bug, where it is possible the interrupts were not enabled anymore if there was no reading.
@Wiki591
Copy link
Contributor

Wiki591 commented Apr 19, 2019

As I described above I downgraded my most instable devices (four Wemos D1, each with 1xDS18B20 and 1x INA2219, sending voltage evry 10 sec, temperature every 30 sec to Domoticz MQTT) from mega-20190108 and mega-20190315 back to release mega-20181008, first of them on 4th April, the graphs are showing the maximum uptime of this device in minutes:
grafik
grafik

@TD-er
Copy link
Member

TD-er commented Apr 19, 2019

Last week I made some extensive logging on one of my own nodes.
The logs were recorded on the device, using the new (in development) Cache Controller, so all data is recorded, even when the unit crashes.

image

It shows the uptime and the WiFi RSSI value.
Every reboot in this chart correlates with a RSSI value of +31.
Apart from 1 reboot, all had reported RSSI value of 31 before the reboot.
So I would say not only a correlation, but also a causal relation.

@Domosapiens
Copy link

@TD-er Glad you found something, ... very interesting,
I am very curious ... what the effect-cause-effect relationship is.
It looks like overwritten data.
Due to data misalignment, pointer misalignment, stack residue .....?

@TD-er
Copy link
Member

TD-er commented Apr 19, 2019

@Domosapiens
My best guess is that the ESP is waiting for some data that never will arrive and thus the watchdog kicks in.

TD-er added a commit to TD-er/ESPEasy that referenced this issue Apr 19, 2019
@Domosapiens
Copy link

Domosapiens commented Apr 19, 2019

@TD-er But how to explain the RSSI value +31 ?
Can the RSSI measurement code calculate/create such value?

@TD-er
Copy link
Member

TD-er commented Apr 19, 2019

The +31 value is an error state of the RSSI function.
So a simple check for RSSI < 0 should give a good indication if something is wrong.

@Domosapiens
Copy link

Great ! (found it in the ESP8266 SDK API guide)
But I think it's a symptom, not the cause.
Why is the WiFi lost ???

@TD-er
Copy link
Member

TD-er commented Apr 20, 2019

Well, wifi connection can be lost at any time.

  • AP switches channel
  • session timeout
  • bad reception

But what does worry me, is that the ESP does miss beacon frames and also some other packets or events.
For example:

  • ARP requests ("Who has IP xxxx?" to which a reply must be sent "I handle that address, my MAC address is xxxx")
  • UDP packets (see that happen in the ESPeasy p2p messages if the CPU enters lower power mode)
  • disconnect events are not fired in our part of the code, unless there is some network traffic initiated from the ESP.

@Domosapiens
Copy link

@TD-er ... related to RSSI
When I execute a WiFi scan I see SSID: NPOWLAN2G with Ch:4 (-75dBm) WPA2/PSK
But the (refreshed) Main page gives RSSI: | -81 dB (NPOWLAN2G)
Refresh Main page again adapts the value, but I never see the same value.
Why this huge difference? Is it significant?

Any positive result on your forked commit ... [WiFi] Use RSSI value to determine connected state ?

@TD-er
Copy link
Member

TD-er commented Apr 26, 2019

@Domosapiens Using the RSSI value does seem to help a bit. One of my nodes that was rebooting about every day, is now reconnecting. But still it had a HW watchdog reboot 4 days ago.
So I guess it may help a bit, or at least not hurt it.

About the measured RSSI, it is a value that will change constantly.
image
This is the RSSI value of a sensor mounted at a fixed point, so it has always the same orientation to the AP.
As you can see, its value does change a lot during the day.

@Domosapiens
Copy link

Thanks @TD-er , I see the same fluctuations.
But I was talking about looking at WiFi scan and then directly at the Main page and vice versa.

Reboot once in the 4 days would already be a big improvement !
I see 1 once a day on 2 nodes, to worst 10x on other nodes when heavenly interrupted by the pulse counter.

Will this change applied to the main branch?

@TD-er
Copy link
Member

TD-er commented Apr 26, 2019

Yep, I will merge it soon.
But it is no magic cure for all HW watchdog reboots, since that's the result of several issues.

@Domosapiens
Copy link

@TD-er Ok, top

RSSI difference ....I was talking about looking at WiFi scan and then directly at the Main page and vice versa.
Is the Main page measuring RSSI in a different way as the WiFi scan page?

@TD-er
Copy link
Member

TD-er commented Apr 26, 2019

Not as far as I know.

The only difference I can think of is that the scan is mainly listening (and switching channels) while during normal operations (especially when loading a web page), the ESP is also sending.
While sending, the power consumption is (quite a lot) higher, so maybe the 3V3 line will see a drop in voltage. The 3V3 line voltage is also used in the RF calibration, so maybe it is also used as some reference to relate the received signal strength.

But that's just speculation.

@s0170071
Copy link
Contributor

I can confirm the reboots. Since I activated the pulse counter on a previously stable unit, it restarted several times.

@Wiki591
Copy link
Contributor

Wiki591 commented Jul 29, 2019

Intermediate report of the ongoing testing of different releases.

The device mentioned above was flashed backwards to different releases without a stable result. meanwhile I have flashed back on 12th July to release of 2018-05-22 with the result:

grafik

grafik

grafik

@TD-er
Copy link
Member

TD-er commented Jul 29, 2019

Do you have any wifi reconnects in those 17 days?

@Wiki591
Copy link
Contributor

Wiki591 commented Jul 29, 2019

Do you mean this?

grafik

@Wiki591
Copy link
Contributor

Wiki591 commented Jul 29, 2019

Or this?

grafik

@TD-er
Copy link
Member

TD-er commented Jul 29, 2019

That last one.

The ConnectFailures value is being used in the wifi reconnect threshold to trigger a reboot if it wasn't able to connect to a host within set attempts.

I really have no idea why some nodes do perform reconnects very wel and others don't.
I also have one such node here.
It was up for over 2 weeks yesterday until I put last night's test build on it.
That one also had like 40+ reconnects while others WDT reboot after 1 - 4 reconnects.

Edit:
That one was running 20190523 if I remember correctly.

@TD-er
Copy link
Member

TD-er commented Oct 27, 2019

I will mark this one as fixed, especially with the fixes I merged this morning and considering almost all of my own devices run now for double-digit days without a problem.
If it is still an issue, please let me know.

@TD-er TD-er closed this as completed Oct 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Stabiliy Things that work, but not as long as desired Type: Bug Considered a bug
Projects
None yet
Development

No branches or pull requests