Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent Exception 28 Crash/Reboots on Testing Build #1643

Closed
thomastech opened this issue Aug 12, 2018 · 31 comments
Closed

Intermittent Exception 28 Crash/Reboots on Testing Build #1643

thomastech opened this issue Aug 12, 2018 · 31 comments
Labels
Status: Needs Info Needs more info before action can be taken Type: Discussion Open ended discussion (compared to specific question)

Comments

@thomastech
Copy link
Contributor

thomastech commented Aug 12, 2018

Summarize of the problem/feature request

The Testing build is prone to Exception 28 crash reboots. I suspect that the higher memory usage from the additional plugins is inviting memory allocation issues.

This problem was reported in issue #1625. But it is not related to the yield panic problem. So I have created a new ticket for the Exception 28 problem.

Expected behavior

Exception 28 is a fatal memory allocation problem. These should never occur.

Actual behavior

The exception 28 crash reboots occur randomly. May crash a couple times a day, or may run for days without incident

Steps to reproduce

  1. Install a recent build.
  2. Configure with noted plugins and controller. No need to install Nextion display.
  3. Let it run until exception 28.

System configuration

Hardware:
The ESP8266 boards are NodeMCU clones (LoLin 0.1 V3) with 4MB memory (memory chip ID 001640EF, Speed 40000000, IDE Mode DIO).

ESP Easy version: ESPEasy_mega-20180808
I self compile using Arduino 1.8.5 and ESP8266 core 2.4.1.
Arduino Settings: Board NodeMCU 1.0 (ESP-12E), Flash Size 4M (3M SPIFFS)

ESP Easy settings/screenshots:
The following plugins are being used:
P001 Switch Input
P026 SysInfo
P075 Nextion
P045 MPU6050

Controller is OpenHab MQTT. Message Interval is 500mS. Rules and NPT are enabled. Serial log is disabled.

Rules or log data

on WASHER#ac do
  if [WASHER#ac]=1
    NEXTION,page0.va_WasherAC.val=1
  else
    NEXTION,page0.va_WasherAC.val=0
  endif
endon

on DRYERMOV#detect do
  if [DRYERMOV#detect]=1
    NEXTION,page0.va_DryerAC.val=1
  else
    NEXTION,page0.va_DryerAC.val=0
  endif
endon

on NEXTION#idx do
  if [NEXTION#idx]>=10 and [NEXTION#idx]<=30
      Publish /%sysname%/NEXTION/idx,[NEXTION#idx]
  endif
  if [NEXTION#idx]>=500  // Touch Events (not used)
      Publish /%sysname%/NEXTION/idx,[NEXTION#idx]
      Publish /%sysname%/NEXTION/value,[NEXTION#value]
  endif
endon

on NEXTION#idx=98 do
  NEXTION,page7.t_wifi_ssid.txt="SSID: %ssid%"
  NEXTION,page7.t_ip.txt="IP: %ip%"
  NEXTION,page7.t_signal.txt="RSSI: [RSSI#signal]dBm"
  NEXTION,page7.t_date.txt="Date %sysmonth%:%sysday%:%sysyears%"
  NEXTION,page7.t_time.txt="Time %syshour%:%sysmin%:%syssec% "
  NEXTION,page7.t_uptime.txt="Uptime [RUNTIME#days]days"
endon
  • Thomas
@thomastech
Copy link
Contributor Author

I created a custom build that only has the plugins I need. That is to say, all unused plugins were removed from the define_plugin_sets.h file. The list of plugins allowed in my build are:

    #define USES_P001   // Switch
    #define USES_P026   // SysInfo
    #define USES_P033   // Dummy
    #define USES_P075   // Nextion
    #define USES_P045   // MPU6050

The entire flash was erased during the code upload so ESPEasy defaulted to AP mode on first boot. WiFi was configured and the previous settings were loaded from file. FreeRam increased from 14KB (normal builds) to about 20KB. This lighter weight build hasn't experienced the exception 28 crashes. The performance tests were ended after 45 hours without incident.

I flashed a second NodeMCU (identical hardware) with the same exact software and restored the configuration. Surprisingly, it only had 16KB of FreeRam. I expected it to be about 20KB. I reviewed both devices and all settings were identical. Reboots did not change the reported FreeRam.

The FreeRam memory difference seemed significant. So I clean reflashed both boards and restored the configuration. Now their FreeRam was similar (but not identical), around 16KB.

The 20KB seen earlier seems like a significant observation. Maybe an uninitialized memory pointer?

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 13, 2018

Do you also have some chart showing memory usage over time?
Those charts can be helpful to see memory leaks.

@thomastech
Copy link
Contributor Author

thomastech commented Aug 13, 2018

@TD-er: FreeRam memory usage has remained relatively consistent during run time. Normal builds' FreeRam settles to about 15K a couple minutes after boot. I'll see minor reduction after a few hours, but FreeRam mostly remains above 14K.

My lightweight build that ran with 20KB FreeRam for 45 hours appears to be a anomaly. I've tried dozens of cold and soft reboots and I cannot reproduce the 20KB. All I see now is around 15-16KB.

A couple weeks ago I saw one of my boards slowly lose FreeRam over several hours. I rebooted it when it hit 9KB. The problem went away after the reboot.

My backup test board (with the lightweight build) was at 16KB when I went to bed last night. This morning it was at 6KB, but still running. Unfortunately the log was disabled. I've rebooted it and the memory has remained steady at 15KB.

Also, sometime last night my main board (with the lightweight build) rebooted. Today I caught it do a panic reboot. But I had serial log turned on, so I suspect the serial log related reboot problem is still haunting me. I've disabled serial log and cold booted it. Everything seems Ok now.

I've enabled syslog and I will post the log if the memory leak re-appears this afternoon.

Regarding my self-builds, my NodeMCU boards are 4MB. I have been selecting 3MB SPIFF size during compile. The other choice is 1MB SPIFF. However, I've tried both settings and don't see any difference in operation. What is your recommendation, 1MB or 3MB SPIFF?

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 13, 2018

You could also a Generic plugin set to free mem and log that to your OpenHab

About the SPIFF. I guess OTA updates need some space and I think it writes them on SPIFF, but I am not sure about that. That may be an issue.

@thomastech
Copy link
Contributor Author

I'm using syslog (level: info) and sending all the activity to my NAS. That way I get FreeRam at 5 sec intervals (from generic plugin) plus all the other actions.

BTW, Tools->Factory Reset does not reset the settings files. It only reboots my NodeMCU's. Not sure if that is unique to my builds or if it is affecting everyone. Maybe it is an important symptom related to SPIFF R/W access?

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 13, 2018

I've had more reports about factory reset. If it can be reproduced by others, then it is worth an issue I guess.
It may be the settings are reset, but not written before the reboot. (or crashes before setting are written)

@thomastech
Copy link
Contributor Author

I can report the factory reset issue. I'll do it after all my existing ESPEasy drama is resolved.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 13, 2018

I can report the factory reset issue. I'll do it after all my existing ESPEasy drama is resolved.

Hmm, let's hope your memory of that issue will last that long ;)

@thomastech
Copy link
Contributor Author

Hmm, let's hope your memory of that issue will last that long ;)

At this point I can easily brag that my memory is better than ESPEasy's.

  • Thomas

@thomastech
Copy link
Contributor Author

thomastech commented Aug 13, 2018

Some progress to report. Here's the latest status:

I'm using Arduino ESP8266 board core V2.4.1. Issue #4497 on Github reported that there is a WiFi related memory leak on this release (not present on V2.4.0). Two days ago V2.4.2 was released with the leak patch and other improvements. So I've installed V2.4.2.

Details: esp8266/Arduino#4497

When I recompiled my "lightweight" build the FreeRam was 25KB! That fantastic news gave me reason to revert back the full [Testing] build. Now FreeRam on it is 18KB. Not bad, about 4KB improvement.

I suspect this will fix my random memory leak. Fingers are crossed that it also eliminates the exception 28 reboots and/or the other memory issues I've experienced. Even if it doesn't, the increased FreeRam is nice.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 13, 2018

They are indeed improving free RAM in the last few releases.
For ESP32 the improvement is even more impressive. From about 105 kB free (PlatformIO espressif32@0.12.0) to about 187 kB free :) (PlatformIO espressif32@1.x)

@thomastech
Copy link
Contributor Author

The two stock [Testing] builds (self compile on V2.4.2) are still running (15 hours so far). No reboots, memory near 18KB but has decreased a small amount. Looks promising.

  • Thomas

@thomastech
Copy link
Contributor Author

The V2.4.2 ESP8266 compiler core hasn't solved the memory leak. I also experienced a reboot when I performed a http write to one of the test systems.

Do you also have some chart showing memory usage over time?
Those charts can be helpful to see memory leaks.

Here is a FreeRam chart from the overnight run.
freeram

Here is a csv text file with the FreeRam values:

freeram2.txt

  • Thomas

@thomastech
Copy link
Contributor Author

Yesterday one of the test systems rebooted after running for 18 hours. I didn't record the log so the details are a mystery.

The main test system lost it's WiFi connection after running for 21 hrs. It remained offline for about two hours, then rebooted on its own. While offline it continued to run, as confirmed by the Nextion display that was reporting run time and local time. The Syslog recording ended when the WiFi was lost, but the last entries didn't show any obvious problems.

Both systems were restarted and have been running for about 22 hours so far. The memory leak hasn't appeared yet. Random memory leaks are the work of the devil.

FWIW:
I've stopped using serial log because of the troubles it causes. So I have no way of knowing if an exception 28 crash has occurred.

Present Status:
I don't know how to solve the memory leak other than to observe and report. My suspicions are that the leak is related to WiFi. There are also a lot of String allocations in the code that concatenate. I think it would help to use the reserve() function on them to prevent string reallocation issues.

Going Forward:
After reviewing the ESPEasy code again, I noticed that the tcpCleanup() in backgroundtasks() is conditional. I've changed backgroundtasks() so that tcpCleanup() is always called. One of the test systems is now running with this code.

I've also edited the Nextion plugin and added reserve() to the Strings that concatenate. But not yet tested.

Epilog:
I definitely need some help on this. But I recognize that the maintenance developers like @TD-er have their hands full. So I'll do my best and report again if I have any useful info.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 16, 2018

Can you also run one of the last builds, which is based on the last PlatformIO code (1.8.0), which is using core libraries 2.4.2
Maybe there was some issue there that is now resolved?

@thomastech
Copy link
Contributor Author

Good idea. I try out the latest [Testing] build after the next crash/reboot appears.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 18, 2018

Can you test this PR?
#1664

@thomastech
Copy link
Contributor Author

Can you test this PR?

Can you post/email the ESP_Easy_mega_test_ESP8266_4096.bin? This will ensure I use a build that is identical to yours.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 19, 2018

That's a valid point :)
Builds made for 4096

@thomastech
Copy link
Contributor Author

Can you test this PR?

I've loaded #1664 on two NodeMCU's. Testing has started.

I see you added some graphics to the menu bar. Are your new icons the fix to the missing "3-bar" menu that has affected some small screen browsers?

Can you also run one of the last builds, which is based on the last PlatformIO code (1.8.0), which is using core libraries 2.4.2. Maybe there was some issue there that is now resolved?

Results summary from Aug 17 - 19 test run:
My main ("production") system rebooted three times during the two day test.
My workshop ("test") system lasted 37 hours before it rebooted.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 19, 2018

Yep, those are to add some recognizable pictogram to the tabs.
But I am not really content with the way they look, so if you have a better suggestion of UTF symbols, which are also supported by all browsers, please tell me.
I still find it very odd that those relatively new browsers did not work with the way it was, but I like the tabs better, if only their style was a little more matching.

I am eager to know if this fix I made here does fix at least the Exception reboots.
Not sure about the Watchdog crashes.

I still have to look at the HTTP handling, so there may still be crashes, but it would be nice if reproducability was reduced ;)

@thomastech
Copy link
Contributor Author

thomastech commented Aug 20, 2018

I am eager to know if this fix I made here does fix at least the Exception reboots.
Not sure about the Watchdog crashes.

Update: My workshop ("test") system rebooted after 7 hours.
System Info reports this:
Boot : Manual reboot (2), Reset Reason : Hardware Watchdog

But I am not really content with the way they look, so if you have a better suggestion of UTF symbols, which are also supported by all browsers, please tell me.

Creating good looking GUI graphics is something I wish I could do.

  • Thomas

@thomastech
Copy link
Contributor Author

My main "production" board is still running (>22 hours). But last night it lost the WiFi connection and now it only has local control.

Losing WiFi seems to be an issue that started sometime with the Aug releases and this is the third occurrence I have experienced. I never saw it with the July code releases I had been using.

Perhaps my WiFi lost connection problem is related to issue #1640. Like that installation I also use static IP. I noticed that his build used ESP8266 2.4.1 whereas all my affected firmware is on 2.4.2.

The device is about 3 meters from my TPlink router so the WiFi RF signal is strong (-40dBm). As an experiment I rebooted my router and the ESP still remains offline. So I don't believe the issue is a DHCP versus static IP problem.

Before I reboot the ESPEasy board do you have any experiments for me to try?

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 20, 2018

Just a though, can you change the wifi channel of the accesspoint?

@thomastech
Copy link
Contributor Author

thomastech commented Aug 20, 2018

Just a though, can you change the wifi channel of the accesspoint?

Good idea, but does not help. Still offline.

Normally the router is set to AUTO, but for this test I manually set it to several different channels. My working ESPEasy devices (another NodeMCU, two Sonoff TH10's) followed the channel changes and reconnected in a few seconds after each new channel setting.

Anything else before I reboot?

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 20, 2018

I am afraid I am also out of ideas.
Maybe running some display to show the internal IP-address? But that's for another time and perhaps also a bit useless when it is set to being static.

I may add some check for connection errors in the controllers and if that is above some level perform a wifi reset (wifi off/on)

@thomastech
Copy link
Contributor Author

Maybe running some display to show the internal IP-address? But that's for another time and perhaps also a bit useless when it is set to being static.

My Nextion menus include a system status page. The SSID is missing (shown as - - ) and the static IP is still present. The RSSI is reported as +31dBm, which is a default value when WiFi is disconnected.

wifi_offline

@TD-er
Copy link
Member

TD-er commented Aug 20, 2018

Maybe you can add a button on the nextion and couple that with digital disconnect command :)

I will also check this -31 dB value you mention, maybe it is also a good check for WiFi status.
It looks like the current one may get out of sync with reality

@thomastech
Copy link
Contributor Author

Maybe you can add a button on the nextion and couple that with digital disconnect command :)

I've thought about adding a reboot rule when RSSI is 31dBm for more than a minute. But that's a last resort primitive band-aide.

I will also check this -31 dB value you mention, maybe it is also a good check for WiFi status.

Actually, rssi is **+**31dBm (not -31) when wifi is disconnected. At least that's what I get on my NodeMCU modules.

  • Thomas

@TD-er
Copy link
Member

TD-er commented Aug 20, 2018

OK, I will look into the source to see what it means :)

@Grovkillen Grovkillen added Status: Needs Info Needs more info before action can be taken Type: Discussion Open ended discussion (compared to specific question) labels Aug 22, 2018
@thomastech
Copy link
Contributor Author

This old open issue continues to be reported by others. Since there are now several open tickets on ESPEasy's random reboot problems I have closed this one to reduce the "noise."

  • Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Needs Info Needs more info before action can be taken Type: Discussion Open ended discussion (compared to specific question)
Projects
None yet
Development

No branches or pull requests

3 participants