Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8266: WLED keeps rebooting after 0.14.1 update. #3685

Open
1 task done
Trevo525 opened this issue Jan 14, 2024 · 163 comments
Open
1 task done

8266: WLED keeps rebooting after 0.14.1 update. #3685

Trevo525 opened this issue Jan 14, 2024 · 163 comments
Labels
bug major This is a non-trivial major feature and will take some time to implement needs investigation The bug has not yet been reproduced by me. Analysis or more details are needed.

Comments

@Trevo525
Copy link

What happened?

I have two instances of WLED running on two separate ESP-12F (I believe they are 8266 based?) modules. To be specific, it's this module (not the esp32, obviously). They are wired with different types of LEDs. One is with a WS2812B LED Strip and the other is a more generic LED string that has R|G|B|12V as the inputs, as opposed to 5V|Data|Ground that the first has. I'm not sure that will make a difference. But, I included it as it might be important to note. I just got them both running a week or two ago with WLED 0.14.0 and added them to Home Assistant. Everything worked as expected, I have been using presets and playing with the effects and colors on both. I even have a

However, I updated to 0.14.1 today and the ESP connected to the generic LED strip started turning off when I changed the color it will do that for a split second and I'll notice that the light will switch back to the default orange color. So, I kept testing and it kept happening. Then, I noticed that for a split second after this happens the web interface will be unresponsive for a moment. This leads me to believe the light is restarting.

I have been able to fix this for now by going to the update section and giving it the 0.14.0 interface. But, if I can give any assistance in finding this issue feel free to reach out and I will put 0.14.1 back on it if there is any form of logs or anything I can provide.

To Reproduce Bug

Update to 0.14.1
Press most any button in the interface.

Expected Behavior

I would have expected it not to crash.

Install Method

Binary from WLED.me

What version of WLED?

WLED 0.14.1

Which microcontroller/board are you seeing the problem on?

ESP8266

Relevant log/trace output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@Trevo525 Trevo525 added the bug label Jan 14, 2024
@AKHwyJunkie
Copy link

In the FWIW department, I'm also seeing this same behavior in Athom bulbs as well. (I'm using the recommended ESP02 image, happens across all bulb models.) In case it helps, I noticed this issue started in 0.14.1-B3 and did not occur in 0.14.1-B2, at least in my case. I figured this might have been related to the JSON buffer lock issue, but it looks like not. I can trigger it by changing profiles, either via the web interface or via Home Assistant. I don't believe it's configuration related as I tried a full factory reset in B3.

@chertvl
Copy link

chertvl commented Jan 15, 2024

Same with 8266.
Continuously goes to Unavailable

Screenshot_20240115-064150_Home Assistant

@AngusMcT
Copy link

Have the same problem. Just updated through Home Assistant, and have the same symptoms as OP.

@blazoncek
Copy link
Collaborator

Please remove Home Assistant integration and see if the problems persist.
If they don't you may want to upgrade to ESP32 or get a special build without various features to get more free RAM on ESP8266.

BTW one way to see if WLED restarted is in Info dialog, Uptime field.

@dosipod
Copy link
Contributor

dosipod commented Jan 15, 2024

I do not use esp8266 ( 4MB , 2MB or 1MB ) in production setup but i do have a lot of them around to replicate such issues . If cfg.json and preset.json are provided then we could do so .

I have flashed two esp8266 4MB units since the first hour of 0.14.1 release and kept them
with debug bins , i did not notice anything strange nor seen disconnection/reboot/crash in the log .

As of 1 hour ago i have added one of them to HA with a simple automation ( to actually only send alert if the unit is on/off ) and i can see the unit disconnecting from wifi ( ping is lost ) but could not get it to constantly behave the same way .

I blame HA integration but can not confirm

@blazoncek
Copy link
Collaborator

@chertvl down-voting will not help resolving the issue.

@Doyle4
Copy link

Doyle4 commented Jan 15, 2024

Running fine on ESP32 S2 mini, will test on a esp8266 device later when I can.

@chertvl
Copy link

chertvl commented Jan 15, 2024

@chertvl down-voting will not help resolving the issue.

Nevermind. Already downgraded to 0.14.0 and thats works perfectly.

About "not help resolving issue", its:

  • Advise to change the electronic component of the device, without thinking that this is a ready-made factory device where this is impossible
  • Sin on integration. Which was first deactivated during debugging.
  • Advise not to use usermods. But they don’t exist anyway. In my case, this is a regular clean 0.14.1, which was updated via HA. and HA does not know how to update firmware with usermods. If I'm not mistaken....

I now have more time to describe the symptoms.
After updating an 8266-based device using HA from version 0.14.0 to 0.14.1:

  • The WLED web page takes forever to load, sometimes some elements will be drawn, but very rarely, most often the error is err_connection_refused.
  • APIs do not work, including HA integration.
  • It can be seen that the device reboots every few minutes, and could not turn on normally. He's missing something, maybe memory.
  • The router reports that the device is connected, the uptime is stable, there are no reconnections.

@mxilievski
Copy link

Same here, updated 3 8266-based devices. They can’t be accessed via Web.

@Doyle4
Copy link

Doyle4 commented Jan 16, 2024

How many LED's you guys using? Flashed a couple esp8266's from B3 to released 0.14.1, no more than 100 led's working fine, BUT I don't use H.A at all so I can't help on that side sorry.

@photobix
Copy link

Same problem on 4 instances. Between 80 and 278 LED on WEMOS D1 Mini (8266).
Even an update no longer works without any problems OTA, I had to flash 3 instances via USB. Apparently, the update runs into a timeout.

@WarC0zes
Copy link

Same problem on Atom Matrix.I use home assistant and a RESTful command.
Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

@mxilievski
Copy link

Same problem on Atom Matrix.I use home assistant and a RESTful command. Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

How did you revert?

@WarC0zes
Copy link

WarC0zes commented Jan 16, 2024

Same problem on Atom Matrix.I use home assistant and a RESTful command. Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

How did you revert?

I downloaded the firmware (.bin) in version 0.14.0.
After you connect to the esp through the browser.
In setting / security and update, and click on manual OTA update.
wled update
You select the firmware and update.

@softhack007
Copy link
Collaborator

softhack007 commented Jan 16, 2024

I now have more time to describe the symptoms. After updating an 8266-based device using HA from version 0.14.0 to 0.14.1:

  • The WLED web page takes forever to load, sometimes some elements will be drawn, but very rarely, most often the error is err_connection_refused.
  • APIs do not work, including HA integration.
  • It can be seen that the device reboots every few minutes, and could not turn on normally. He's missing something, maybe memory.
  • The router reports that the device is connected, the uptime is stable, there are no reconnections.

@blazoncek a few thoughts on commonalities in user reports

  • Its seems to only affect 8266 ("Running fine on ESP32 S2")
  • the only real change for 0.14.1 is the modified locking mechanism for WebSocket API
  • some people said that problems disappeared with -DWLED_DISABLE_WEBSOCKETS
  • some problems include WDT reset (watchdog = potential infinite loop)
  • also web responses are sometimes affected ("takes ages")

We have to remember that WS responses are not running in arduino context; on esp32 they run inside the async_tcp task, not sure how its implemented on 8266.

I think there are a few dangerous lines in the code to lock the JSON buffer

while (jsonBufferLock && millis()-now < 1000) delay(1); // wait for a second for buffer lock

  • delay() does work on esp32, however is dangerous on 8266 when not in arduino context
  • on 8266, millis() does not advance outside of arduino context

@chertvl @WarC0zes @Doyle4 if my understanding is right, it could help if you comment out the line I quoted, and replace it with

    if (jsonBufferLock) return false;

its a temporary hack and not a proper solution, but it should help to understand if using delay() and millis() on 8266 is the problem. If this hack helps, then I'll take some time the next days to implement a proper solution for requestJSONBufferLock() without busy-waiting.

@softhack007
Copy link
Collaborator

softhack007 commented Jan 16, 2024

🔺 On a different topic that goes to all who commented and contribute to this thread:

Please stop this thumbs-up thumbs-down BS. We are trying to analyse a problem and need you as users who must help us.
It does not really help if you just express fuzzy feelings with thumbs.

image

image

We are trying to do engineering work here, not to entertain fans in the roman circus.

  • In case you want to add your few cents, please write a sentence in Englisch, following basic rules of grammar.
  • If someone wants to say that he cannot even disable HA integration for a test, please write that.
  • a written "same here, too" is a lot easier to understand, instead of giving a thumbs-up to "same here".

I'm really tired of playing guessing games with emoji.

Use words, instead of throwing tags onto the wall. please.

@softhack007 softhack007 changed the title WLED keeps rebooting after 0.14.1 update. 8266: WLED keeps rebooting after 0.14.1 update. Jan 16, 2024
@asolochek
Copy link

I noticed this same behavior on my athom rgbw controller which is paired to home assistant.

After upgrading earlier in the afternoon everything seemed fine, but when I went to turn my lights off I noticed the wled controller wasn't responding. I tried a few times to turn them off via home assistant, and somehow got it stuck in a reboot loop that caused the leds to blink off every 30 seconds or so.

I was able to stop this by turning them off via the web UI and reverted to 0.14.0 and it's working again.

@chertvl
Copy link

chertvl commented Jan 16, 2024

@chertvl @WarC0zes @Doyle4 if my understanding is right, it could help if you comment out the line I quoted, and replace it

Thanks for the detailed explanation.
I tried to compile the firmware for the first time using these instruction at
https://kno.wled.ge/advanced/compiling-wled/

I followed your steps, commented out the required line, and added a new one. It seemed like I did everything right, but, unfortunately, it didn’t help.
The web interface still cannot load properly, or does not load at all. Sometimes it’s possible to view the status via JSON. The physical button control on the board works.
The behavior has not changed.
ps: HA integration was disabled before all of these.

Below are some screenshots:

image
image
image
image
image
image

@chertvl
Copy link

chertvl commented Jan 16, 2024

unfortunately, it didn’t help.

It may have gotten worse.
Now I do not have enough time to update the firmware via OTA, browser gives err_connection_refused.
Last time I miraculously succeeded, but now I don’t.

Unfortunately, my device doesn't have a UART, and I don't have one at home either. So continue the tests without me until I find a UART to restore the device...
Thanks for understanding.

@softhack007
Copy link
Collaborator

softhack007 commented Jan 16, 2024

Now I do not have enough time to update the firmware via OTA, browser gives err_connection_refused.
So continue the tests without me until I find a UART to restore the device...
Thanks for understanding.

Thanks for helping as much as you could 🥇 and sorry about making it worse for you.

About the UART: if gpio 1 and 3 are accessible on your board, then a standard "USB-to-TTL" adapter is all you need. Like this one that's using a CH340G:
https://amzn.eu/d/fZChiyZ

... or this one that's specificially made for "ESP-01S"
https://amzn.eu/d/2CEAFUb

You'll also find them for cheap on ali.

@softhack007 softhack007 added needs investigation The bug has not yet been reproduced by me. Analysis or more details are needed. major This is a non-trivial major feature and will take some time to implement labels Jan 16, 2024
@blazoncek
Copy link
Collaborator

* the only real change for 0.14.1 is the modified locking mechanism for WebSocket API

There were more changes than this. And it is not for websockets but for HTTP requests.
Foremost we added PIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48 to circumvent full IRAM condition. This may cause slowness in non LED display functions.
Mode blending was introduced in 0.14.1-a1. It can use a lot of memory and CPU on its own.

IMO, and my own testing showed that, new locking mechanism only improved on stability and memory corruption.

* some people said that problems disappeared with -DWLED_DISABLE_WEBSOCKETS

Websockes need plenty of heap. Constantly. Disabling them can only improve things at the expense of stale UI.

* some problems include WDT reset (watchdog = potential infinite loop)

I've seen WDT in non-WLED code. How to avoid it? Have no clue.
Async* stuff (web server and TCP and UDP) are interrupt driven on ESP8266.

* also web responses are sometimes affected ("takes ages")

This may be attributed to a more susceptible WiFi code in newer Arduino core we use with 0.14 (I've posted my own experience in another issue detailing the resolution).

All in all, IMO if you want to run 0.14.x on ESP8266 you need to make a few compromises. Why? Because with only 16kB of RAM available (after boot) it can get crowded rather quickly in the heap.

I am going to post my own ESP8266 configuration I use on ESP01 devices which I have plenty in daily use. Unfortunately that configuration may not work for some people as it strips quite a few features out, but produces reliable and working ESP8266 environment.

[env:esp01_4m]
extends = env:esp01_1m_full
board_build.filesystem = littlefs
board_build.ldscript = ${common.ldscript_4m1m}
board_build.f_cpu = 160000000L
build_flags = ${common.build_flags_esp8266}
  -DPIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48
  -D LED_BUILTIN=2
  -D WLED_DISABLE_ALEXA
  -D WLED_DISABLE_HUESYNC
  -D WLED_DISABLE_LOXONE
  -D WLED_DISABLE_ADALIGHT
  -D WLED_DISABLE_MQTT
  -D WLED_DISABLE_2D
  -D WLED_DISABLE_PXMAGIC
  -D WLED_USE_UNREAL_MATH
  -D WLED_MAX_BUSSES=2
  -D LEDPIN=2
  -D USERMOD_PIRSWITCH
  -D PIR_SENSOR_PIN=3
  -D PIR_SENSOR_OFF_SEC=60
  -UWLED_USE_MY_CONFIG

My ESP01 use 4MB flash so they can be updated OTA.

If we explore the possibility to swap ESP8266 (in Wemos D1 mini format) with alternate (cheap) device (which I also did) I would recommend Lolin ESP32-S2 D1 mini with 4MB flash and 2MB PSRAM. I've also posted build environments for that elsewhere but the stock WLED doesn't differ much.

And for clarification I will not pursue resolving this issue any more since ESP8266 just does not have enough resources to run smooth everything 0.14 offers. If anybody insists on running fully built 0.14 with external system like Home Assistant, Alexa or Hue and MQTT, I would urge them to reconsider and build special version with other features stripped away.

@softhack007
Copy link
Collaborator

softhack007 commented Jan 16, 2024

@blazoncek thanks for your thoughts, and I completely forgot about "Mode blending" and other additions that really increase RAM and CPU needs.

It seems my idea about requestJSONBufferLock() did not improve it. So agreed, it could be a general issue with low RAM. Even when users see free RAM, it might be fragmented heavily - I've seen examples where the largest availeable block was less than 10% of total free space.

Guess that we need serial monitor logs from debug builds, to find out if something can be done to improve 8266 performances - or maybe nothing can be done, and we'll soon declare 8266 as "half-dead" 😉 aka deprecated....

Edit: a few more "disable" flags to try out:

  • -D WLED_DISABLE_ESPNOW
  • -D WLED_DISABLE_WEBSOCKETS
  • -D WLED_DISABLE_MODE_BLEND

.... and a simple one: go to LEDs settings, uncheck "Use global LED buffer"

@blazoncek
Copy link
Collaborator

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

If you are not using GPIO1 or GPIO2 or GPIO3 for digital led output then CPU has to keep feeding LEDs. This in turn reduces performance for everything else.

If you use PWM LEDs make sure you only use GPIO4 or GPIO12 or GPIO14 or GPIO15 (as specified by Espressif technical documentation, https://www.espressif.com/sites/default/files/documentation/esp8266-technical_reference_en.pdf). Do not forget PWM signal requires NMI to be driven, hence uses CPU.

@willmmiles
Copy link
Collaborator

willmmiles commented Jan 17, 2024

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

My test case here is a single strip of 110 WS2812Bs, using a 0_15 branch derived build. Bit-banging for this many LEDs can take several milliseconds with interrupts disabled, which I believe can overflow some of the wifi hardware queues, depending on the amount of traffic on the network. I'm working on hacking some of the interrupt tolerance ideas from FastLED in to NeoPixelBus to see if I can mitigate it.

If a setup has more LEDs on a bit-banging pin, or a busier network, it might trip problems sooner. Sometimes this might manifest as hard reboots like I'm seeing; it's also possible it manifests as a wifi disconnect. (I'm actually rather suprised I haven't seen that in my testing, to be honest).

I will try a 0.14.1 build tonight and see if it behaves differently for me than the 0_15 development branch. It's quite possible this is a different issue than the one I've been chasing.

@afflux
Copy link

afflux commented Jan 18, 2024

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

FWIW, I'm seeing occasional resets on 8266 with 0.14.1 and use LPD8806, so no bitbanging involved. (But it's way rarer than what people are reporting here, I have 48h uptime right now)

@blazoncek
Copy link
Collaborator

use LPD8806, so no bitbanging involved

how do you know it is not? If you are using GPIO13 & GPIO14 then yes it uses HW to accelerate output otherwise you are using SW (CPU) to drive clock and data.

@Scope666
Copy link

So I went downstairs for a snack and noticed the test unit on when it was supposed to be off. It crashed (build 3) after 17 days.

Going to try build 4 now.

@willmmiles
Copy link
Collaborator

At this point I'm thinking none of these test builds are actually "stable" - whatever the issue is, it's lurking at a lower level; it's more that some builds/environments/configs trigger it faster than others.

I've put together some code to hook the crash handler and save the stack trace in the flash for later recovery. I've also thrown in some task and interrupt tracing logic I'd written while debugging the PWM-related crashes earlier this year. If a crash is logged, on the next boot the software will write a 'dump.txt' file with the trace to the local filesystem. The file can be retrieved with the /edit interface. Once a crash dump is logged, the system won't save another one until the 'dump.txt' file is deleted -- so even if it's crashing a lot, it won't wear out the flash.

Unfortunately this build will not catch "hard watchdog" type crashes; the Arduino core logic for debugging those doesn't have a user code hook, and I haven't got to pulling it in and modifying it yet. If that's what's going on, it'll still dump stack to the serial port, but it won't leave a file behind.

WLED_0.15.0-b5_ESP02_test.bin.gz

Lastly: this build is also based on the latest 0.15 tip, which has some other improvements that might improve stability beyond the previous builds -- though also some new logic.

@Scope666
Copy link

Scope666 commented Sep 29, 2024

@willmmiles I've installed your test build. If it crashes and it creates one, I'll share the dump.txt here.

Thanks!!!

PS ... my other 3 units that are running 14.0 have been up since the last power failure ... 31 days and counting, so it's something that changed after that point.

@softhack007
Copy link
Collaborator

I've put together some code to hook the crash handler and save the stack trace in the flash for later recovery.

@willmmiles cool, sounds like something we really should have in WLED. Do you know if the dump.txt trick would also work on esp32? Some people would kill for such a feature ;-)

Unfortunately this build will not catch "hard watchdog" type crashes

Well at least you can detect the restart reason on the next boot, so that watchdog aborts would not go completely unnoticed. Example https://github.com/MoonModules/WLED/blob/63ff7205d61c4bdf7e9b952e392222e46b93e1d6/wled00/wled.cpp#L575-L577

@willmmiles
Copy link
Collaborator

I've put together some code to hook the crash handler and save the stack trace in the flash for later recovery.

@willmmiles cool, sounds like something we really should have in WLED. Do you know if the dump.txt trick would also work on esp32? Some people would kill for such a feature ;-)

I hadn't done any research on ESP32 yet. Looks like core dumps to flash are already a feature of ESP32-IDF, we'd just need to figure out how to turn them on and supply a partition for them to reside in. (For the ESP8266 code I cheated and used the OTA space).

https://docs.espressif.com/projects/esp-idf/en/stable/esp32/api-guides/core_dump.html
https://www.reddit.com/r/esp32/comments/pmefci/esp32_coredump_to_flash_with_arduino_and/
https://community.platformio.org/t/platformio-support-for-esp32-coredump/25141

Unfortunately this build will not catch "hard watchdog" type crashes

Well at least you can detect the restart reason on the next boot, so that watchdog aborts would not go completely unnoticed. Example https://github.com/MoonModules/WLED/blob/63ff7205d61c4bdf7e9b952e392222e46b93e1d6/wled00/wled.cpp#L575-L577

Oh yeah, I've got that in my debug build too. It only goes to the serial port though. I've got the HWDT stack traces enabled in this build too, but they also only go to the serial port. I do think it's possible to upgrade the HWDT debugging logic to stash the trace elsewhere, but it's a bit more work to integrate than the convenient callback hook the Arduino core folks left for the other crash cases.

@kenni
Copy link

kenni commented Sep 30, 2024

WLED_0.15.0-b5_ESP02_test.bin.gz

Lastly: this build is also based on the latest 0.15 tip, which has some other improvements that might improve stability beyond the previous builds -- though also some new logic.

Thanks @willmmiles. I’ve updated one of my Athom LS-4P devices with the new test firmware, but it seems like basic WLED functionality is broken - I can’t even switch colors. Any idea on what is going on?

I updated from 0.14.0 and tried re-flashing and restarting a couple of times with no luck. Downgrading to 0.14.0 restores all functionality.

IMG_0115
IMG_0116

@willmmiles
Copy link
Collaborator

Thanks @willmmiles. I’ve updated one of my Athom LS-4P devices with the new test firmware, but it seems like basic WLED functionality is broken - I can’t even switch colors. Any idea on what is going on?

@kenni Thanks for giving it a try! It sounds like the index page isn't loading completely, so elements are missing and the javascript code fails.

Can you try connecting with a desktop web browser, ideally with the "developer tools" enabled in the network panel? The index page should be 44679 bytes in size. Also please look in /edit for a dump.txt. (You can check /edit even with the old firmware, the filesystem persists across versions).

@kenni
Copy link

kenni commented Oct 1, 2024

@willmmiles The index page seems to be complete to me. The HTTP response header advertises that the content-length of the file is 44679, as you expected. The transferred file has a size of "45kB" according to Chrome and if I look at the content of the file, it ends with "< / html>" on the last line. So it seems complete.

When I access /edit there're only two files available: cfg.json and presets.json.

EDIT: Factory reset fixes the Javascript-issue, so my old configuration apparently isn't compatible with the new version. Restoring the configuration file on the new firmware reintroduces the Javascript error. Downgrading firmware to 0.14.0 and restoring the configuration file works perfectly.

EDIT 2: The cause of the configuration error seems to be the assignment of LED Data GPIO. The correct pin for my controller is GPIO1, and this works in 0.14.0, but selection of that PIN is not allowed in the GUI in 0.15.0-b5.

EDIT 3: Ahh, seems like the stock 0.15.0-b5 doesn't reserve GPIO1... @willmmiles , are you perhaps using GPIO1 for debugging or something else in your build? Any chance you could generate a build where GPIO1 (and GPIO12 for relay) are unused? I can't physically move any wires, as I'm using a factory-made Athom LS-4P all-in-one controller.

@willmmiles
Copy link
Collaborator

EDIT 3: Ahh, seems like the stock 0.15.0-b5 doesn't reserve GPIO1... @willmmiles , are you perhaps using GPIO1 for debugging or something else in your build? Any chance you could generate a build where GPIO1 (and GPIO12 for relay) are unused? I can't physically move any wires, as I'm using a factory-made Athom LS-4P all-in-one controller.

I haven't made any changes to the default pin settings for the esp8266_2m build. The test build is based on the 0_15 tip, which includes a significant change to the bus and pin management past the -b5 tag; I'll review the logic and see if I can find out why pin 1 is disallowed.

@willmmiles
Copy link
Collaborator

I'll review the logic and see if I can find out why pin 1 is disallowed.

Ah, it's not a new thing at all - this build has debug messages enabled, which are sent to the serial port, for which pin 1 is the transmit pin; so WLED reserves it for that purpose.

Unfortunately we don't yet have a good solution for collecting regular debug logs internally for post-mortem storage. The new code here handles only stack traces, and even then they'll also be echoed out the serial port by the Arduino platform code. I hate to say it but your hardware might just not be suitable for software debugging with this build. :( Sorry!

@kenni
Copy link

kenni commented Oct 2, 2024

Ah, it's not a new thing at all - this build has debug messages enabled, which are sent to the serial port, for which pin 1 is the transmit pin; so WLED reserves it for that purpose.

Ok, that was also my conclusion after coming across a comment in the source code mentioning GPIO1 for serial communication when doing a debug build.

Unfortunately we don't yet have a good solution for collecting regular debug logs internally for post-mortem storage. The new code here handles only stack traces, and even then they'll also be echoed out the serial port by the Arduino platform code. I hate to say it but your hardware might just not be suitable for software debugging with this build. :( Sorry!

It's too bad, I'll just cross my fingers that someone else has suitable hardware and will be able to test your builds. I would love to get the esp8266 back in a working state with latest WLED versions. Thanks for all of your time and willingness to fix this :)

@Scope666
Copy link

Scope666 commented Oct 5, 2024

@willmmiles ... just reporting in that your latest test build has not crashed, still up from when I installed it 5 days ago:

image

@Scope666
Copy link

Scope666 commented Oct 6, 2024

Ok, it just crashed 3 hours ago, no dump.txt file present:

image

image

For comparison, one of my other units running 14.0 ... it would be longer but we had a power failure:

image

@willmmiles
Copy link
Collaborator

OK, so most likely a hard watchdog crash then, indicating we're getting trapped in an interrupt handler. I'll implement HWDT stack tracing next, I guess!

@willmmiles
Copy link
Collaborator

Today's update:

  • I've expanded the stack trace postmortem to a complete core dump. Device memory is so small compared to flash sizes, might as well collect all the things.
  • I've hacked up the Arduino framework's HWDT debugging handler to also collect a core dump. I couldn't find a way to get in early enough without editing the framework code; unlike the regular crash traces, they did not include a user code hook in this layer. :(
  • However, I am observing crashes in platform wifi code when attempting to open an AP after failing to connect, and also when changing the wifi connection. It's easily repeatable (good!) but I have yet to be able to localize the offender. Unfortunately there are a lot of places that set a value of -1. :(

New build to follow when I isolate the wifi crash.

@willmmiles
Copy link
Collaborator

willmmiles commented Oct 19, 2024

My crash case turned out to be that I'd corrupted the SDK nonvolatile storage with a bad build; clearing it restored my unit to working properly. The curious thing was that I could consistently trigger the crash by changing the wifi settings, or opening the AP while it was searching, but if it connected successfully it seemed stable for a long time (overnight, at least).

Has anyone upthread tried a complete erase_flash?

Here is the latest test build:

  • Captures complete RAM contents as a core.dump file on crash
  • HWDT crashes also generate crash.txt and core dumps
  • Some unrelated text processing tweaks from while I was tracing the crashes I was seeing
  • New feature: if /reset.sdk is found on the filesystem when booting, the SDK storage will be cleared (and the file entry removed). Currently this file must be created with the /edit interface; I haven't yet added this feature to the UI. This can be used to try an SDK data reset after an OTA update (ie. without needing to use esptool.py erase_flash).

WLED_0.15.0-b5_ESP02_test.bin.gz
Edit: replaced with a build without STATUS_LED

@Scope666
Copy link

@willmmiles ... I finally found a dump.txt, it's from the build before your one from 2 hours ago:

https://gist.github.com/Scope666/40f8d3fb0ea9ea28c9fdf712008581c8

Hope it sheds some light...

@willmmiles
Copy link
Collaborator

@willmmiles ... I finally found a dump.txt, it's from the build before your one from 2 hours ago:

https://gist.github.com/Scope666/40f8d3fb0ea9ea28c9fdf712008581c8

Hope it sheds some light...

Hm, interesting! Looks like the wifi subsystem got stuck. There are no wifi interrupts, but the main code keeps trying to yield() since the wifi task is flagged as "needs attention" -- and yet it seems as if its returning without clearing the flag. Could be that it can't allocate memory for packet buffers; could be SDK corruption confusing the wifi module like I saw on my board. (This is why I added the whole core dump -- it'd still be a pain to trawl the heap data structures and see what the state is, but it'd at least be possible.)

My quick recommendation would be to try the SDK reset and see if it improves things.

@Scope666
Copy link

Before I do it, will it put up an AP so I can flash it again? It's in a spot that's kind of difficult to get to. (under a cabinet, wood blocking access)

I'm running the build you posted yesterday, but I'll try the reset if you think it will shed some light.

@willmmiles
Copy link
Collaborator

Oh! The "SDK reset" doesn't reset the WLED configuration -- it should boot up like normal and maintain all of your settings, including your wifi SSID and password. The SDK reset just clears something?? the underlying Espressif platform code maintains from one boot to the next. The platform code is designed to "start clean" if that flash block is erased, it will re-measure or rebuild whatever was stored there -- it just slows down the boot for a little bit.

That said, this build only resets the "SDK" flash region, not the "wifi calibration" region, since that's what the exposed Arduino framework function does out of the box. If this doesn't improve anything, we can also try resetting the wifi calibration region next.

@Scope666
Copy link

Ok, I did the reset, and it successfully deleted the reset.sdk file I created ... fingers crossed.

Svennte pushed a commit to Svennte/WLED that referenced this issue Oct 22, 2024
…ircoookie#3690 and Aircoookie#3685)

some users have reported that releases after 0.14.0 are not working reliably. So we add a few "compat" for 8266 that try to reproduce the buildenv of 0.14.0 as much as possible.

* platform and platform_packages from 0.14.0
* not using PIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48
* due to smaller IRAM, we had to move some functions back from IRAM to normal flash (may cause slowdown)
@Scope666
Copy link

@willmmiles, ok, had a crash, and I'm attaching both files created here:

dump.txt
wled.core.zip

@willmmiles
Copy link
Collaborator

I've taken a quick look at the crash data. The actual point-of-failure is definitely a "wifi resource exhaustion" type fault; but there's no evidence of the "failure to yield()" people always blame this kind of thing on. There are a couple of interesting bits though:

  • Event trace shows flash IO activity at least 24 secs before the crash, with indications that it's doing wifi sends. Curiously there seems to be a write event somewhere in the middle of the sequence; I'm not sure what that's about. I'm assuming you're not routinely doing intentional config or preset writes, just immediate state changes.
  • There's a status flag set on the system (wifi) task that I haven't seen before in an event log. Still looking to find out what that's about. Unfortunately, the task state does not survive HWDT resets. :(

Otherwise I'd say it's seems as if it's operating normally right up to the point of failure. I'm still digging deeper; maybe I can find out what the recent wifi packets were.

@Scope666
Copy link

Scope666 commented Oct 29, 2024

@willmmiles ... some additional info if it helps. Preset never changes... an automation fires in the morning from HA to turn it on, and another one fires around midnight to shut if off. I know it was discussed there's some routine polling that HA does, that could be a factor. The AP the unit is connecting to is a Ubiquiti U6 Pro. The 2 pre-shared keys determine whether it connects to main or IoT VLAN:

image

Again, a reminder, the 3 units running on 14.0 have been up for over 61 days, that version seems happy with the AP / settings.

@willmmiles
Copy link
Collaborator

Still digging through the crash dump.

The actual fault was preciptated by a timer which triggered the wifi subsystem. My best guess is something timed out, which triggered a flurry of wifi activity, resulting in wifi resource exhaustion of some kind (presumably packet buffers, but this is speculation as we don't have a register map for the wifi interface); whatever it is, the wifi driver responds by hanging the system and triggering the HWDT. WLED itself seems to have been running fine, yield()ing to the wifi stack in reasonable periods of time (<200us), but this wasn't enough to keep up??

Recent IP traffic seems to be all HA polling stuff.

I'm investigating the timer code now, to see if I can backtrack to identify why the wireless code is waiting.

One thing I have observed with my test device is that when I'm running with HA disabled, the web interface reliably loads quickly. With HA enabled and polling the device, half the time I can't get the web interface to even load completely; the system behaves like there's a ton of packet loss, with connections timing out at both ends. I don't have an obvious explanation; heap metrics look OK whenever packets do make it to the processing layers (ie. room remaining for 5 or 6 packets), and I haven't had any issues other than heap exhaustion with parallel load tests in the past. So possibly there's something else going on; I might take a deeper look at the websockets code, as that's something both the web interface and HA use that I don't have test cases for - maybe there's a bug lurking there.

@drewcovi
Copy link

Just a quick note and thanks to Quindor for pointing out the multicast flood issue. Assuming that's in the mix here. But it was obvious for me that a third party was at play when two instance went down within seconds of each other.

@Scope666
Copy link

Scope666 commented Jan 1, 2025

Just a quick note that I tried the "compat" build of 15.0 final and it still has the crashing issue. 14.0 NEVER crashes on any of the 4 devices I have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug major This is a non-trivial major feature and will take some time to implement needs investigation The bug has not yet been reproduced by me. Analysis or more details are needed.
Projects
None yet
Development

No branches or pull requests