-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8266: WLED keeps rebooting after 0.14.1 update. #3685
Comments
In the FWIW department, I'm also seeing this same behavior in Athom bulbs as well. (I'm using the recommended ESP02 image, happens across all bulb models.) In case it helps, I noticed this issue started in 0.14.1-B3 and did not occur in 0.14.1-B2, at least in my case. I figured this might have been related to the JSON buffer lock issue, but it looks like not. I can trigger it by changing profiles, either via the web interface or via Home Assistant. I don't believe it's configuration related as I tried a full factory reset in B3. |
Have the same problem. Just updated through Home Assistant, and have the same symptoms as OP. |
Please remove Home Assistant integration and see if the problems persist. BTW one way to see if WLED restarted is in Info dialog, Uptime field. |
I do not use esp8266 ( 4MB , 2MB or 1MB ) in production setup but i do have a lot of them around to replicate such issues . If cfg.json and preset.json are provided then we could do so . I have flashed two esp8266 4MB units since the first hour of 0.14.1 release and kept them As of 1 hour ago i have added one of them to HA with a simple automation ( to actually only send alert if the unit is on/off ) and i can see the unit disconnecting from wifi ( ping is lost ) but could not get it to constantly behave the same way . I blame HA integration but can not confirm |
@chertvl down-voting will not help resolving the issue. |
Running fine on ESP32 S2 mini, will test on a esp8266 device later when I can. |
Nevermind. Already downgraded to 0.14.0 and thats works perfectly. About "not help resolving issue", its:
I now have more time to describe the symptoms.
|
Same here, updated 3 8266-based devices. They can’t be accessed via Web. |
How many LED's you guys using? Flashed a couple esp8266's from B3 to released 0.14.1, no more than 100 led's working fine, BUT I don't use H.A at all so I can't help on that side sorry. |
Same problem on 4 instances. Between 80 and 278 LED on WEMOS D1 Mini (8266). |
Same problem on Atom Matrix.I use home assistant and a RESTful command.
I reverted to version 0.14.0 and I no longer have errors. |
How did you revert? |
@blazoncek a few thoughts on commonalities in user reports
We have to remember that WS responses are not running in arduino context; on esp32 they run inside the async_tcp task, not sure how its implemented on 8266. I think there are a few dangerous lines in the code to lock the JSON buffer Line 205 in a4a8e26
@chertvl @WarC0zes @Doyle4 if my understanding is right, it could help if you comment out the line I quoted, and replace it with if (jsonBufferLock) return false; its a temporary hack and not a proper solution, but it should help to understand if using |
🔺 On a different topic that goes to all who commented and contribute to this thread: Please stop this thumbs-up thumbs-down BS. We are trying to analyse a problem and need you as users who must help us. We are trying to do engineering work here, not to entertain fans in the roman circus.
I'm really tired of playing guessing games with emoji. Use words, instead of throwing tags onto the wall. please. |
I noticed this same behavior on my athom rgbw controller which is paired to home assistant. After upgrading earlier in the afternoon everything seemed fine, but when I went to turn my lights off I noticed the wled controller wasn't responding. I tried a few times to turn them off via home assistant, and somehow got it stuck in a reboot loop that caused the leds to blink off every 30 seconds or so. I was able to stop this by turning them off via the web UI and reverted to 0.14.0 and it's working again. |
Thanks for the detailed explanation. I followed your steps, commented out the required line, and added a new one. It seemed like I did everything right, but, unfortunately, it didn’t help. Below are some screenshots: |
It may have gotten worse. Unfortunately, my device doesn't have a UART, and I don't have one at home either. So continue the tests without me until I find a UART to restore the device... |
Thanks for helping as much as you could 🥇 and sorry about making it worse for you. About the UART: if gpio 1 and 3 are accessible on your board, then a standard "USB-to-TTL" adapter is all you need. Like this one that's using a CH340G: ... or this one that's specificially made for "ESP-01S" You'll also find them for cheap on ali. |
There were more changes than this. And it is not for websockets but for HTTP requests. IMO, and my own testing showed that, new locking mechanism only improved on stability and memory corruption.
Websockes need plenty of heap. Constantly. Disabling them can only improve things at the expense of stale UI.
I've seen WDT in non-WLED code. How to avoid it? Have no clue.
This may be attributed to a more susceptible WiFi code in newer Arduino core we use with 0.14 (I've posted my own experience in another issue detailing the resolution). All in all, IMO if you want to run 0.14.x on ESP8266 you need to make a few compromises. Why? Because with only 16kB of RAM available (after boot) it can get crowded rather quickly in the heap. I am going to post my own ESP8266 configuration I use on ESP01 devices which I have plenty in daily use. Unfortunately that configuration may not work for some people as it strips quite a few features out, but produces reliable and working ESP8266 environment. [env:esp01_4m]
extends = env:esp01_1m_full
board_build.filesystem = littlefs
board_build.ldscript = ${common.ldscript_4m1m}
board_build.f_cpu = 160000000L
build_flags = ${common.build_flags_esp8266}
-DPIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48
-D LED_BUILTIN=2
-D WLED_DISABLE_ALEXA
-D WLED_DISABLE_HUESYNC
-D WLED_DISABLE_LOXONE
-D WLED_DISABLE_ADALIGHT
-D WLED_DISABLE_MQTT
-D WLED_DISABLE_2D
-D WLED_DISABLE_PXMAGIC
-D WLED_USE_UNREAL_MATH
-D WLED_MAX_BUSSES=2
-D LEDPIN=2
-D USERMOD_PIRSWITCH
-D PIR_SENSOR_PIN=3
-D PIR_SENSOR_OFF_SEC=60
-UWLED_USE_MY_CONFIG My ESP01 use 4MB flash so they can be updated OTA. If we explore the possibility to swap ESP8266 (in Wemos D1 mini format) with alternate (cheap) device (which I also did) I would recommend Lolin ESP32-S2 D1 mini with 4MB flash and 2MB PSRAM. I've also posted build environments for that elsewhere but the stock WLED doesn't differ much. And for clarification I will not pursue resolving this issue any more since ESP8266 just does not have enough resources to run smooth everything 0.14 offers. If anybody insists on running fully built 0.14 with external system like Home Assistant, Alexa or Hue and MQTT, I would urge them to reconsider and build special version with other features stripped away. |
@blazoncek thanks for your thoughts, and I completely forgot about "Mode blending" and other additions that really increase RAM and CPU needs. It seems my idea about Guess that we need serial monitor logs from debug builds, to find out if something can be done to improve 8266 performances - or maybe nothing can be done, and we'll soon declare 8266 as "half-dead" 😉 aka deprecated.... Edit: a few more "disable" flags to try out:
.... and a simple one: go to LEDs settings, uncheck "Use global LED buffer" |
Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out. If you are not using GPIO1 or GPIO2 or GPIO3 for digital led output then CPU has to keep feeding LEDs. This in turn reduces performance for everything else. If you use PWM LEDs make sure you only use GPIO4 or GPIO12 or GPIO14 or GPIO15 (as specified by Espressif technical documentation, https://www.espressif.com/sites/default/files/documentation/esp8266-technical_reference_en.pdf). Do not forget PWM signal requires NMI to be driven, hence uses CPU. |
My test case here is a single strip of 110 WS2812Bs, using a 0_15 branch derived build. Bit-banging for this many LEDs can take several milliseconds with interrupts disabled, which I believe can overflow some of the wifi hardware queues, depending on the amount of traffic on the network. I'm working on hacking some of the interrupt tolerance ideas from FastLED in to NeoPixelBus to see if I can mitigate it. If a setup has more LEDs on a bit-banging pin, or a busier network, it might trip problems sooner. Sometimes this might manifest as hard reboots like I'm seeing; it's also possible it manifests as a wifi disconnect. (I'm actually rather suprised I haven't seen that in my testing, to be honest). I will try a 0.14.1 build tonight and see if it behaves differently for me than the 0_15 development branch. It's quite possible this is a different issue than the one I've been chasing. |
FWIW, I'm seeing occasional resets on 8266 with 0.14.1 and use LPD8806, so no bitbanging involved. (But it's way rarer than what people are reporting here, I have 48h uptime right now) |
how do you know it is not? If you are using GPIO13 & GPIO14 then yes it uses HW to accelerate output otherwise you are using SW (CPU) to drive clock and data. |
So I went downstairs for a snack and noticed the test unit on when it was supposed to be off. It crashed (build 3) after 17 days. Going to try build 4 now. |
At this point I'm thinking none of these test builds are actually "stable" - whatever the issue is, it's lurking at a lower level; it's more that some builds/environments/configs trigger it faster than others. I've put together some code to hook the crash handler and save the stack trace in the flash for later recovery. I've also thrown in some task and interrupt tracing logic I'd written while debugging the PWM-related crashes earlier this year. If a crash is logged, on the next boot the software will write a 'dump.txt' file with the trace to the local filesystem. The file can be retrieved with the Unfortunately this build will not catch "hard watchdog" type crashes; the Arduino core logic for debugging those doesn't have a user code hook, and I haven't got to pulling it in and modifying it yet. If that's what's going on, it'll still dump stack to the serial port, but it won't leave a file behind. WLED_0.15.0-b5_ESP02_test.bin.gz Lastly: this build is also based on the latest 0.15 tip, which has some other improvements that might improve stability beyond the previous builds -- though also some new logic. |
@willmmiles I've installed your test build. If it crashes and it creates one, I'll share the dump.txt here. Thanks!!! PS ... my other 3 units that are running 14.0 have been up since the last power failure ... 31 days and counting, so it's something that changed after that point. |
@willmmiles cool, sounds like something we really should have in WLED. Do you know if the
Well at least you can detect the restart reason on the next boot, so that watchdog aborts would not go completely unnoticed. Example https://github.com/MoonModules/WLED/blob/63ff7205d61c4bdf7e9b952e392222e46b93e1d6/wled00/wled.cpp#L575-L577 |
I hadn't done any research on ESP32 yet. Looks like core dumps to flash are already a feature of ESP32-IDF, we'd just need to figure out how to turn them on and supply a partition for them to reside in. (For the ESP8266 code I cheated and used the OTA space). https://docs.espressif.com/projects/esp-idf/en/stable/esp32/api-guides/core_dump.html
Oh yeah, I've got that in my debug build too. It only goes to the serial port though. I've got the HWDT stack traces enabled in this build too, but they also only go to the serial port. I do think it's possible to upgrade the HWDT debugging logic to stash the trace elsewhere, but it's a bit more work to integrate than the convenient callback hook the Arduino core folks left for the other crash cases. |
Thanks @willmmiles. I’ve updated one of my Athom LS-4P devices with the new test firmware, but it seems like basic WLED functionality is broken - I can’t even switch colors. Any idea on what is going on? I updated from 0.14.0 and tried re-flashing and restarting a couple of times with no luck. Downgrading to 0.14.0 restores all functionality. |
@kenni Thanks for giving it a try! It sounds like the index page isn't loading completely, so elements are missing and the javascript code fails. Can you try connecting with a desktop web browser, ideally with the "developer tools" enabled in the network panel? The index page should be 44679 bytes in size. Also please look in /edit for a dump.txt. (You can check /edit even with the old firmware, the filesystem persists across versions). |
@willmmiles The index page seems to be complete to me. The HTTP response header advertises that the content-length of the file is 44679, as you expected. The transferred file has a size of "45kB" according to Chrome and if I look at the content of the file, it ends with "< / html>" on the last line. So it seems complete. When I access /edit there're only two files available: cfg.json and presets.json. EDIT: Factory reset fixes the Javascript-issue, so my old configuration apparently isn't compatible with the new version. Restoring the configuration file on the new firmware reintroduces the Javascript error. Downgrading firmware to 0.14.0 and restoring the configuration file works perfectly. EDIT 2: The cause of the configuration error seems to be the assignment of LED Data GPIO. The correct pin for my controller is GPIO1, and this works in 0.14.0, but selection of that PIN is not allowed in the GUI in 0.15.0-b5. EDIT 3: Ahh, seems like the stock 0.15.0-b5 doesn't reserve GPIO1... @willmmiles , are you perhaps using GPIO1 for debugging or something else in your build? Any chance you could generate a build where GPIO1 (and GPIO12 for relay) are unused? I can't physically move any wires, as I'm using a factory-made Athom LS-4P all-in-one controller. |
I haven't made any changes to the default pin settings for the esp8266_2m build. The test build is based on the 0_15 tip, which includes a significant change to the bus and pin management past the -b5 tag; I'll review the logic and see if I can find out why pin 1 is disallowed. |
Ah, it's not a new thing at all - this build has debug messages enabled, which are sent to the serial port, for which pin 1 is the transmit pin; so WLED reserves it for that purpose. Unfortunately we don't yet have a good solution for collecting regular debug logs internally for post-mortem storage. The new code here handles only stack traces, and even then they'll also be echoed out the serial port by the Arduino platform code. I hate to say it but your hardware might just not be suitable for software debugging with this build. :( Sorry! |
Ok, that was also my conclusion after coming across a comment in the source code mentioning GPIO1 for serial communication when doing a debug build.
It's too bad, I'll just cross my fingers that someone else has suitable hardware and will be able to test your builds. I would love to get the esp8266 back in a working state with latest WLED versions. Thanks for all of your time and willingness to fix this :) |
@willmmiles ... just reporting in that your latest test build has not crashed, still up from when I installed it 5 days ago: |
OK, so most likely a hard watchdog crash then, indicating we're getting trapped in an interrupt handler. I'll implement HWDT stack tracing next, I guess! |
Today's update:
New build to follow when I isolate the wifi crash. |
My crash case turned out to be that I'd corrupted the SDK nonvolatile storage with a bad build; clearing it restored my unit to working properly. The curious thing was that I could consistently trigger the crash by changing the wifi settings, or opening the AP while it was searching, but if it connected successfully it seemed stable for a long time (overnight, at least). Has anyone upthread tried a complete erase_flash? Here is the latest test build:
WLED_0.15.0-b5_ESP02_test.bin.gz |
@willmmiles ... I finally found a dump.txt, it's from the build before your one from 2 hours ago: https://gist.github.com/Scope666/40f8d3fb0ea9ea28c9fdf712008581c8 Hope it sheds some light... |
Hm, interesting! Looks like the wifi subsystem got stuck. There are no wifi interrupts, but the main code keeps trying to yield() since the wifi task is flagged as "needs attention" -- and yet it seems as if its returning without clearing the flag. Could be that it can't allocate memory for packet buffers; could be SDK corruption confusing the wifi module like I saw on my board. (This is why I added the whole core dump -- it'd still be a pain to trawl the heap data structures and see what the state is, but it'd at least be possible.) My quick recommendation would be to try the SDK reset and see if it improves things. |
Before I do it, will it put up an AP so I can flash it again? It's in a spot that's kind of difficult to get to. (under a cabinet, wood blocking access) I'm running the build you posted yesterday, but I'll try the reset if you think it will shed some light. |
Oh! The "SDK reset" doesn't reset the WLED configuration -- it should boot up like normal and maintain all of your settings, including your wifi SSID and password. The SDK reset just clears something?? the underlying Espressif platform code maintains from one boot to the next. The platform code is designed to "start clean" if that flash block is erased, it will re-measure or rebuild whatever was stored there -- it just slows down the boot for a little bit. That said, this build only resets the "SDK" flash region, not the "wifi calibration" region, since that's what the exposed Arduino framework function does out of the box. If this doesn't improve anything, we can also try resetting the wifi calibration region next. |
Ok, I did the reset, and it successfully deleted the reset.sdk file I created ... fingers crossed. |
…ircoookie#3690 and Aircoookie#3685) some users have reported that releases after 0.14.0 are not working reliably. So we add a few "compat" for 8266 that try to reproduce the buildenv of 0.14.0 as much as possible. * platform and platform_packages from 0.14.0 * not using PIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48 * due to smaller IRAM, we had to move some functions back from IRAM to normal flash (may cause slowdown)
@willmmiles, ok, had a crash, and I'm attaching both files created here: |
I've taken a quick look at the crash data. The actual point-of-failure is definitely a "wifi resource exhaustion" type fault; but there's no evidence of the "failure to yield()" people always blame this kind of thing on. There are a couple of interesting bits though:
Otherwise I'd say it's seems as if it's operating normally right up to the point of failure. I'm still digging deeper; maybe I can find out what the recent wifi packets were. |
@willmmiles ... some additional info if it helps. Preset never changes... an automation fires in the morning from HA to turn it on, and another one fires around midnight to shut if off. I know it was discussed there's some routine polling that HA does, that could be a factor. The AP the unit is connecting to is a Ubiquiti U6 Pro. The 2 pre-shared keys determine whether it connects to main or IoT VLAN: Again, a reminder, the 3 units running on 14.0 have been up for over 61 days, that version seems happy with the AP / settings. |
Still digging through the crash dump. The actual fault was preciptated by a timer which triggered the wifi subsystem. My best guess is something timed out, which triggered a flurry of wifi activity, resulting in wifi resource exhaustion of some kind (presumably packet buffers, but this is speculation as we don't have a register map for the wifi interface); whatever it is, the wifi driver responds by hanging the system and triggering the HWDT. WLED itself seems to have been running fine, yield()ing to the wifi stack in reasonable periods of time (<200us), but this wasn't enough to keep up?? Recent IP traffic seems to be all HA polling stuff. I'm investigating the timer code now, to see if I can backtrack to identify why the wireless code is waiting. One thing I have observed with my test device is that when I'm running with HA disabled, the web interface reliably loads quickly. With HA enabled and polling the device, half the time I can't get the web interface to even load completely; the system behaves like there's a ton of packet loss, with connections timing out at both ends. I don't have an obvious explanation; heap metrics look OK whenever packets do make it to the processing layers (ie. room remaining for 5 or 6 packets), and I haven't had any issues other than heap exhaustion with parallel load tests in the past. So possibly there's something else going on; I might take a deeper look at the websockets code, as that's something both the web interface and HA use that I don't have test cases for - maybe there's a bug lurking there. |
Just a quick note and thanks to Quindor for pointing out the multicast flood issue. Assuming that's in the mix here. But it was obvious for me that a third party was at play when two instance went down within seconds of each other. |
Just a quick note that I tried the "compat" build of 15.0 final and it still has the crashing issue. 14.0 NEVER crashes on any of the 4 devices I have. |
What happened?
I have two instances of WLED running on two separate ESP-12F (I believe they are 8266 based?) modules. To be specific, it's this module (not the esp32, obviously). They are wired with different types of LEDs. One is with a WS2812B LED Strip and the other is a more generic LED string that has R|G|B|12V as the inputs, as opposed to 5V|Data|Ground that the first has. I'm not sure that will make a difference. But, I included it as it might be important to note. I just got them both running a week or two ago with WLED 0.14.0 and added them to Home Assistant. Everything worked as expected, I have been using presets and playing with the effects and colors on both. I even have a
However, I updated to 0.14.1 today and the ESP connected to the generic LED strip started turning off when I changed the color it will do that for a split second and I'll notice that the light will switch back to the default orange color. So, I kept testing and it kept happening. Then, I noticed that for a split second after this happens the web interface will be unresponsive for a moment. This leads me to believe the light is restarting.
I have been able to fix this for now by going to the update section and giving it the 0.14.0 interface. But, if I can give any assistance in finding this issue feel free to reach out and I will put 0.14.1 back on it if there is any form of logs or anything I can provide.
To Reproduce Bug
Update to 0.14.1
Press most any button in the interface.
Expected Behavior
I would have expected it not to crash.
Install Method
Binary from WLED.me
What version of WLED?
WLED 0.14.1
Which microcontroller/board are you seeing the problem on?
ESP8266
Relevant log/trace output
No response
Anything else?
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: