-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues on ESP82668266 with 16MB flash #77
Comments
Thank You for effort you have put into this. Now only few words on topic, I'll try to write a bit more in few days. Issue 7095 You have mentioned - I believe they say about 16M filesystem not chip size. AFAIK SPIFFS ignores all not used flash, so from data structure filesystem has 2M. I have observed that behavior, and it is one of the reasons we don't sell Wemoses D1 mini Pro. Apparently boards from some manufactures are OK and are working good, and some have this problem. So I'm not sure if this is a software or hardware problem, but I can't rule out either cause. Do Your Wemos have antenna/IPX selecting jumper placed under angle 45° to ceramic antenna? Regarding changing design of how webconfig works - I did a test with such approach, but since current ESP8266WebSever is single threaded and blocking - performance (as seen by user) was dramatic. I did test with replacing it with ESPAsyncWebServer there was much better user experience, but I had some other problems with this server. Unfortunately I can not find my notes from this project so I can not say what was exactly problem. I believe that async_web is this code. |
No, my "pro" board has an antenna resistor at 90° (it's that weird rare variant). After running into some issues with the standard 4MB Wemos D1 Mini (the one sold as part of NAM kit), I'm not so sure if it's SPIFFS that causes all the issues (but probably still makes it harder to debug). My NAM 0.3.3 box was working for weeks, but today it disconnected from Wi-Fi and refused to reconnect in a stable manner (when it connected, every endpoint was tragically slow). I actually investigated the issue live for the first time, but it wasn't the first occurrence (although it always fixed itself). After desperate attempts at re-flashing, I even faced full firmware crashes, but they were probably caused by corrupted config.json stored on SPIFFS. I thought it could be related to esp8266/Arduino#6007 or esp8266/Arduino#5493, but my ancient Wemos from NAM 0.2.1 exhibits the same issue with wake-ups; however, it never actually failed on me with Luftdaten/NAMF. Validated flash by dumping and it's working fine. I even tried lowering Wi-Fi TX power to try keeping flash stable. I've seen similar issues with SPIFFS + multipart HTTP + ESP8266 in multiple places, but nothing really stood out. I even tried building an image with even more debug and random changes to After some tweaking, I reset TX power to max (20.5, even if code accepts any number) and changed Wi-Fi mode to 802.11b. I never had luck with N when connecting to my Wi-Fi router, so long time ago, I assumed it had to do with some compatibility issue and stuck to G. However, with B mode, the connection is the most stable and retransmits are very rare. My current theory is that the ESP8266 running NAMF must be somehow so overloaded, it struggles to run more modern variants. The most bizarre twist: that 16MB "pro" board started to work with 802.11b - it's relatively slow, but renders config page nearly 100% of the time. With 802.11b and 20.5dBm TX power on rc5, the config page renders on those boards in:
That suggests off-brand SPI flash is significantly slower, at least with SPIFFS. @netmaniac I've seen some work done on ESP32 variant - how it's going? Ideally, we'd need new main board to support it, but I think that using wire spaghetti to convert some of the available dev boards with ESP32 would be worth trying to fix performance issues in the vast ocean of non-genuine D1 Mini variants. ESP32-CAM seems the easiest board to adapt due to low price, similar size and built-in external antenna connector (we could just ignore dangling camera and take advantage of microSD for logs storage). |
I am beginning to wonder if off-brand SPI is just an indicator of poor quality of ESP8266 dev board, rather than root cause. After some tweaks to my Asus Wi-Fi router (I lost track of what's causing most of the issues, but most likely Airtime Fairness has to be turned off) and sticking to I developed a "test suite" that runs curl to One thing that should be ruled out is my Wi-Fi setup, ideally to get reproducible results, it should be re-run on some Raspberry Pi. First thing I checked was
And this seems to make sense after reading https://arduino-esp8266.readthedocs.io/en/latest/esp8266wifi/client-class.html#setnodelay and the linked Wikipedia article about delayed ACK. Especially when using slow 802.11b/g and having ESP8266 outdoor (plus probably slow flash doesn't help). This is an example of transmission with Naggle enabled ( And this is with Naggle disabled - chunking only happens to fit into Ethernet frames: Again, it seems that if anything goes wrong with a single TCP packet, poor ESP8266 has to retransmit it, deal with other packets that are now out of order and ultimately fails to render something that makes sense in a reasonable time, thus forcing the client to just terminate connection. I'll post full test results for WiFi mode x Naggle x board (4MB & 16MB from 2024 plus 4MB from 2019 that's genuine Wemos with 99% certainty). Although from partial tests with 16MB board that switched from B to G, I already started to see original behaviour with extremely long response times. I'll also try to add parameters for Naggle and some timeouts, especially for Wi-Fi connection, as with N it's often failing to connect within the time limit :/ |
Mystery of N-mode not working is solved - AP mode is known to work only in B/G, STA should work, but Asus (and some other vendors) are stubborn in following 802.11n specs and silently ignore connections from certain SDK versions that do not advertise WMM (it manifests as established link-layer connection, but no connectivity - most notably no DHCP). Best described on esp8266/Arduino#7965 Moreover, the linked issue also mentions issues with poor quality crystal oscillator that also may be influencing connectivity, especially with TCP. I constantly observe random disconnects on all new (aka non-Wemos) boards when used outside. If that's confirmed, then I think there's no future for supporting ESP8266 if we can't get good quality boards (and we can't get Wemos ones, since they no longer make them). For me, that was the last nail in the AsusWRT coffin, so I'll soon have Mikrotik setup to play around and determine the true root-cause of those connectivity issues on the NAM end. If nothing else, I'll finally get rid of the incoherent mess that's Wi-Fi configuration on AsusWRT... But for now, I added |
Summary
When you use an ESP8266 board that has 16MB flash instead of 4MB, system performance is non-existent, most notably the config page does not load 99% of the time.
Tests
Scenario:
NAMF-2020-46rc5-en.bin
time curl -v http://192.168.4.1/config -o /dev/null
several timesExpected outcome (true on same board but with 4MB flash):
Actual outcome:
curl: (56) Malformed encoding found in chunked-encoding
on various places (i.e. it's not single chunk)Other observations:
/
redirects to/config
and web browser hangs/config
, which may lead to confusion on user side/config.json
is always workingBackground on flash itself
ESP8266 usually has 4MB SPI flash. However, some boards marketed as "WeMos D1 Mini Pro" that are preferred for their external antenna are sold with 16MB SPI flash.
That flash chip can be of various quality, but all my boards use
Zbit ZB25VQ
. Additionally, I'm quite certain that this can be ruled out, as flashing and dumping speeds were constantly the same on all boards.Moreover, I tested all combinations of below flashing parameters (in
esptool.py
):While all those impact actual runtime (NAMF reports it) and there are some minor performance differences, it never solves the issue. It could be too small testing sample, but best results were on 16MB/QOUT/40MHz combination, when loading time of config page that didn't break was 1.5-3 seconds and I had somewhere between 50% and 75% success rate.
SPIFFS is using flash to store configuration, and it seems like during config page rendering it's read many times.
The chip I have on 16MB board seems to relatively rare, but some people reported similar issues on completely different platforms using ESP8266. For reference, it's
25VQ128DSJG
.My suspicions
After some digging, I think the culprit is SPIFFS on 16MB flash in general.
NAMF uses
4m2m
layout by default, which means 4MB total flash, 2MB for SPIFFS, leaving 2MB for compiled program. I tried to rule out some other aspects of that profile, which are implemented in Arduino, like block size etc. The motivation is that SPI flash chips actually have internal cache, and cache miss ratio could be high due to different sizes of data. To achieve that, I tried to build an image of NAMF with16m2m
and16m1m
. This yielded little to low improvements.In other words, I tried both:
Unfortunately, there's no easy way to create a custom profile like
16m14m
, as those are compiled into development tools and not parametrized.esp8266/Arduino#7095 (comment) seems to confirm that SPIFFS is expected to struggle on 16MB chip, and it doesn't matter how much of memory is actually allocated.
I also tried changing code of config page render process to add chunk sending more often and to include free memory stats in HTML response, but it looks like it's not the core of the issue. However, afterr random poking around the code it looks like memory is not explicitly freed - it's reserved on #77 and probably should be freed here: #77
Proposed solution
At the moment, we can at least put huge warning in docs, that 16MB flash is problematic.
Ideally, SPIFFS should be replaced with LittleFS, as it's deprecated and expected to be dropped from ESP8266 Arduino SDK completely. On the other hand, it looks like it has own issues and may not solve 16MB variant issues alow.
Web UI rendering rewrite
I was trying to figure out what's the core of the problem, but this spaghetti code originating from the original project is a nightmare. A few months ago, when I was considering porting some feature from the original project (SSL for InfluxDB) I was scared by the number of places I'd need to change to add a simple option.
I hope to find time to work on my fork that would replace HTML and JSON rendering that's currently a series of functions with a more modern approach: static HTML and JS files hosted from SPIFFS and only dynamic content implemented by serving JSON (it can be easily rendered from structs).
This Web UI already uses plenty of JavaScript and XHR to load content (e.g. WiFi list), so I believe it wouldn't be a big deal (i.e. there are no hardcore users with NoScript). Moreover, all endpoints used in external integrations could be left as-is, so it'd stay compatible. Furthermore, it's a weird decision to perform heavy tasks like rendering templates and actual HTML tags on ESP8266, when client devices have several orders of magnitude more power and memory to render those on the frontend.
My fork is https://github.com/danielskowronski/namf/tree/rewrite_webui
The text was updated successfully, but these errors were encountered: