-
-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESP32 crash with ESP_ERR_NO_MEM in NimBLEDevice #179
Comments
Thanks for reporting this, I believe you will find the cause of the no mem error to be that I don't see how that error could affect the SPIFFS data though, that's quite odd. |
I agree that affect on SPIFFS seems unlikely, its just that when these crashes occur this seems to be a symptom and its exact cause might be somewhere else that is not coming up in the logs. Overall this and the other issue are all around supporting BLE and WiFi concurrently. I can't say just yet if there is something inherently wrong with the device or the libraries involved. I have managed though to greatly improve stability by managing tasks using a state machine. |
I feel your pain lol. Looking through the IDF commit logs regarding the ble-wifi coexist stuff it's a never ending stream of patching, so they are having difficulties with this too.
This is what I do as well, wifi, ble, mqtt, webserver, etc all in separate tasks with a few semaphores spread around, then just trigger each task as it's needed. Much more manageable as it takes away some of the randomness. |
The biggest stability improvement I got was to put BLE task on Core 0 and ESPAsyncWebServer on Core 1 - whether this really matters I'm not sure but I have one controller running 21 hours without a crash. I have another controller based on same state machine but with a couple of additional tasks. It's better but still crashes with the same errors. I'm still entertaining the idea of adding a second ESP32 to my PCB and having a dedicated BLE scanner to deliver my solution. |
Yes, that does help. In fact I just added some documentation recommending this as helps with more than just your issue as well. Unfortunately I have not had time to look into the async webserver + BLE issue lately as I've been trying to push a new release. |
I know you don't want to hear it (as it's a huge change for the same functionality) but I'm 100% stable (many units with weeks of uptime) using just the regular webserver + client side javascript for parsing and pulling data. I'm not scanning constantly but I've never had an issue during scans with this setup. I'm not running the webserver lightly either as I'm using it to stream debugging output to the client (webserver does slow when scanning). I spent many weeks this summer trying to get ESPAsync to play nice with bluetooth and it was nothing but a steady +2 bugs for every 1 fixed. |
Actually I'm glad to hear your success with the standard web server. And that you can confirm that BLE and ESPAsyncWebServer is not stable at present. I'd switch in a minute, but I depend on a framework that would take a lot of work to migrate. It's definitely on the cards though. |
Doing some reading and found me-no-dev/ESPAsyncWebServer#876 Maybe modifying async as described in that thread would work? |
Thank you! This looks very promising indeed - just tried the code changes to AsyncTCP.cpp - so far so good, I can see the response difference immediately. I'll leave it running overnight and see how it looks tomorrow. |
Let me know if that works out for you, would be great to give the author of that issue feedback as well. |
I captured this in the logs: [E][WiFiClient.cpp:392] write(): fail on fd 57, errno: 11, "No more processes" |
That's unfortunate, that backtrace looks much different than the ones discussed in #167. I'm no expert with the webserver/async stuff but are using SSL? If so, there are issues with that and async webserver that I have seen, also reports of SSL over MQTT and BLE concurrency problems. I don't have any other suggestions at the moment 😞 |
I've made some progress and resolved some of the issues around the stability. This issue actually occurs due to out of memory and I have noticed a pattern. I run the BLE scan as part of a state machine using FreeRTOS which gives each task a slot to run in. The BLE scan task runs for 5 seconds and is repeated in a 60 sec window or until a beacon is discovered. However what I notice is that after each scan, the heap is gradually getting consumed to the point where it get's to less than 35k and this is usually where the crash occurs. Could this be a memory leak and is there something I could try to alleviate this somehow? |
I don't think there is a memory leak here. You are printing the free heap as it is scanning and you can see it is recovered eventually. I suspect over a longer period you would see all of the heap recovered but there would need to be enough time in the idle task to do garbage collection. The amount of change you see is related to the number of devices advertising and after clearing results there is a delay in free heap recovery. |
Any chance (to prove your point) you could stop scanning if the heap gets below 33000 but keep printing heap messages to see if the garbage collection @h2zero mentions starts clearing it up? |
Good suggestion - I'll try that. |
I believe I have an answer for this. It's the scan results vector increasing in capacity as advertisers are found, The more advertisers found the bigger the capacity, but when it is cleared it does not reduce it's capacity back to 0. I modified the BLE_Scan example to test:
Here are the results:
Most notable are the I have also tried reserving 100 in the NimBLEScanResults constructor and that resulted in no reduction in heap at all except for the fact 100 pointers were reserved right at the start and are wasted ram if not used. |
Here is the log using reserve(100):
No change in I will think about this in the upcoming advertised device / scan patch that's currently being developed. If you want to test this just add
to the NimBLEScanResults class in NimBLEScan.h. |
Thanks so much - your example gave me the clue - if you see my code, I had the delay after the if check to break. This means that I was not allowing any chance for the heap to recover. I moved the delay to just after the clearResults() and now the heap is holding within a safe range. |
I've reached the point now, where I just don't think BLE is stable on ESP32. Probably NimBLE library is good enough for most use cases. But running even a modest number of tasks and using a web server just doesn't seem to be a good combination. At least I haven't been able to get the sort of reliability that I think should be possible. Today I got this for the first time. |
Yes, BLE is tricky to get it playing nice with others, specifically others using the the antenna. I can say from experience that it can be stable, it "just depends", there are always bugs somewhere that can get triggered. That looks like you're running out of heap. |
Actually coming from a long time amateur radio operator, it's freaking magic and I have absolutely no idea how they can get it to work at all with one radio switching that fast. I'm just going to continue to point my finger at the asyncwebserver library. No issues I can't solve in an hour of fiddling since I moved away from it. |
Point taken, thank you. As I have much invested in the features built on said web server, I'll have to sacrifice BLE for now in the interests of getting a working product. This issue can be closed. |
That's understandable, sometimes it's best to drop something that doesn't work and just come back to it later. I will close this for now then, please re-open if needed. |
I have found this error which is related to #167
The NimBLEDevice code

The consequence of this error aside from the ESP32 crash is that some part of the flash memory seems to be corrupted. I am storing JSON data in the SPIFFS and often when this crash happens the data is corrupted and I have to persist it again.
The text was updated successfully, but these errors were encountered: