-
-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic reset when scanning and several concurrent http requests occur #167
Comments
I can only add that I fought the exact same issue for a couple weeks with espasyncwebserver and the old Bluetooth library. I ended up dropping back to the regular webserver and using client side JavaScript to provide the functionality I needed. |
Well I cannot say for sure what is happening here but I found this interesting, in 7 this assert is triggered. This is a message from the controller that it has encountered a hardware error. Not much to work with here, except perhaps a power issue. If not that, then perhaps a bug in the controller, guessing by the previous area of the backtrace (who knows here, it's black magic). One other thing it could be that could be tested however, is the commands being sent from core 0. This is a modification made to the esp-nimble files to address delay issues when sending commands from core 1. If you revert this commit and test I'd be very curious of the results. |
Hi. I have run into the same issue running an http server using ESPAsyncWebserver and scanning BLE - my understanding is that this is hardware problem in that the WiFi and BLE share the same hardware radio. Meaning that you have a harder software problem to balance scanning BLE for devices and servicing Wifi packets as they share the data pipeline. I've not found a solution yet and may have to make a separate controller for BLE scanning as the ESP32 crashes often or the http server takes as long to respond as the shortest scanning cycle. It seems while BLE scan is in progress you can't service http requests. I'll share my findings if I manage to come up with a workable approach. |
Nimble works perfectly fine with me with all of these libraries, even during scan. It's just the espasync . Which sucks because it is a nice webserver. You can however spin off the regular webserver below on its own task and it works reasonably well. Good luck! #include <WebServer.h> |
I'm one of the unnamed co-conspirators.
I can help rule-out the power unless it's a design flaw in the dev board itself (Lolin D32 Pro.) There's three of us working on this with similar results, it seems unlikely all of us have flakey power. I'd also expect to see a reset caused by brownout, not a panic. Unless you see something I do not?
@justoke which exception decoder are you using? I've had horrible results trying to consistently use those, but that would be a nice thing to figure out. I am using the esp32_exception_decoder built in to PIO and I get this which does not show the same information:
Rolling back to use the synchronous web server is an undesirable outcome. The performance difference between the two is quite pronounced for this application. Unless I inline all of the code (undesirable from a re-use and maintenance standpoint), a single connection would be quite bad.
The web server (async specifically) is pinned to core 1 per this note. Removing that makes things quite a bit worse, I did attempt to pin NimBLE to core 1 as well with no improvement. Reverting that commit has proved challenging for me, and I'm sure it's something I am forgetting but I don't see it:
I appreciate the folks here willing to help out. If it's something we need to bring upstream we can obviously do that. It's just quite a bit more challenging to work there. |
I think this might resolve your issue on ESP32. Add this define you your code - in main.cpp if you are using Platform IO This will pin the AsyncWebserver which runs as a FreeRTOS task on CORE 1. I tried this and although the web server can still get blocked by the BLE(apparent because the web server response times are longer when the scan is in progress), it has not crashed since I tried this and often it responds in sub 100ms times which is as good as I could hope for. I implemented some other things specific to my use case - like blocking my loop() task logic while updating data from the BLE onResult() event. |
Yes, sir - already doing that. My comment about it was not horribly clear in retrospect. I've edited my post above. |
@lbussy I used this decode the exceptions https://github.com/me-no-dev/EspExceptionDecoder. Although I mainly use PIO, this is the only thing I use Arduino IDE for. |
I had replied before I saw your comment. I'm happy to try help you. I was definitely in a world of pain before I made the core assignment. Of course my use case will differ from yours, but maybe there is an approach to what I have done that might help you. |
Thanks - I really hate going back to the Arduino IDE after PIO. So here's something interesting. I have a fork of ESPAsynchWebServer I was using elsewhere and I pulled that in instead. The behavior is much improved (but not perfect). Here's the PR I merged in my fork: https://github.com/me-no-dev/ESPAsyncWebServer/pull/836/files As you can see the difference is basically cleaning up connections. It's not making perfect sense yet as I've not had my coffee - but you are 5 hours ahead so maybe it strikes a chord with you. |
I agree - and we're using that:
So aside from the PR up there having a net positive impact, if I remove the We're poking in the right place for sure. |
If the regulator is weak, which it may be on this board, the voltage might dip low enough to cause an error but not low enough to trigger the brownout detection (2.43v default in arduino). So if your regulator dropped to, say, 2.9v, that might be enough to trigger this. Just speculating here though, you would have to measure with a scope or meter to detect min levels. Hopefully your PR there will lead to a software resolution though. |
FYI, you can still work in PIO and just use the trace debugger |
So I've had 5 hrs of up time on my ESP32 and not a single panic crash. The basic app use case is as follows: To test the web api response, I fired up Postman and fired requests at 500 millisecond intervals for a few minutes. What is apparent is during the BLE scan response time goes up to 4s, otherwise typical response is sub 200 millisecond with 80 to 90 ms being fairly average. |
Stopping the BLE scans while servicing the jquery (which are JSON requests) is a possibility. That might help. Thank you for your ideas - there might be something in there we can use! I wonder if the AsyncWebServer can be paused during BLE scans - sort of the opposite. |
He has 40 open PRs going back to 2017. I'm not going to hold my breath. |
I do this - so when BLE onResult() event finds a matching device that I am after, it reads the sensor data and sets HasChanges to true. In that state I don't process web requests until the changes have been processed by the relevant observers. In practice this doesn't take long and the stability gained is worth the slightly slower request response. |
I'm using the latest version and haven't had to rely on any PRs which is preferred in my case as I don't want to have to keep track of such nuances in my projects |
Yeah, that would be fine if I was synchronous, I'm not. I hope to not have to go back to the other webserver. If I can somehow "pause" the async, that might help. |
Are you able to completely shut down the async webserver then restart it?
Last I tried, it led to a kernel panic but that may have been my error.
On Thu, Jan 7, 2021 at 1:25 PM Lee Bussy ***@***.***> wrote:
Yeah, that would be fine if I was synchronous, I'm not. I hope to *not*
have to go back to the other webserver. If I can somehow "pause" the async,
that might help.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#167 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEEO2KFZQQDLYJRUBPP4YXLSYYDCFANCNFSM4VYJK3BA>
.
--
*----------------------*
*Anthony Doud*
|
I reverted the mentioned commit and it did not appear to make much difference as both myself and my fellow conspirators continue to experience the PANIC. Thank you for providing the suggestion. We'll continue to brainstorm ideas.
|
I do this in a different project ( |
Thanks for letting me know. Could I ask what your scan interval and window values are? I know you've tried adjusting them I'm just curious. If I get time tonight I'll load up my async server BLE project and see if I can reproduce. Also is the backtrace the same every time or does it change? |
Here's our init:
Between us, we get a slightly different backtrace, but we each keep the same one if that makes sense. Happy to share a link to this project if that will help you help us. |
Those values look pretty tight in reality if you are sharing the antenna with a webserver. You've got 20ms open for the webserver to do its job between scan windows. If your latency to the server is more than that you'll have issues. Try increasing the interval by 200, you'll get less responses on BLE but it may solve the abort. You can work your way up or down from there with those values to find the sweet spot. |
I had randomly chosen some values with (as expected) no improvement. Can you recommend what might work? For reference, we're working with up to eight devices that are BLEacon-only. |
Only values I can suggest would be setting the interval much higher and keeping the window as is. Something extreme, say 5000 interval. This is not an easy issue to solve for sure and the feedback we have is not ideal. We don't know where or why the problem occurs but we can try. If the extreme interval value does not improve the results we need to look elsewhere. Let me know what happens if you try that. |
Well, it's consistent anyway :) :
|
Thank you. I've tried to reference that branch as it does sound promising. Unfortunately there are a lot of compile errors which I don't really understand related to things like this: I'll keep trying. |
I have recently achieved very good stability and performance. I attribute the improvements to the following:
WDT errors are completely gone and I haven't been able to produce any other crashes for several days now under heavy load. I am closing this issue as I consider it resolved from my perspective. If others wish to continue the discussion, feel free to reopen. |
Glad you got it stable! |
Would you mind outlining what you did in more detail? When you say latest arduino-espressif are you adding anything specific to the platformio.ini file or you are using latest PlatformIO Espressif 2.1? I pretty much followed the last 2 of your list and got to a much more stable place, but have not been able to eliminate the crashes all together. My impression is that while scanning is in progress and a WifiUDP::parsePacket is processed - in my case as part of OTA handler(note OTA is not in progress, just the handler is polled by the loop function) - a crash often occurs. I run the scan in a Freertos task on core 0, and as part of a state machine so that tasks run according to state and not concurrently. That said I'm happy for the issue to be closed as I don't think there is anything wrong with the NimBLE library, rather it is not easy to get BLE and WiFi working in the scenario described. Running BLE on it's own is fine, I run into heap running out and not recovering when I combine everything together, even with a state machine. Thanks to @h2zero for finding this me-no-dev/ESPAsyncWebServer#876 - I modified the AsynTCP.cpp according to the suggestions there and the web server responds near instantly most of the time. I was getting responses taking up to five seconds, now it is typically under 200ms. |
I can't answer for him but you can use: It's a blessing and a curse though. Small things they change have caused issues for me that made it past my initial testing and into binaries. |
@lbussy, @pletch I have not tested this with tiltbridge but I made some significant changes to NimBLEScan and NimBLEAdvertised device that significantly increase performance and may alleviate some of the watchdog timeouts. If you're interested in testing this please checkout this branch. Some notes for your use: in the scan setup call |
We'll give it a go and let you know how it works out, thank you! |
I added this to my platformio.ini file lib_deps: But after building it I don't see the setMaxResults() method. Am I missing something? Thanks. |
I managed to get it to compile - had to clean out all the environments and used this reference: |
I've given this a try. It works fine, most notably for me is the heap seems to be almost entirely preserved. I would often see the heap go from 80k down to 35k(the danger zone from my experience). So far it has been fluctuating around only a few hundred bytes during the scan. This is also probably because I am now specifying max results to 3 as I only have 2 devices of interest. This is anecdotal and my use case is likely to differ from others, but still a good improvement and I see the payload is now complete which was another issue that seems to be rolled into this branch. @h2zero Excellent work here - this project is so useful and was much needed. Thank you to you and everyone else who is making this better and better! |
I don't have time to "dig in" and learn what may be beneficial, I slapped it in just to see what happens. Unfortunately, the heap usage went up significantly. Before:
After:
Fortunately (but unfortunately I suppose) we are using a |
@lbussy As I said, my use case will differ from others. Nonetheless I would say the heap is very healthy from my point of view. I'm pegged at around 85k. I expect that it is not unusual for it to drop, but I've experienced that once you get to the <35k it is likely to crash the ESP32. |
I should add that I have now had a stable web server and BLE running without crashes for 7hrs now - not related to this branch of NimBLE directly, but mainly around the way the BLE and web server and other tasks are orchestrated using Freertos. I had 2 weeks of pain trying to get to this stage. I need to get through 24 hrs straight to be satisfied that I have indeed cracked it and that the pattern will be fit for a MVP. |
Yikes, that shouldn't happen, I'll have to look into this. Thanks for letting me know. @justoke Awesome!, glad to hear it's working out for you. |
I have a VERY chatty BLE environment. Is that a factor? |
That's what this branch was intended to help with, if max results is set to 0 then no advertised devices will be stored unless done by copy in the callback in the user code. It's quite shocking that you are seeing the opposite of the intended effect. |
@lbussy, I'm not able to reproduce the heap use you see. This is master branch:
This is refactor-scan:
Max results set to 0:
|
Well after testing this branch for a while and spamming the about page like a mad man I have not been able to reproduce the crash originally discussed in this thread 😄 . Whatever changes you all made, good work! |
Maybe we need to get you to compile and send it to me? I can still kill it. :) Which branch are you testing with? |
I'm working with this thorrak/tiltbridge@8c2c81f using lcd_ssd1306 env I can stuff it up and lag it for sure still if I spam the page fast enough but I have not caused a panic so far. |
Ah yeah. The 1306 is quite a bit more stable for some reason. If I use |
Interesting, sounds like the display updates may be a contributing factor then. I'm not sure how they are handled though so just thinking out loud. |
We've been leaning in that direction, but we're not able to prove it (yet). |
Are you still seeing a big difference in heap consumption with the refactor-scan branch? |
It's looking good so far. |
Working with others on a project which leverages NimBLE as a client only receiving advertisements (ADV_NONCONN_IND) from up to 8 devices. We have recently converted the project over to use ESPAsyncWebserver and have started experiencing sporadic PANIC resets. A reset can be fairly reliably triggered by hard reloading one of the pages with several concurrent jQuery requests repeatedly. We suspect this relates to competition for radio time between WiFi and BLE but are unsure about this aspect. Have done some troubleshooting such as trying to adjust the Scan Interval and Scan Window but have not been able to reliably eliminate the issue and am looking for ideas. The backtrace is always the same and appears to indicate that the PANIC is being triggered in NimBLE related code.
Any guidance on whether this likely points to a problem with our implementation or whether we have uncovered a library issue is much appreciated!
Logging output:
ble_hs_hci_cmd_send: ogf=0x08 ocf=0x000c len=2
0x0c 0x20 0x02 0x00 0x00
Command complete: cmd_pkts=5 ogf=0x8 ocf=0xc status=0
GAP procedure initiated: discovery; own_addr_type=0 filter_policy=0 passive=1 limited=0 filter_duplicates=0 duration=3000ms
ble_hs_hci_cmd_send: ogf=0x08 ocf=0x000b len=7
0x0b 0x20 0x07 0x00 0x54 0x00 0x3b 0x00 0x00 0x00
Command complete: cmd_pkts=5 ogf=0x8 ocf=0xb status=0
ble_hs_hci_cmd_send: ogf=0x08 ocf=0x000c len=2
0x0c 0x20 0x02 0x01 0x00
Backtrace: 0x40091fa4:0x3ffbdc20 0x400921d5:0x3ffbdc40 0x40085a9d:0x3ffbdc60 0x40085bc9:0x3ffbdc90 0x40180377:0x3ffbdcb0 0x4017ad15:0x3ffbdf70 0x40177de9:0x3ffbdfc0 0x400ee0e2:0x3ffbdff0 0x401c75ea:0x3ffbe010 0x401bba2a:0x3ffbe030 0x4001791d:0x3ffbe050 0x401bc029:0x3ffbe070 0x401bc106:0x3ffbe0b0 0x400188f5:0x3ffbe0e0 0x400175ee:0x3ffbe120 0x40017688:0x3ffbe140 0x4008cb36:0x3ffbe170 0x4008cd0b:0x3ffbe190 0x401bb9d7:0x3ffbe1b0 0x4008cd28:0x3ffbe1d0 0x40085025:0x3ffbe1f0 0x4008583e:0x3ffb9d00 0x4008cadd:0x3ffb9d20 0x401c7789:0x3ffb9d40 0x401aa088:0x3ffb9d60 0x400edd8d:0x3ffb9d80 0x40082b53:0x3ffb9da0 0x4008e33d:0x3ffb9dc0
#0 0x40091fa4:0x3ffbdc20 in invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:707
#1 0x400921d5:0x3ffbdc40 in abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:707
#2 0x40085a9d:0x3ffbdc60 in lock_acquire_generic at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/locks.c:143
#3 0x40085bc9:0x3ffbdc90 in _lock_acquire_recursive at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/locks.c:171
#4 0x40180377:0x3ffbdcb0 in _vfiprintf_r at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/vfprintf.c:860 (discriminator 2)
#5 0x4017ad15:0x3ffbdf70 in fiprintf at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/fiprintf.c:50
#6 0x40177de9:0x3ffbdfc0 in __assert_func at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/assert.c:59 (discriminator 8)
#7 0x400ee0e2:0x3ffbdff0 in host_rcv_pkt at .pio\libdeps\lcd_ssd1306\NimBLE-Arduino@src-e248a3b4ca528fb2d5b4751077939ebd\src\esp-hci\src/esp_nimble_hci.c:365
#8 0x401c75ea:0x3ffbe010 in vhci_send at ??:?
#9 0x401bba2a:0x3ffbe030 in r_eif_send at ??:?
#10 0x4001791d:0x3ffbe050 in ?? ??:0
#11 0x401bc029:0x3ffbe070 in hci_tx_start at hci_tl.c:?
#12 0x401bc106:0x3ffbe0b0 in r_hci_tl_send at ??:?
#13 0x400188f5:0x3ffbe0e0 in ?? ??:0
#14 0x400175ee:0x3ffbe120 in ?? ??:0
#15 0x40017688:0x3ffbe140 in ?? ??:0
#16 0x4008cb36:0x3ffbe170 in vhci_recv_end at vhci.c:?
#17 0x4008cd0b:0x3ffbe190 in r_vhci_isr at ??:?
#18 0x401bb9d7:0x3ffbe1b0 in r_eif_isr at ??:?
#19 0x4008cd28:0x3ffbe1d0 in r_eif_isr_wrapper at intc.c:?
#20 0x40085025:0x3ffbe1f0 in _xt_lowint1 at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/xtensa_vectors.S:1154
#21 0x4008583e:0x3ffb9d00 in cause_sw_intr_to_core_wrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/bt/bt.c:1322
#22 0x4008cadd:0x3ffb9d20 in vhci_set_interrupt at vhci.c:?
#23 0x401c7789:0x3ffb9d40 in API_vhci_host_send_packet at ??:?
#24 0x401aa088:0x3ffb9d60 in esp_vhci_host_send_packet at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/bt/bt.c:1322
#25 0x400edd8d:0x3ffb9d80 in ble_hci_trans_hs_cmd_tx_on_core_0 at .pio\libdeps\lcd_ssd1306\NimBLE-Arduino@src-e248a3b4ca528fb2d5b4751077939ebd\src\esp-hci\src/esp_nimble_hci.c:171
#26 0x40082b53:0x3ffb9da0 in ipc_task at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/ipc.c:62
#27 0x4008e33d:0x3ffb9dc0 in vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c:355 (discriminator 1)
The text was updated successfully, but these errors were encountered: