-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LL_ASSERT and 'Imprecise data bus error' in LL Controller #21107
Comments
@LeBlue Is this issue reproducible with CONFIG_LOG=n and CONFIG_SERIAL=n ? Also, otherwise, could you try with updating the BT_CTLR_RX_PRIO_STACK_SIZE value from 448 to something higher say 1024 and try. |
@cvinayak The issue is reproducible with CONFIG_LOG=n, at least the behaviour was the same. I enabled CONFIG_LOG to just find out what/where it happens. But CONFIG_SERIAL was enabled as it was necessary for MCUMRG. I will first try out the BT_CTRL_RX_PRIO_STACK_SIZE option and share the results. |
@cvinayak increasing BT_CTRL_RX_PRIO_STACK_SIZE did not help. Addtionaly I watermarked several thread stacks with CONFIG_INIT_STACKS=y, added CONFIG_THREAD_NAME=y and CONFIG_MPU_STACK_GUARD=y
I think this rules out the stack overflow assumption? btw.: I had to adjust the default for BT_CTRL_RX_PRIO_STACK_SIZE, adding it to prj.conf did not work. Addresses v2.1-branch@v2.1.0:
|
@LeBlue Have you done any modifications to controller code? did you add the "le" word to the BT_INFO logging from the controller's hci.c ? I see you are connecting to upto 8 devices. Does this problem exist if its only one device connected to? Is there a way, I can reproduce this on my end, could you share a minimal private branch that can reproduce the issue with steps to reproduce? |
No i did not change this part, but the
I did not see it happening with only one or even a few devices. (Does not mean it cannot happen).
I have not been able to directly provoke the crash. So sometimes it just works for hours or even days. This might be unrelated, but here some additional information about the peripherals:
Before you try and reproduce it (instructions will follow), have a look at the gathered and annotated logs: Summary:
Thoughts/my conclusions/possible explanations on this (might be incomplete or plain wrong, please correct/share thoughts on this)
It seems unlikely but these are obvious places to check. |
How to reproduce the issueit should be sufficient to use the plain zephyr hci_usb sample with updated Changes to the hci_usb sampleThe only changes I am using are:
These should not affect the bug, and 4 and 5 did not change the outcome. PeripheralsFor the peripherals the plain peripheral sample should do.
Probably at least 5/6 peripherals are needed. To increase the chance, maybe even more than 8. Host systemAny Linux distro native/VM should do with bluez. Connect the peripherals and exchange some data by reading characteristics regularly and/or indications by the peripherals. TimeframeThe issue happened in between less than 1h and 2-3 days. NotesI have two test systems setup:
The attached 44 crash logs are all the logs from 2 systems running for 14 days. There may have been a couple more since some weekend logs got rotated. |
@cvinayak any ideas or comments? |
Hi,
Nordic |
UpdateI found a way to reliably trigger the
by more or less simultaenously reading a value from all (9) connected peripherals. Command sequence from btmon:
The assertion was hit 9/10 times directly after the successful reading of all values. This applies only to the 1. test setup described above. FixI was able to fix/circumvent the problem by increasing
Should this be the default setting for the hci_* samples? |
CONFIG_BT_CTLR_LLCP_CONN= CONFIG_BT_MAX_CONN Yes, this has to be default if connections are established back-to-back and/or control procedures initiated simultaneously to all connected peers. Does this fix original issue? #21107 (comment) |
@cvinayak But there seems to be a different issue that is still occuring (1. in the first comment). This is the imprecise data bus error in the memq_enqueue function. I have tested this up until master@fbbf68d63c45 with no changes. Attached are the all recent failure logs where i decoded the registers shown in the fault dump (pc, lr, etc). I am reluctant to close ths issue an open a new one, because there is a lot of context information gathered here (mainly in #21107 (comment)), but i can do it, if necessary. If i should try something or if you have an idea how to get more information, let me know. I did not see the assert(rx) (2. in the first comment). |
@LeBlue is it possible to provide me a script or application for linux that I can use to reproduce the crash? (at best I have only used btmgmt tool). Also, could you use zephyr SDK on a linux distro to build your hci_usb? |
I am not able to reproduce this. Are you using the zephyr SDK supplied toolchain? |
No, because I run on Windows (and SDK is not supported there???). I used 3rd party GNU ARM Embedded toolchain. |
@cvinayak I will provide a python program as a simple script will not do the job. This will take approx. 1-2 days. I will share it as a GitHub repository. |
@lebue Thanks. Yes, a GitHub repo to reproduce the issue will be the best. Also, please check if #22946 has any influence. Its possible to have left radio in unknown state, could cause memory corruption (speculating based on possibility of leaving radio in Rx state), and occurrence being of random nature. |
@LeBlue any update on this? |
Possibly related: recent install on Linux (v2.2.99, SDK v0.11.2) running
Edit: Connecting and selecting |
@cvinayak @jhedberg I did some more digging and found out the following (same behaviour on 2.2.0 and 2.3.0-rc1), backtraces are from v2.3.0-rc1 (fault in frame 6). backtrace of imprecise bus fault
Frame 6memq_enqueue (faulty value is at
The exact value is found in the
This shows, the corrupted link element that leads to the crash is not generated in this event, only the crash happens here:
The faulty value (here backtracing the culpritTo set a breakpoint where the faulty link element is I have commented interesting parts in the gdb log, these patterns were observed multiple times. Otherwise I was having a look around a bit, maybe it helps.
In
Check where the faulty value originates (note the difference in the last byte)
Here the
check if the alloc_peek and alloc returend the same pointer (yes)
garbage link value is also in the mem_link_rx pool (note the difference in the last byte)
random stuff
frame 2
Any ideas? |
@cvinayak Added a script that I use for reproducing the crash: https://github.com/LeBlue/zephyr-memq-bug-tools |
@LeBlue Thank for the detailed tools to reproduce the issue. I will set this up and let it run over the weekend. |
I notice that your bt_conn_loop.py only connects to first device and reports not connected to others. Looking at the btmon log, I only see, at anytime, only one device being connected to (connection handle is always 0). I see a lot of kernel panic too. (i am using a dedicated linux laptop, I will collect the piped logs and attach them on Monday). I have now switched over to using hci_uart using I am definitely seeing differences between use of hci_usb versus hci_uart. I suspect hci_usb maybe causing buffer overflow in the controller code. Could you add I would like you to setup a video call next week anytime between 0700 CET - 1400 CET, so we can debug together over screen sharing, if this is an option. |
@emob-nordic any chance this is related to the USB code? |
@carlescufi I will have a look. |
@LeBlue Thanks for the quick response. Could you also, increase thread stack sizes (say 4096 bytes) of USB stack thread, logging thread and Bluetooth Tx Stack Size (you will need to enable it first, see my attached .config file in txt extension). I got my nRF52833DK (I dont have a spare nRF52840DK) crashing on power up when I enabled debug logging in Bluetooth and USB subsystems, hence the increase. My zephyr/.config file is attached herewith.
Ignore this for the moment, seems its specific to my physical laptop. I am now testing on a macbook pro with ubuntu 20.04 in virtualbox. 6 peripherals connected now, and periodically reading the software versions, so far over 30 mins now.... |
@LeBlue with all the thread stack increase, I hit a MPU fault. Analysing the instruction address being in RAM inside the USB Bluetooth thread stack region and LR being in log_core.c, I infer that the hardcoded bluetooth rx thread stack size (calling bt_send) of 512 bytes could be insufficient. For debugging purpose, I have increased it to 2048. Could you please increase and perform your tests? |
The hci_uart uses thread stack size of >640 to 1024 bytes, the thread that calls bt_send. |
@cvinayak I have modified the test setup and moved the peripherals on the edge of the radio reception range. Now the issue is observable (without the fix) after 10-20 mins (instead of several hours). This at least already shows that the memory corruption gets triggered by transmission errors. |
@LeBlue could you please test #25619 with your new test setup? |
@carlescufi Yes, test (hci_uart and hci_usb) for #25619 already running for ~3h. Seems to be working so far. |
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes zephyrproject-rtos#21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes #21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes zephyrproject-rtos#21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
Implemented an intermediate decrypt buffer to cover the CCM overrun under CRC error conditions. The workaround is applicable to nRF52832 SoC only, which is missing the MAXPACKETSIZE register in the NRF_CCM peripheral. Fixes zephyrproject-rtos#21107 for nRF52832 SoC. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
Implemented an intermediate decrypt buffer to cover the CCM overrun under CRC error conditions. The workaround is applicable to nRF52832 SoC only, which is missing the MAXPACKETSIZE register in the NRF_CCM peripheral. Fixes #21107 for nRF52832 SoC. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes zephyrproject-rtos#21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
Implemented an intermediate decrypt buffer to cover the CCM overrun under CRC error conditions. The workaround is applicable to nRF52832 SoC only, which is missing the MAXPACKETSIZE register in the NRF_CCM peripheral. Fixes zephyrproject-rtos#21107 for nRF52832 SoC. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes #21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes zephyrproject-rtos#21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no> (cherry picked from commit cd7a73c)
…overrun Implemented an intermediate decrypt buffer to cover the CCM overrun under CRC error conditions. The workaround is applicable to nRF52832 SoC only, which is missing the MAXPACKETSIZE register in the NRF_CCM peripheral. Fixes zephyrproject-rtos#21107 for nRF52832 SoC. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no> (cherry picked from commit ba487fe)
Describe the bug
I am using a custom board similar to the nrf52840_pca10056 modified version of the hci_usb sample with addtion of a watchdog, mcumgr capability and logging. The main application on a linux host (bluez5.51) will search and (re)connect up to 9 peripheral devices.
For reconnection the application will regularly enable and disable scans. In the setup there are only 8 devices present to force simultaneous scanning and data exchange with the peripherals.
The FW (hci_usb) on the nrf52840 is crashing regularly (minutes - 4h) with one of the three following errors. Since the crashes happen in the same subsystem (bluetooth/controller/ll_sw/), all three different error logs are attached here. Every crash was seen at least two times.
To Reproduce
The crash is observed on a custom board similar to the nrf52840_pca10056 modified version of the hci_usb sample with addtion of a watchdog, mcumgr capability and logging:
The bug should be observable on the standard hci_usb sample for the nrf52840_pca10056
Debugging efforts
I tried adding debug logs of the subsystem, but could not reproduce the bug with it enabled.
Expected behavior
The crashes should not happen.
Impact
This is a major annoance to showstopper
zephyr log output
See file below for complete logs.
zephyr/subsys/bluetooth/controller/util/memq.c
Line 100 in 12948fd
zephyr/subsys/bluetooth/controller/ll_sw/ull.c
Line 1449 in 12948fd
zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c
Line 586 in 12948fd
Additional journal log of the linux host with the (full) zephyr logs included and some additional context from the linux side:
journal_ll_ctrl_assert_bug.txt
Environment (please complete the following information):
Additional context
The more traffic is created, eg. sending indications from/reading/writing values of the remote devices, the more likely it is to trigger the bug.
The full logs show some hci command timeouts from the linux side, which mostly revolve around enableing/disableing scanning.
The text was updated successfully, but these errors were encountered: