LL_ASSERT and 'Imprecise data bus error' in LL Controller #21107

LeBlue · 2019-12-02T14:29:10Z

Describe the bug
I am using a custom board similar to the nrf52840_pca10056 modified version of the hci_usb sample with addtion of a watchdog, mcumgr capability and logging. The main application on a linux host (bluez5.51) will search and (re)connect up to 9 peripheral devices.

For reconnection the application will regularly enable and disable scans. In the setup there are only 8 devices present to force simultaneous scanning and data exchange with the peripherals.

The FW (hci_usb) on the nrf52840 is crashing regularly (minutes - 4h) with one of the three following errors. Since the crashes happen in the same subsystem (bluetooth/controller/ll_sw/), all three different error logs are attached here. Every crash was seen at least two times.

To Reproduce
The crash is observed on a custom board similar to the nrf52840_pca10056 modified version of the hci_usb sample with addtion of a watchdog, mcumgr capability and logging:

CONFIG_BOOTLOADER_MCUBOOT=y
CONFIG_MCUMGR=y
CONFIG_MCUMGR_SMP_UART=y
CONFIG_MCUMGR_CMD_IMG_MGMT=y
CONFIG_MCUMGR_CMD_OS_MGMT=y
CONFIG_BT_MAX_CONN=15
CONFIG_BLUETOOTH_INT_EP_MPS=64
CONFIG_LOG=y
CONFIG_UART_CONSOLE=y

The bug should be observable on the standard hci_usb sample for the nrf52840_pca10056

Debugging efforts
I tried adding debug logs of the subsystem, but could not reproduce the bug with it enabled.

Expected behavior
The crashes should not happen.

Impact
This is a major annoance to showstopper

zephyr log output

See file below for complete logs.

This happens the most of all three errors

[03:36:46.588,989] <err> os: ***** BUS FAULT *****
[03:36:46.588,989] <err> os:   Imprecise data bus error
[03:36:46.589,019] <err> os: r0/a1:  0x20001b40  r1/a2:  0x20001aac  r2/a3:  0x20002060
[03:36:46.589,019] <err> os: r3/a4:  0x20008dcc r12/ip:  0x00021339 r14/lr:  0x00018e0d
[03:36:46.589,019] <err> os:  xpsr:  0x6100021b
[03:36:46.589,019] <err> os: Faulting instruction address (r15/pc): 0x00014686
[03:36:46.589,019] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[03:36:46.589,019] <err> os: Fault during interruptling
[03:36:46.589,050] <err> os: Current thread: 0x20004fa8 (unknown)
[03:36:46.652,160] <err> os: Halting system

addr2line -e zephyr/zephyr.elf 0x00014686
zephyr/subsys/bluetooth/controller/util/memq.c:100

zephyr/subsys/bluetooth/controller/util/memq.c

Line 100 in 12948fd

*tail = link;

assert: 'rx' failed
[00:38:19.855,865] <err> os: r0/a1:  0x00000003  r1/a2:  0x0000000a  r2/a3:  0x00000001
[00:38:19.855,895] <err> os: r3/a4:  0x00011a01 r12/ip:  0x20002800 r14/lr:  0x00017d99
[00:38:19.855,895] <err> os:  xpsr:  0x6100001b
[00:38:19.855,895] <err> os: Faulting instruction address (r15/pc): 0x00017da4
[00:38:19.855,895] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:38:19.855,895] <err> os: Fault during interruptling
[00:38:19.855,926] <err> os: Current thread: 0x20004fa8 (unknown)
[00:38:19.907,470] <err> os: Halting system

addr2line -e zephyr/zephyr.elf 0x00017da4
zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1449 (discriminator 1)

zephyr/subsys/bluetooth/controller/ll_sw/ull.c

Line 1449 in 12948fd

LL_ASSERT(rx);

assert: 'e' failed
[04:18:06.579,406] <err> os: r0/a1:  0x00000003  r1/a2:  0x0000000a  r2/a3:  0x00000001
[04:18:06.579,406] <err> os: r3/a4:  0x00011a01 r12/ip:  0x00000000 r14/lr:  0x0001e0b1
[04:18:06.579,406] <err> os:  xpsr:  0x61000011
[04:18:06.579,406] <err> os: Faulting instruction address (r15/pc): 0x0001e0bc
[04:18:06.579,406] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[04:18:06.579,437] <err> os: Fault during interruptling
[04:18:06.579,437] <err> os: Current thread: 0x20004fa8 (unknown)
[04:18:06.630,981] <err> os: Halting system

addr2line -e zephyr/zephyr.elf 0x0001e0bc
zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c:586 (discriminator 1)

zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c

Line 586 in 12948fd

LL_ASSERT(e);

Additional journal log of the linux host with the (full) zephyr logs included and some additional context from the linux side:
journal_ll_ctrl_assert_bug.txt

Environment (please complete the following information):

OS: MacOS (FW build), Linux (user of the external HCI controller)
Toolchain: gnuarmemb v7.3.1, updated today v9.2.1 no change
Commit SHA or Version used: v2.1.0-rc1-316-g12948fdf37d5

Additional context
The more traffic is created, eg. sending indications from/reading/writing values of the remote devices, the more likely it is to trigger the bug.

The full logs show some hci command timeouts from the linux side, which mostly revolve around enableing/disableing scanning.

The text was updated successfully, but these errors were encountered:

cvinayak · 2019-12-05T01:46:56Z

@LeBlue Is this issue reproducible with CONFIG_LOG=n and CONFIG_SERIAL=n ? Also, otherwise, could you try with updating the BT_CTLR_RX_PRIO_STACK_SIZE value from 448 to something higher say 1024 and try.

Its here https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/bluetooth/controller/Kconfig#L152

LeBlue · 2019-12-05T17:02:22Z

@cvinayak The issue is reproducible with CONFIG_LOG=n, at least the behaviour was the same. I enabled CONFIG_LOG to just find out what/where it happens. But CONFIG_SERIAL was enabled as it was necessary for MCUMRG.

I will first try out the BT_CTRL_RX_PRIO_STACK_SIZE option and share the results.

LeBlue · 2019-12-13T18:54:33Z

@cvinayak increasing BT_CTRL_RX_PRIO_STACK_SIZE did not help. Addtionaly I watermarked several thread stacks with CONFIG_INIT_STACKS=y, added CONFIG_THREAD_NAME=y and CONFIG_MPU_STACK_GUARD=y

[00:55:57.123,168] <inf> os: recv thread stack :        unused 256        usage 256 / 512 (50 %)
[00:55:58.127,593] <inf> os: z_main_stack :        unused 736        usage 288 / 1024 (28 %)
[00:55:59.127,838] <inf> os: isr_stack :        unused 1280        usage 768 / 2048 (37 %)
[00:56:00.128,387] <inf> os: sys_work_q_stack :        unused 2872        usage 200 / 3072 (6 %)
[00:56:00.832,916] <inf> bt_ctlr_hci: handle: 0x0001, rssi: -69 dB.
[00:56:01.128,540] <inf> os: z_idle_stack :        unused 268        usage 52 / 320 (16 %)
[00:56:02.128,631] <inf> os: logging_stack :        unused 240        usage 528 / 768 (68 %)
[00:56:04.877,624] <inf> os: prio recv thread stack :        unused 808        usage 216 / 1024 (21 %)
[00:56:14.413,055] <inf> os: prio recv thread stack :        unused 808        usage 216 / 1024 (21 %)
[00:56:14.718,933] <inf> os: usb hci tx :        unused 128        usage 384 / 512 (75 %)
[00:56:14.720,031] <inf> os: usb hci rx :        unused 272        usage 240 / 512 (46 %)
[00:56:14.810,729] <inf> os: recv thread stack :        unused 256        usage 256 / 512 (50 %)
[00:56:14.812,744] <err> os: ***** BUS FAULT *****
[00:56:14.812,744] <err> os:   Imprecise data bus error
[00:56:14.812,744] <err> os: r0/a1:  0x20001cd0  r1/a2:  0x20001fc8  r2/a3:  0x20002150
[00:56:14.812,744] <err> os: r3/a4:  0x40001000 r12/ip:  0xaaaaaaaa r14/lr:  0x0001adcb
[00:56:14.812,744] <err> os:  xpsr:  0x21000211
[00:56:14.812,744] <err> os: Faulting instruction address (r15/pc): 0x0001478a
[00:56:14.812,774] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:56:14.812,774] <err> os: Fault during interrupt handling
[00:56:14.812,774] <err> os: Current thread: 0x20005098 (idle)
[00:56:14.907,135] <err> os: Halting system

I think this rules out the stack overflow assumption?

btw.: I had to adjust the default for BT_CTRL_RX_PRIO_STACK_SIZE, adding it to prj.conf did not work.

Addresses v2.1-branch@v2.1.0:

addr2line -e zephyr/zephyr.elf 0x0001478a    subsys/bluetooth/controller/util/memq.c:103
addr2line -e zephyr/zephyr.elf 0x0001adcb     subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:1137

cvinayak · 2019-12-17T00:54:02Z

@LeBlue Have you done any modifications to controller code? did you add the "le" word to the BT_INFO logging from the controller's hci.c ?

I see you are connecting to upto 8 devices. Does this problem exist if its only one device connected to?

Is there a way, I can reproduce this on my end, could you share a minimal private branch that can reproduce the issue with steps to reproduce?

LeBlue · 2019-12-18T13:47:28Z

@cvinayak

Have you done any modifications to controller code? did you add the "le" word to the BT_INFO logging from the controller's hci.c ?

No i did not change this part, but the le is probably just an artifact from filtering the log files.

I see you are connecting to upto 8 devices. Does this problem exist if its only one device connected to?

I did not see it happening with only one or even a few devices. (Does not mean it cannot happen).
Generally the crash does happen more frequently if:

More devices are connected
More additional BLE advertisers are around
Scanning is regularly switched on and off in intervals. (We are searching for the 9th devices, which is the targeted max. number, and that is intentionally not around/switched off)
More data is exchanged between the peripherals and dongle (reading/writing values, indications generated by the peripherals)

I have not been able to directly provoke the crash. So sometimes it just works for hours or even days.

This might be unrelated, but here some additional information about the peripherals:

connection interval 22.5 ms (18)
variable slave latency between 0 and 49 via slave initiated connection parameter update
connections are authenticated and encrypted

Before you try and reproduce it (instructions will follow), have a look at the gathered and annotated logs:

crash_logs.tar.gz

Summary:

Almost all (38) show a "imprecise bus fault" at memq_enqueue, the rest (6) is a ASSERT from data obtained by memq_dequeue (probably also enqueue failed, ignore for now)
All happened with the queues memq_ull_rx and memq_ll_rx.
All parameters to memq_enqueue seem to be from the correct mem pools and in range, based on the return address and registers r0-r3. Only as far as i can tell and I did check this only for some occurrences.
Since the bus fault is imprecise, any of the parameters to memq_enqueue dereferencing/storing might be the issue.

Thoughts/my conclusions/possible explanations on this (might be incomplete or plain wrong, please correct/share thoughts on this)

memq implementation might not work correctly and suffer from race?
memq usage is wrong? As far as i can tell, it should only be equipped to deal in parallel with one consumer and one producer? Call sites seem to be OK, but i cannot tell if preempted interrupt handlers could violate this.
mem_acquire/mem_release suffer from race? Most/All? parameters are acquired from mempools. It might be worth checking if concurrent allocations from the same pool can happen (and fail without notice).
mem_acquire error is ignored somewhere? I could not find any occurrences.

It seems unlikely but these are obvious places to check.

LeBlue · 2019-12-18T17:45:22Z

How to reproduce the issue

it should be sufficient to use the plain zephyr hci_usb sample with updated BT_MAX_CONN=15 on any board with the nrf52840 or similar with USB and UART (for logging).

Changes to the hci_usb sample

The only changes I am using are:

set the BD Address from the vendor register
a watchdog (thread) on low priority. (Bug was reproducible, even with this disabled.)
MCUMGR update capability via the UART and build binaries bootable with MCUBOOT
in the later versions i added stack watermark checks for most of the system threads in 5 to 10s intervals.
in later versions i added CONFIG_MPU_STACK_GUARD=y , thread names and upped the size of rx_prio thread stack.

These should not affect the bug, and 4 and 5 did not change the outcome.

Peripherals

For the peripherals the plain peripheral sample should do.
To match the behaviour, a call to bt_conn_le_param_update maybe helpful.

connection interval 22.5 ms (18)
variable slave latency between 0 and 49 via slave initiated connection parameter update, for reproducing, I think 0 is best. I do not think, the parameter update triggers the crash, value can be fixed.
connections are authenticated and encrypted

Probably at least 5/6 peripherals are needed. To increase the chance, maybe even more than 8.
Support for indicating any value regularly would help. Here the indication frequency of the sample should be increased (is currently at 1/s)

Host system

Any Linux distro native/VM should do with bluez.

Connect the peripherals and exchange some data by reading characteristics regularly and/or indications by the peripherals.

Timeframe

The issue happened in between less than 1h and 2-3 days.
The issue is not directly triggerable but only the chance can be increased by creating a lot of RX events. This is the most important part.

Notes

I have two test systems setup:

8 connected devices
regularly enable/disableing scanning
round robin reading a characteristic value (peripheral idle state, slave latency 49)
Additionally value indications (peripheral used/active state, slave latency 4), only manually triggered by a tester.

One setup, where the crash happens quite frequently (around 5/day, idle state), there a A LOT of advertisers around (80-130). Here the crash happens regularly (15-45 mins), if a lot of indications (>20/s) are generated by the peripherals over a longer period (used/active state).
The second setup, the crash happens but rarely (around 1 in 3 days, idle state), there are few BLE devices around (10-20). I could not trigger the crash (used/active state) in this setup.

The attached 44 crash logs are all the logs from 2 systems running for 14 days. There may have been a couple more since some weekend logs got rotated.

LeBlue · 2020-01-24T17:48:40Z

@cvinayak any ideas or comments?

cristic-devel · 2020-01-29T11:20:26Z

Hi,
I also run into this issue running peripheral sample on a nrf52_pca10040 board. The sample is compiled with FIXED_PASSKEY and this change in the main.c

bt_conn_cb_register(&conn_callbacks);

#if defined(CONFIG_BT_FIXED_PASSKEY)
bt_passkey_set(111111);
#endif /* CONFIG_BT_FIXED_PASSKEY */
bt_conn_auth_cb_register(&auth_cb_display);

Nordic
Connected
[01:12:04.213,348] os: ***** BUS FAULT *****
[01:12:04.213,348] os: Imprecise data bus error
[01:12:04.213,348] os: r0/a1: 0x20000b60 r1/a2: 0x20000aa4 r2/a3: 0x20000e30
[01:12:04.213,378] os: r3/a4: 0x00000001 r12/ip: 0xeb2a7358 r14/lr: 0x00010521
[01:12:04.213,378] os: xpsr: 0x21000011
[01:12:04.213,378] os: Faulting instruction address (r15/pc): 0x0001c0e8
[01:12:04.213,378] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[01:12:04.213,378] os: Fault during interrupt handling
[01:12:04.213,378] os: Current thread: 0x20000250 (unknown)
[01:12:04.276,153] os: Halting system

LeBlue · 2020-02-06T09:33:52Z

Update

I found a way to reliably trigger the

zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c

Line 589 in fbbf68d

LL_ASSERT(e);

by more or less simultaenously reading a value from all (9) connected peripherals. Command sequence from btmon:

 < ACL Data TX: Handle 8 flags 0x00 dlen 7               #2313 [hci0] 140.299577
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 4 flags 0x00 dlen 7               #2314 [hci0] 140.301017
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 5 flags 0x00 dlen 7               #2315 [hci0] 140.302633
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 7 flags 0x00 dlen 7               #2316 [hci0] 140.304207
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 3 flags 0x00 dlen 7               #2317 [hci0] 140.305643
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 1 flags 0x00 dlen 7               #2318 [hci0] 140.307114
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 6 flags 0x00 dlen 7               #2319 [hci0] 140.308518
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
> HCI Event: Number of Completed Packets (0x13) plen 5  #2320 [hci0] 140.325971
        Num handles: 1
        Handle: 3
        Count: 1
< ACL Data TX: Handle 2 flags 0x00 dlen 7               #2321 [hci0] 140.326137
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
> HCI Event: Number of Completed Packets (0x13) plen 5  #2322 [hci0] 140.326977
        Num handles: 1
        Handle: 8
        Count: 1
< ACL Data TX: Handle 0 flags 0x00 dlen 7               #2323 [hci0] 140.327063
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
> HCI Event: Number of Completed Packets (0x13) plen 5  #2324 [hci0] 140.331973
        Num handles: 1
        Handle: 2
        Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5  #2325 [hci0] 140.337971
        Num handles: 1
        Handle: 7
        Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5  #2326 [hci0] 140.345993
        Num handles: 1
        Handle: 4
        Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5  #2327 [hci0] 140.350960
        Num handles: 1
        Handle: 6
        Count: 1
> ACL Data RX: Handle 2 flags 0x02 dlen 7               #2328 [hci0] 140.352972
      ATT: Read Response (0x0b) len 2
        Value: 42a7
> ACL Data RX: Handle 3 flags 0x02 dlen 7               #2329 [hci0] 140.368984
      ATT: Read Response (0x0b) len 2
        Value: 3fa7
> ACL Data RX: Handle 6 flags 0x02 dlen 7               #2330 [hci0] 140.372994
      ATT: Read Response (0x0b) len 2
        Value: 3ea7
> HCI Event: Number of Completed Packets (0x13) plen 5  #2331 [hci0] 140.386995
        Num handles: 1
        Handle: 0
        Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5  #2332 [hci0] 140.388978
        Num handles: 1
        Handle: 5
        Count: 1
> ACL Data RX: Handle 4 flags 0x02 dlen 7               #2333 [hci0] 140.391981
      ATT: Read Response (0x0b) len 2
        Value: 43a7
> ACL Data RX: Handle 8 flags 0x02 dlen 7               #2334 [hci0] 140.396978
      ATT: Read Response (0x0b) len 2
        Value: 42a7
> ACL Data RX: Handle 7 flags 0x02 dlen 7               #2335 [hci0] 140.406993
      ATT: Read Response (0x0b) len 2
        Value: 35a7
> HCI Event: Number of Completed Packets (0x13) plen 5  #2336 [hci0] 140.447990
        Num handles: 1
        Handle: 1
        Count: 1
> ACL Data RX: Handle 0 flags 0x02 dlen 7               #2337 [hci0] 140.455983
      ATT: Read Response (0x0b) len 2
        Value: 44a7
> ACL Data RX: Handle 5 flags 0x02 dlen 7               #2338 [hci0] 140.458996
      ATT: Read Response (0x0b) len 2
        Value: 41a7
> ACL Data RX: Handle 1 flags 0x02 dlen 7               #2339 [hci0] 140.516996
      ATT: Read Response (0x0b) len 2
        Value: 3fa7

The assertion was hit 9/10 times directly after the successful reading of all values. This applies only to the 1. test setup described above.
Tested on master@fbbf68d63c45

Fix

I was able to fix/circumvent the problem by increasing CONFIG_BT_CTLR_LLCP_CONN to the maximum allowed (== CONFIG_BT_MAX_CONN).

CONFIG_BT_MAX_CONN=15
CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_LLCP_CONN=15

Should this be the default setting for the hci_* samples?

cvinayak · 2020-02-14T12:39:23Z

@LeBlue

CONFIG_BT_CTLR_LLCP_CONN= CONFIG_BT_MAX_CONN

Yes, this has to be default if connections are established back-to-back and/or control procedures initiated simultaneously to all connected peers.

Does this fix original issue? #21107 (comment)

LeBlue · 2020-02-14T17:08:44Z

@cvinayak
It fixes one of them (assert(e), 3. in the first comment), yes. This was the crash I could reliably reproduce and resolved by setting CONFIG_BT_CTLR_LLCP_CONN .

But there seems to be a different issue that is still occuring (1. in the first comment). This is the imprecise data bus error in the memq_enqueue function. I have tested this up until master@fbbf68d63c45 with no changes.

Attached are the all recent failure logs where i decoded the registers shown in the fault dump (pc, lr, etc).
imprecise-memq_enqueue.tar.gz

I am reluctant to close ths issue an open a new one, because there is a lot of context information gathered here (mainly in #21107 (comment)), but i can do it, if necessary.

If i should try something or if you have an idea how to get more information, let me know.

I did not see the assert(rx) (2. in the first comment).

cvinayak · 2020-02-18T09:04:11Z

@LeBlue is it possible to provide me a script or application for linux that I can use to reproduce the crash? (at best I have only used btmgmt tool).

Also, could you use zephyr SDK on a linux distro to build your hci_usb?

cvinayak · 2020-02-18T10:58:15Z

Hi,
I also run into this issue running peripheral sample on a nrf52_pca10040 board. The sample is compiled with FIXED_PASSKEY and this change in the main.c
bt_conn_cb_register(&conn_callbacks);
#if defined(CONFIG_BT_FIXED_PASSKEY)
bt_passkey_set(111111);
#endif /* CONFIG_BT_FIXED_PASSKEY */
bt_conn_auth_cb_register(&auth_cb_display);

Nordic
Connected
[01:12:04.213,348] os: ***** BUS FAULT *****
[01:12:04.213,348] os: Imprecise data bus error
[01:12:04.213,348] os: r0/a1: 0x20000b60 r1/a2: 0x20000aa4 r2/a3: 0x20000e30
[01:12:04.213,378] os: r3/a4: 0x00000001 r12/ip: 0xeb2a7358 r14/lr: 0x00010521
[01:12:04.213,378] os: xpsr: 0x21000011
[01:12:04.213,378] os: Faulting instruction address (r15/pc): 0x0001c0e8
[01:12:04.213,378] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[01:12:04.213,378] os: Fault during interrupt handling
[01:12:04.213,378] os: Current thread: 0x20000250 (unknown)
[01:12:04.276,153] os: Halting system

I am not able to reproduce this. Are you using the zephyr SDK supplied toolchain?

cristic-devel · 2020-02-18T11:00:55Z

No, because I run on Windows (and SDK is not supported there???). I used 3rd party GNU ARM Embedded toolchain.

LeBlue · 2020-02-19T08:20:39Z

@cvinayak I will provide a python program as a simple script will not do the job. This will take approx. 1-2 days. I will share it as a GitHub repository.

cvinayak · 2020-02-20T01:02:51Z

@lebue Thanks. Yes, a GitHub repo to reproduce the issue will be the best.

Also, please check if #22946 has any influence. Its possible to have left radio in unknown state, could cause memory corruption (speculating based on possibility of leaving radio in Rx state), and occurrence being of random nature.

jhedberg · 2020-03-09T11:04:06Z

@LeBlue any update on this?

mjago · 2020-04-17T19:45:51Z

Possibly related: recent install on Linux (v2.2.99, SDK v0.11.2) running hci_pwr_ctrl sample on nrf52dk_nrf52832. Reproducible occurring every build. Client Android nRF Connect. Other tried samples are working fine.

[00:01:49.009,796] <err> os: ***** BUS FAULT *****
[00:01:49.009,796] <err> os:   Imprecise data bus error
[00:01:49.009,796] <err> os: r0/a1:  0xfffffffc  r1/a2:  0x00000004  r2/a3:  0x00019043
[00:01:49.009,826] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00018d55
[00:01:49.009,826] <err> os:  xpsr:  0x81000000
[00:01:49.009,826] <err> os: Faulting instruction address (r15/pc): 0x00018d60
[00:01:49.009,826] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:01:49.009,826] <err> os: Current thread: 0x20001a14 (unknown)
[00:01:49.123,016] <err> os: Halting system

Edit: Connecting and selecting Heart Rate Measurement Notify seems to push over-the-edge.

LeBlue · 2020-05-20T09:52:02Z

@cvinayak @jhedberg
Sorry for the long time, I did not have time to work on this.
The flash driver patch did not change anything.

I did some more digging and found out the following (same behaviour on 2.2.0 and 2.3.0-rc1), backtraces are from v2.3.0-rc1 (fault in frame 6).

backtrace of imprecise bus fault

#0  k_sys_fatal_error_handler (reason=reason@entry=0, esf=esf@entry=0x200096f0 <z_interrupt_stacks+912>) at zephyr/include/generated/syscalls/log_ctrl.h:33
        __func__ = <optimized out>
#1  0x00025586 in z_fatal_error (reason=reason@entry=0, esf=esf@entry=0x200096f0 <z_interrupt_stacks+912>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/kernel/fatal.c:123
        key = 0
        thread = 0x200027a4 <recv_thread_data>
        __func__ = <optimized out>
#2  0x00013a6c in z_arm_fatal_error (reason=reason@entry=0, esf=esf@entry=0x200096f0 <z_interrupt_stacks+912>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/arch/arm/core/aarch32/fatal.c:47
No locals.
#3  0x00013ef8 in z_arm_fault (msp=<optimized out>, psp=<optimized out>, exc_return=<optimized out>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/arch/arm/core/aarch32/cortex_m/fault.c:969
        reason = 0
        fault = <optimized out>
        recoverable = false
        nested_exc = true
        esf = 0x20009730 <z_interrupt_stacks+976>
        esf_copy = {basic = {{a1 = 536881784, r0 = 536881784}, {a2 = 536881540, r1 = 536881540}, {a3 = 536884008, r2 = 536884008}, {a4 = 536871812, r3 = 536871812}, {ip = 3, r12 = 3}, {lr = 103401, r14 = 103401}, {pc = 83654, r15 = 83654}, xpsr = 1627390481}}
#4  0x00013b5c in z_arm_usage_fault () at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/arch/arm/core/aarch32/cortex_m/fault_s.S:85
No locals.
#5  <signal handler called>
No symbol table info available.
#6  memq_enqueue (link=link@entry=0x20002a78 <mem_link_rx+104>, mem=mem@entry=0x20002984 <mem_done+60>, tail=tail@entry=0x20003328 <memq_ull_rx+4>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/util/memq.c:100
No locals.
#7  0x000193e8 in ull_rx_put (rx=0x20002984 <mem_done+60>, link=0x20002a78 <mem_link_rx+104>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1087
        rx_hdr = 0x20002984 <mem_done+60>
#8  ull_event_done (param=<optimized out>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1173
        evdone = 0x20002984 <mem_done+60>
        link = 0x20002a78 <mem_link_rx+104>
#9  0x0001f6c0 in lll_done (param=<optimized out>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:296
        next = <optimized out>
        ull = <optimized out>
        evdone = <optimized out>
#10 0x0001f1c2 in radio_nrf5_isr_body () at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:76
No locals.
#11 radio_nrf5_isr () at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:72

Frame 6

memq_enqueue (faulty value is at r4 == (*tail). The writing to [r4, #0] generates the bus fault):

frame 6
#6  memq_enqueue (link=link@entry=0x20002a78 <mem_link_rx+104>, mem=mem@entry=0x20002984 <mem_done+60>, tail=tail@entry=0x20003328 <memq_ull_rx+4>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/util/memq.c:100
100		*tail = link;
r0             0x20002a78          536881784
r1             0x20002984          536881540
r2             0x20003328          536884008
r3             0x20000384          536871812
r4             0x29c11739          700520249
r5             0x1                 1
r6             0x0                 0
r7             0x20002a78          536881784
r8             0x0                 0
r9             0x20002708          536880904
r10            0x200028e8          536881384
r11            0x2000290c          536881420
r12            0x3                 3
sp             0x20009754          0x20009754 <z_interrupt_stacks+1012>
lr             0x193e9             103401
pc             0x146c6             0x146c6 <memq_enqueue+10>
xpsr           0x61000211          1627390481
msp            0x200096c8          536909512
psp            0x20008808          536905736

The exact value is found in the memq_ull_rx.tail and memq_ull_rx.head as well as the mem_link_rx mempool

(gdb) p memq_ull_rx
$481 = {head = 0x29c11739, tail = 0x29c11739}
(gdb) x/60xw &mem_link_rx
0x20002a10 <mem_link_rx>:	0x00000000	0x20002af0	0x20002b20	0x20000009
0x20002a20 <mem_link_rx+16>:	0x20002b20	0x20000009	0x20002b20	0x20000009
0x20002a30 <mem_link_rx+32>:	0x20002b20	0x00000000	0x20002b20	0x20000009
0x20002a40 <mem_link_rx+48>:	0x20002b20	0x20000009	0x20002b20	0x20000009
0x20002a50 <mem_link_rx+64>:	0x20002b20	0x20000009	0x20002b20	0x20000009
0x20002a60 <mem_link_rx+80>:	0x20002b20	0x20000009	0x29c11739	0x20002ef4
0x20002a70 <mem_link_rx+96>:	0x20002b20	0x20000009	0x20002b18	0x20002984
0x20002a80 <mem_link_rx+112>:	0x20002b20	0x20000009	0x20002a68	0x2000294c
(gdb) x/8xw ((void*) &memq_ull_rx) - 8
0x2000331c <memq_ll_rx>:	0x20002a68	0x20002a68	0x29c11739	0x29c11739
0x2000332c <sem_recv>:	0x200008f4	0x20003330	0x20003330	0x00000000

This shows, the corrupted link element that leads to the crash is not generated in this event, only the crash happens here:

some faulty link pointer gets memq_enqueued (writing the value to memq_ull_rx.tail
later the faulty link pointer gets memq_dequeued (writing value to memq_ull_rx.head).
Next memq_enqueue crashes (this backtrace) as it tries to dereference (*tail) and write the next element.

The faulty value (here 0x29c11739) changes for every crash, but every time the value is present in the tail, in head itself or in the head->next* chain and somewhere in the mem_link_rx pool (next pointer/element).

backtracing the culprit

To set a breakpoint where the faulty link element is memq_enqueued modifications https://github.com/LeBlue/zephyr/tree/dcb87377ae990156de34a4b0720326286d68b80b are used. Breakpoint was hit in memq_enqueue https://github.com/LeBlue/zephyr/blob/dcb87377ae990156de34a4b0720326286d68b80b/subsys/bluetooth/controller/util/memq.c#L114

I have commented interesting parts in the gdb log, these patterns were observed multiple times.

Otherwise I was having a look around a bit, maybe it helps.

(gdb) bt full
#0  memq_enqueue (link=0xaf736ff7, mem=mem@entry=0x200031f8 <mem_pdu_rx+1676>, tail=tail@entry=0x20003334 <memq_ull_rx+4>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/util/memq.c:114
No locals.
#1  0x0001933a in ull_rx_put (link=<optimized out>, rx=rx@entry=0x200031f8 <mem_pdu_rx+1676>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1087
        rx_hdr = 0x200031f8 <mem_pdu_rx+1676>
#2  0x00020836 in isr_rx_scan_report (rssi_ready=rssi_ready@entry=1 '\001', rl_idx=<optimized out>, dir_report=dir_report@entry=false, lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:1151
        node_rx = 0x200031f8 <mem_pdu_rx+1676>
        pdu_adv_rx = <optimized out>
#3  0x0000f2fe in isr_rx_pdu (devmatch_ok=<optimized out>, devmatch_id=<optimized out>, irkmatch_id=<optimized out>, rssi_ready=1 '\001', rl_idx=255 '\377', irkmatch_ok=0 '\000', lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:992
        err = <optimized out>
        node_rx = 0x200031f8 <mem_pdu_rx+1676>
        pdu_adv_rx = <optimized out>
        dir_report = false
        node_rx = <optimized out>
        pdu_adv_rx = <optimized out>
        dir_report = <optimized out>
        lll_conn = <optimized out>
        ftr = <optimized out>
        rx = <optimized out>
        pdu_tx = <optimized out>
        conn_interval_us = <optimized out>
        conn_offset_us = <optimized out>
        conn_space_us = <optimized out>
        evt = <optimized out>
        pdu_end_us = <optimized out>
        lrpa = <optimized out>
        ret = <optimized out>
        pdu_adv_rx_chan_sel = <optimized out>
        scan_interval_us = <optimized out>
        pdu_tx = <optimized out>
        lrpa = <optimized out>
        err = <optimized out>
        err = <optimized out>
#4  isr_rx (param=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:414
        err = <optimized out>
        lll = 0x20003b08 <ll_scan+28>
        trx_done = <optimized out>
        crc_ok = <optimized out>
        devmatch_ok = <optimized out>
        devmatch_id = <optimized out>
        irkmatch_ok = 0 '\000'
        irkmatch_id = <optimized out>
        rssi_ready = 1 '\001'
        rl_idx = 255 '\377'
#5  0x0001f252 in radio_nrf5_isr_body () at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:76
No locals.
#6  radio_nrf5_isr () at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:72
No locals.
#7  <signal handler called>
No symbol table info available.
#8  0xaaaaaaaa in ?? ()
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

In memq_enqueue I checked the value of tail and break on suspicious value:

(gdb) p memq_ull_rx
$97 = {head = 0x20002a84 <mem_link_rx+104>, tail = 0xaf736ff7}
(gdb) frame 4
#4  isr_rx (param=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:414
414			err = isr_rx_pdu(lll, devmatch_ok, devmatch_id, irkmatch_ok,
(gdb) p lll
$98 = (struct lll_scan *) 0x20003b08 <ll_scan+28>
(gdb) set print pretty on
(gdb) p *lll
$100 = {
  hdr = {
    parent = 0x20003aec <ll_scan>,
    is_stop = 0 '\000'
  },
  conn = 0x0,
  conn_ticks_slot = 0,
  conn_win_offset_us = 0,
  conn_timeout = 750,
  state = 1 '\001',
  chan = 0 '\000',
  filter_policy = 0 '\000',
  adv_addr_type = 1 '\001',
  init_addr_type = 1 '\001',
  type = 1 '\001',
  rpa_gen = 0 '\000',
  rl_idx = 255 '\377',
  init_addr = "\235\016\267\276\021\032", # address is definitly correct
  adv_addr = "\233$\027V#\376",
  interval = 18,
  ticks_window = 0
}

(gdb) frame 3
#3  0x0000f2fe in isr_rx_pdu (devmatch_ok=<optimized out>, devmatch_id=<optimized out>, irkmatch_id=<optimized out>, rssi_ready=1 '\001', rl_idx=255 '\377', irkmatch_ok=0 '\000', lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:992
992			err = isr_rx_scan_report(lll, rssi_ready,

Check where the faulty value originates (note the difference in the last byte)

(gdb) p node_rx
$102 = (struct node_rx_pdu *) 0x200031f8 <mem_pdu_rx+1676>
(gdb) p *node_rx
$103 = {
  hdr = {
    {
      next = 0xaf736f4b,
      link = 0xaf736f4b,
      ack_last = 75 'K'
    },
    type = NODE_RX_TYPE_REPORT,
    user_meta = 39 '\'',
    handle = 65535,
    {
      rx_ftr = {
        param = 0x0,
        extra = 0x0,
        ticks_anchor = 0,
        us_radio_end = 0,
        us_radio_rdy = 0,
        rssi = 58 ':',
        lrpa_used = 0 '\000',
        rl_idx = 255 '\377',
        direct = 0 '\000'
      }
    }
  },
  pdu = 0x20003218 <mem_pdu_rx+1708> "D\006\352C\356\375\025Y\002\001\032\002\n\030\n\377L"
}
(gdb) p/x mfifo_pdu_rx_free
$105 = {
  s = 0x4,
  n = 0x16,
  f = 0x3,
  l = 0x1,
  m = {0xbc, 0x2b, 0x0, 0x20, 0xbc, 0x2b, 0x0, 0x20, 0xf8, 0x31, 0x0, 0x20, 0x60, 0x31, 0x0, 0x20, 0x30, 0x30, 0x0, 0x20, 0xb4, 0x2e, 0x0, 0x20, 0x7c, 0x30, 0x0, 0x20, 0x54, 0x2c, 0x0, 0x20, 0x98, 0x2f, 0x0, 0x20, 0x84, 0x2d, 0x0, 0x20, 0x0, 0x2f, 0x0, 0x20, 0x68, 0x2e, 0x0, 0x20, 0x1c, 0x2e, 0x0, 0x20, 0x14, 0x31, 0x0, 0x20, 0x38, 0x2d, 0x0, 0x20, 0xc8, 0x30, 0x0, 0x20, 0xec, 0x2c, 0x0, 0x20, 0xac, 0x31, 0x0, 0x20, 0xd0, 0x2d, 0x0, 0x20, 0xe4, 0x2f, 0x0, 0x20, 0x4c, 0x2f, 0x0, 0x20, 0x70, 0x2b, 0x0, 0x20}
}
(gdb) x/4xw &mfifo_pdu_rx_free.m
0x200000cc <mfifo_pdu_rx_free+4>:	0x20002bbc	0x20002bbc	0x200031f8	0x20003160
(gdb) p/x *((void**) (mfifo_pdu_rx_free.m + mfifo_pdu_rx_free.s * mfifo_pdu_rx_free.f)) # forgot to add +1 to .f
$106 = 0x20003160
(gdb) p *rx
value has been optimized out
(gdb) frame 2
#2  0x00020836 in isr_rx_scan_report (rssi_ready=rssi_ready@entry=1 '\001', rl_idx=<optimized out>, dir_report=dir_report@entry=false, lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:1151
1151		ull_rx_put(node_rx->hdr.link, node_rx);

Here the hdr.link is garbage (this value will later be put into memq_ull_rx via ull_rx_put (and crash the FW on the next ull_rx_put))

(gdb) p *node_rx
$107 = {
  hdr = {
    {
      next = 0xaf736f4b,
      link = 0xaf736f4b,
      ack_last = 75 'K'
    },
    type = NODE_RX_TYPE_REPORT,
    user_meta = 39 '\'',
    handle = 65535,
    {
      rx_ftr = {
        param = 0x0,
        extra = 0x0,
        ticks_anchor = 0,
        us_radio_end = 0,
        us_radio_rdy = 0,
        rssi = 58 ':',
        lrpa_used = 0 '\000',
        rl_idx = 255 '\377',
        direct = 0 '\000'
      }
    }
  },
  pdu = 0x20003218 <mem_pdu_rx+1708> "D\006\352C\356\375\025Y\002\001\032\002\n\030\n\377L"
}
(gdb) p/x *((void**) (mfifo_pdu_rx_free.m + mfifo_pdu_rx_free.s * mfifo_pdu_rx_free.f))
$108 = 0x20003160
(gdb) p node_rx
$109 = (struct node_rx_pdu *) 0x200031f8 <mem_pdu_rx+1676>

check if the alloc_peek and alloc returend the same pointer (yes)

(gdb) p/x *((void**) (mfifo_pdu_rx_free.m + mfifo_pdu_rx_free.s * (mfifo_pdu_rx_free.f - 1)))
$111 = 0x200031f8
(gdb) p/x node_rx
$112 = 0x200031f8

garbage link value is also in the mem_link_rx pool (note the difference in the last byte)

(gdb) x/32xw &mem_link_rx
0x20002a1c <mem_link_rx>:	0x00000000	0x20002b24	0x20002b24	0x2000000a
0x20002a2c <mem_link_rx+16>:	0x20002b24	0x2000000a	0x20002b24	0x00000000
0x20002a3c <mem_link_rx+32>:	0x20002b24	0x2000000a	0x20002b24	0x00000000
0x20002a4c <mem_link_rx+48>:	0x20002b24	0x2000000a	0x20002b24	0x2000000a
0x20002a5c <mem_link_rx+64>:	0x20002b24	0x2000000a	0x20002b24	0x2000000a
0x20002a6c <mem_link_rx+80>:	0x20002a84	0x20002bbc	0x20002b24	0x2000000a
0x20002a7c <mem_link_rx+96>:	0x20002b24	0x2000000a	0xaf736ff7	0x200031f8
0x20002a8c <mem_link_rx+112>:	0x20002b24	0x2000000a	0x20002b24	0x2000000a
(gdb) info symbol 0x200000D4
mfifo_pdu_rx_free + 12 in section datas
(gdb) p/x &mfifo_pdu_rx_free
$113 = 0x200000c8
(gdb) p/x &mfifo_pdu_rx_free.m
$114 = 0x200000cc
(gdb) x/16xw mfifo_pdu_rx_free.m
0x200000cc <mfifo_pdu_rx_free+4>:	0x20002bbc	0x20002bbc	0x200031f8	0x20003160
0x200000dc <mfifo_pdu_rx_free+20>:	0x20003030	0x20002eb4	0x2000307c	0x20002c54
0x200000ec <mfifo_pdu_rx_free+36>:	0x20002f98	0x20002d84	0x20002f00	0x20002e68
0x200000fc <mfifo_pdu_rx_free+52>:	0x20002e1c	0x20003114	0x20002d38	0x200030c8
(gdb) x/32xw mfifo_pdu_rx_free.m
0x200000cc <mfifo_pdu_rx_free+4>:	0x20002bbc	0x20002bbc	0x200031f8	0x20003160
0x200000dc <mfifo_pdu_rx_free+20>:	0x20003030	0x20002eb4	0x2000307c	0x20002c54
0x200000ec <mfifo_pdu_rx_free+36>:	0x20002f98	0x20002d84	0x20002f00	0x20002e68
0x200000fc <mfifo_pdu_rx_free+52>:	0x20002e1c	0x20003114	0x20002d38	0x200030c8
0x2000010c <mfifo_pdu_rx_free+68>:	0x20002cec	0x200031ac	0x20002dd0	0x20002fe4
0x2000011c <mfifo_pdu_rx_free+84>:	0x20002f4c	0x20002b70	0x06060824	0x200062fc
0x2000012c <mfifo_prep+8>:	0x2000443c	0x003bb91c	0x20003b08	0x000205ed
0x2000013c <mfifo_prep+24>:	0x000206b5	0x00020661	0x00000000	0x00000001

random stuff

(gdb) info symbol 0x20009740
z_interrupt_stacks + 960 in section noinit
(gdb) x/4xw (&z_interrupt_stacks + 960)
0x20100b80:	0x00000000	0x00000000	0x00000000	0x00000000
(gdb) x/4xw z_interrupt_stacks + 960
0x20100b80:	0x00000000	0x00000000	0x00000000	0x00000000
(gdb) x/4xw &z_interrupt_stacks
0x20009380 <z_interrupt_stacks>:	0xaaaaaaaa	0xaaaaaaaa	0xaaaaaaaa	0xaaaaaaaa
(gdb) x/4xw 0x20009740
0x20009740 <z_interrupt_stacks+960>:	0x40001000	0x200031f8	0x00000000	0x200031f8
(gdb) x/12xw 0x20009730
0x20009730 <z_interrupt_stacks+944>:	0x00000001	0x20002640	0x2000261c	0x20002650
0x20009740 <z_interrupt_stacks+960>:	0x40001000	0x200031f8	0x00000000	0x200031f8
0x20009750 <z_interrupt_stacks+976>:	0x00000000	0x20003b08	0x00000000	0x00000001
(gdb) p mfifo_pdu_rx_free.s
$117 = 4 '\004'
(gdb) p mfifo_pdu_rx_free.n
$118 = 22 '\026'
(gdb) x/22xw mfifo_pdu_rx_free.m
0x200000cc <mfifo_pdu_rx_free+4>:	0x20002bbc	0x20002bbc	0x200031f8	0x20003160
0x200000dc <mfifo_pdu_rx_free+20>:	0x20003030	0x20002eb4	0x2000307c	0x20002c54
0x200000ec <mfifo_pdu_rx_free+36>:	0x20002f98	0x20002d84	0x20002f00	0x20002e68
0x200000fc <mfifo_pdu_rx_free+52>:	0x20002e1c	0x20003114	0x20002d38	0x200030c8
0x2000010c <mfifo_pdu_rx_free+68>:	0x20002cec	0x200031ac	0x20002dd0	0x20002fe4
0x2000011c <mfifo_pdu_rx_free+84>:	0x20002f4c	0x20002b70

(gdb) frame 3
#3  0x0000f2fe in isr_rx_pdu (devmatch_ok=<optimized out>, devmatch_id=<optimized out>, irkmatch_id=<optimized out>, rssi_ready=1 '\001', rl_idx=255 '\377', irkmatch_ok=0 '\000', lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:992
992			err = isr_rx_scan_report(lll, rssi_ready,
(gdb) p *lll
$124 = {
  hdr = {
    parent = 0x20003aec <ll_scan>,
    is_stop = 0 '\000'
  },
  conn = 0x0,
  conn_ticks_slot = 0,
  conn_win_offset_us = 0,
  conn_timeout = 750,
  state = 1 '\001',
  chan = 0 '\000',
  filter_policy = 0 '\000',
  adv_addr_type = 1 '\001',
  init_addr_type = 1 '\001',
  type = 1 '\001',
  rpa_gen = 0 '\000',
  rl_idx = 255 '\377',
  init_addr = "\235\016\267\276\021\032",
  adv_addr = "\233$\027V#\376",
  interval = 18,
  ticks_window = 0
}
(gdb) p lll->init_addr
$125 = "\235\016\267\276\021\032"
(gdb) p/x lll->init_addr
$126 = {0x9d, 0xe, 0xb7, 0xbe, 0x11, 0x1a}
(gdb) p/x lll
$127 = 0x20003b08
(gdb) p lll
$128 = (struct lll_scan *) 0x20003b08 <ll_scan+28>
(gdb) p *lll
$129 = {
  hdr = {
    parent = 0x20003aec <ll_scan>,
    is_stop = 0 '\000'
  },
  conn = 0x0,
  conn_ticks_slot = 0,
  conn_win_offset_us = 0,
  conn_timeout = 750,
  state = 1 '\001',
  chan = 0 '\000',
  filter_policy = 0 '\000',
  adv_addr_type = 1 '\001',
  init_addr_type = 1 '\001',
  type = 1 '\001',
  rpa_gen = 0 '\000',
  rl_idx = 255 '\377',
  init_addr = "\235\016\267\276\021\032",
  adv_addr = "\233$\027V#\376",
  interval = 18,
  ticks_window = 0
}

frame 2

(gdb) frame 2
#2  0x00020836 in isr_rx_scan_report (rssi_ready=rssi_ready@entry=1 '\001', rl_idx=<optimized out>, dir_report=dir_report@entry=false, lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:1151
1151		ull_rx_put(node_rx->hdr.link, node_rx);
(gdb) p *lll
$130 = {
  hdr = {
    parent = 0x20003aec <ll_scan>,
    is_stop = 0 '\000'
  },
  conn = 0x0,
  conn_ticks_slot = 0,
  conn_win_offset_us = 0,
  conn_timeout = 750,
  state = 1 '\001',
  chan = 0 '\000',
  filter_policy = 0 '\000',
  adv_addr_type = 1 '\001',
  init_addr_type = 1 '\001',
  type = 1 '\001',
  rpa_gen = 0 '\000',
  rl_idx = 255 '\377',
  init_addr = "\235\016\267\276\021\032",
  adv_addr = "\233$\027V#\376",
  interval = 18,
  ticks_window = 0
}

Any ideas?

LeBlue · 2020-05-20T15:41:56Z

@cvinayak Added a script that I use for reproducing the crash: https://github.com/LeBlue/zephyr-memq-bug-tools

cvinayak · 2020-05-22T05:55:49Z

@LeBlue Thank for the detailed tools to reproduce the issue. I will set this up and let it run over the weekend.

cvinayak · 2020-05-22T13:09:11Z

@LeBlue

I have 6 nRF51dongle_nrf51422 with samples/bluetooth/peripheral_sc_only application, these have static random addresses as privacy support is disabled.
I have hci_usb in nrf52840dongle_nrf52840 with the attached zephyr config. Note, I have console and uart disabled.

I notice that your bt_conn_loop.py only connects to first device and reports not connected to others. Looking at the btmon log, I only see, at anytime, only one device being connected to (connection handle is always 0).

I see a lot of kernel panic too. (i am using a dedicated linux laptop, I will collect the piped logs and attach them on Monday).

I have now switched over to using hci_uart using btattach -B <uart dev> -P H4
Now I have 6 peripherals gracefully connected and my laptop with your script reading the device information characteristics. I will leave this over for the weekend.

I am definitely seeing differences between use of hci_usb versus hci_uart. I suspect hci_usb maybe causing buffer overflow in the controller code.

Could you add LL_ASSERT(len <= 27); here: https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/bluetooth/controller/hci/hci.c#L2357

I would like you to setup a video call next week anytime between 0700 CET - 1400 CET, so we can debug together over screen sharing, if this is an option.

carlescufi · 2020-05-22T13:13:07Z

@emob-nordic any chance this is related to the USB code?

eobalski · 2020-05-22T13:15:28Z

@carlescufi I will have a look.

cvinayak · 2020-05-25T13:36:45Z

@LeBlue Thanks for the quick response.

Could you also, increase thread stack sizes (say 4096 bytes) of USB stack thread, logging thread and Bluetooth Tx Stack Size (you will need to enable it first, see my attached .config file in txt extension). I got my nRF52833DK (I dont have a spare nRF52840DK) crashing on power up when I enabled debug logging in Bluetooth and USB subsystems, hence the increase.

My zephyr/.config file is attached herewith.

hci_usb_debug_config.txt

I notice that your bt_conn_loop.py only connects to first device and reports not connected to others. Looking at the btmon log, I only see, at anytime, only one device being connected to (connection handle is always 0).

Strange. I will have a look till tomorow.

Ignore this for the moment, seems its specific to my physical laptop. I am now testing on a macbook pro with ubuntu 20.04 in virtualbox. 6 peripherals connected now, and periodically reading the software versions, so far over 30 mins now....

cvinayak · 2020-05-25T14:54:12Z

@LeBlue with all the thread stack increase, I hit a MPU fault. Analysing the instruction address being in RAM inside the USB Bluetooth thread stack region and LR being in log_core.c, I infer that the hardcoded bluetooth rx thread stack size (calling bt_send) of 512 bytes could be insufficient. For debugging purpose, I have increased it to 2048.

https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/usb/class/bluetooth.c#L41

Could you please increase and perform your tests?

cvinayak · 2020-05-25T15:08:04Z

The hci_uart uses thread stack size of >640 to 1024 bytes, the thread that calls bt_send.

LeBlue · 2020-05-26T12:19:09Z

@cvinayak I have modified the test setup and moved the peripherals on the edge of the radio reception range. Now the issue is observable (without the fix) after 10-20 mins (instead of several hours). This at least already shows that the memory corruption gets triggered by transmission errors.

carlescufi · 2020-05-26T14:21:04Z

@cvinayak I have modified the test setup and moved the peripherals on the edge of the radio reception range. Now the issue is observable (without the fix) after 10-20 mins (instead of several hours). This at least already shows that the memory corruption gets triggered by transmission errors.

@LeBlue could you please test #25619 with your new test setup?

LeBlue · 2020-05-26T15:24:48Z

could you please test #25619 with your new test setup?

@carlescufi Yes, test (hci_uart and hci_usb) for #25619 already running for ~3h. Seems to be working so far.

Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes zephyrproject-rtos#21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>

Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes #21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>

Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes zephyrproject-rtos#21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>

Implemented an intermediate decrypt buffer to cover the CCM overrun under CRC error conditions. The workaround is applicable to nRF52832 SoC only, which is missing the MAXPACKETSIZE register in the NRF_CCM peripheral. Fixes zephyrproject-rtos#21107 for nRF52832 SoC. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>

Implemented an intermediate decrypt buffer to cover the CCM overrun under CRC error conditions. The workaround is applicable to nRF52832 SoC only, which is missing the MAXPACKETSIZE register in the NRF_CCM peripheral. Fixes #21107 for nRF52832 SoC. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>

Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes zephyrproject-rtos#21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>

Implemented an intermediate decrypt buffer to cover the CCM overrun under CRC error conditions. The workaround is applicable to nRF52832 SoC only, which is missing the MAXPACKETSIZE register in the NRF_CCM peripheral. Fixes zephyrproject-rtos#21107 for nRF52832 SoC. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>

Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes #21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>

Fix missing assignment of NRF_CCM->MAXPACKETSIZE register for PDU sizes smaller than 251 bytes. If there is CRC errors causing PDU length fields to be higher than configured PDU buffer sizes in the controller, without the MAXPACKETSIZE register set to correct PDU size, CCM module could overrun the PDU buffer and cause memory corruption. This fix is applicable for all nRF52 Series SoCs except nRF52832 SoC. Fixes zephyrproject-rtos#21107. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no> (cherry picked from commit cd7a73c)

…overrun Implemented an intermediate decrypt buffer to cover the CCM overrun under CRC error conditions. The workaround is applicable to nRF52832 SoC only, which is missing the MAXPACKETSIZE register in the NRF_CCM peripheral. Fixes zephyrproject-rtos#21107 for nRF52832 SoC. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no> (cherry picked from commit ba487fe)

LeBlue added the bug The issue is a bug, or the PR is fixing a bug label Dec 2, 2019

carlescufi assigned cvinayak Dec 2, 2019

aescolar added the area: Bluetooth label Dec 2, 2019

dleach02 added the priority: medium Medium impact/importance bug label Dec 3, 2019

cvinayak added the Waiting for response Waiting for author's response label Mar 5, 2020

carlescufi assigned eobalski May 22, 2020

carlescufi removed the Waiting for response Waiting for author's response label May 25, 2020

cvinayak mentioned this issue May 26, 2020

USB: Bluetooth: Fix rx_thread_stack size #25606

Merged

cvinayak mentioned this issue May 26, 2020

Bluetooth: controller: split: Fix NRF_CCM overrun #25619

Merged

cvinayak mentioned this issue May 26, 2020

Bluetooth: controller: split: Workaround nRF52832 CCM overrun #25634

Merged

carlescufi closed this as completed in #25619 May 27, 2020

cvinayak mentioned this issue May 28, 2020

[backport v1.14] Bluetooth: controller: split: Fix NRF_CCM overrun #25686

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LL_ASSERT and 'Imprecise data bus error' in LL Controller #21107

LL_ASSERT and 'Imprecise data bus error' in LL Controller #21107

LeBlue commented Dec 2, 2019

cvinayak commented Dec 5, 2019

LeBlue commented Dec 5, 2019

LeBlue commented Dec 13, 2019

cvinayak commented Dec 17, 2019

LeBlue commented Dec 18, 2019 •

edited

Loading

LeBlue commented Dec 18, 2019

LeBlue commented Jan 24, 2020

cristic-devel commented Jan 29, 2020 •

edited

Loading

LeBlue commented Feb 6, 2020

cvinayak commented Feb 14, 2020 •

edited

Loading

LeBlue commented Feb 14, 2020

cvinayak commented Feb 18, 2020 •

edited

Loading

cvinayak commented Feb 18, 2020

cristic-devel commented Feb 18, 2020

LeBlue commented Feb 19, 2020

cvinayak commented Feb 20, 2020

jhedberg commented Mar 9, 2020

mjago commented Apr 17, 2020 •

edited

Loading

LeBlue commented May 20, 2020

LeBlue commented May 20, 2020

cvinayak commented May 22, 2020

cvinayak commented May 22, 2020

carlescufi commented May 22, 2020

eobalski commented May 22, 2020

cvinayak commented May 25, 2020

cvinayak commented May 25, 2020

cvinayak commented May 25, 2020

LeBlue commented May 26, 2020

carlescufi commented May 26, 2020

LeBlue commented May 26, 2020

LL_ASSERT and 'Imprecise data bus error' in LL Controller #21107

LL_ASSERT and 'Imprecise data bus error' in LL Controller #21107

Comments

LeBlue commented Dec 2, 2019

cvinayak commented Dec 5, 2019

LeBlue commented Dec 5, 2019

LeBlue commented Dec 13, 2019

cvinayak commented Dec 17, 2019

LeBlue commented Dec 18, 2019 • edited Loading

LeBlue commented Dec 18, 2019

How to reproduce the issue

Changes to the hci_usb sample

Peripherals

Host system

Timeframe

Notes

LeBlue commented Jan 24, 2020

cristic-devel commented Jan 29, 2020 • edited Loading

LeBlue commented Feb 6, 2020

Update

Fix

cvinayak commented Feb 14, 2020 • edited Loading

LeBlue commented Feb 14, 2020

cvinayak commented Feb 18, 2020 • edited Loading

cvinayak commented Feb 18, 2020

cristic-devel commented Feb 18, 2020

LeBlue commented Feb 19, 2020

cvinayak commented Feb 20, 2020

jhedberg commented Mar 9, 2020

mjago commented Apr 17, 2020 • edited Loading

LeBlue commented May 20, 2020

backtrace of imprecise bus fault

Frame 6

backtracing the culprit

LeBlue commented May 20, 2020

cvinayak commented May 22, 2020

cvinayak commented May 22, 2020

carlescufi commented May 22, 2020

eobalski commented May 22, 2020

cvinayak commented May 25, 2020

cvinayak commented May 25, 2020

cvinayak commented May 25, 2020

LeBlue commented May 26, 2020

carlescufi commented May 26, 2020

LeBlue commented May 26, 2020

LeBlue commented Dec 18, 2019 •

edited

Loading

cristic-devel commented Jan 29, 2020 •

edited

Loading

cvinayak commented Feb 14, 2020 •

edited

Loading

cvinayak commented Feb 18, 2020 •

edited

Loading

mjago commented Apr 17, 2020 •

edited

Loading