Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LL_ASSERT and 'Imprecise data bus error' in LL Controller #21107

Closed
LeBlue opened this issue Dec 2, 2019 · 32 comments · Fixed by #25619 or #25634
Closed

LL_ASSERT and 'Imprecise data bus error' in LL Controller #21107

LeBlue opened this issue Dec 2, 2019 · 32 comments · Fixed by #25619 or #25634
Assignees
Labels
area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@LeBlue
Copy link

LeBlue commented Dec 2, 2019

Describe the bug
I am using a custom board similar to the nrf52840_pca10056 modified version of the hci_usb sample with addtion of a watchdog, mcumgr capability and logging. The main application on a linux host (bluez5.51) will search and (re)connect up to 9 peripheral devices.

For reconnection the application will regularly enable and disable scans. In the setup there are only 8 devices present to force simultaneous scanning and data exchange with the peripherals.

The FW (hci_usb) on the nrf52840 is crashing regularly (minutes - 4h) with one of the three following errors. Since the crashes happen in the same subsystem (bluetooth/controller/ll_sw/), all three different error logs are attached here. Every crash was seen at least two times.

To Reproduce
The crash is observed on a custom board similar to the nrf52840_pca10056 modified version of the hci_usb sample with addtion of a watchdog, mcumgr capability and logging:

CONFIG_BOOTLOADER_MCUBOOT=y
CONFIG_MCUMGR=y
CONFIG_MCUMGR_SMP_UART=y
CONFIG_MCUMGR_CMD_IMG_MGMT=y
CONFIG_MCUMGR_CMD_OS_MGMT=y
CONFIG_BT_MAX_CONN=15
CONFIG_BLUETOOTH_INT_EP_MPS=64
CONFIG_LOG=y
CONFIG_UART_CONSOLE=y

The bug should be observable on the standard hci_usb sample for the nrf52840_pca10056

Debugging efforts
I tried adding debug logs of the subsystem, but could not reproduce the bug with it enabled.

Expected behavior
The crashes should not happen.

Impact
This is a major annoance to showstopper

zephyr log output

See file below for complete logs.

  1. This happens the most of all three errors
[03:36:46.588,989] <err> os: ***** BUS FAULT *****
[03:36:46.588,989] <err> os:   Imprecise data bus error
[03:36:46.589,019] <err> os: r0/a1:  0x20001b40  r1/a2:  0x20001aac  r2/a3:  0x20002060
[03:36:46.589,019] <err> os: r3/a4:  0x20008dcc r12/ip:  0x00021339 r14/lr:  0x00018e0d
[03:36:46.589,019] <err> os:  xpsr:  0x6100021b
[03:36:46.589,019] <err> os: Faulting instruction address (r15/pc): 0x00014686
[03:36:46.589,019] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[03:36:46.589,019] <err> os: Fault during interruptling
[03:36:46.589,050] <err> os: Current thread: 0x20004fa8 (unknown)
[03:36:46.652,160] <err> os: Halting system
addr2line -e zephyr/zephyr.elf 0x00014686
zephyr/subsys/bluetooth/controller/util/memq.c:100

assert: 'rx' failed
[00:38:19.855,865] <err> os: r0/a1:  0x00000003  r1/a2:  0x0000000a  r2/a3:  0x00000001
[00:38:19.855,895] <err> os: r3/a4:  0x00011a01 r12/ip:  0x20002800 r14/lr:  0x00017d99
[00:38:19.855,895] <err> os:  xpsr:  0x6100001b
[00:38:19.855,895] <err> os: Faulting instruction address (r15/pc): 0x00017da4
[00:38:19.855,895] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:38:19.855,895] <err> os: Fault during interruptling
[00:38:19.855,926] <err> os: Current thread: 0x20004fa8 (unknown)
[00:38:19.907,470] <err> os: Halting system
addr2line -e zephyr/zephyr.elf 0x00017da4
zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1449 (discriminator 1)

assert: 'e' failed
[04:18:06.579,406] <err> os: r0/a1:  0x00000003  r1/a2:  0x0000000a  r2/a3:  0x00000001
[04:18:06.579,406] <err> os: r3/a4:  0x00011a01 r12/ip:  0x00000000 r14/lr:  0x0001e0b1
[04:18:06.579,406] <err> os:  xpsr:  0x61000011
[04:18:06.579,406] <err> os: Faulting instruction address (r15/pc): 0x0001e0bc
[04:18:06.579,406] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[04:18:06.579,437] <err> os: Fault during interruptling
[04:18:06.579,437] <err> os: Current thread: 0x20004fa8 (unknown)
[04:18:06.630,981] <err> os: Halting system
addr2line -e zephyr/zephyr.elf 0x0001e0bc
zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c:586 (discriminator 1)

Additional journal log of the linux host with the (full) zephyr logs included and some additional context from the linux side:
journal_ll_ctrl_assert_bug.txt

Environment (please complete the following information):

  • OS: MacOS (FW build), Linux (user of the external HCI controller)
  • Toolchain: gnuarmemb v7.3.1, updated today v9.2.1 no change
  • Commit SHA or Version used: v2.1.0-rc1-316-g12948fdf37d5

Additional context
The more traffic is created, eg. sending indications from/reading/writing values of the remote devices, the more likely it is to trigger the bug.

The full logs show some hci command timeouts from the linux side, which mostly revolve around enableing/disableing scanning.

@LeBlue LeBlue added the bug The issue is a bug, or the PR is fixing a bug label Dec 2, 2019
@dleach02 dleach02 added the priority: medium Medium impact/importance bug label Dec 3, 2019
@cvinayak
Copy link
Contributor

cvinayak commented Dec 5, 2019

@LeBlue Is this issue reproducible with CONFIG_LOG=n and CONFIG_SERIAL=n ? Also, otherwise, could you try with updating the BT_CTLR_RX_PRIO_STACK_SIZE value from 448 to something higher say 1024 and try.

Its here https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/bluetooth/controller/Kconfig#L152

@LeBlue
Copy link
Author

LeBlue commented Dec 5, 2019

@cvinayak The issue is reproducible with CONFIG_LOG=n, at least the behaviour was the same. I enabled CONFIG_LOG to just find out what/where it happens. But CONFIG_SERIAL was enabled as it was necessary for MCUMRG.

I will first try out the BT_CTRL_RX_PRIO_STACK_SIZE option and share the results.

@LeBlue
Copy link
Author

LeBlue commented Dec 13, 2019

@cvinayak increasing BT_CTRL_RX_PRIO_STACK_SIZE did not help. Addtionaly I watermarked several thread stacks with CONFIG_INIT_STACKS=y, added CONFIG_THREAD_NAME=y and CONFIG_MPU_STACK_GUARD=y

[00:55:57.123,168] <inf> os: recv thread stack :        unused 256        usage 256 / 512 (50 %)
[00:55:58.127,593] <inf> os: z_main_stack :        unused 736        usage 288 / 1024 (28 %)
[00:55:59.127,838] <inf> os: isr_stack :        unused 1280        usage 768 / 2048 (37 %)
[00:56:00.128,387] <inf> os: sys_work_q_stack :        unused 2872        usage 200 / 3072 (6 %)
[00:56:00.832,916] <inf> bt_ctlr_hci: handle: 0x0001, rssi: -69 dB.
[00:56:01.128,540] <inf> os: z_idle_stack :        unused 268        usage 52 / 320 (16 %)
[00:56:02.128,631] <inf> os: logging_stack :        unused 240        usage 528 / 768 (68 %)
[00:56:04.877,624] <inf> os: prio recv thread stack :        unused 808        usage 216 / 1024 (21 %)
[00:56:14.413,055] <inf> os: prio recv thread stack :        unused 808        usage 216 / 1024 (21 %)
[00:56:14.718,933] <inf> os: usb hci tx :        unused 128        usage 384 / 512 (75 %)
[00:56:14.720,031] <inf> os: usb hci rx :        unused 272        usage 240 / 512 (46 %)
[00:56:14.810,729] <inf> os: recv thread stack :        unused 256        usage 256 / 512 (50 %)
[00:56:14.812,744] <err> os: ***** BUS FAULT *****
[00:56:14.812,744] <err> os:   Imprecise data bus error
[00:56:14.812,744] <err> os: r0/a1:  0x20001cd0  r1/a2:  0x20001fc8  r2/a3:  0x20002150
[00:56:14.812,744] <err> os: r3/a4:  0x40001000 r12/ip:  0xaaaaaaaa r14/lr:  0x0001adcb
[00:56:14.812,744] <err> os:  xpsr:  0x21000211
[00:56:14.812,744] <err> os: Faulting instruction address (r15/pc): 0x0001478a
[00:56:14.812,774] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:56:14.812,774] <err> os: Fault during interrupt handling
[00:56:14.812,774] <err> os: Current thread: 0x20005098 (idle)
[00:56:14.907,135] <err> os: Halting system

I think this rules out the stack overflow assumption?

btw.: I had to adjust the default for BT_CTRL_RX_PRIO_STACK_SIZE, adding it to prj.conf did not work.

Addresses v2.1-branch@v2.1.0:

addr2line -e zephyr/zephyr.elf 0x0001478a    subsys/bluetooth/controller/util/memq.c:103
addr2line -e zephyr/zephyr.elf 0x0001adcb     subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:1137

@cvinayak
Copy link
Contributor

@LeBlue Have you done any modifications to controller code? did you add the "le" word to the BT_INFO logging from the controller's hci.c ?

I see you are connecting to upto 8 devices. Does this problem exist if its only one device connected to?

Is there a way, I can reproduce this on my end, could you share a minimal private branch that can reproduce the issue with steps to reproduce?

@LeBlue
Copy link
Author

LeBlue commented Dec 18, 2019

@cvinayak

Have you done any modifications to controller code? did you add the "le" word to the BT_INFO logging from the controller's hci.c ?

No i did not change this part, but the le is probably just an artifact from filtering the log files.

I see you are connecting to upto 8 devices. Does this problem exist if its only one device connected to?

I did not see it happening with only one or even a few devices. (Does not mean it cannot happen).
Generally the crash does happen more frequently if:

  • More devices are connected
  • More additional BLE advertisers are around
  • Scanning is regularly switched on and off in intervals. (We are searching for the 9th devices, which is the targeted max. number, and that is intentionally not around/switched off)
  • More data is exchanged between the peripherals and dongle (reading/writing values, indications generated by the peripherals)

I have not been able to directly provoke the crash. So sometimes it just works for hours or even days.

This might be unrelated, but here some additional information about the peripherals:

  • connection interval 22.5 ms (18)
  • variable slave latency between 0 and 49 via slave initiated connection parameter update
  • connections are authenticated and encrypted

Before you try and reproduce it (instructions will follow), have a look at the gathered and annotated logs:

crash_logs.tar.gz

Summary:

  1. Almost all (38) show a "imprecise bus fault" at memq_enqueue, the rest (6) is a ASSERT from data obtained by memq_dequeue (probably also enqueue failed, ignore for now)
  2. All happened with the queues memq_ull_rx and memq_ll_rx.
  3. All parameters to memq_enqueue seem to be from the correct mem pools and in range, based on the return address and registers r0-r3. Only as far as i can tell and I did check this only for some occurrences.
  4. Since the bus fault is imprecise, any of the parameters to memq_enqueue dereferencing/storing might be the issue.

Thoughts/my conclusions/possible explanations on this (might be incomplete or plain wrong, please correct/share thoughts on this)

  1. memq implementation might not work correctly and suffer from race?
  2. memq usage is wrong? As far as i can tell, it should only be equipped to deal in parallel with one consumer and one producer? Call sites seem to be OK, but i cannot tell if preempted interrupt handlers could violate this.
  3. mem_acquire/mem_release suffer from race? Most/All? parameters are acquired from mempools. It might be worth checking if concurrent allocations from the same pool can happen (and fail without notice).
  4. mem_acquire error is ignored somewhere? I could not find any occurrences.

It seems unlikely but these are obvious places to check.

@LeBlue
Copy link
Author

LeBlue commented Dec 18, 2019

How to reproduce the issue

it should be sufficient to use the plain zephyr hci_usb sample with updated BT_MAX_CONN=15 on any board with the nrf52840 or similar with USB and UART (for logging).

Changes to the hci_usb sample

The only changes I am using are:

  1. set the BD Address from the vendor register
  2. a watchdog (thread) on low priority. (Bug was reproducible, even with this disabled.)
  3. MCUMGR update capability via the UART and build binaries bootable with MCUBOOT
  4. in the later versions i added stack watermark checks for most of the system threads in 5 to 10s intervals.
  5. in later versions i added CONFIG_MPU_STACK_GUARD=y , thread names and upped the size of rx_prio thread stack.

These should not affect the bug, and 4 and 5 did not change the outcome.

Peripherals

For the peripherals the plain peripheral sample should do.
To match the behaviour, a call to bt_conn_le_param_update maybe helpful.

  1. connection interval 22.5 ms (18)
  2. variable slave latency between 0 and 49 via slave initiated connection parameter update, for reproducing, I think 0 is best. I do not think, the parameter update triggers the crash, value can be fixed.
  3. connections are authenticated and encrypted

Probably at least 5/6 peripherals are needed. To increase the chance, maybe even more than 8.
Support for indicating any value regularly would help. Here the indication frequency of the sample should be increased (is currently at 1/s)

Host system

Any Linux distro native/VM should do with bluez.

Connect the peripherals and exchange some data by reading characteristics regularly and/or indications by the peripherals.

Timeframe

The issue happened in between less than 1h and 2-3 days.
The issue is not directly triggerable but only the chance can be increased by creating a lot of RX events. This is the most important part.

Notes

I have two test systems setup:

  • 8 connected devices
  • regularly enable/disableing scanning
  • round robin reading a characteristic value (peripheral idle state, slave latency 49)
  • Additionally value indications (peripheral used/active state, slave latency 4), only manually triggered by a tester.
  1. One setup, where the crash happens quite frequently (around 5/day, idle state), there a A LOT of advertisers around (80-130). Here the crash happens regularly (15-45 mins), if a lot of indications (>20/s) are generated by the peripherals over a longer period (used/active state).
  2. The second setup, the crash happens but rarely (around 1 in 3 days, idle state), there are few BLE devices around (10-20). I could not trigger the crash (used/active state) in this setup.

The attached 44 crash logs are all the logs from 2 systems running for 14 days. There may have been a couple more since some weekend logs got rotated.

@LeBlue
Copy link
Author

LeBlue commented Jan 24, 2020

@cvinayak any ideas or comments?

@cristic-devel
Copy link
Contributor

cristic-devel commented Jan 29, 2020

Hi,
I also run into this issue running peripheral sample on a nrf52_pca10040 board. The sample is compiled with FIXED_PASSKEY and this change in the main.c

bt_conn_cb_register(&conn_callbacks);

#if defined(CONFIG_BT_FIXED_PASSKEY)
bt_passkey_set(111111);
#endif /* CONFIG_BT_FIXED_PASSKEY */
bt_conn_auth_cb_register(&auth_cb_display);

Nordic
Connected
[01:12:04.213,348] os: ***** BUS FAULT *****
[01:12:04.213,348] os: Imprecise data bus error
[01:12:04.213,348] os: r0/a1: 0x20000b60 r1/a2: 0x20000aa4 r2/a3: 0x20000e30
[01:12:04.213,378] os: r3/a4: 0x00000001 r12/ip: 0xeb2a7358 r14/lr: 0x00010521
[01:12:04.213,378] os: xpsr: 0x21000011
[01:12:04.213,378] os: Faulting instruction address (r15/pc): 0x0001c0e8
[01:12:04.213,378] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[01:12:04.213,378] os: Fault during interrupt handling
[01:12:04.213,378] os: Current thread: 0x20000250 (unknown)
[01:12:04.276,153] os: Halting system

@LeBlue
Copy link
Author

LeBlue commented Feb 6, 2020

Update

I found a way to reliably trigger the


by more or less simultaenously reading a value from all (9) connected peripherals. Command sequence from btmon:

 < ACL Data TX: Handle 8 flags 0x00 dlen 7               #2313 [hci0] 140.299577
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 4 flags 0x00 dlen 7               #2314 [hci0] 140.301017
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 5 flags 0x00 dlen 7               #2315 [hci0] 140.302633
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 7 flags 0x00 dlen 7               #2316 [hci0] 140.304207
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 3 flags 0x00 dlen 7               #2317 [hci0] 140.305643
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 1 flags 0x00 dlen 7               #2318 [hci0] 140.307114
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
< ACL Data TX: Handle 6 flags 0x00 dlen 7               #2319 [hci0] 140.308518
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
> HCI Event: Number of Completed Packets (0x13) plen 5  #2320 [hci0] 140.325971
        Num handles: 1
        Handle: 3
        Count: 1
< ACL Data TX: Handle 2 flags 0x00 dlen 7               #2321 [hci0] 140.326137
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
> HCI Event: Number of Completed Packets (0x13) plen 5  #2322 [hci0] 140.326977
        Num handles: 1
        Handle: 8
        Count: 1
< ACL Data TX: Handle 0 flags 0x00 dlen 7               #2323 [hci0] 140.327063
      ATT: Read Request (0x0a) len 2
        Handle: 0x0031
> HCI Event: Number of Completed Packets (0x13) plen 5  #2324 [hci0] 140.331973
        Num handles: 1
        Handle: 2
        Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5  #2325 [hci0] 140.337971
        Num handles: 1
        Handle: 7
        Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5  #2326 [hci0] 140.345993
        Num handles: 1
        Handle: 4
        Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5  #2327 [hci0] 140.350960
        Num handles: 1
        Handle: 6
        Count: 1
> ACL Data RX: Handle 2 flags 0x02 dlen 7               #2328 [hci0] 140.352972
      ATT: Read Response (0x0b) len 2
        Value: 42a7
> ACL Data RX: Handle 3 flags 0x02 dlen 7               #2329 [hci0] 140.368984
      ATT: Read Response (0x0b) len 2
        Value: 3fa7
> ACL Data RX: Handle 6 flags 0x02 dlen 7               #2330 [hci0] 140.372994
      ATT: Read Response (0x0b) len 2
        Value: 3ea7
> HCI Event: Number of Completed Packets (0x13) plen 5  #2331 [hci0] 140.386995
        Num handles: 1
        Handle: 0
        Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5  #2332 [hci0] 140.388978
        Num handles: 1
        Handle: 5
        Count: 1
> ACL Data RX: Handle 4 flags 0x02 dlen 7               #2333 [hci0] 140.391981
      ATT: Read Response (0x0b) len 2
        Value: 43a7
> ACL Data RX: Handle 8 flags 0x02 dlen 7               #2334 [hci0] 140.396978
      ATT: Read Response (0x0b) len 2
        Value: 42a7
> ACL Data RX: Handle 7 flags 0x02 dlen 7               #2335 [hci0] 140.406993
      ATT: Read Response (0x0b) len 2
        Value: 35a7
> HCI Event: Number of Completed Packets (0x13) plen 5  #2336 [hci0] 140.447990
        Num handles: 1
        Handle: 1
        Count: 1
> ACL Data RX: Handle 0 flags 0x02 dlen 7               #2337 [hci0] 140.455983
      ATT: Read Response (0x0b) len 2
        Value: 44a7
> ACL Data RX: Handle 5 flags 0x02 dlen 7               #2338 [hci0] 140.458996
      ATT: Read Response (0x0b) len 2
        Value: 41a7
> ACL Data RX: Handle 1 flags 0x02 dlen 7               #2339 [hci0] 140.516996
      ATT: Read Response (0x0b) len 2
        Value: 3fa7

The assertion was hit 9/10 times directly after the successful reading of all values. This applies only to the 1. test setup described above.
Tested on master@fbbf68d63c45

Fix

I was able to fix/circumvent the problem by increasing CONFIG_BT_CTLR_LLCP_CONN to the maximum allowed (== CONFIG_BT_MAX_CONN).

CONFIG_BT_MAX_CONN=15
CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_LLCP_CONN=15

Should this be the default setting for the hci_* samples?

@cvinayak
Copy link
Contributor

cvinayak commented Feb 14, 2020

@LeBlue

CONFIG_BT_CTLR_LLCP_CONN= CONFIG_BT_MAX_CONN

Yes, this has to be default if connections are established back-to-back and/or control procedures initiated simultaneously to all connected peers.

Does this fix original issue? #21107 (comment)

@LeBlue
Copy link
Author

LeBlue commented Feb 14, 2020

@cvinayak
It fixes one of them (assert(e), 3. in the first comment), yes. This was the crash I could reliably reproduce and resolved by setting CONFIG_BT_CTLR_LLCP_CONN .

But there seems to be a different issue that is still occuring (1. in the first comment). This is the imprecise data bus error in the memq_enqueue function. I have tested this up until master@fbbf68d63c45 with no changes.

Attached are the all recent failure logs where i decoded the registers shown in the fault dump (pc, lr, etc).
imprecise-memq_enqueue.tar.gz

I am reluctant to close ths issue an open a new one, because there is a lot of context information gathered here (mainly in #21107 (comment)), but i can do it, if necessary.

If i should try something or if you have an idea how to get more information, let me know.

I did not see the assert(rx) (2. in the first comment).

@cvinayak
Copy link
Contributor

cvinayak commented Feb 18, 2020

@LeBlue is it possible to provide me a script or application for linux that I can use to reproduce the crash? (at best I have only used btmgmt tool).

Also, could you use zephyr SDK on a linux distro to build your hci_usb?

@cvinayak
Copy link
Contributor

Hi,
I also run into this issue running peripheral sample on a nrf52_pca10040 board. The sample is compiled with FIXED_PASSKEY and this change in the main.c

bt_conn_cb_register(&conn_callbacks);

#if defined(CONFIG_BT_FIXED_PASSKEY)
bt_passkey_set(111111);
#endif /* CONFIG_BT_FIXED_PASSKEY */
bt_conn_auth_cb_register(&auth_cb_display);

Nordic
Connected
[01:12:04.213,348] os: ***** BUS FAULT *****
[01:12:04.213,348] os: Imprecise data bus error
[01:12:04.213,348] os: r0/a1: 0x20000b60 r1/a2: 0x20000aa4 r2/a3: 0x20000e30
[01:12:04.213,378] os: r3/a4: 0x00000001 r12/ip: 0xeb2a7358 r14/lr: 0x00010521
[01:12:04.213,378] os: xpsr: 0x21000011
[01:12:04.213,378] os: Faulting instruction address (r15/pc): 0x0001c0e8
[01:12:04.213,378] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[01:12:04.213,378] os: Fault during interrupt handling
[01:12:04.213,378] os: Current thread: 0x20000250 (unknown)
[01:12:04.276,153] os: Halting system

I am not able to reproduce this. Are you using the zephyr SDK supplied toolchain?

@cristic-devel
Copy link
Contributor

No, because I run on Windows (and SDK is not supported there???). I used 3rd party GNU ARM Embedded toolchain.

@LeBlue
Copy link
Author

LeBlue commented Feb 19, 2020

@cvinayak I will provide a python program as a simple script will not do the job. This will take approx. 1-2 days. I will share it as a GitHub repository.

@cvinayak
Copy link
Contributor

@lebue Thanks. Yes, a GitHub repo to reproduce the issue will be the best.

Also, please check if #22946 has any influence. Its possible to have left radio in unknown state, could cause memory corruption (speculating based on possibility of leaving radio in Rx state), and occurrence being of random nature.

@cvinayak cvinayak added the Waiting for response Waiting for author's response label Mar 5, 2020
@jhedberg
Copy link
Member

jhedberg commented Mar 9, 2020

@LeBlue any update on this?

@mjago
Copy link

mjago commented Apr 17, 2020

Possibly related: recent install on Linux (v2.2.99, SDK v0.11.2) running hci_pwr_ctrl sample on nrf52dk_nrf52832. Reproducible occurring every build. Client Android nRF Connect. Other tried samples are working fine.

[00:01:49.009,796] <err> os: ***** BUS FAULT *****
[00:01:49.009,796] <err> os:   Imprecise data bus error
[00:01:49.009,796] <err> os: r0/a1:  0xfffffffc  r1/a2:  0x00000004  r2/a3:  0x00019043
[00:01:49.009,826] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00018d55
[00:01:49.009,826] <err> os:  xpsr:  0x81000000
[00:01:49.009,826] <err> os: Faulting instruction address (r15/pc): 0x00018d60
[00:01:49.009,826] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:01:49.009,826] <err> os: Current thread: 0x20001a14 (unknown)
[00:01:49.123,016] <err> os: Halting system

Edit: Connecting and selecting Heart Rate Measurement Notify seems to push over-the-edge.

@LeBlue
Copy link
Author

LeBlue commented May 20, 2020

@cvinayak @jhedberg
Sorry for the long time, I did not have time to work on this.
The flash driver patch did not change anything.

I did some more digging and found out the following (same behaviour on 2.2.0 and 2.3.0-rc1), backtraces are from v2.3.0-rc1 (fault in frame 6).

backtrace of imprecise bus fault

#0  k_sys_fatal_error_handler (reason=reason@entry=0, esf=esf@entry=0x200096f0 <z_interrupt_stacks+912>) at zephyr/include/generated/syscalls/log_ctrl.h:33
        __func__ = <optimized out>
#1  0x00025586 in z_fatal_error (reason=reason@entry=0, esf=esf@entry=0x200096f0 <z_interrupt_stacks+912>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/kernel/fatal.c:123
        key = 0
        thread = 0x200027a4 <recv_thread_data>
        __func__ = <optimized out>
#2  0x00013a6c in z_arm_fatal_error (reason=reason@entry=0, esf=esf@entry=0x200096f0 <z_interrupt_stacks+912>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/arch/arm/core/aarch32/fatal.c:47
No locals.
#3  0x00013ef8 in z_arm_fault (msp=<optimized out>, psp=<optimized out>, exc_return=<optimized out>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/arch/arm/core/aarch32/cortex_m/fault.c:969
        reason = 0
        fault = <optimized out>
        recoverable = false
        nested_exc = true
        esf = 0x20009730 <z_interrupt_stacks+976>
        esf_copy = {basic = {{a1 = 536881784, r0 = 536881784}, {a2 = 536881540, r1 = 536881540}, {a3 = 536884008, r2 = 536884008}, {a4 = 536871812, r3 = 536871812}, {ip = 3, r12 = 3}, {lr = 103401, r14 = 103401}, {pc = 83654, r15 = 83654}, xpsr = 1627390481}}
#4  0x00013b5c in z_arm_usage_fault () at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/arch/arm/core/aarch32/cortex_m/fault_s.S:85
No locals.
#5  <signal handler called>
No symbol table info available.
#6  memq_enqueue (link=link@entry=0x20002a78 <mem_link_rx+104>, mem=mem@entry=0x20002984 <mem_done+60>, tail=tail@entry=0x20003328 <memq_ull_rx+4>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/util/memq.c:100
No locals.
#7  0x000193e8 in ull_rx_put (rx=0x20002984 <mem_done+60>, link=0x20002a78 <mem_link_rx+104>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1087
        rx_hdr = 0x20002984 <mem_done+60>
#8  ull_event_done (param=<optimized out>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1173
        evdone = 0x20002984 <mem_done+60>
        link = 0x20002a78 <mem_link_rx+104>
#9  0x0001f6c0 in lll_done (param=<optimized out>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:296
        next = <optimized out>
        ull = <optimized out>
        evdone = <optimized out>
#10 0x0001f1c2 in radio_nrf5_isr_body () at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:76
No locals.
#11 radio_nrf5_isr () at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:72

Frame 6

memq_enqueue (faulty value is at r4 == (*tail). The writing to [r4, #0] generates the bus fault):

frame 6
#6  memq_enqueue (link=link@entry=0x20002a78 <mem_link_rx+104>, mem=mem@entry=0x20002984 <mem_done+60>, tail=tail@entry=0x20003328 <memq_ull_rx+4>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/util/memq.c:100
100		*tail = link;
r0             0x20002a78          536881784
r1             0x20002984          536881540
r2             0x20003328          536884008
r3             0x20000384          536871812
r4             0x29c11739          700520249
r5             0x1                 1
r6             0x0                 0
r7             0x20002a78          536881784
r8             0x0                 0
r9             0x20002708          536880904
r10            0x200028e8          536881384
r11            0x2000290c          536881420
r12            0x3                 3
sp             0x20009754          0x20009754 <z_interrupt_stacks+1012>
lr             0x193e9             103401
pc             0x146c6             0x146c6 <memq_enqueue+10>
xpsr           0x61000211          1627390481
msp            0x200096c8          536909512
psp            0x20008808          536905736

The exact value is found in the memq_ull_rx.tail and memq_ull_rx.head as well as the mem_link_rx mempool

(gdb) p memq_ull_rx
$481 = {head = 0x29c11739, tail = 0x29c11739}
(gdb) x/60xw &mem_link_rx
0x20002a10 <mem_link_rx>:	0x00000000	0x20002af0	0x20002b20	0x20000009
0x20002a20 <mem_link_rx+16>:	0x20002b20	0x20000009	0x20002b20	0x20000009
0x20002a30 <mem_link_rx+32>:	0x20002b20	0x00000000	0x20002b20	0x20000009
0x20002a40 <mem_link_rx+48>:	0x20002b20	0x20000009	0x20002b20	0x20000009
0x20002a50 <mem_link_rx+64>:	0x20002b20	0x20000009	0x20002b20	0x20000009
0x20002a60 <mem_link_rx+80>:	0x20002b20	0x20000009	0x29c11739	0x20002ef4
0x20002a70 <mem_link_rx+96>:	0x20002b20	0x20000009	0x20002b18	0x20002984
0x20002a80 <mem_link_rx+112>:	0x20002b20	0x20000009	0x20002a68	0x2000294c
(gdb) x/8xw ((void*) &memq_ull_rx) - 8
0x2000331c <memq_ll_rx>:	0x20002a68	0x20002a68	0x29c11739	0x29c11739
0x2000332c <sem_recv>:	0x200008f4	0x20003330	0x20003330	0x00000000

This shows, the corrupted link element that leads to the crash is not generated in this event, only the crash happens here:

  1. some faulty link pointer gets memq_enqueued (writing the value to memq_ull_rx.tail
  2. later the faulty link pointer gets memq_dequeued (writing value to memq_ull_rx.head).
  3. Next memq_enqueue crashes (this backtrace) as it tries to dereference (*tail) and write the next element.

The faulty value (here 0x29c11739) changes for every crash, but every time the value is present in the tail, in head itself or in the head->next* chain and somewhere in the mem_link_rx pool (next pointer/element).

backtracing the culprit

To set a breakpoint where the faulty link element is memq_enqueued modifications https://github.com/LeBlue/zephyr/tree/dcb87377ae990156de34a4b0720326286d68b80b are used. Breakpoint was hit in memq_enqueue https://github.com/LeBlue/zephyr/blob/dcb87377ae990156de34a4b0720326286d68b80b/subsys/bluetooth/controller/util/memq.c#L114

I have commented interesting parts in the gdb log, these patterns were observed multiple times.

Otherwise I was having a look around a bit, maybe it helps.

(gdb) bt full
#0  memq_enqueue (link=0xaf736ff7, mem=mem@entry=0x200031f8 <mem_pdu_rx+1676>, tail=tail@entry=0x20003334 <memq_ull_rx+4>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/util/memq.c:114
No locals.
#1  0x0001933a in ull_rx_put (link=<optimized out>, rx=rx@entry=0x200031f8 <mem_pdu_rx+1676>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1087
        rx_hdr = 0x200031f8 <mem_pdu_rx+1676>
#2  0x00020836 in isr_rx_scan_report (rssi_ready=rssi_ready@entry=1 '\001', rl_idx=<optimized out>, dir_report=dir_report@entry=false, lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:1151
        node_rx = 0x200031f8 <mem_pdu_rx+1676>
        pdu_adv_rx = <optimized out>
#3  0x0000f2fe in isr_rx_pdu (devmatch_ok=<optimized out>, devmatch_id=<optimized out>, irkmatch_id=<optimized out>, rssi_ready=1 '\001', rl_idx=255 '\377', irkmatch_ok=0 '\000', lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:992
        err = <optimized out>
        node_rx = 0x200031f8 <mem_pdu_rx+1676>
        pdu_adv_rx = <optimized out>
        dir_report = false
        node_rx = <optimized out>
        pdu_adv_rx = <optimized out>
        dir_report = <optimized out>
        lll_conn = <optimized out>
        ftr = <optimized out>
        rx = <optimized out>
        pdu_tx = <optimized out>
        conn_interval_us = <optimized out>
        conn_offset_us = <optimized out>
        conn_space_us = <optimized out>
        evt = <optimized out>
        pdu_end_us = <optimized out>
        lrpa = <optimized out>
        ret = <optimized out>
        pdu_adv_rx_chan_sel = <optimized out>
        scan_interval_us = <optimized out>
        pdu_tx = <optimized out>
        lrpa = <optimized out>
        err = <optimized out>
        err = <optimized out>
#4  isr_rx (param=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:414
        err = <optimized out>
        lll = 0x20003b08 <ll_scan+28>
        trx_done = <optimized out>
        crc_ok = <optimized out>
        devmatch_ok = <optimized out>
        devmatch_id = <optimized out>
        irkmatch_ok = 0 '\000'
        irkmatch_id = <optimized out>
        rssi_ready = 1 '\001'
        rl_idx = 255 '\377'
#5  0x0001f252 in radio_nrf5_isr_body () at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:76
No locals.
#6  radio_nrf5_isr () at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:72
No locals.
#7  <signal handler called>
No symbol table info available.
#8  0xaaaaaaaa in ?? ()
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

In memq_enqueue I checked the value of tail and break on suspicious value:

(gdb) p memq_ull_rx
$97 = {head = 0x20002a84 <mem_link_rx+104>, tail = 0xaf736ff7}
(gdb) frame 4
#4  isr_rx (param=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:414
414			err = isr_rx_pdu(lll, devmatch_ok, devmatch_id, irkmatch_ok,
(gdb) p lll
$98 = (struct lll_scan *) 0x20003b08 <ll_scan+28>
(gdb) set print pretty on
(gdb) p *lll
$100 = {
  hdr = {
    parent = 0x20003aec <ll_scan>,
    is_stop = 0 '\000'
  },
  conn = 0x0,
  conn_ticks_slot = 0,
  conn_win_offset_us = 0,
  conn_timeout = 750,
  state = 1 '\001',
  chan = 0 '\000',
  filter_policy = 0 '\000',
  adv_addr_type = 1 '\001',
  init_addr_type = 1 '\001',
  type = 1 '\001',
  rpa_gen = 0 '\000',
  rl_idx = 255 '\377',
  init_addr = "\235\016\267\276\021\032", # address is definitly correct
  adv_addr = "\233$\027V#\376",
  interval = 18,
  ticks_window = 0
}

(gdb) frame 3
#3  0x0000f2fe in isr_rx_pdu (devmatch_ok=<optimized out>, devmatch_id=<optimized out>, irkmatch_id=<optimized out>, rssi_ready=1 '\001', rl_idx=255 '\377', irkmatch_ok=0 '\000', lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:992
992			err = isr_rx_scan_report(lll, rssi_ready,

Check where the faulty value originates (note the difference in the last byte)

(gdb) p node_rx
$102 = (struct node_rx_pdu *) 0x200031f8 <mem_pdu_rx+1676>
(gdb) p *node_rx
$103 = {
  hdr = {
    {
      next = 0xaf736f4b,
      link = 0xaf736f4b,
      ack_last = 75 'K'
    },
    type = NODE_RX_TYPE_REPORT,
    user_meta = 39 '\'',
    handle = 65535,
    {
      rx_ftr = {
        param = 0x0,
        extra = 0x0,
        ticks_anchor = 0,
        us_radio_end = 0,
        us_radio_rdy = 0,
        rssi = 58 ':',
        lrpa_used = 0 '\000',
        rl_idx = 255 '\377',
        direct = 0 '\000'
      }
    }
  },
  pdu = 0x20003218 <mem_pdu_rx+1708> "D\006\352C\356\375\025Y\002\001\032\002\n\030\n\377L"
}
(gdb) p/x mfifo_pdu_rx_free
$105 = {
  s = 0x4,
  n = 0x16,
  f = 0x3,
  l = 0x1,
  m = {0xbc, 0x2b, 0x0, 0x20, 0xbc, 0x2b, 0x0, 0x20, 0xf8, 0x31, 0x0, 0x20, 0x60, 0x31, 0x0, 0x20, 0x30, 0x30, 0x0, 0x20, 0xb4, 0x2e, 0x0, 0x20, 0x7c, 0x30, 0x0, 0x20, 0x54, 0x2c, 0x0, 0x20, 0x98, 0x2f, 0x0, 0x20, 0x84, 0x2d, 0x0, 0x20, 0x0, 0x2f, 0x0, 0x20, 0x68, 0x2e, 0x0, 0x20, 0x1c, 0x2e, 0x0, 0x20, 0x14, 0x31, 0x0, 0x20, 0x38, 0x2d, 0x0, 0x20, 0xc8, 0x30, 0x0, 0x20, 0xec, 0x2c, 0x0, 0x20, 0xac, 0x31, 0x0, 0x20, 0xd0, 0x2d, 0x0, 0x20, 0xe4, 0x2f, 0x0, 0x20, 0x4c, 0x2f, 0x0, 0x20, 0x70, 0x2b, 0x0, 0x20}
}
(gdb) x/4xw &mfifo_pdu_rx_free.m
0x200000cc <mfifo_pdu_rx_free+4>:	0x20002bbc	0x20002bbc	0x200031f8	0x20003160
(gdb) p/x *((void**) (mfifo_pdu_rx_free.m + mfifo_pdu_rx_free.s * mfifo_pdu_rx_free.f)) # forgot to add +1 to .f
$106 = 0x20003160
(gdb) p *rx
value has been optimized out
(gdb) frame 2
#2  0x00020836 in isr_rx_scan_report (rssi_ready=rssi_ready@entry=1 '\001', rl_idx=<optimized out>, dir_report=dir_report@entry=false, lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:1151
1151		ull_rx_put(node_rx->hdr.link, node_rx);

Here the hdr.link is garbage (this value will later be put into memq_ull_rx via ull_rx_put (and crash the FW on the next ull_rx_put))

(gdb) p *node_rx
$107 = {
  hdr = {
    {
      next = 0xaf736f4b,
      link = 0xaf736f4b,
      ack_last = 75 'K'
    },
    type = NODE_RX_TYPE_REPORT,
    user_meta = 39 '\'',
    handle = 65535,
    {
      rx_ftr = {
        param = 0x0,
        extra = 0x0,
        ticks_anchor = 0,
        us_radio_end = 0,
        us_radio_rdy = 0,
        rssi = 58 ':',
        lrpa_used = 0 '\000',
        rl_idx = 255 '\377',
        direct = 0 '\000'
      }
    }
  },
  pdu = 0x20003218 <mem_pdu_rx+1708> "D\006\352C\356\375\025Y\002\001\032\002\n\030\n\377L"
}
(gdb) p/x *((void**) (mfifo_pdu_rx_free.m + mfifo_pdu_rx_free.s * mfifo_pdu_rx_free.f))
$108 = 0x20003160
(gdb) p node_rx
$109 = (struct node_rx_pdu *) 0x200031f8 <mem_pdu_rx+1676>

check if the alloc_peek and alloc returend the same pointer (yes)

(gdb) p/x *((void**) (mfifo_pdu_rx_free.m + mfifo_pdu_rx_free.s * (mfifo_pdu_rx_free.f - 1)))
$111 = 0x200031f8
(gdb) p/x node_rx
$112 = 0x200031f8

garbage link value is also in the mem_link_rx pool (note the difference in the last byte)

(gdb) x/32xw &mem_link_rx
0x20002a1c <mem_link_rx>:	0x00000000	0x20002b24	0x20002b24	0x2000000a
0x20002a2c <mem_link_rx+16>:	0x20002b24	0x2000000a	0x20002b24	0x00000000
0x20002a3c <mem_link_rx+32>:	0x20002b24	0x2000000a	0x20002b24	0x00000000
0x20002a4c <mem_link_rx+48>:	0x20002b24	0x2000000a	0x20002b24	0x2000000a
0x20002a5c <mem_link_rx+64>:	0x20002b24	0x2000000a	0x20002b24	0x2000000a
0x20002a6c <mem_link_rx+80>:	0x20002a84	0x20002bbc	0x20002b24	0x2000000a
0x20002a7c <mem_link_rx+96>:	0x20002b24	0x2000000a	0xaf736ff7	0x200031f8
0x20002a8c <mem_link_rx+112>:	0x20002b24	0x2000000a	0x20002b24	0x2000000a
(gdb) info symbol 0x200000D4
mfifo_pdu_rx_free + 12 in section datas
(gdb) p/x &mfifo_pdu_rx_free
$113 = 0x200000c8
(gdb) p/x &mfifo_pdu_rx_free.m
$114 = 0x200000cc
(gdb) x/16xw mfifo_pdu_rx_free.m
0x200000cc <mfifo_pdu_rx_free+4>:	0x20002bbc	0x20002bbc	0x200031f8	0x20003160
0x200000dc <mfifo_pdu_rx_free+20>:	0x20003030	0x20002eb4	0x2000307c	0x20002c54
0x200000ec <mfifo_pdu_rx_free+36>:	0x20002f98	0x20002d84	0x20002f00	0x20002e68
0x200000fc <mfifo_pdu_rx_free+52>:	0x20002e1c	0x20003114	0x20002d38	0x200030c8
(gdb) x/32xw mfifo_pdu_rx_free.m
0x200000cc <mfifo_pdu_rx_free+4>:	0x20002bbc	0x20002bbc	0x200031f8	0x20003160
0x200000dc <mfifo_pdu_rx_free+20>:	0x20003030	0x20002eb4	0x2000307c	0x20002c54
0x200000ec <mfifo_pdu_rx_free+36>:	0x20002f98	0x20002d84	0x20002f00	0x20002e68
0x200000fc <mfifo_pdu_rx_free+52>:	0x20002e1c	0x20003114	0x20002d38	0x200030c8
0x2000010c <mfifo_pdu_rx_free+68>:	0x20002cec	0x200031ac	0x20002dd0	0x20002fe4
0x2000011c <mfifo_pdu_rx_free+84>:	0x20002f4c	0x20002b70	0x06060824	0x200062fc
0x2000012c <mfifo_prep+8>:	0x2000443c	0x003bb91c	0x20003b08	0x000205ed
0x2000013c <mfifo_prep+24>:	0x000206b5	0x00020661	0x00000000	0x00000001

random stuff

(gdb) info symbol 0x20009740
z_interrupt_stacks + 960 in section noinit
(gdb) x/4xw (&z_interrupt_stacks + 960)
0x20100b80:	0x00000000	0x00000000	0x00000000	0x00000000
(gdb) x/4xw z_interrupt_stacks + 960
0x20100b80:	0x00000000	0x00000000	0x00000000	0x00000000
(gdb) x/4xw &z_interrupt_stacks
0x20009380 <z_interrupt_stacks>:	0xaaaaaaaa	0xaaaaaaaa	0xaaaaaaaa	0xaaaaaaaa
(gdb) x/4xw 0x20009740
0x20009740 <z_interrupt_stacks+960>:	0x40001000	0x200031f8	0x00000000	0x200031f8
(gdb) x/12xw 0x20009730
0x20009730 <z_interrupt_stacks+944>:	0x00000001	0x20002640	0x2000261c	0x20002650
0x20009740 <z_interrupt_stacks+960>:	0x40001000	0x200031f8	0x00000000	0x200031f8
0x20009750 <z_interrupt_stacks+976>:	0x00000000	0x20003b08	0x00000000	0x00000001
(gdb) p mfifo_pdu_rx_free.s
$117 = 4 '\004'
(gdb) p mfifo_pdu_rx_free.n
$118 = 22 '\026'
(gdb) x/22xw mfifo_pdu_rx_free.m
0x200000cc <mfifo_pdu_rx_free+4>:	0x20002bbc	0x20002bbc	0x200031f8	0x20003160
0x200000dc <mfifo_pdu_rx_free+20>:	0x20003030	0x20002eb4	0x2000307c	0x20002c54
0x200000ec <mfifo_pdu_rx_free+36>:	0x20002f98	0x20002d84	0x20002f00	0x20002e68
0x200000fc <mfifo_pdu_rx_free+52>:	0x20002e1c	0x20003114	0x20002d38	0x200030c8
0x2000010c <mfifo_pdu_rx_free+68>:	0x20002cec	0x200031ac	0x20002dd0	0x20002fe4
0x2000011c <mfifo_pdu_rx_free+84>:	0x20002f4c	0x20002b70

(gdb) frame 3
#3  0x0000f2fe in isr_rx_pdu (devmatch_ok=<optimized out>, devmatch_id=<optimized out>, irkmatch_id=<optimized out>, rssi_ready=1 '\001', rl_idx=255 '\377', irkmatch_ok=0 '\000', lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:992
992			err = isr_rx_scan_report(lll, rssi_ready,
(gdb) p *lll
$124 = {
  hdr = {
    parent = 0x20003aec <ll_scan>,
    is_stop = 0 '\000'
  },
  conn = 0x0,
  conn_ticks_slot = 0,
  conn_win_offset_us = 0,
  conn_timeout = 750,
  state = 1 '\001',
  chan = 0 '\000',
  filter_policy = 0 '\000',
  adv_addr_type = 1 '\001',
  init_addr_type = 1 '\001',
  type = 1 '\001',
  rpa_gen = 0 '\000',
  rl_idx = 255 '\377',
  init_addr = "\235\016\267\276\021\032",
  adv_addr = "\233$\027V#\376",
  interval = 18,
  ticks_window = 0
}
(gdb) p lll->init_addr
$125 = "\235\016\267\276\021\032"
(gdb) p/x lll->init_addr
$126 = {0x9d, 0xe, 0xb7, 0xbe, 0x11, 0x1a}
(gdb) p/x lll
$127 = 0x20003b08
(gdb) p lll
$128 = (struct lll_scan *) 0x20003b08 <ll_scan+28>
(gdb) p *lll
$129 = {
  hdr = {
    parent = 0x20003aec <ll_scan>,
    is_stop = 0 '\000'
  },
  conn = 0x0,
  conn_ticks_slot = 0,
  conn_win_offset_us = 0,
  conn_timeout = 750,
  state = 1 '\001',
  chan = 0 '\000',
  filter_policy = 0 '\000',
  adv_addr_type = 1 '\001',
  init_addr_type = 1 '\001',
  type = 1 '\001',
  rpa_gen = 0 '\000',
  rl_idx = 255 '\377',
  init_addr = "\235\016\267\276\021\032",
  adv_addr = "\233$\027V#\376",
  interval = 18,
  ticks_window = 0
}

frame 2

(gdb) frame 2
#2  0x00020836 in isr_rx_scan_report (rssi_ready=rssi_ready@entry=1 '\001', rl_idx=<optimized out>, dir_report=dir_report@entry=false, lll=0x20003b08 <ll_scan+28>) at /Users/matthiaswauer/Projects/nordic_hci_ble/ble_fw/firmware/zephyr/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan.c:1151
1151		ull_rx_put(node_rx->hdr.link, node_rx);
(gdb) p *lll
$130 = {
  hdr = {
    parent = 0x20003aec <ll_scan>,
    is_stop = 0 '\000'
  },
  conn = 0x0,
  conn_ticks_slot = 0,
  conn_win_offset_us = 0,
  conn_timeout = 750,
  state = 1 '\001',
  chan = 0 '\000',
  filter_policy = 0 '\000',
  adv_addr_type = 1 '\001',
  init_addr_type = 1 '\001',
  type = 1 '\001',
  rpa_gen = 0 '\000',
  rl_idx = 255 '\377',
  init_addr = "\235\016\267\276\021\032",
  adv_addr = "\233$\027V#\376",
  interval = 18,
  ticks_window = 0
}

Any ideas?

@LeBlue
Copy link
Author

LeBlue commented May 20, 2020

@cvinayak Added a script that I use for reproducing the crash: https://github.com/LeBlue/zephyr-memq-bug-tools

@cvinayak
Copy link
Contributor

@LeBlue Thank for the detailed tools to reproduce the issue. I will set this up and let it run over the weekend.

@cvinayak
Copy link
Contributor

@LeBlue

  1. I have 6 nRF51dongle_nrf51422 with samples/bluetooth/peripheral_sc_only application, these have static random addresses as privacy support is disabled.
  2. I have hci_usb in nrf52840dongle_nrf52840 with the attached zephyr config. Note, I have console and uart disabled.

I notice that your bt_conn_loop.py only connects to first device and reports not connected to others. Looking at the btmon log, I only see, at anytime, only one device being connected to (connection handle is always 0).

I see a lot of kernel panic too. (i am using a dedicated linux laptop, I will collect the piped logs and attach them on Monday).

I have now switched over to using hci_uart using btattach -B <uart dev> -P H4
Now I have 6 peripherals gracefully connected and my laptop with your script reading the device information characteristics. I will leave this over for the weekend.

I am definitely seeing differences between use of hci_usb versus hci_uart. I suspect hci_usb maybe causing buffer overflow in the controller code.

Could you add LL_ASSERT(len <= 27); here: https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/bluetooth/controller/hci/hci.c#L2357

I would like you to setup a video call next week anytime between 0700 CET - 1400 CET, so we can debug together over screen sharing, if this is an option.

@carlescufi
Copy link
Member

@emob-nordic any chance this is related to the USB code?

@eobalski
Copy link
Collaborator

@carlescufi I will have a look.

@cvinayak
Copy link
Contributor

@LeBlue Thanks for the quick response.

Could you also, increase thread stack sizes (say 4096 bytes) of USB stack thread, logging thread and Bluetooth Tx Stack Size (you will need to enable it first, see my attached .config file in txt extension). I got my nRF52833DK (I dont have a spare nRF52840DK) crashing on power up when I enabled debug logging in Bluetooth and USB subsystems, hence the increase.

My zephyr/.config file is attached herewith.

hci_usb_debug_config.txt

I notice that your bt_conn_loop.py only connects to first device and reports not connected to others. Looking at the btmon log, I only see, at anytime, only one device being connected to (connection handle is always 0).

Strange. I will have a look till tomorow.

Ignore this for the moment, seems its specific to my physical laptop. I am now testing on a macbook pro with ubuntu 20.04 in virtualbox. 6 peripherals connected now, and periodically reading the software versions, so far over 30 mins now....

@carlescufi carlescufi removed the Waiting for response Waiting for author's response label May 25, 2020
@cvinayak
Copy link
Contributor

@LeBlue with all the thread stack increase, I hit a MPU fault. Analysing the instruction address being in RAM inside the USB Bluetooth thread stack region and LR being in log_core.c, I infer that the hardcoded bluetooth rx thread stack size (calling bt_send) of 512 bytes could be insufficient. For debugging purpose, I have increased it to 2048.

https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/usb/class/bluetooth.c#L41

Could you please increase and perform your tests?

@cvinayak
Copy link
Contributor

The hci_uart uses thread stack size of >640 to 1024 bytes, the thread that calls bt_send.

@LeBlue
Copy link
Author

LeBlue commented May 26, 2020

@cvinayak I have modified the test setup and moved the peripherals on the edge of the radio reception range. Now the issue is observable (without the fix) after 10-20 mins (instead of several hours). This at least already shows that the memory corruption gets triggered by transmission errors.

@carlescufi
Copy link
Member

@cvinayak I have modified the test setup and moved the peripherals on the edge of the radio reception range. Now the issue is observable (without the fix) after 10-20 mins (instead of several hours). This at least already shows that the memory corruption gets triggered by transmission errors.

@LeBlue could you please test #25619 with your new test setup?

@LeBlue
Copy link
Author

LeBlue commented May 26, 2020

could you please test #25619 with your new test setup?

@carlescufi Yes, test (hci_uart and hci_usb) for #25619 already running for ~3h. Seems to be working so far.

cvinayak added a commit to cvinayak/zephyr that referenced this issue May 26, 2020
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register
for PDU sizes smaller than 251 bytes. If there is CRC errors
causing PDU length fields to be higher than configured PDU
buffer sizes in the controller, without the MAXPACKETSIZE
register set to correct PDU size, CCM module could overrun
the PDU buffer and cause memory corruption. This fix is
applicable for all nRF52 Series SoCs except nRF52832 SoC.

Fixes zephyrproject-rtos#21107.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
carlescufi pushed a commit that referenced this issue May 27, 2020
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register
for PDU sizes smaller than 251 bytes. If there is CRC errors
causing PDU length fields to be higher than configured PDU
buffer sizes in the controller, without the MAXPACKETSIZE
register set to correct PDU size, CCM module could overrun
the PDU buffer and cause memory corruption. This fix is
applicable for all nRF52 Series SoCs except nRF52832 SoC.

Fixes #21107.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
cvinayak added a commit to cvinayak/zephyr that referenced this issue May 28, 2020
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register
for PDU sizes smaller than 251 bytes. If there is CRC errors
causing PDU length fields to be higher than configured PDU
buffer sizes in the controller, without the MAXPACKETSIZE
register set to correct PDU size, CCM module could overrun
the PDU buffer and cause memory corruption. This fix is
applicable for all nRF52 Series SoCs except nRF52832 SoC.

Fixes zephyrproject-rtos#21107.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
cvinayak added a commit to cvinayak/zephyr that referenced this issue May 28, 2020
Implemented an intermediate decrypt buffer to cover the CCM
overrun under CRC error conditions. The workaround is
applicable to nRF52832 SoC only, which is missing the
MAXPACKETSIZE register in the NRF_CCM peripheral.

Fixes zephyrproject-rtos#21107 for nRF52832 SoC.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
carlescufi pushed a commit that referenced this issue May 29, 2020
Implemented an intermediate decrypt buffer to cover the CCM
overrun under CRC error conditions. The workaround is
applicable to nRF52832 SoC only, which is missing the
MAXPACKETSIZE register in the NRF_CCM peripheral.

Fixes #21107 for nRF52832 SoC.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
krip-tip pushed a commit to krip-tip/zephyr-local that referenced this issue May 30, 2020
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register
for PDU sizes smaller than 251 bytes. If there is CRC errors
causing PDU length fields to be higher than configured PDU
buffer sizes in the controller, without the MAXPACKETSIZE
register set to correct PDU size, CCM module could overrun
the PDU buffer and cause memory corruption. This fix is
applicable for all nRF52 Series SoCs except nRF52832 SoC.

Fixes zephyrproject-rtos#21107.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
krip-tip pushed a commit to krip-tip/zephyr-local that referenced this issue May 30, 2020
Implemented an intermediate decrypt buffer to cover the CCM
overrun under CRC error conditions. The workaround is
applicable to nRF52832 SoC only, which is missing the
MAXPACKETSIZE register in the NRF_CCM peripheral.

Fixes zephyrproject-rtos#21107 for nRF52832 SoC.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
nashif pushed a commit that referenced this issue Jun 4, 2020
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register
for PDU sizes smaller than 251 bytes. If there is CRC errors
causing PDU length fields to be higher than configured PDU
buffer sizes in the controller, without the MAXPACKETSIZE
register set to correct PDU size, CCM module could overrun
the PDU buffer and cause memory corruption. This fix is
applicable for all nRF52 Series SoCs except nRF52832 SoC.

Fixes #21107.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
kkasperczyk-no pushed a commit to kkasperczyk-no/zephyr that referenced this issue Jun 15, 2020
Fix missing assignment of NRF_CCM->MAXPACKETSIZE register
for PDU sizes smaller than 251 bytes. If there is CRC errors
causing PDU length fields to be higher than configured PDU
buffer sizes in the controller, without the MAXPACKETSIZE
register set to correct PDU size, CCM module could overrun
the PDU buffer and cause memory corruption. This fix is
applicable for all nRF52 Series SoCs except nRF52832 SoC.

Fixes zephyrproject-rtos#21107.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
(cherry picked from commit cd7a73c)
kkasperczyk-no pushed a commit to kkasperczyk-no/zephyr that referenced this issue Jun 15, 2020
…overrun

Implemented an intermediate decrypt buffer to cover the CCM
overrun under CRC error conditions. The workaround is
applicable to nRF52832 SoC only, which is missing the
MAXPACKETSIZE register in the NRF_CCM peripheral.

Fixes zephyrproject-rtos#21107 for nRF52832 SoC.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
(cherry picked from commit ba487fe)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
9 participants