Sending Bluetooth L2CAP messages after reconnecting sometimes leads to crashes due to uninitalized net_buf callback #76738

EaselinkBachmann · 2024-08-06T14:24:09Z

Describe the bug

When sending Bluetooth L2CAP messages over a dynamic channel (with bt_l2cap_chan_send()), the callback closure in the net_buf userdata is never initialized (unlike with static channels). This leads to bt_conn_tx_processor() reading out an invalid callback pointer (this manifests as 0x1 on my device) which later gets called, triggering a crash in tx_notify() when calling the callback.

The TODO comment "add sdu_sent callback on last PDU" in conn.c:1039 seems relevant, since the exact problem is that this PDU has never had its callback set. I don't know why that is though, since l2cap_data_pull should add the callback for dynamic channels, but this doesn't seem to happen in all cases.

The issue goes away when I manually zero the net_buf user_data before passing the buffer to bt_l2cap_chan_send().

To Reproduce

device A connects to device B (bt_conn_le_create())
device A creates L2CAP channel with device B (bt_l2cap_chan_connect())
device A and device B send packets (sending another one as soon as the chan_ops.sent callback is called) (bt_l2cap_chan_send())
power cycle device A
go to 1 until the crash occurs on device B

I don't have a small enough sample yet that I can share here since this occurred in an internal code base, but if necessary I can try and strip it down.

Expected behavior

Zephyr should not crash.

Impact

Annoyance, leads to occasional power cycles needed

Logs and console output

These logs where done with warn log level and certain log messages patched to be WRN instead of DBG since with debug log level lead to a lot of log output being dropped due to buffer overruns.

receiver:bt:l2cap: accepting channel connection
receiver:bt:l2cap: channel connected
[00:00:58.524,719] <wrn> bt_conn: conn 0x20000fc0: not connected
[00:00:59.391,998] <wrn> bt_conn: tx_process: pop: cb 0x1 userdata 0 buf 0x20007488 (ref 1)
[00:00:59.392,028] <wrn> bt_conn: tx_process: conn 0x20000fc0 buf 0x20007488 (last)
[00:00:59.392,028] <wrn> bt_conn: send_buf: invalid callback, cb = 0x1
[00:00:59.434,844] <wrn> bt_conn: tx_notify: invalid cb: conn 0x20000fc0 tx 0x20001b1c cb 0x1 user_data 0
[00:00:59.434,875] <err> os: ***** MPU FAULT *****
[00:00:59.434,875] <err> os:   Data Access Violation
[00:00:59.434,875] <err> os:   MMFAR Address: 0xd
[00:00:59.434,906] <err> os: r0/a1:  0x00000000  r1/a2:  0xffffffff  r2/a3:  0x00000000
[00:00:59.434,906] <err> os: r3/a4:  0x00000001 r12/ip:  0x20001cac r14/lr:  0x00009b0b
[00:00:59.434,936] <err> os:  xpsr:  0x61000000
[00:00:59.434,936] <err> os: Faulting instruction address (r15/pc): 0x0000002c
[00:00:59.434,967] <err> os: >>> ZEPHYR FATAL ERROR 19: Unknown error on CPU 0
[00:00:59.434,997] <err> os: Current thread: 0x20001898 (unknown)
[00:01:00.841,003] <err> os: Halting system

The crash dump link register points to /home/ferdinand/Documents/Work/UWB/UDK1/zephyrproject/zephyr/subsys/bluetooth/host/conn.c:299, which means this crash happened why trying to call the invalid cb.

Logging patch:

diff --git a/subsys/bluetooth/host/conn.c b/subsys/bluetooth/host/conn.c
index 31d08048571..b86d65c1b56 100644
--- a/subsys/bluetooth/host/conn.c
+++ b/subsys/bluetooth/host/conn.c
@@ -278,7 +278,7 @@ static void tx_notify(struct bt_conn *conn)
                        return;
                }
 
-               LOG_DBG("tx %p cb %p user_data %p", tx, tx->cb, tx->user_data);
+               if (tx->cb == (void*)1) LOG_WRN("tx_notify: invalid cb: conn %p tx %p cb %p user_data %p", conn, tx, tx->cb, tx->user_data);
 
                /* Copy over the params */
                cb = tx->cb;
@@ -648,6 +648,7 @@ static int send_buf(struct bt_conn *conn, struct net_buf *buf,
                return -ENOMEM;
        }
 
+       if (cb == (void*)1) LOG_WRN("send_buf: invalid callback, cb = %p", cb);
        tx->cb = cb;
        tx->user_data = ud;
 
@@ -1041,10 +1042,10 @@ void bt_conn_tx_processor(void)
                 * We still allocate one TX context per-fragment though.
                 */
                conn->get_and_clear_cb(conn, buf, &cb, &ud);
-               LOG_DBG("pop: cb %p userdata %p", cb, ud);
+               if (cb == (void*)1) LOG_WRN("tx_process: pop: cb %p userdata %p buf %p (ref %d)", cb, ud, buf, buf->ref);
        }
 
-       LOG_DBG("TX process: conn %p buf %p (%s)",
+       if (cb == (void*)1) LOG_WRN("tx_process: conn %p buf %p (%s)",
                conn, buf, last_buf ? "last" : "frag");
 
        int err = send_buf(conn, buf, buf_len, cb, ud);

Environment (please complete the following information):

OS: Linux
Toolchain: arm-zephyr-eabi-gcc (Zephyr SDK 0.16.6) 12.2.0
Commit SHA or Version used: 3.7.0 (642b308)

Additional context

Target: nrf52840dk/nrf52840
Config (prj.conf):

CONFIG_GPIO=y
CONFIG_BT=y
CONFIG_BT_CENTRAL=y
CONFIG_BT_PERIPHERAL=y
CONFIG_BT_ATT_PREPARE_COUNT=1
CONFIG_BT_GATT_CLIENT=y
CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_CONN_RSSI=y
CONFIG_BT_SMP=y
CONFIG_BT_L2CAP_DYNAMIC_CHANNEL=y
CONFIG_BT_L2CAP_TX_MTU=65
CONFIG_BT_BUF_ACL_RX_SIZE=71
CONFIG_CRC=y
CONFIG_STD_C23=y
CONFIG_REBOOT=y
CONFIG_LOG=y
CONFIG_BT_LOG_LEVEL_WRN=y
CONFIG_EXCEPTION_STACK_TRACE=y
CONFIG_COMPILER_OPT="-g"

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-06T14:24:48Z

Hi @EaselinkBachmann! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

jhedberg · 2024-08-06T16:06:15Z

@EaselinkBachmann I think it's likely that this was fixed by #76489. Could you retry with the latest main branch?

EaselinkBachmann · 2024-08-07T10:52:44Z

@EaselinkBachmann I think it's likely that this was fixed by #76489. Could you retry with the latest main branch?

Yes, with main my code correctly triggers the " bt_l2cap: Please clear user_data first" message, although I think this should be a WRN instead of a DBG.

Thanks for the fix!

EaselinkBachmann added the bug The issue is a bug, or the PR is fixing a bug label Aug 6, 2024

jhedberg added area: Bluetooth area: Bluetooth Host Bluetooth Host (excluding BR/EDR) labels Aug 6, 2024

jhedberg assigned jhedberg and jori-nordic Aug 6, 2024

jhedberg added the priority: low Low impact/importance bug label Aug 6, 2024

EaselinkBachmann closed this as completed Aug 7, 2024

jori-nordic linked a pull request Aug 7, 2024 that will close this issue

Bluetooth: L2CAP: Set NULL callback for PDUs #76489

Merged

EaselinkBachmann mentioned this issue Aug 7, 2024

Bluetooth L2CAP TX fragmentation state not cleared upon disconnect #76737

Closed

jhedberg mentioned this issue Aug 7, 2024

[Backport v3.7-branch] Bluetooth: L2CAP: Set NULL callback for PDUs #76770

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sending Bluetooth L2CAP messages after reconnecting sometimes leads to crashes due to uninitalized net_buf callback #76738

Sending Bluetooth L2CAP messages after reconnecting sometimes leads to crashes due to uninitalized net_buf callback #76738

EaselinkBachmann commented Aug 6, 2024 •

edited

Loading

github-actions bot commented Aug 6, 2024

jhedberg commented Aug 6, 2024

EaselinkBachmann commented Aug 7, 2024

Sending Bluetooth L2CAP messages after reconnecting sometimes leads to crashes due to uninitalized net_buf callback #76738

Sending Bluetooth L2CAP messages after reconnecting sometimes leads to crashes due to uninitalized net_buf callback #76738

Comments

EaselinkBachmann commented Aug 6, 2024 • edited Loading

github-actions bot commented Aug 6, 2024

jhedberg commented Aug 6, 2024

EaselinkBachmann commented Aug 7, 2024

EaselinkBachmann commented Aug 6, 2024 •

edited

Loading