Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sending Bluetooth L2CAP messages after reconnecting sometimes leads to crashes due to uninitalized net_buf callback #76738

Closed
EaselinkBachmann opened this issue Aug 6, 2024 · 3 comments · Fixed by #76489
Assignees
Labels
area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@EaselinkBachmann
Copy link

EaselinkBachmann commented Aug 6, 2024

Describe the bug

When sending Bluetooth L2CAP messages over a dynamic channel (with bt_l2cap_chan_send()), the callback closure in the net_buf userdata is never initialized (unlike with static channels). This leads to bt_conn_tx_processor() reading out an invalid callback pointer (this manifests as 0x1 on my device) which later gets called, triggering a crash in tx_notify() when calling the callback.

The TODO comment "add sdu_sent callback on last PDU" in conn.c:1039 seems relevant, since the exact problem is that this PDU has never had its callback set. I don't know why that is though, since l2cap_data_pull should add the callback for dynamic channels, but this doesn't seem to happen in all cases.

The issue goes away when I manually zero the net_buf user_data before passing the buffer to bt_l2cap_chan_send().

To Reproduce

  1. device A connects to device B (bt_conn_le_create())
  2. device A creates L2CAP channel with device B (bt_l2cap_chan_connect())
  3. device A and device B send packets (sending another one as soon as the chan_ops.sent callback is called) (bt_l2cap_chan_send())
  4. power cycle device A
  5. go to 1 until the crash occurs on device B

I don't have a small enough sample yet that I can share here since this occurred in an internal code base, but if necessary I can try and strip it down.

Expected behavior

Zephyr should not crash.

Impact

Annoyance, leads to occasional power cycles needed

Logs and console output

These logs where done with warn log level and certain log messages patched to be WRN instead of DBG since with debug log level lead to a lot of log output being dropped due to buffer overruns.

receiver:bt:l2cap: accepting channel connection
receiver:bt:l2cap: channel connected
[00:00:58.524,719] <wrn> bt_conn: conn 0x20000fc0: not connected
[00:00:59.391,998] <wrn> bt_conn: tx_process: pop: cb 0x1 userdata 0 buf 0x20007488 (ref 1)
[00:00:59.392,028] <wrn> bt_conn: tx_process: conn 0x20000fc0 buf 0x20007488 (last)
[00:00:59.392,028] <wrn> bt_conn: send_buf: invalid callback, cb = 0x1
[00:00:59.434,844] <wrn> bt_conn: tx_notify: invalid cb: conn 0x20000fc0 tx 0x20001b1c cb 0x1 user_data 0
[00:00:59.434,875] <err> os: ***** MPU FAULT *****
[00:00:59.434,875] <err> os:   Data Access Violation
[00:00:59.434,875] <err> os:   MMFAR Address: 0xd
[00:00:59.434,906] <err> os: r0/a1:  0x00000000  r1/a2:  0xffffffff  r2/a3:  0x00000000
[00:00:59.434,906] <err> os: r3/a4:  0x00000001 r12/ip:  0x20001cac r14/lr:  0x00009b0b
[00:00:59.434,936] <err> os:  xpsr:  0x61000000
[00:00:59.434,936] <err> os: Faulting instruction address (r15/pc): 0x0000002c
[00:00:59.434,967] <err> os: >>> ZEPHYR FATAL ERROR 19: Unknown error on CPU 0
[00:00:59.434,997] <err> os: Current thread: 0x20001898 (unknown)
[00:01:00.841,003] <err> os: Halting system

The crash dump link register points to /home/ferdinand/Documents/Work/UWB/UDK1/zephyrproject/zephyr/subsys/bluetooth/host/conn.c:299, which means this crash happened why trying to call the invalid cb.

Logging patch:

diff --git a/subsys/bluetooth/host/conn.c b/subsys/bluetooth/host/conn.c
index 31d08048571..b86d65c1b56 100644
--- a/subsys/bluetooth/host/conn.c
+++ b/subsys/bluetooth/host/conn.c
@@ -278,7 +278,7 @@ static void tx_notify(struct bt_conn *conn)
                        return;
                }
 
-               LOG_DBG("tx %p cb %p user_data %p", tx, tx->cb, tx->user_data);
+               if (tx->cb == (void*)1) LOG_WRN("tx_notify: invalid cb: conn %p tx %p cb %p user_data %p", conn, tx, tx->cb, tx->user_data);
 
                /* Copy over the params */
                cb = tx->cb;
@@ -648,6 +648,7 @@ static int send_buf(struct bt_conn *conn, struct net_buf *buf,
                return -ENOMEM;
        }
 
+       if (cb == (void*)1) LOG_WRN("send_buf: invalid callback, cb = %p", cb);
        tx->cb = cb;
        tx->user_data = ud;
 
@@ -1041,10 +1042,10 @@ void bt_conn_tx_processor(void)
                 * We still allocate one TX context per-fragment though.
                 */
                conn->get_and_clear_cb(conn, buf, &cb, &ud);
-               LOG_DBG("pop: cb %p userdata %p", cb, ud);
+               if (cb == (void*)1) LOG_WRN("tx_process: pop: cb %p userdata %p buf %p (ref %d)", cb, ud, buf, buf->ref);
        }
 
-       LOG_DBG("TX process: conn %p buf %p (%s)",
+       if (cb == (void*)1) LOG_WRN("tx_process: conn %p buf %p (%s)",
                conn, buf, last_buf ? "last" : "frag");
 
        int err = send_buf(conn, buf, buf_len, cb, ud);

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: arm-zephyr-eabi-gcc (Zephyr SDK 0.16.6) 12.2.0
  • Commit SHA or Version used: 3.7.0 (642b308)

Additional context

Target: nrf52840dk/nrf52840
Config (prj.conf):

CONFIG_GPIO=y
CONFIG_BT=y
CONFIG_BT_CENTRAL=y
CONFIG_BT_PERIPHERAL=y
CONFIG_BT_ATT_PREPARE_COUNT=1
CONFIG_BT_GATT_CLIENT=y
CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_CONN_RSSI=y
CONFIG_BT_SMP=y
CONFIG_BT_L2CAP_DYNAMIC_CHANNEL=y
CONFIG_BT_L2CAP_TX_MTU=65
CONFIG_BT_BUF_ACL_RX_SIZE=71
CONFIG_CRC=y
CONFIG_STD_C23=y
CONFIG_REBOOT=y
CONFIG_LOG=y
CONFIG_BT_LOG_LEVEL_WRN=y
CONFIG_EXCEPTION_STACK_TRACE=y
CONFIG_COMPILER_OPT="-g"
@EaselinkBachmann EaselinkBachmann added the bug The issue is a bug, or the PR is fixing a bug label Aug 6, 2024
Copy link

github-actions bot commented Aug 6, 2024

Hi @EaselinkBachmann! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

@jhedberg
Copy link
Member

jhedberg commented Aug 6, 2024

@EaselinkBachmann I think it's likely that this was fixed by #76489. Could you retry with the latest main branch?

@jhedberg jhedberg added area: Bluetooth area: Bluetooth Host Bluetooth Host (excluding BR/EDR) labels Aug 6, 2024
@jhedberg jhedberg added the priority: low Low impact/importance bug label Aug 6, 2024
@EaselinkBachmann
Copy link
Author

@EaselinkBachmann I think it's likely that this was fixed by #76489. Could you retry with the latest main branch?

Yes, with main my code correctly triggers the " bt_l2cap: Please clear user_data first" message, although I think this should be a WRN instead of a DBG.

Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants