DNM: Bluetooth: Host: Make bt rx thread never blocked #80718

LingaoM · 2024-11-01T05:36:03Z

Also look like sysworkq too.

LingaoM · 2024-11-01T05:40:56Z

@alwa-nordic I remember you mentioned before that BT RX should not be blocked, it should be processed in a streaming manner. I think this PR is trying to find a reasonable way to achieve this goal.

Also look like sysworkq too. Signed-off-by: Lingao Meng <menglingao@xiaomi.com>

alwa-nordic · 2024-11-02T08:28:15Z

Thanks! This is interesting.

Unfortunately, at least some work put on BT RX WQ uses blocking, AFAIK. We should fix that first.

But this could be useful as a deadlock prevention heuristic in callbacks. I guess this is similar to temporarily treating a thread like an ISR? We have the issue that our users expect that they can call blocking Bluetooth APIs while in callbacks on a thread vital to Bluetooth processing. If we make it impossible to block in those callbacks, it will prevent a large class of deadlocks.

May be it would be appropriate to do this change for the system work queue instead? This ties in to #80167. @JordanYates, maybe you want to comment on this? It would break many users though, so it would need a deprecation period and maybe an opt-out.

LingaoM · 2024-11-06T02:56:05Z

Not only does it require work on Bluetooth, but it also requires work on the kernel, including defining threads with special states(No Block).

Thalley · 2024-11-08T10:19:13Z

I think @PavelVPV had some issues with this PR. @PavelVPV do you want to sum up what you said in the BTWG meeting?

PavelVPV · 2024-11-12T06:58:30Z

I don’t think this proposed change should go forward for a few key reasons:

This doesn't address root cause. The change sidesteps the underlying issues with the Bluetooth host’s design that are causing deadlocks. By focusing only on kernel thread behavior to prevent blocking, we are effectively masking the problem, rather than solving it.

This keeps bad code in place. This approach leaves problematic code untouched, potentially creating long-term maintenance issues. If we continue with workarounds instead of solutions, we’ll accumulate technical debt that’s harder to address later.

This adds unpredictability and complexity. Altering the kernel thread behavior in this way will make the system’s response to certain scenarios less predictable. It may also make it challenging to track down the root cause of future issues, since the behavior of these threads would not align with the intended design.

This also makes troubleshooting harder in future. If a similar issue arises later, it’ll be difficult to pinpoint the exact cause. Fixing the underlying deadlock issues directly would make the system easier to understand and maintain in the future.

IMHO, we should focus on identifying and addressing the actual sources of deadlocks within the Bluetooth host. Fixing those issues directly will lead to a more reliable and maintainable solution in the long run.

PavelVPV

See the comment above

LingaoM · 2024-11-12T07:48:04Z

#68008 (comment)

I think the deadlocking problems the host has can really only be solved a principled way: We must stop blocking the event stream from the controller. All these priority levels and extra pools and stuff are just band-aids without technical soundness.

I mean all events must be treated as ISR. And I mean the event buffer must be made available for the next event immediately without any dependencies. If an event handler must keep data around, it must have it's own buffers (it can copy data into or give to the driver for DMA), or it can choose to drop the data. The handler cannot delay the event stream.

I agree with @alwa-nordic viewpoint that BT RX should be treated as an ISR. Furthermore, this PR aims to expose the current issues with Bluetooth hosts and intercept them in the future.

By focusing only on kernel thread behavior to prevent blocking, we are effectively masking the problem, rather than solving it.

I don't think the current solution to the deadlock is reasonable, some controversial code has already been introduced, and perhaps there will be more in the future, Such as:

More RAM overhead: bluetooth: conn: Use a separate workqueue for connection TX notify #79258
Controversial modification: Bluetooth: conn: Don't wait for buf allocation in syswq #71666 net: buf: Disallow blocking allocation in syswq #71697

LingaoM · 2024-11-12T07:50:37Z

Perhaps we should summarize all the dead locks we have encountered so far, which may also provide more theoretical support

PavelVPV · 2024-11-12T08:53:49Z

I agree with @alwa-nordic viewpoint that BT RX should be treated as an ISR. Furthermore, this PR aims to expose the current issues with Bluetooth hosts and intercept them in the future.

Trying to fix something without having a clear picture of the current state will lead to more issues in future. Having a common understanding between all stakeholders is also important so everybody understands exactly the same the issue and the proposed solution.

We are currently trying to put the current host architecture on a paper which includes threads, work items, the host API and interface to hci driver, queues, common usage. This should give more clarity on how the host works and can be used. We are planning to add this to the host's documentation. Then we can discuss proposes.

Perhaps we should summarize all the dead locks we have encountered so far, which may also provide more theoretical support

Exactly. So if you encounter any other deadlock, please create an issue so we can investigate it.

For the PRs like this I propose to create a detailed RFC first so we can discuss a problem and the proposed solution.

don't think the current solution to the deadlock is reasonable, some controversial code has already been introduced, and perhaps there will be more in the future, Such as:

With regards to #79258, @alwa-nordic has created a draft which should fix the exact change in that PR: #80606. I'm also working on a improvement of buf.h API to allow using it without K_FOREVER, which originally caused the issue on our side.

Thalley · 2025-02-07T13:16:27Z

The issue with blocking in the BT host is beyond the BT RX thread:
Even after #84282, when using the system workqueue to send notifications, which is a fairly common approach in Bluetooth, we end up calling bt_l2cap_create_pdu_timeout with K_FOREVER from bt_att_chan_create_pdu, which then triggers
the LOG_WRN from

	if (!K_TIMEOUT_EQ(timeout, K_NO_WAIT) &&
	    k_current_get() == k_work_queue_thread_get(&k_sys_work_q)) {
		LOG_WRN("Timeout discarded. No blocking in syswq.");
		timeout = K_NO_WAIT;
	}

How should we go about that issue? Should I create a separate GH issue for this?

cc @alwa-nordic @jhedberg

jhedberg · 2025-02-07T13:30:32Z

@Thalley I think that branch should likely be removed, but also the code that's calling the function should just use K_NO_WAIT directly. For the most part, I think we've gotten a bit carried away with the ability to do blocking allocations. Even though we can do that, it doesn't mean we (always) should. In a system with a plentyful heap, we'd likely have just designed all of this around malloc() which doesn't have any ability to block anyway.

Thalley · 2025-02-18T11:18:30Z

@Thalley I think that branch should likely be removed, but also the code that's calling the function should just use K_NO_WAIT directly. For the most part, I think we've gotten a bit carried away with the ability to do blocking allocations. Even though we can do that, it doesn't mean we (always) should. In a system with a plentyful heap, we'd likely have just designed all of this around malloc() which doesn't have any ability to block anyway.

The caller here would be bt_att_chan_create_pdu which has

	switch (att_op_get_type(op)) {
	case ATT_RESPONSE:
	case ATT_CONFIRMATION:
		/* Use a timeout only when responding/confirming */
		timeout = BT_ATT_TIMEOUT;
		break;
	default:
		timeout = K_FOREVER;
	}

So you are suggesting that we remove the branch removing the timeout, or modifying timeout = K_FOREVER; to timeout = K_NO_WAIT;?

We could also set timeout = K_NO_WAIT; only if called in the system workqueue thread; that way we can keep the warning and handle the case at the ATT layer.

jhedberg · 2025-02-21T07:24:23Z

@Thalley I think I'd do the timeout = K_NO_WAIT from att.c in case of system wq context. @alwa-nordic any thoughts on this? Whatever the solution, I'd like to see the warning from conn.c gone before the 4.1 release. Right now you get it pretty much immediately with any of our sample apps when connected, which is not acceptable.

jhedberg · 2025-02-21T07:55:44Z

@Thalley I think I'd do the timeout = K_NO_WAIT from att.c in case of system wq context. @alwa-nordic any thoughts on this? Whatever the solution, I'd like to see the warning from conn.c gone before the 4.1 release. Right now you get it pretty much immediately with any of our sample apps when connected, which is not acceptable.

I implemented my proposal in #86125. Feel free to object, but please do that constructively, i.e. provide an alternative solution which eliminates the warning for Zephyr 4.1.

github-actions · 2025-04-23T00:37:27Z

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

zephyrbot added area: Bluetooth area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Kernel labels Nov 1, 2024

zephyrbot requested review from alwa-nordic, andyross, ceolin, cfriedt, dcpleung, hermabe, jhedberg, nashif, npitre, peter-mitsis, rugeGerritsen, sjanc, TaiJuWu, Thalley and theob-pro November 1, 2024 05:36

zephyrbot assigned andyross and peter-mitsis Nov 1, 2024

LingaoM assigned alwa-nordic Nov 1, 2024

LingaoM force-pushed the draft_make_bt_rx_never_blocked branch 3 times, most recently from 8e6d920 to 9917aa0 Compare November 1, 2024 06:02

DNM: Bluetooth: Host: Make bt rx thread never blocked

748c38f

Also look like sysworkq too. Signed-off-by: Lingao Meng <menglingao@xiaomi.com>

LingaoM force-pushed the draft_make_bt_rx_never_blocked branch from 9917aa0 to 748c38f Compare November 1, 2024 06:08

LingaoM removed the request for review from theob-pro November 6, 2024 05:24

LingaoM requested a review from jori-nordic November 6, 2024 05:24

LingaoM mentioned this pull request Nov 6, 2024

bluetooth: Behavior change in host causing MESH to fail on sending messages #77241

Closed

PavelVPV requested changes Nov 12, 2024

View reviewed changes

LingaoM marked this pull request as draft December 25, 2024 03:00

github-actions bot added the Stale label Apr 23, 2025

github-actions bot closed this May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DNM: Bluetooth: Host: Make bt rx thread never blocked #80718

DNM: Bluetooth: Host: Make bt rx thread never blocked #80718

Uh oh!

LingaoM commented Nov 1, 2024

Uh oh!

LingaoM commented Nov 1, 2024

Uh oh!

alwa-nordic commented Nov 2, 2024 •

edited

Loading

Uh oh!

LingaoM commented Nov 6, 2024

Uh oh!

Thalley commented Nov 8, 2024

Uh oh!

PavelVPV commented Nov 12, 2024

Uh oh!

PavelVPV left a comment

Uh oh!

LingaoM commented Nov 12, 2024 •

edited

Loading

Uh oh!

LingaoM commented Nov 12, 2024

Uh oh!

PavelVPV commented Nov 12, 2024

Uh oh!

Thalley commented Feb 7, 2025

Uh oh!

jhedberg commented Feb 7, 2025

Uh oh!

Thalley commented Feb 18, 2025

Uh oh!

jhedberg commented Feb 21, 2025

Uh oh!

jhedberg commented Feb 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

Uh oh!

DNM: Bluetooth: Host: Make bt rx thread never blocked #80718

DNM: Bluetooth: Host: Make bt rx thread never blocked #80718

Uh oh!

Conversation

LingaoM commented Nov 1, 2024

Uh oh!

LingaoM commented Nov 1, 2024

Uh oh!

alwa-nordic commented Nov 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LingaoM commented Nov 6, 2024

Uh oh!

Thalley commented Nov 8, 2024

Uh oh!

PavelVPV commented Nov 12, 2024

Uh oh!

PavelVPV left a comment

Choose a reason for hiding this comment

Uh oh!

LingaoM commented Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LingaoM commented Nov 12, 2024

Uh oh!

PavelVPV commented Nov 12, 2024

Uh oh!

Thalley commented Feb 7, 2025

Uh oh!

jhedberg commented Feb 7, 2025

Uh oh!

Thalley commented Feb 18, 2025

Uh oh!

jhedberg commented Feb 21, 2025

Uh oh!

jhedberg commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

Uh oh!

alwa-nordic commented Nov 2, 2024 •

edited

Loading

LingaoM commented Nov 12, 2024 •

edited

Loading

jhedberg commented Feb 21, 2025 •

edited

Loading