Skip to content

Conversation

LingaoM
Copy link
Contributor

@LingaoM LingaoM commented Nov 1, 2024

Also look like sysworkq too.

@LingaoM
Copy link
Contributor Author

LingaoM commented Nov 1, 2024

@alwa-nordic I remember you mentioned before that BT RX should not be blocked, it should be processed in a streaming manner. I think this PR is trying to find a reasonable way to achieve this goal.

@LingaoM LingaoM force-pushed the draft_make_bt_rx_never_blocked branch 3 times, most recently from 8e6d920 to 9917aa0 Compare November 1, 2024 06:02
Also look like sysworkq too.

Signed-off-by: Lingao Meng <menglingao@xiaomi.com>
@LingaoM LingaoM force-pushed the draft_make_bt_rx_never_blocked branch from 9917aa0 to 748c38f Compare November 1, 2024 06:08
@alwa-nordic
Copy link
Contributor

alwa-nordic commented Nov 2, 2024

Thanks! This is interesting.

Unfortunately, at least some work put on BT RX WQ uses blocking, AFAIK. We should fix that first.

But this could be useful as a deadlock prevention heuristic in callbacks. I guess this is similar to temporarily treating a thread like an ISR? We have the issue that our users expect that they can call blocking Bluetooth APIs while in callbacks on a thread vital to Bluetooth processing. If we make it impossible to block in those callbacks, it will prevent a large class of deadlocks.

May be it would be appropriate to do this change for the system work queue instead? This ties in to #80167. @JordanYates, maybe you want to comment on this? It would break many users though, so it would need a deprecation period and maybe an opt-out.

@LingaoM
Copy link
Contributor Author

LingaoM commented Nov 6, 2024

Not only does it require work on Bluetooth, but it also requires work on the kernel, including defining threads with special states(No Block).

@LingaoM LingaoM removed the request for review from theob-pro November 6, 2024 05:24
@Thalley
Copy link
Contributor

Thalley commented Nov 8, 2024

I think @PavelVPV had some issues with this PR. @PavelVPV do you want to sum up what you said in the BTWG meeting?

@PavelVPV
Copy link
Contributor

I don’t think this proposed change should go forward for a few key reasons:

This doesn't address root cause. The change sidesteps the underlying issues with the Bluetooth host’s design that are causing deadlocks. By focusing only on kernel thread behavior to prevent blocking, we are effectively masking the problem, rather than solving it.

This keeps bad code in place. This approach leaves problematic code untouched, potentially creating long-term maintenance issues. If we continue with workarounds instead of solutions, we’ll accumulate technical debt that’s harder to address later.

This adds unpredictability and complexity. Altering the kernel thread behavior in this way will make the system’s response to certain scenarios less predictable. It may also make it challenging to track down the root cause of future issues, since the behavior of these threads would not align with the intended design.

This also makes troubleshooting harder in future. If a similar issue arises later, it’ll be difficult to pinpoint the exact cause. Fixing the underlying deadlock issues directly would make the system easier to understand and maintain in the future.

IMHO, we should focus on identifying and addressing the actual sources of deadlocks within the Bluetooth host. Fixing those issues directly will lead to a more reliable and maintainable solution in the long run.

Copy link
Contributor

@PavelVPV PavelVPV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above

@LingaoM
Copy link
Contributor Author

LingaoM commented Nov 12, 2024

#68008 (comment)

I think the deadlocking problems the host has can really only be solved a principled way: We must stop blocking the event stream from the controller. All these priority levels and extra pools and stuff are just band-aids without technical soundness.

I mean all events must be treated as ISR. And I mean the event buffer must be made available for the next event immediately without any dependencies. If an event handler must keep data around, it must have it's own buffers (it can copy data into or give to the driver for DMA), or it can choose to drop the data. The handler cannot delay the event stream.

I agree with @alwa-nordic viewpoint that BT RX should be treated as an ISR. Furthermore, this PR aims to expose the current issues with Bluetooth hosts and intercept them in the future.

By focusing only on kernel thread behavior to prevent blocking, we are effectively masking the problem, rather than solving it.

I don't think the current solution to the deadlock is reasonable, some controversial code has already been introduced, and perhaps there will be more in the future, Such as:

  1. More RAM overhead: bluetooth: conn: Use a separate workqueue for connection TX notify #79258
  2. Controversial modification: Bluetooth: conn: Don't wait for buf allocation in syswq #71666 net: buf: Disallow blocking allocation in syswq #71697

@LingaoM
Copy link
Contributor Author

LingaoM commented Nov 12, 2024

Perhaps we should summarize all the dead locks we have encountered so far, which may also provide more theoretical support

@PavelVPV
Copy link
Contributor

I agree with @alwa-nordic viewpoint that BT RX should be treated as an ISR. Furthermore, this PR aims to expose the current issues with Bluetooth hosts and intercept them in the future.

Trying to fix something without having a clear picture of the current state will lead to more issues in future. Having a common understanding between all stakeholders is also important so everybody understands exactly the same the issue and the proposed solution.

We are currently trying to put the current host architecture on a paper which includes threads, work items, the host API and interface to hci driver, queues, common usage. This should give more clarity on how the host works and can be used. We are planning to add this to the host's documentation. Then we can discuss proposes.

Perhaps we should summarize all the dead locks we have encountered so far, which may also provide more theoretical support

Exactly. So if you encounter any other deadlock, please create an issue so we can investigate it.

For the PRs like this I propose to create a detailed RFC first so we can discuss a problem and the proposed solution.

don't think the current solution to the deadlock is reasonable, some controversial code has already been introduced, and perhaps there will be more in the future, Such as:

With regards to #79258, @alwa-nordic has created a draft which should fix the exact change in that PR: #80606. I'm also working on a improvement of buf.h API to allow using it without K_FOREVER, which originally caused the issue on our side.

@LingaoM LingaoM marked this pull request as draft December 25, 2024 03:00
@Thalley
Copy link
Contributor

Thalley commented Feb 7, 2025

The issue with blocking in the BT host is beyond the BT RX thread:
Even after #84282, when using the system workqueue to send notifications, which is a fairly common approach in Bluetooth, we end up calling bt_l2cap_create_pdu_timeout with K_FOREVER from bt_att_chan_create_pdu, which then triggers
the LOG_WRN from

	if (!K_TIMEOUT_EQ(timeout, K_NO_WAIT) &&
	    k_current_get() == k_work_queue_thread_get(&k_sys_work_q)) {
		LOG_WRN("Timeout discarded. No blocking in syswq.");
		timeout = K_NO_WAIT;
	}

How should we go about that issue? Should I create a separate GH issue for this?

cc @alwa-nordic @jhedberg

@jhedberg
Copy link
Member

jhedberg commented Feb 7, 2025

@Thalley I think that branch should likely be removed, but also the code that's calling the function should just use K_NO_WAIT directly. For the most part, I think we've gotten a bit carried away with the ability to do blocking allocations. Even though we can do that, it doesn't mean we (always) should. In a system with a plentyful heap, we'd likely have just designed all of this around malloc() which doesn't have any ability to block anyway.

@Thalley
Copy link
Contributor

Thalley commented Feb 18, 2025

@Thalley I think that branch should likely be removed, but also the code that's calling the function should just use K_NO_WAIT directly. For the most part, I think we've gotten a bit carried away with the ability to do blocking allocations. Even though we can do that, it doesn't mean we (always) should. In a system with a plentyful heap, we'd likely have just designed all of this around malloc() which doesn't have any ability to block anyway.

The caller here would be bt_att_chan_create_pdu which has

	switch (att_op_get_type(op)) {
	case ATT_RESPONSE:
	case ATT_CONFIRMATION:
		/* Use a timeout only when responding/confirming */
		timeout = BT_ATT_TIMEOUT;
		break;
	default:
		timeout = K_FOREVER;
	}

So you are suggesting that we remove the branch removing the timeout, or modifying timeout = K_FOREVER; to timeout = K_NO_WAIT;?

We could also set timeout = K_NO_WAIT; only if called in the system workqueue thread; that way we can keep the warning and handle the case at the ATT layer.

@jhedberg
Copy link
Member

@Thalley I think I'd do the timeout = K_NO_WAIT from att.c in case of system wq context. @alwa-nordic any thoughts on this? Whatever the solution, I'd like to see the warning from conn.c gone before the 4.1 release. Right now you get it pretty much immediately with any of our sample apps when connected, which is not acceptable.

@jhedberg
Copy link
Member

jhedberg commented Feb 21, 2025

@Thalley I think I'd do the timeout = K_NO_WAIT from att.c in case of system wq context. @alwa-nordic any thoughts on this? Whatever the solution, I'd like to see the warning from conn.c gone before the 4.1 release. Right now you get it pretty much immediately with any of our sample apps when connected, which is not acceptable.

I implemented my proposal in #86125. Feel free to object, but please do that constructively, i.e. provide an alternative solution which eliminates the warning for Zephyr 4.1.

Copy link

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Apr 23, 2025
@github-actions github-actions bot closed this May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants