-
Notifications
You must be signed in to change notification settings - Fork 8k
DNM: Bluetooth: Host: Make bt rx thread never blocked #80718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNM: Bluetooth: Host: Make bt rx thread never blocked #80718
Conversation
@alwa-nordic I remember you mentioned before that BT RX should not be blocked, it should be processed in a streaming manner. I think this PR is trying to find a reasonable way to achieve this goal. |
8e6d920
to
9917aa0
Compare
Also look like sysworkq too. Signed-off-by: Lingao Meng <menglingao@xiaomi.com>
9917aa0
to
748c38f
Compare
Thanks! This is interesting. Unfortunately, at least some work put on BT RX WQ uses blocking, AFAIK. We should fix that first. But this could be useful as a deadlock prevention heuristic in callbacks. I guess this is similar to temporarily treating a thread like an ISR? We have the issue that our users expect that they can call blocking Bluetooth APIs while in callbacks on a thread vital to Bluetooth processing. If we make it impossible to block in those callbacks, it will prevent a large class of deadlocks. May be it would be appropriate to do this change for the system work queue instead? This ties in to #80167. @JordanYates, maybe you want to comment on this? It would break many users though, so it would need a deprecation period and maybe an opt-out. |
Not only does it require work on Bluetooth, but it also requires work on the kernel, including defining threads with special states(No Block). |
I don’t think this proposed change should go forward for a few key reasons: This doesn't address root cause. The change sidesteps the underlying issues with the Bluetooth host’s design that are causing deadlocks. By focusing only on kernel thread behavior to prevent blocking, we are effectively masking the problem, rather than solving it. This keeps bad code in place. This approach leaves problematic code untouched, potentially creating long-term maintenance issues. If we continue with workarounds instead of solutions, we’ll accumulate technical debt that’s harder to address later. This adds unpredictability and complexity. Altering the kernel thread behavior in this way will make the system’s response to certain scenarios less predictable. It may also make it challenging to track down the root cause of future issues, since the behavior of these threads would not align with the intended design. This also makes troubleshooting harder in future. If a similar issue arises later, it’ll be difficult to pinpoint the exact cause. Fixing the underlying deadlock issues directly would make the system easier to understand and maintain in the future. IMHO, we should focus on identifying and addressing the actual sources of deadlocks within the Bluetooth host. Fixing those issues directly will lead to a more reliable and maintainable solution in the long run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comment above
I agree with @alwa-nordic viewpoint that BT RX should be treated as an ISR. Furthermore, this PR aims to expose the current issues with Bluetooth hosts and intercept them in the future.
I don't think the current solution to the deadlock is reasonable, some controversial code has already been introduced, and perhaps there will be more in the future, Such as:
|
Perhaps we should summarize all the dead locks we have encountered so far, which may also provide more theoretical support |
Trying to fix something without having a clear picture of the current state will lead to more issues in future. Having a common understanding between all stakeholders is also important so everybody understands exactly the same the issue and the proposed solution. We are currently trying to put the current host architecture on a paper which includes threads, work items, the host API and interface to hci driver, queues, common usage. This should give more clarity on how the host works and can be used. We are planning to add this to the host's documentation. Then we can discuss proposes.
Exactly. So if you encounter any other deadlock, please create an issue so we can investigate it. For the PRs like this I propose to create a detailed RFC first so we can discuss a problem and the proposed solution.
With regards to #79258, @alwa-nordic has created a draft which should fix the exact change in that PR: #80606. I'm also working on a improvement of |
The issue with blocking in the BT host is beyond the BT RX thread: if (!K_TIMEOUT_EQ(timeout, K_NO_WAIT) &&
k_current_get() == k_work_queue_thread_get(&k_sys_work_q)) {
LOG_WRN("Timeout discarded. No blocking in syswq.");
timeout = K_NO_WAIT;
} How should we go about that issue? Should I create a separate GH issue for this? |
@Thalley I think that branch should likely be removed, but also the code that's calling the function should just use |
The caller here would be switch (att_op_get_type(op)) {
case ATT_RESPONSE:
case ATT_CONFIRMATION:
/* Use a timeout only when responding/confirming */
timeout = BT_ATT_TIMEOUT;
break;
default:
timeout = K_FOREVER;
} So you are suggesting that we remove the branch removing the timeout, or modifying We could also set |
@Thalley I think I'd do the |
I implemented my proposal in #86125. Feel free to object, but please do that constructively, i.e. provide an alternative solution which eliminates the warning for Zephyr 4.1. |
This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time. |
Also look like sysworkq too.