-
Notifications
You must be signed in to change notification settings - Fork 8k
bluetooth: conn: Use a separate workqueue for connection TX notify #79258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bluetooth: conn: Use a separate workqueue for connection TX notify #79258
Conversation
This feels going in a different direction that a few other recent PRs where we've move BT things to the system workqueue :o |
The idea is fine as an option if you want it. But we are not going to spend time testing this upstream. I have not reviewed the code yet. As an alternative, I propose we create a second system work queue, and explicitly state in its documentation that it must be treated isr-like and invoking any blocking functions is forbidden. @MarekPieta, does your application have RAM available for another system work queue? |
It depends on what "things" you would like to do in the system workqueue context. From what I noticed so far, the Bluetooth stack may already perform settings (non-volatile memory) operations from the system workqueue context. Application also might perform some time consuming/settings-related operations from the system workqueue context. Such operations could have negative impact on the Bluetooth stack (delaying Also there is a risk that synchronizing non-volatile memory and radio operations (required e.g. for nRF SoCs) could lead to problems if both Bluetooth and non-volatile memory operations are performed from the system workqueue context (and even Bluetooth stack performs settings operations from the system workqueue context). Blocking system workqueue during a non-volatile memory operation while waiting for Bluetooth to perform operation also in the system workqueue context could eventually even introduce a risk of deadlock related to flash synchronization mechanisms in some cases. Because of the reasons explained above, I think that relying on separate threads for Bluetooth may improve stability of the Bluetooth stack in some use-cases. Relying on system workqueue context for more complex use-cases may be problematic. |
I think that introducing an additional work queue to separate operations that are more time critical may be step in the right direction (more complex applications could benefit from it; it could also be shared among subsystems). I think that my application could use it for most of the supported boards too (we should be able to fit the additional thread in memory). Introducing the proposed system-wide solution is a bigger task and it would require more work. Could we consider using my proposed improvement until the system-wide solution is ready (using change from this PR as a temporary improvement)? |
Discussed offline with @alwa-nordic I will use a separate workqueue (available under a Kconfig option) to handle |
bab22b9
to
b662f75
Compare
tx_notify
b662f75
to
aa2ae65
Compare
subsys/bluetooth/host/Kconfig
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be tests verifying that the stack works with and without this option enabled, right?
Should there also be tests that verifies that ATT cannot time out/proves it does when the system workqueue is blocked for a long time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just insist on having this enabled and remove the option. Doing so will remove the need for some hacks in the code like https://github.com/zephyrproject-rtos/zephyr/pull/79258/files#diff-00881f70fc7ec49b966fed79bd6a5a8ed5357599b241b5b029736bbc97683f83R275-R288
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we just remove one thread from BT and replace it with the system workqueue, and now adding a new thread to offload from the system workqueue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just insist on having this enabled and remove the option. Doing so will remove the need for some hacks in the code like https://github.com/zephyrproject-rtos/zephyr/pull/79258/files#diff-00881f70fc7ec49b966fed79bd6a5a8ed5357599b241b5b029736bbc97683f83R275-R288
Removing the Kconfig option and making the feature enabled by default. I will also modify https://github.com/zephyrproject-rtos/zephyr/pull/79258/files#diff-00881f70fc7ec49b966fed79bd6a5a8ed5357599b241b5b029736bbc97683f83R275-R288 (I noticed that you do not like this approach)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we just remove one thread from BT and replace it with the system workqueue, and now adding a new thread to offload from the system workqueue?
It seems that reusing system workqueue causes problems here. Because of that we may need to change approach here a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rugeGerritsen I think we should definitely have some tests in BSIM using this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If needed, maybe I could open a separate PR that enables the feature by default to run all of the CI tests with the feature enabled (for additional validation of the feature)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that we should run at least some tests in this PR, otherwise we are adding a untested Kconfig. It makes more sense to test before you merge a feature, rather than after :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's enable feature by default on a separate test PR to run CI tests on it then: #79713
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's enable feature by default on a separate test PR to run CI tests on it then: #79713
Can you explain why you are not adding tests in this PR? We need the tests regardless
9266ecf
to
b03043b
Compare
b03043b
to
1eb0c82
Compare
5b8dca3
to
680e996
Compare
Using a separate workq at least mitigates negative impact of other works that rely on the system workqueue. Because of that, it might be useful for some of the applications. E.g. if the system workqueue is known to be used for tasks that may last long - like settings operations which could trigger NVM erases. The feature is optional, so if an application does not need it, it can simply keep the feature disabled. The feature can also help with mitigating risks related to synchronizing non-volatile memory and radio operations for nRF SoCs (see #79258 (comment) for details). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the use case for this and the side effects. It's not a direction I'm happy about (basically re-adding a minimal TX thread again), but I have to assume that there's requirement that unfortunately isn't very clear here (no links to feature requests/issues).
It seems like a temporary patch to a fundamental issue, and this solution only alleviates the issue and doesn't solve it.
That being said, I won't be blocking this once my outstanding comments have been solved.
Besides the comments, we should add an entry in the release notes stating that this was added, as the removal of the TX thread was added to the 3.7 release notes.
subsys/bluetooth/host/Kconfig
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a pretty significant default stack size. Since the stack is only using this for a fairly fall function, shouldn't it default to something like 128/256/512?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially made the stack size bigger to ensure that we are on safe side (unfortunately I can validate it locally only in limited number of use-cases related to our application - e.g. we do not use CONFIG_BT_ISO_TX
).
@alwa-nordic could you advise on the stack size here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a good case for testing a stack size would be to run one of the LE Audio sample and in the bt_bap_stream_ops.sent
callback measure the stack usage.
The reason why I think this is a good way of measuring it, is that the callback goes from the work item in conn.c
to the sent
callback in iso.c
to the sent
callback in BAP and then finally to the application (possible passing through the sent
callback in cap_stream.c
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the Bluetooth WG meeting today, there was consensus to default this setting to copy the value from the system work queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, updated.
subsys/bluetooth/host/Kconfig
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this being a non-configurable Kconfig option, compared to just a const
value or #define
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it in the same way as for CONFIG_BT_RX_PRIO
and CONFIG_BT_DRIVER_RX_HIGH_PRIO
. Still, it would not hurt to make the priority user-configurable. Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, checked those out too, and didn't really understand why they were non-configurable Kconfig options either :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that it's done to prevent breaking something by reconfiguring application while still allowing to change this Kconfig value by overriding the default in application's Kconfig definition file (advanced configuration done only if user knows what he is doing). That's why I initially followed the same approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That pretty much sounds like "We want it to be configurable but not really" :D Or "Configurable but with extra steps".
int
Kconfigs are generally just a pain to work with, as they are not as flexible as bool
s (imply
select
only works with bool
s).
Why not borrow BT RX? This is also a workq, and the stack size is large enough. |
subsys/bluetooth/host/Kconfig
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR mostly looks ok now, so I'll do a nitpick: I would like to rename this to BT_CONN_TX_NOTIFY_WQ
to avoid confusing it with tx_work
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New name sounds better. Changed
Unfortunately that might not work properly too. I already discussed this approach with @alwa-nordic. He explained to me that it might cause deadlocks because the work posted to BT WQ may be blocking - e.g. it invokes application callbacks, ATT server (which blocks on allocation), etc. |
7fcfa90
to
cdc867c
Compare
Added note to Zephyr 4.0 release notes (there seems to be no file for the 3.8 release) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objects from me anymore, but as discussed in the BT WG, this is a half-asses solution to a problem caused by an application's usage of the system workqueue.
There are probably better solutions to this, but less trivial to implement than this.
I am OK with adding this for now, but let's mark it as experimental so we can easily remove it again
doc/releases/release-notes-4.0.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have max line length requirements for .rst files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that generally doc lines tend to be split if they are longer than 100 characters (but it's not always a case). I will align my note.
btw. Documentation build seems not to complain about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kartben Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya in theory it's 100. In practice a better rule would probably be one sentence per line, as it also makes it easier for folks to realize that a blank line is required for paragraph breaks -- when using line wrapping people tend to assume that having just one line return will do that (as it looks "ok enough" in source form) when obviously it doesn't. It also makes diffs cleaner too, since adding a word or two to a line really just impacts that line as opposed to potentially many if it "pushes" a wrapped line to the next (and so on).
Unfortunately switching to this approach would require updating (and potentially breaking) so many files that I have yet to give it a try :)
In the meantime, 100 characters per line it is, when possible :)
cdc867c
to
4b70631
Compare
Use a separate workqueue instead of system workqueue for connection TX notify processing. This makes Bluetooth stack more independent from the system workqueue. Signed-off-by: Marek Pieta <Marek.Pieta@nordicsemi.no>
Change adds a release note informing about the newly introduced Kconfig option for Bluetooth stack. Signed-off-by: Marek Pieta <Marek.Pieta@nordicsemi.no>
4b70631
to
26b7104
Compare
Out-of-PR tests are passing with this, so won't be blocking, but this PR really should include proper testing of this new option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving on the condition that @MarekPieta or someone else involved will add proper tests of this soon after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the commit message with the following:
- The motivation behind the change
- What use case will start working when enabling this config/stopped working when the TX queue was removed
- Why the configuration is not enabled by default
- Why it is experimental
Document the plans going forward:
- How and who is going to test this?
- Are we planning to get rid of it afterwards?
That will make it easier to look at this change again some time later
(I know that I'm writing to the merged PR but for future). IMHO it would be good to try to come up with a solution without introducing extra threads to the host when fixing bugs spawned after the tx thread removal. This change adds extra 1k RAM overhead. Mesh now needs ~1k more to fix the consequences caused by the tx thread removal. This makes less and less feasible to run a reliable solution on platforms with constraint RAM, like nRF52832. For example, the nRF Connect SDK light_ctrl sample built for nRF52832 already has only ~6.5kBytes RAM left:
|
The PR was merged already so I cannot update the commit message. Agreed with @rugeGerritsen to provide the answers as a separate comment here: What use case will start working when enabling this config/stopped working when the TX queue was removed? Locally I observed problems with nRF flash synchronization mechanism (my application uses a flash synchronization mechanism that is implemented outside Zephyr). Situation where an application triggered a flash erase operation from the system workqueue context and received a GATT notification could lead to a deadlock. I guess that Zephyr SW LL may be affected by similar problems, because FLASH synchronization is required for nRF SoCs in general. @alwa-nordic mentioned that the fix introduced in scope of this PR also fixes another issue, which is not directly related to nRF flash synchronization: #78761. Seems that we cannot rely on no blocking of the system work queue. Why the configuration is not enabled by default? Why it is experimental?
|
@MarekPieta @alwa-nordic |
That's an interesting idea, though in some ways it sounds like it's intruding on the responsibility area of the scheduler. I wouldn't limit it to "the system", rather make it a general "workqueue pool" style API, and then the system work queue could potentially be extended to take advantage of it. Instead of having a 1:1 mapping between a queue and a thread that runs the work, you'd have a 1:N mapping from a queue to multiple possible threads that can execute the work. Looking at |
Use a separate workqueue instead of system workqueue for connection TX notify processing. This makes Bluetooth stack more independent from the system workqueue.