bluetooth: conn: Use a separate workqueue for connection TX notify #79258

MarekPieta · 2024-10-01T10:29:50Z

Use a separate workqueue instead of system workqueue for connection TX notify processing. This makes Bluetooth stack more independent from the system workqueue.

Thalley · 2024-10-01T10:51:22Z

This feels going in a different direction that a few other recent PRs where we've move BT things to the system workqueue :o

alwa-nordic · 2024-10-01T11:02:46Z

The idea is fine as an option if you want it. But we are not going to spend time testing this upstream. I have not reviewed the code yet.

As an alternative, I propose we create a second system work queue, and explicitly state in its documentation that it must be treated isr-like and invoking any blocking functions is forbidden. @MarekPieta, does your application have RAM available for another system work queue?

MarekPieta · 2024-10-01T13:07:31Z

This feels going in a different direction that a few other recent PRs where we've move BT things to the system workqueue :o

It depends on what "things" you would like to do in the system workqueue context. From what I noticed so far, the Bluetooth stack may already perform settings (non-volatile memory) operations from the system workqueue context. Application also might perform some time consuming/settings-related operations from the system workqueue context. Such operations could have negative impact on the Bluetooth stack (delaying tx_notify processing).

Also there is a risk that synchronizing non-volatile memory and radio operations (required e.g. for nRF SoCs) could lead to problems if both Bluetooth and non-volatile memory operations are performed from the system workqueue context (and even Bluetooth stack performs settings operations from the system workqueue context). Blocking system workqueue during a non-volatile memory operation while waiting for Bluetooth to perform operation also in the system workqueue context could eventually even introduce a risk of deadlock related to flash synchronization mechanisms in some cases.

Because of the reasons explained above, I think that relying on separate threads for Bluetooth may improve stability of the Bluetooth stack in some use-cases. Relying on system workqueue context for more complex use-cases may be problematic.

MarekPieta · 2024-10-01T13:18:18Z

The idea is fine as an option if you want it. But we are not going to spend time testing this upstream. I have not reviewed the code yet.

As an alternative, I propose we create a second system work queue, and explicitly state in its documentation that it must be treated isr-like and invoking any blocking functions is forbidden. @MarekPieta, does your application have RAM available for another system work queue?

I think that introducing an additional work queue to separate operations that are more time critical may be step in the right direction (more complex applications could benefit from it; it could also be shared among subsystems). I think that my application could use it for most of the supported boards too (we should be able to fit the additional thread in memory).

Introducing the proposed system-wide solution is a bigger task and it would require more work. Could we consider using my proposed improvement until the system-wide solution is ready (using change from this PR as a temporary improvement)?

MarekPieta · 2024-10-01T14:19:36Z

Discussed offline with @alwa-nordic

I will use a separate workqueue (available under a Kconfig option) to handle tx_notify as currently proposed solution also introduces risk of deadlocks in some edge cases. I will update the PR once I prepare the new solution.

rugeGerritsen · 2024-10-04T11:17:02Z

subsys/bluetooth/host/Kconfig

There should be tests verifying that the stack works with and without this option enabled, right?

Should there also be tests that verifies that ATT cannot time out/proves it does when the system workqueue is blocked for a long time?

I think we should just insist on having this enabled and remove the option. Doing so will remove the need for some hacks in the code like https://github.com/zephyrproject-rtos/zephyr/pull/79258/files#diff-00881f70fc7ec49b966fed79bd6a5a8ed5357599b241b5b029736bbc97683f83R275-R288

Did we just remove one thread from BT and replace it with the system workqueue, and now adding a new thread to offload from the system workqueue?

I think we should just insist on having this enabled and remove the option. Doing so will remove the need for some hacks in the code like https://github.com/zephyrproject-rtos/zephyr/pull/79258/files#diff-00881f70fc7ec49b966fed79bd6a5a8ed5357599b241b5b029736bbc97683f83R275-R288

Removing the Kconfig option and making the feature enabled by default. I will also modify https://github.com/zephyrproject-rtos/zephyr/pull/79258/files#diff-00881f70fc7ec49b966fed79bd6a5a8ed5357599b241b5b029736bbc97683f83R275-R288 (I noticed that you do not like this approach)

Did we just remove one thread from BT and replace it with the system workqueue, and now adding a new thread to offload from the system workqueue?

It seems that reusing system workqueue causes problems here. Because of that we may need to change approach here a bit.

@rugeGerritsen I think we should definitely have some tests in BSIM using this.

If needed, maybe I could open a separate PR that enables the feature by default to run all of the CI tests with the feature enabled (for additional validation of the feature)?

I would argue that we should run at least some tests in this PR, otherwise we are adding a untested Kconfig. It makes more sense to test before you merge a feature, rather than after :D

Let's enable feature by default on a separate test PR to run CI tests on it then: #79713

Let's enable feature by default on a separate test PR to run CI tests on it then: #79713

Can you explain why you are not adding tests in this PR? We need the tests regardless

MarekPieta · 2024-10-09T13:52:28Z

Applied changes from offline discussion with @alwa-nordic

Just to be clear: It is still the case that when tx_notify_process is called in the context of conn_tx_workq then it still calls the TX callback which can still be blocked by a poorly implemented application, so I'm still unclear whether this is really fixing the issue if conn_tx_workq should never be blocked.

Using a separate workq at least mitigates negative impact of other works that rely on the system workqueue. Because of that, it might be useful for some of the applications. E.g. if the system workqueue is known to be used for tasks that may last long - like settings operations which could trigger NVM erases. The feature is optional, so if an application does not need it, it can simply keep the feature disabled.

The feature can also help with mitigating risks related to synchronizing non-volatile memory and radio operations for nRF SoCs (see #79258 (comment) for details).

Thalley

I understand the use case for this and the side effects. It's not a direction I'm happy about (basically re-adding a minimal TX thread again), but I have to assume that there's requirement that unfortunately isn't very clear here (no links to feature requests/issues).

It seems like a temporary patch to a fundamental issue, and this solution only alleviates the issue and doesn't solve it.

That being said, I won't be blocking this once my outstanding comments have been solved.

Besides the comments, we should add an entry in the release notes stating that this was added, as the removal of the TX thread was added to the 3.7 release notes.

Thalley · 2024-10-09T14:55:03Z

subsys/bluetooth/host/Kconfig

That is a pretty significant default stack size. Since the stack is only using this for a fairly fall function, shouldn't it default to something like 128/256/512?

I initially made the stack size bigger to ensure that we are on safe side (unfortunately I can validate it locally only in limited number of use-cases related to our application - e.g. we do not use CONFIG_BT_ISO_TX).

@alwa-nordic could you advise on the stack size here?

I think a good case for testing a stack size would be to run one of the LE Audio sample and in the bt_bap_stream_ops.sent callback measure the stack usage.

The reason why I think this is a good way of measuring it, is that the callback goes from the work item in conn.c to the sent callback in iso.c to the sent callback in BAP and then finally to the application (possible passing through the sent callback in cap_stream.c as well.

At the Bluetooth WG meeting today, there was consensus to default this setting to copy the value from the system work queue.

Ok, updated.

Thalley · 2024-10-09T14:55:54Z

subsys/bluetooth/host/Kconfig

What's the purpose of this being a non-configurable Kconfig option, compared to just a const value or #define?

I did it in the same way as for CONFIG_BT_RX_PRIO and CONFIG_BT_DRIVER_RX_HIGH_PRIO. Still, it would not hurt to make the priority user-configurable. Updated.

Yeah, checked those out too, and didn't really understand why they were non-configurable Kconfig options either :)

I assumed that it's done to prevent breaking something by reconfiguring application while still allowing to change this Kconfig value by overriding the default in application's Kconfig definition file (advanced configuration done only if user knows what he is doing). That's why I initially followed the same approach here.

That pretty much sounds like "We want it to be configurable but not really" :D Or "Configurable but with extra steps".

int Kconfigs are generally just a pain to work with, as they are not as flexible as bools (imply select only works with bools).

subsys/bluetooth/host/conn.c

LingaoM · 2024-10-10T11:31:03Z

Why not borrow BT RX? This is also a workq, and the stack size is large enough.

alwa-nordic · 2024-10-10T13:54:45Z

subsys/bluetooth/host/Kconfig

The PR mostly looks ok now, so I'll do a nitpick: I would like to rename this to BT_CONN_TX_NOTIFY_WQ to avoid confusing it with tx_work.

New name sounds better. Changed

MarekPieta · 2024-10-11T06:47:19Z

Why not borrow BT RX? This is also a workq, and the stack size is large enough.

Unfortunately that might not work properly too. I already discussed this approach with @alwa-nordic. He explained to me that it might cause deadlocks because the work posted to BT WQ may be blocking - e.g. it invokes application callbacks, ATT server (which blocks on allocation), etc.
Because of that, the idea was eventually rejected.

MarekPieta · 2024-10-11T08:30:43Z

Besides the comments, we should add an entry in the release notes stating that this was added, as the removal of the TX thread was added to the 3.7 release notes.

Added note to Zephyr 4.0 release notes (there seems to be no file for the 3.8 release)

Thalley

No objects from me anymore, but as discussed in the BT WG, this is a half-asses solution to a problem caused by an application's usage of the system workqueue.

There are probably better solutions to this, but less trivial to implement than this.

I am OK with adding this for now, but let's mark it as experimental so we can easily remove it again

Thalley · 2024-10-11T09:08:11Z

doc/releases/release-notes-4.0.rst

Do we have max line length requirements for .rst files?

I see that generally doc lines tend to be split if they are longer than 100 characters (but it's not always a case). I will align my note.

btw. Documentation build seems not to complain about it.

@kartben Thoughts?

Ya in theory it's 100. In practice a better rule would probably be one sentence per line, as it also makes it easier for folks to realize that a blank line is required for paragraph breaks -- when using line wrapping people tend to assume that having just one line return will do that (as it looks "ok enough" in source form) when obviously it doesn't. It also makes diffs cleaner too, since adding a word or two to a line really just impacts that line as opposed to potentially many if it "pushes" a wrapped line to the next (and so on).
Unfortunately switching to this approach would require updating (and potentially breaking) so many files that I have yet to give it a try :)

In the meantime, 100 characters per line it is, when possible :)

subsys/bluetooth/host/Kconfig

Use a separate workqueue instead of system workqueue for connection TX notify processing. This makes Bluetooth stack more independent from the system workqueue. Signed-off-by: Marek Pieta <Marek.Pieta@nordicsemi.no>

Change adds a release note informing about the newly introduced Kconfig option for Bluetooth stack. Signed-off-by: Marek Pieta <Marek.Pieta@nordicsemi.no>

Out-of-PR tests are passing with this, so won't be blocking, but this PR really should include proper testing of this new option

Thalley

Approving on the condition that @MarekPieta or someone else involved will add proper tests of this soon after.

rugeGerritsen

Can you update the commit message with the following:

The motivation behind the change
- What use case will start working when enabling this config/stopped working when the TX queue was removed
- Why the configuration is not enabled by default
- Why it is experimental

Document the plans going forward:

How and who is going to test this?
Are we planning to get rid of it afterwards?

That will make it easier to look at this change again some time later

PavelVPV · 2024-10-14T11:15:43Z

(I know that I'm writing to the merged PR but for future).

IMHO it would be good to try to come up with a solution without introducing extra threads to the host when fixing bugs spawned after the tx thread removal. This change adds extra 1k RAM overhead. Mesh now needs ~1k more to fix the consequences caused by the tx thread removal. This makes less and less feasible to run a reliable solution on platforms with constraint RAM, like nRF52832. For example, the nRF Connect SDK light_ctrl sample built for nRF52832 already has only ~6.5kBytes RAM left:

RAM:       58950 B        64 KB     89.95%

MarekPieta · 2024-10-14T12:43:35Z

Can you update the commit message with the following:

The motivation behind the change

What use case will start working when enabling this config/stopped working when the TX queue was removed

Why the configuration is not enabled by default

Why it is experimental

Document the plans going forward:

How and who is going to test this?

Are we planning to get rid of it afterwards?

That will make it easier to look at this change again some time later

The PR was merged already so I cannot update the commit message. Agreed with @rugeGerritsen to provide the answers as a separate comment here:

What use case will start working when enabling this config/stopped working when the TX queue was removed?

Locally I observed problems with nRF flash synchronization mechanism (my application uses a flash synchronization mechanism that is implemented outside Zephyr). Situation where an application triggered a flash erase operation from the system workqueue context and received a GATT notification could lead to a deadlock. I guess that Zephyr SW LL may be affected by similar problems, because FLASH synchronization is required for nRF SoCs in general.

@alwa-nordic mentioned that the fix introduced in scope of this PR also fixes another issue, which is not directly related to nRF flash synchronization: #78761. Seems that we cannot rely on no blocking of the system work queue.

Why the configuration is not enabled by default?
It introduces an additional thread (that significantly increases RAM usage) which might not be necessary for many of the use-cases (the deadlock issue might not replicate for some applications). At this moment, I enabled the fix in my application configuration to workaround problems that I observed locally. We could consider enabling this by default in general or depending on other enabled Kconfigs (if we accept the RAM usage increase). @theob-pro, @alwa-nordic, @Thalley, what's your opinion on that?

Why it is experimental?
The newly introduced workqueue was accepted as a temporary solution. During discussion with @alwa-nordic, we were considering to introduce a new global workqueue that could only execute non-blocking works (as we already noticed that the system workqueue may block in many practical use-cases). The newly introduced global non-blocking workqueue could be shared among various subsystems and application. The non-blocking workqueue would ensure better code execution responsiveness than the system workqueue. Introducing the solution globally may require more time, because of that we currently rely on the workaround implemented in the Bluetooth host scope.

How and who is going to test this?
We agreed, that I will introduce Babblesim tests for the feature soon (under a separate PR).
Are we planning to get rid of it afterwards?
We may consider removing the workqueue introduced here after we e.g. introduce a global non-blocking workqueue to Zephyr. See Why it is experimental? section of this comment for more details.

Thalley · 2024-10-14T13:02:06Z

@MarekPieta @alwa-nordic
Arguably we should discuss this further in a GH discussion or on Discord, but just to "answer" the above: Has the idea of being able to configure multiple (identical) system workqueues with a load balancer been considered? It sounds like overkill in an embedded system, but could in theory have fixed this issue as well, and could work as a scalable solution compared to the suggested non-blocking workqueue (how would we even enforce that?).

jhedberg · 2024-10-15T08:41:19Z

Has the idea of being able to configure multiple (identical) system workqueues with a load balancer been considered?

That's an interesting idea, though in some ways it sounds like it's intruding on the responsibility area of the scheduler. I wouldn't limit it to "the system", rather make it a general "workqueue pool" style API, and then the system work queue could potentially be extended to take advantage of it. Instead of having a 1:1 mapping between a queue and a thread that runs the work, you'd have a 1:N mapping from a queue to multiple possible threads that can execute the work. Looking at kernel/work.c it seems like it might not even be that hard to implement. Most likely it's not quite as simple as starting multiple work_queue_main() threads for the same queue, however it's possible that not much more is needed. Anyway, if you want to drive this, it's something that should be proposed on a Zephyr architecture level, i.e. create an issue for it and bring it to the Architecture WG for discussion.

MarekPieta requested review from jori-nordic, theob-pro and alwa-nordic October 1, 2024 10:29

zephyrbot added area: Bluetooth area: Bluetooth Host Bluetooth Host (excluding BR/EDR) labels Oct 1, 2024

zephyrbot requested review from hermabe, jhedberg, sjanc and Thalley October 1, 2024 10:30

zephyrbot assigned jhedberg and alwa-nordic Oct 1, 2024

alwa-nordic mentioned this pull request Oct 2, 2024

Bluetooth: Host: ATT: Zephyr host is deadlocked due to the use of the low-priority system work queue to block the high-priority BT RX thread. #78761

Closed

MarekPieta force-pushed the bluetooth_context_improvements branch from bab22b9 to b662f75 Compare October 4, 2024 10:07

MarekPieta changed the title ~~bluetooth: conn: Use Bluetooth workqueue for tx_notify~~ bluetooth: conn: Use a separate workqueue for connection TX notify Oct 4, 2024

MarekPieta force-pushed the bluetooth_context_improvements branch from b662f75 to aa2ae65 Compare October 4, 2024 11:14

rugeGerritsen reviewed Oct 4, 2024

View reviewed changes

MarekPieta force-pushed the bluetooth_context_improvements branch 2 times, most recently from 9266ecf to b03043b Compare October 7, 2024 11:54

alwa-nordic mentioned this pull request Oct 8, 2024

Bluetooth: Host: Don't invoke callbacks from tx_work #79538

Open

MarekPieta force-pushed the bluetooth_context_improvements branch from b03043b to 1eb0c82 Compare October 9, 2024 06:26

kapi-no mentioned this pull request Oct 9, 2024

applications: nrf_desktop: enable encryption for mcumgr smp service nrfconnect/sdk-nrf#17764

Merged

MarekPieta force-pushed the bluetooth_context_improvements branch 4 times, most recently from 5b8dca3 to 680e996 Compare October 9, 2024 12:36

Thalley requested changes Oct 9, 2024

View reviewed changes

alwa-nordic added the Bluetooth Review Discussion in the Bluetooth WG meeting required label Oct 10, 2024

alwa-nordic reviewed Oct 10, 2024

View reviewed changes

MarekPieta force-pushed the bluetooth_context_improvements branch from 7fcfa90 to cdc867c Compare October 11, 2024 08:29

zephyrbot added the Release Notes To be mentioned in the release notes label Oct 11, 2024

zephyrbot requested review from dkalowsk, kartben and mmahadevan108 October 11, 2024 08:30

Thalley previously requested changes Oct 11, 2024

View reviewed changes

MarekPieta force-pushed the bluetooth_context_improvements branch from cdc867c to 4b70631 Compare October 11, 2024 09:50

MarekPieta added 2 commits October 11, 2024 11:55

bluetooth: conn: Use a separate workqueue for connection TX notify

3c7ad3c

Use a separate workqueue instead of system workqueue for connection TX notify processing. This makes Bluetooth stack more independent from the system workqueue. Signed-off-by: Marek Pieta <Marek.Pieta@nordicsemi.no>

doc: releases: Add note for CONFIG_BT_CONN_TX_NOTIFY_WQ

26b7104

Change adds a release note informing about the newly introduced Kconfig option for Bluetooth stack. Signed-off-by: Marek Pieta <Marek.Pieta@nordicsemi.no>

MarekPieta force-pushed the bluetooth_context_improvements branch from 4b70631 to 26b7104 Compare October 11, 2024 09:56

alwa-nordic approved these changes Oct 11, 2024

View reviewed changes

MarekPieta mentioned this pull request Oct 11, 2024

Bluetooth context improvements default y test #79713

Closed

Thalley approved these changes Oct 11, 2024

View reviewed changes

MarekPieta requested a review from rugeGerritsen October 11, 2024 13:51

rugeGerritsen requested changes Oct 11, 2024

View reviewed changes

nashif merged commit bc5f1c8 into zephyrproject-rtos:main Oct 11, 2024
26 checks passed

MarekPieta deleted the bluetooth_context_improvements branch October 14, 2024 07:11

LingaoM mentioned this pull request Nov 12, 2024

DNM: Bluetooth: Host: Make bt rx thread never blocked #80718

Closed

bluetooth: conn: Use a separate workqueue for connection TX notify #79258

bluetooth: conn: Use a separate workqueue for connection TX notify #79258

Uh oh!

Conversation

MarekPieta commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thalley commented Oct 1, 2024

Uh oh!

alwa-nordic commented Oct 1, 2024

Uh oh!

MarekPieta commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarekPieta commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarekPieta commented Oct 1, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarekPieta commented Oct 9, 2024

Uh oh!

Thalley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarekPieta Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LingaoM commented Oct 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarekPieta Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarekPieta commented Oct 11, 2024

Uh oh!

MarekPieta commented Oct 11, 2024

Uh oh!

Thalley left a comment

MarekPieta commented Oct 1, 2024 •

edited

Loading

MarekPieta commented Oct 1, 2024 •

edited

Loading

MarekPieta commented Oct 1, 2024 •

edited

Loading

MarekPieta Oct 11, 2024 •

edited

Loading

MarekPieta Oct 11, 2024 •

edited

Loading

MarekPieta Oct 11, 2024 •

edited

Loading