Skip to content

Conversation

MarekPieta
Copy link
Contributor

@MarekPieta MarekPieta commented Oct 1, 2024

Use a separate workqueue instead of system workqueue for connection TX notify processing. This makes Bluetooth stack more independent from the system workqueue.

@Thalley
Copy link
Contributor

Thalley commented Oct 1, 2024

This feels going in a different direction that a few other recent PRs where we've move BT things to the system workqueue :o

@alwa-nordic
Copy link
Contributor

The idea is fine as an option if you want it. But we are not going to spend time testing this upstream. I have not reviewed the code yet.

As an alternative, I propose we create a second system work queue, and explicitly state in its documentation that it must be treated isr-like and invoking any blocking functions is forbidden. @MarekPieta, does your application have RAM available for another system work queue?

@MarekPieta
Copy link
Contributor Author

MarekPieta commented Oct 1, 2024

This feels going in a different direction that a few other recent PRs where we've move BT things to the system workqueue :o

It depends on what "things" you would like to do in the system workqueue context. From what I noticed so far, the Bluetooth stack may already perform settings (non-volatile memory) operations from the system workqueue context. Application also might perform some time consuming/settings-related operations from the system workqueue context. Such operations could have negative impact on the Bluetooth stack (delaying tx_notify processing).

Also there is a risk that synchronizing non-volatile memory and radio operations (required e.g. for nRF SoCs) could lead to problems if both Bluetooth and non-volatile memory operations are performed from the system workqueue context (and even Bluetooth stack performs settings operations from the system workqueue context). Blocking system workqueue during a non-volatile memory operation while waiting for Bluetooth to perform operation also in the system workqueue context could eventually even introduce a risk of deadlock related to flash synchronization mechanisms in some cases.

Because of the reasons explained above, I think that relying on separate threads for Bluetooth may improve stability of the Bluetooth stack in some use-cases. Relying on system workqueue context for more complex use-cases may be problematic.

@MarekPieta
Copy link
Contributor Author

MarekPieta commented Oct 1, 2024

The idea is fine as an option if you want it. But we are not going to spend time testing this upstream. I have not reviewed the code yet.

As an alternative, I propose we create a second system work queue, and explicitly state in its documentation that it must be treated isr-like and invoking any blocking functions is forbidden. @MarekPieta, does your application have RAM available for another system work queue?

I think that introducing an additional work queue to separate operations that are more time critical may be step in the right direction (more complex applications could benefit from it; it could also be shared among subsystems). I think that my application could use it for most of the supported boards too (we should be able to fit the additional thread in memory).

Introducing the proposed system-wide solution is a bigger task and it would require more work. Could we consider using my proposed improvement until the system-wide solution is ready (using change from this PR as a temporary improvement)?

@MarekPieta
Copy link
Contributor Author

Discussed offline with @alwa-nordic

I will use a separate workqueue (available under a Kconfig option) to handle tx_notify as currently proposed solution also introduces risk of deadlocks in some edge cases. I will update the PR once I prepare the new solution.

@MarekPieta MarekPieta force-pushed the bluetooth_context_improvements branch from bab22b9 to b662f75 Compare October 4, 2024 10:07
@MarekPieta MarekPieta changed the title bluetooth: conn: Use Bluetooth workqueue for tx_notify bluetooth: conn: Use a separate workqueue for connection TX notify Oct 4, 2024
@MarekPieta MarekPieta force-pushed the bluetooth_context_improvements branch from b662f75 to aa2ae65 Compare October 4, 2024 11:14
Comment on lines 132 to 139
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be tests verifying that the stack works with and without this option enabled, right?

Should there also be tests that verifies that ATT cannot time out/proves it does when the system workqueue is blocked for a long time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just insist on having this enabled and remove the option. Doing so will remove the need for some hacks in the code like https://github.com/zephyrproject-rtos/zephyr/pull/79258/files#diff-00881f70fc7ec49b966fed79bd6a5a8ed5357599b241b5b029736bbc97683f83R275-R288

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we just remove one thread from BT and replace it with the system workqueue, and now adding a new thread to offload from the system workqueue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just insist on having this enabled and remove the option. Doing so will remove the need for some hacks in the code like https://github.com/zephyrproject-rtos/zephyr/pull/79258/files#diff-00881f70fc7ec49b966fed79bd6a5a8ed5357599b241b5b029736bbc97683f83R275-R288

Removing the Kconfig option and making the feature enabled by default. I will also modify https://github.com/zephyrproject-rtos/zephyr/pull/79258/files#diff-00881f70fc7ec49b966fed79bd6a5a8ed5357599b241b5b029736bbc97683f83R275-R288 (I noticed that you do not like this approach)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we just remove one thread from BT and replace it with the system workqueue, and now adding a new thread to offload from the system workqueue?

It seems that reusing system workqueue causes problems here. Because of that we may need to change approach here a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rugeGerritsen I think we should definitely have some tests in BSIM using this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If needed, maybe I could open a separate PR that enables the feature by default to run all of the CI tests with the feature enabled (for additional validation of the feature)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that we should run at least some tests in this PR, otherwise we are adding a untested Kconfig. It makes more sense to test before you merge a feature, rather than after :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's enable feature by default on a separate test PR to run CI tests on it then: #79713

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's enable feature by default on a separate test PR to run CI tests on it then: #79713

Can you explain why you are not adding tests in this PR? We need the tests regardless

@MarekPieta MarekPieta force-pushed the bluetooth_context_improvements branch 2 times, most recently from 9266ecf to b03043b Compare October 7, 2024 11:54
@MarekPieta MarekPieta force-pushed the bluetooth_context_improvements branch from b03043b to 1eb0c82 Compare October 9, 2024 06:26
@MarekPieta MarekPieta force-pushed the bluetooth_context_improvements branch 4 times, most recently from 5b8dca3 to 680e996 Compare October 9, 2024 12:36
@MarekPieta
Copy link
Contributor Author

Applied changes from offline discussion with @alwa-nordic

Just to be clear: It is still the case that when tx_notify_process is called in the context of conn_tx_workq then it still calls the TX callback which can still be blocked by a poorly implemented application, so I'm still unclear whether this is really fixing the issue if conn_tx_workq should never be blocked.

Using a separate workq at least mitigates negative impact of other works that rely on the system workqueue. Because of that, it might be useful for some of the applications. E.g. if the system workqueue is known to be used for tasks that may last long - like settings operations which could trigger NVM erases. The feature is optional, so if an application does not need it, it can simply keep the feature disabled.

The feature can also help with mitigating risks related to synchronizing non-volatile memory and radio operations for nRF SoCs (see #79258 (comment) for details).

Copy link
Contributor

@Thalley Thalley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the use case for this and the side effects. It's not a direction I'm happy about (basically re-adding a minimal TX thread again), but I have to assume that there's requirement that unfortunately isn't very clear here (no links to feature requests/issues).

It seems like a temporary patch to a fundamental issue, and this solution only alleviates the issue and doesn't solve it.

That being said, I won't be blocking this once my outstanding comments have been solved.

Besides the comments, we should add an entry in the release notes stating that this was added, as the removal of the TX thread was added to the 3.7 release notes.

Comment on lines 142 to 144
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a pretty significant default stack size. Since the stack is only using this for a fairly fall function, shouldn't it default to something like 128/256/512?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially made the stack size bigger to ensure that we are on safe side (unfortunately I can validate it locally only in limited number of use-cases related to our application - e.g. we do not use CONFIG_BT_ISO_TX).

@alwa-nordic could you advise on the stack size here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a good case for testing a stack size would be to run one of the LE Audio sample and in the bt_bap_stream_ops.sent callback measure the stack usage.

The reason why I think this is a good way of measuring it, is that the callback goes from the work item in conn.c to the sent callback in iso.c to the sent callback in BAP and then finally to the application (possible passing through the sent callback in cap_stream.c as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the Bluetooth WG meeting today, there was consensus to default this setting to copy the value from the system work queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, updated.

Comment on lines 146 to 149
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this being a non-configurable Kconfig option, compared to just a const value or #define?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it in the same way as for CONFIG_BT_RX_PRIO and CONFIG_BT_DRIVER_RX_HIGH_PRIO. Still, it would not hurt to make the priority user-configurable. Updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, checked those out too, and didn't really understand why they were non-configurable Kconfig options either :)

Copy link
Contributor Author

@MarekPieta MarekPieta Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that it's done to prevent breaking something by reconfiguring application while still allowing to change this Kconfig value by overriding the default in application's Kconfig definition file (advanced configuration done only if user knows what he is doing). That's why I initially followed the same approach here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That pretty much sounds like "We want it to be configurable but not really" :D Or "Configurable but with extra steps".

int Kconfigs are generally just a pain to work with, as they are not as flexible as bools (imply select only works with bools).

@alwa-nordic alwa-nordic added the Bluetooth Review Discussion in the Bluetooth WG meeting required label Oct 10, 2024
@LingaoM
Copy link
Contributor

LingaoM commented Oct 10, 2024

Why not borrow BT RX? This is also a workq, and the stack size is large enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR mostly looks ok now, so I'll do a nitpick: I would like to rename this to BT_CONN_TX_NOTIFY_WQ to avoid confusing it with tx_work.

Copy link
Contributor Author

@MarekPieta MarekPieta Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New name sounds better. Changed

@MarekPieta
Copy link
Contributor Author

Why not borrow BT RX? This is also a workq, and the stack size is large enough.

Unfortunately that might not work properly too. I already discussed this approach with @alwa-nordic. He explained to me that it might cause deadlocks because the work posted to BT WQ may be blocking - e.g. it invokes application callbacks, ATT server (which blocks on allocation), etc.
Because of that, the idea was eventually rejected.

@MarekPieta MarekPieta force-pushed the bluetooth_context_improvements branch from 7fcfa90 to cdc867c Compare October 11, 2024 08:29
@zephyrbot zephyrbot added the Release Notes To be mentioned in the release notes label Oct 11, 2024
@MarekPieta
Copy link
Contributor Author

Besides the comments, we should add an entry in the release notes stating that this was added, as the removal of the TX thread was added to the 3.7 release notes.

Added note to Zephyr 4.0 release notes (there seems to be no file for the 3.8 release)

Copy link
Contributor

@Thalley Thalley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objects from me anymore, but as discussed in the BT WG, this is a half-asses solution to a problem caused by an application's usage of the system workqueue.

There are probably better solutions to this, but less trivial to implement than this.

I am OK with adding this for now, but let's mark it as experimental so we can easily remove it again

Comment on lines 110 to 111
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have max line length requirements for .rst files?

Copy link
Contributor Author

@MarekPieta MarekPieta Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that generally doc lines tend to be split if they are longer than 100 characters (but it's not always a case). I will align my note.

btw. Documentation build seems not to complain about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kartben Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya in theory it's 100. In practice a better rule would probably be one sentence per line, as it also makes it easier for folks to realize that a blank line is required for paragraph breaks -- when using line wrapping people tend to assume that having just one line return will do that (as it looks "ok enough" in source form) when obviously it doesn't. It also makes diffs cleaner too, since adding a word or two to a line really just impacts that line as opposed to potentially many if it "pushes" a wrapped line to the next (and so on).
Unfortunately switching to this approach would require updating (and potentially breaking) so many files that I have yet to give it a try :)

In the meantime, 100 characters per line it is, when possible :)

@MarekPieta MarekPieta force-pushed the bluetooth_context_improvements branch from cdc867c to 4b70631 Compare October 11, 2024 09:50
Use a separate workqueue instead of system workqueue for connection TX
notify processing. This makes Bluetooth stack more independent from the
system workqueue.

Signed-off-by: Marek Pieta <Marek.Pieta@nordicsemi.no>
Change adds a release note informing about the newly introduced Kconfig
option for Bluetooth stack.

Signed-off-by: Marek Pieta <Marek.Pieta@nordicsemi.no>
@MarekPieta MarekPieta force-pushed the bluetooth_context_improvements branch from 4b70631 to 26b7104 Compare October 11, 2024 09:56
@Thalley Thalley dismissed their stale review October 11, 2024 12:48

Out-of-PR tests are passing with this, so won't be blocking, but this PR really should include proper testing of this new option

Copy link
Contributor

@Thalley Thalley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on the condition that @MarekPieta or someone else involved will add proper tests of this soon after.

Copy link
Contributor

@rugeGerritsen rugeGerritsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the commit message with the following:

  • The motivation behind the change
    • What use case will start working when enabling this config/stopped working when the TX queue was removed
    • Why the configuration is not enabled by default
    • Why it is experimental

Document the plans going forward:

  • How and who is going to test this?
  • Are we planning to get rid of it afterwards?

That will make it easier to look at this change again some time later

@nashif nashif merged commit bc5f1c8 into zephyrproject-rtos:main Oct 11, 2024
26 checks passed
@MarekPieta MarekPieta deleted the bluetooth_context_improvements branch October 14, 2024 07:11
@PavelVPV
Copy link
Contributor

(I know that I'm writing to the merged PR but for future).

IMHO it would be good to try to come up with a solution without introducing extra threads to the host when fixing bugs spawned after the tx thread removal. This change adds extra 1k RAM overhead. Mesh now needs ~1k more to fix the consequences caused by the tx thread removal. This makes less and less feasible to run a reliable solution on platforms with constraint RAM, like nRF52832. For example, the nRF Connect SDK light_ctrl sample built for nRF52832 already has only ~6.5kBytes RAM left:

RAM:       58950 B        64 KB     89.95%

@MarekPieta
Copy link
Contributor Author

Can you update the commit message with the following:

  • The motivation behind the change

    • What use case will start working when enabling this config/stopped working when the TX queue was removed
    • Why the configuration is not enabled by default
    • Why it is experimental

Document the plans going forward:

  • How and who is going to test this?
  • Are we planning to get rid of it afterwards?

That will make it easier to look at this change again some time later

The PR was merged already so I cannot update the commit message. Agreed with @rugeGerritsen to provide the answers as a separate comment here:

What use case will start working when enabling this config/stopped working when the TX queue was removed?

Locally I observed problems with nRF flash synchronization mechanism (my application uses a flash synchronization mechanism that is implemented outside Zephyr). Situation where an application triggered a flash erase operation from the system workqueue context and received a GATT notification could lead to a deadlock. I guess that Zephyr SW LL may be affected by similar problems, because FLASH synchronization is required for nRF SoCs in general.

@alwa-nordic mentioned that the fix introduced in scope of this PR also fixes another issue, which is not directly related to nRF flash synchronization: #78761. Seems that we cannot rely on no blocking of the system work queue.

Why the configuration is not enabled by default?
It introduces an additional thread (that significantly increases RAM usage) which might not be necessary for many of the use-cases (the deadlock issue might not replicate for some applications). At this moment, I enabled the fix in my application configuration to workaround problems that I observed locally. We could consider enabling this by default in general or depending on other enabled Kconfigs (if we accept the RAM usage increase). @theob-pro, @alwa-nordic, @Thalley, what's your opinion on that?

Why it is experimental?
The newly introduced workqueue was accepted as a temporary solution. During discussion with @alwa-nordic, we were considering to introduce a new global workqueue that could only execute non-blocking works (as we already noticed that the system workqueue may block in many practical use-cases). The newly introduced global non-blocking workqueue could be shared among various subsystems and application. The non-blocking workqueue would ensure better code execution responsiveness than the system workqueue. Introducing the solution globally may require more time, because of that we currently rely on the workaround implemented in the Bluetooth host scope.

  • How and who is going to test this?
    We agreed, that I will introduce Babblesim tests for the feature soon (under a separate PR).

  • Are we planning to get rid of it afterwards?
    We may consider removing the workqueue introduced here after we e.g. introduce a global non-blocking workqueue to Zephyr. See Why it is experimental? section of this comment for more details.

@Thalley
Copy link
Contributor

Thalley commented Oct 14, 2024

@MarekPieta @alwa-nordic
Arguably we should discuss this further in a GH discussion or on Discord, but just to "answer" the above: Has the idea of being able to configure multiple (identical) system workqueues with a load balancer been considered? It sounds like overkill in an embedded system, but could in theory have fixed this issue as well, and could work as a scalable solution compared to the suggested non-blocking workqueue (how would we even enforce that?).

@jhedberg
Copy link
Member

Has the idea of being able to configure multiple (identical) system workqueues with a load balancer been considered?

That's an interesting idea, though in some ways it sounds like it's intruding on the responsibility area of the scheduler. I wouldn't limit it to "the system", rather make it a general "workqueue pool" style API, and then the system work queue could potentially be extended to take advantage of it. Instead of having a 1:1 mapping between a queue and a thread that runs the work, you'd have a 1:N mapping from a queue to multiple possible threads that can execute the work. Looking at kernel/work.c it seems like it might not even be that hard to implement. Most likely it's not quite as simple as starting multiple work_queue_main() threads for the same queue, however it's possible that not much more is needed. Anyway, if you want to drive this, it's something that should be proposed on a Zephyr architecture level, i.e. create an issue for it and bring it to the Architecture WG for discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth Bluetooth Review Discussion in the Bluetooth WG meeting required Release Notes To be mentioned in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants