Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add qos_reliablity param #329

Merged
merged 7 commits into from
Jan 28, 2025

Conversation

MrBlenny
Copy link
Contributor

@MrBlenny MrBlenny commented Dec 5, 2024

Changelog

Adds a qos_reliablity param to the ros2 bridge that allows the default reliability to be set via param. Defaults to "automatic" to retain existing behavior.

Docs

N/A

Description

The web-socket bridge runs on TCP which causes issues on flakey internet connections. There is some discussion here:
https://github.com/orgs/foxglove/discussions/15

This PR adds the ability to specify "qos_reliability" of "best_effort" which is useful when running a bridge on a different computer to the robot for example:

ROBOT COMPUTER -> DDS/UDP -> REMOTE COMPUTER -> Websocket/TCP -> Foxglove / Browser

@jtbandes
Copy link
Member

jtbandes commented Dec 5, 2024

when running a bridge on a different computer to the robot

Just curious, why are you doing this rather than running the bridge on the robot?

@MrBlenny
Copy link
Contributor Author

MrBlenny commented Dec 5, 2024

Running the bridge on the robot means the connection from the Robot to Foxglove is TCP.
When comms is poor (for example satellite internet on a field robot) the TCP connection can go down for a few seconds which will lead to a build-up of messages, potentially up to the send_buffer_limit. When connection is established there will then be a large flood of messages, some of which may no longer be relevant.

In my use-case, it is fine if messages are dropped via the DDS middleware. Adding WebRTC or WebTransport would be a nicer solution but this param provides a quick fix.

README.md Outdated
@@ -84,6 +84,7 @@ Parameters are provided to configure the behavior of the bridge. These parameter
* (ROS 2) __num_threads__: The number of threads to use for the ROS node executor. This controls the number of subscriptions that can be processed in parallel. 0 means one thread per CPU core. Defaults to `0`.
* (ROS 2) __min_qos_depth__: Minimum depth used for the QoS profile of subscriptions. Defaults to `1`. This is to set a lower limit for a subscriber's QoS depth which is computed by summing up depths of all publishers. See also [#208](https://github.com/foxglove/ros-foxglove-bridge/issues/208).
* (ROS 2) __max_qos_depth__: Maximum depth used for the QoS profile of subscriptions. Defaults to `25`.
* (ROS 2) __qos_reliability__: The default QoS reliability setting for subscriptions the bridge creates. Can be 'reliable', 'best_effort', or 'automatic'. Defaults to `automatic`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to explain what the automatic setting does under the hood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a description of all options and another best_effort_if_volatile.
I know this is borderline but... this is a very useful default if running the bridge remote as transient_local topics are typically not things that should be thrown away.

A better option would probably be to have a regex matcher on topics that specifies which QoS foxglove should use when in subscribes.

README.md Outdated
* (ROS 2) __qos_reliability__: The default QoS reliability setting for subscriptions the bridge creates. Defaults to `automatic`.
* `reliable`: ALWAYS subscribe with a "reliable" QoS profile.
* `best_effort`: ALWAYS subscribe with a "best effort" QoS profile.
* `best_effort_if_volatile`: subscribe as "best effort" if all the participants are "volatile". If any are "transient_local", subscribe as "reliable".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it wrong to subscribe with "best effort" if any of the participants are "transient_local" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I can fully answer this question but the following info may be helpful: https://docs.ros.org/en/rolling/Concepts/Intermediate/About-Quality-of-Service-Settings.html#qos-compatibilities

Copy link
Contributor Author

@MrBlenny MrBlenny Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a common pattern to publish a topic with "transient_local" durability whenever that value changes. For example a /robot/healthy Bool topic may be published only when the value changes from true<->false.

If the bridge subscribes to a transient_local topic with best_effort over a flakey connection. The historic transient_local message may be lost. These sorts of transient_local topics are typically not re-published so the bridge will never get the value.

best_effort_if_volatile is a useful default in this case but as mentioned, it's probably better to add a regex matcher param instead of this qos_reliability param.

i.e.
best_effort_qos_topic_whitelist: List of regular expressions (ECMAScript) for topics that should use be forced to use 'best_effort' QoS. Unmatched topics will use 'reliable' QoS if ALL publishers are 'reliable', 'best_effort' if any publishers are 'best_effort'. Defaults to ["(?!)"] (match nothing).

I might have a look at adding this today.

Copy link
Contributor Author

@MrBlenny MrBlenny Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made said change. Ready for review again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 we'll take a look

It's a common pattern to publish a topic with "transient_local" durability whenever that value changes. For example a /robot/healthy Bool topic may be published only when the value changes from true<->false.

FWIW my general advice on this is to to pick a cadence where you publish things like this even if they have not changed. It helps with data recording and debugging afterwards because you have the state information present without having to go back minutes or hours in history to lookup the topic value (which is hard in some workflows).

@defunctzombie defunctzombie requested a review from achim-k January 3, 2025 23:34
MrBlenny and others added 2 commits January 28, 2025 21:03
Co-authored-by: Hans-Joachim Krauch <achim-k@users.noreply.github.com>
Co-authored-by: Hans-Joachim Krauch <achim-k@users.noreply.github.com>
@achim-k achim-k merged commit 2fba8a3 into foxglove:main Jan 28, 2025
9 checks passed
@achim-k achim-k mentioned this pull request Feb 3, 2025
achim-k added a commit that referenced this pull request Feb 3, 2025
### Changelog
* add best_effort_qos_topic_whitelist param (#329)
* Add missing functional include in message_definition_cache.cpp (#334)
@tonynajjar
Copy link

Running the bridge on the robot means the connection from the Robot to Foxglove is TCP.
When comms is poor (for example satellite internet on a field robot) the TCP connection can go down for a few seconds which will lead to a build-up of messages, potentially up to the send_buffer_limit. When connection is established there will then be a large flood of messages, some of which may no longer be relevant.

Hey @MrBlenny, out of curiosity, did you try setting max_qos_depth to a small number? Shouldn't that theoretically achieve what you want i.e. only keep the last few messages?

@MrBlenny
Copy link
Contributor Author

The ros2 message handler is called and the messages are pushed onto the websockets queue for transmission in a non-blocking way as far as I can tell. That queue will then start to build up if the websockets connection is poor.

I've actually just been looking at this some more this week... The websocketpp lib would need to have a blocking send method of knowing which messages have been sent and which ones have not (by tracking message ID for example)

@MrBlenny
Copy link
Contributor Author

MrBlenny added a commit to Greenroom-Robotics/ros-foxglove-bridge that referenced this pull request Feb 20, 2025
* Fix "no matching function" error on yocto kirkstone (foxglove#331)

### Changelog
None
### Docs
None
### Description

foxglove#330 Fixes for Yocto issue

Fix for compilation error on yocto Kirkstone.

> error: no matching function for call to 'max(long unsigned int, size_t)

* bump to 0.8.2 (foxglove#332)

Bumping to 0.8.2 for foxglove#331

* Add missing functional include in message_definition_cache.cpp (foxglove#334)

### Changelog

message_definition_cache.cpp uses std::function, so it should include
the functional STL header

### Docs

None

### Description

message_definition_cache.cpp uses std::function, so it should include
the functional STL header

* add best_effort_qos_topic_whitelist param (foxglove#329)

### Changelog
Adds a `qos_reliablity` param to the ros2 bridge that allows the default
reliability to be set via param. Defaults to "automatic" to retain
existing behavior.

### Docs

N/A

### Description

The web-socket bridge runs on TCP which causes issues on flakey internet
connections. There is some discussion here:
https://github.com/orgs/foxglove/discussions/15

This PR adds the ability to specify "qos_reliability" of "best_effort"
which is useful when running a bridge on a different computer to the
robot for example:

**ROBOT COMPUTER** -> DDS/UDP -> **REMOTE COMPUTER** -> Websocket/TCP ->
Foxglove / Browser

---------

Co-authored-by: Hans-Joachim Krauch <achim-k@users.noreply.github.com>

* v0.8.3 (foxglove#336)

### Changelog
* add best_effort_qos_topic_whitelist param (foxglove#329)
* Add missing functional include in message_definition_cache.cpp (foxglove#334)

* feat: add send_buffer_queue param

* chore: lint fixes

---------

Co-authored-by: Graham Harison <gupyfish@gmail.com>
Co-authored-by: Jacob Bandes-Storch <jacob@foxglove.dev>
Co-authored-by: Silvio Traversaro <silvio@traversaro.it>
Co-authored-by: Hans-Joachim Krauch <achim-k@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants