Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support distributed recording to get ultra high performance. #1548

Open
ZhenshengLee opened this issue Jan 22, 2024 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@ZhenshengLee
Copy link

Description

the idea is originally from https://eclipse-ecal.github.io/ecal/applications/rec/recorder_architecture.html#applications-recorder-centralized-distributed-recording

Centralized recording:
This means that you simply start the eCAL Recorder on one of your many machines. It will subscribe to all topics and record the data to your hard drive. Data from remote machines will be sent via network.
This is the most trivial and easy to use mode, as you only need one application on one machine.
Distributed recording:
When having many eCAL applications, it is generally advisable to run applications that exchange huge amounts of data on the same machine, as eCAL will then use its shared memory transport mechanism, which is way faster than sending data over the network. The recorder can also take advantage of that feature and avoid network congestion while running a recording. For this mode you will have to launch the eCAL Recorder on one of your machines and the eCAL Recorder Client on all of your other machines.
Now each Recorder and Recorder Client will only record topics coming from their respectable machine and record them to their local hard drive. After you finished your test, you can then let the Recorder Clients push their files to your main PC to get a contiguous recording.
Of course, mixed configurations are possible (e.g. some machines will only record their own topics, while other machines will also record topics of some external machines).

The main motivation is performance consideration in AV domain projects.

Related Issues

there are many efforts to get the desired performance of rosbag in AV domain.

Completion Criteria

this FEA requests that there is a rosbag cli support for distributed recording

Implementation Notes / Suggestions

there should be:

  • composable recorder node.
  • recorder service
  • recorder client
  • recorder cli

Testing Notes / Suggestions

the distributed recording service should be tested in the AV domain project like autoware auto, https://github.com/autowarefoundation/autoware.universe

@ZhenshengLee ZhenshengLee added the enhancement New feature or request label Jan 22, 2024
@fujitatomoya
Copy link
Contributor

This feature sounds interesting, but we surely need to consider more details for implementation note and suggestions. i believe that REP would be the good way to start the more discussion in detail. as 1st step, i would recommend that we can have discussion in ROS 2 Tolling WG.

btw, as user-experience,

The Client Application is only needed for distributed recordings. It is started on all machines,

i think user does not want to be responsible for this... it does not scale if we have hundreds devices in ROS 2 network? system should be responsible to spawn the necessary recorder agent processes to the appropriate hosts based on recording request by user.

@MichaelOrlov
Copy link
Contributor

MichaelOrlov commented Jan 23, 2024

Hi @ZhenshengLee.
I agree that it would be nice to have fully pledged distributed recording/replay with UI similar to the eCAL available in ROS2.
A few years ago we evaluated the functionality of the eCal in Apex.AI and came to the conclusion that certainly would be nice to have something similar. Although since then we didn't have enough resources to get close to it.
However, we made essential functionality for distributed recording available via the command line interface.
Of course, currently, it exists in the form of workarounds, but still better than nothing.

You can do the following to make a distributed recording (to some certain extent) in the latest rosbag2 version:

  1. Use composition manager or launch files to compose recorder nodes with parameters stored in YAML files. The related feature mentioned by you in the description Composable Player and Recorder nodes #902 has already been implemented recently and we fully support composable recorder/player from rosbag2_transport package.
  2. Configure the recorder's nodes to start in pause mode with a predefined list of topics. We also support regexes for topic list.
  3. Use service calls from the remote machine to pause, resume recording and split bag files for all running remote recorder nodes.
ros2 service call /rosbag2_recorder/snapshot rosbag2_interfaces/Resume
ros2 service call /rosbag2_recorder/snapshot rosbag2_interfaces/Pause
ros2 service call /rosbag2_recorder/snapshot rosbag2_interfaces/IsPaused
ros2 service call /rosbag2_recorder/snapshot rosbag2_interfaces/SplitBagfile
ros2 service call /rosbag2_recorder/snapshot rosbag2_interfaces/Snapshot

Please note that by default rosbag2 recorder creates a node with the name rosbag2_recorder. This is why any service request from above will trigger changes in the rosbag2 recorder on all recorder nodes.
However, the node name could be remapped via the node parameter -r __node:=<new node name> or using rosbag2 --node-name <node name> CLI parameter.

Therefore almost all requests from your description are already satisfied to some certain extent.

  • composable recorder node.
  • recorder service
  • recorder client
  • recorder cli

Only the missing part is that there is currently no way to know on the remote client to what topics recorders already subscribed.
Also, there are some room for improvements in the recorder.
For instance, it would be nice to have the ability to initiate stop and record operations via service requests from the remote client and the ability to provide a topics list to which the recorder needs to be subscribed via service request.

As regards the optimizations in the data recording path mentioned in my article in the blogpost https://www.apex.ai/post/improvements-in-data-recording-path. Unfortunately, I can't disclose or make those improvements publicly available. These are trade secrets of Apex.AI and this is how Apex.AI framework differentiates from the regular ROS 2 and this is what for our customers will pay money at the end.
I only can say that changes were made outside of the rosbag2.
In theory, similar performance improvements could be achieved by using ROS 2 interprocess communication mechanism via using composable recorder nodes and running all nodes in one process. But in this case no guarantee for the rosbag2 performance since it will be impossible to guarantee that rosbag2 threads will have enough CPU resources at all times.

Overall, we are already moving closer to having a fully pledged distributed recording and replay. At least their backend part.
In contrast, 1 or 2 years ago we even didn't have stop or pause/resume functionality in the recorder.

@ZhenshengLee
Copy link
Author

Overall, we are already moving closer to having a fully pledged distributed recording and replay. At least their backend part.
In contrast, 1 or 2 years ago we even didn't have stop or pause/resume functionality in the recorder.

Thank you for your reply! @MichaelOrlov
Your info is clear enough for my feature request.

I'll leave this issue open for more discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants