-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage when using DDS intra-process communication #1642
Comments
@mauropasse thanks for detailed information, really interesting. what about using LoanedMessage instead? the following seems to support zero copy via
CC: @MiguelCompany @eboasson would you care to share thoughts? |
Yes, if it's possible we should try to reuse the logic for The rclcpp code shouldn't really care about whether the message comes from the same process or from a different process via a shared memory layer, as long as the underlying DDS supports optimized intra-process communication. |
Hi @fujitatomoya https://github.com/eclipse-cyclonedds/cyclonedds/blob/iceoryx/src/core/ddsc/src/dds_data_allocator.c#L130 returns is |
thanks for sharing, as far as i know about rmw_fastrtps, data structure has to be bound data type. sorry i am not sure about cyclonedds. (actually we've tried zero copy via rmw_fastrtps PRs, which works okay.)
i was hoping that could be one of them 😄 |
The bounded size data type restriction is also in Cyclone, and it besides needs a few more changes before it will be willing to do this in the absence of shared memory (that's a bookkeeping detail). Frankly, I'm shocked that there are 2 |
@mauropasse If your type is plain, could you check with this commit? I should mention that ros2/rosidl_typesupport_fastrtps#67 is also required |
Great, with @MiguelCompany's commit, using ros2/rosidl_typesupport_fastrtps#67, and commenting out this line we can use loaned messages on intra-process publishers and subscribers. I'll do some profiling to check how the CPU looks like with these changes. |
@eboasson yes, I've observed the same and also plan to look at this, for what I've quickly seen is due the standard C++ behaviour. |
Benchmark results looking good with
|
thanks for sharing the result! |
@mauropasse is this #1642 (comment) come from cyclone or fast-dds? |
@fujitatomoya those results are with Fast-DDS, using the instructions provided by @MiguelCompany. |
I did further testing on LoanedMessages, trying to understand the reason for the performance improvements compared to RCLCPP intra-process, showed on #1642 (comment). Also the CPU improvement was due the memory was not initalized on the LoanedMessages case. Now I made sure every byte is initalized in all tests, to have all memory shown in RSS. As conclusion by now, LoanedMessages perform similar to RCLCPP intra-process, but with bigger latency. Now I'm trying to force ROS2 to use the DDS IPC (instead of the |
@mauropasse The On its travel from the user sample on the publishing side to the user sample on the subscribing side, the following copies may be performed:
I hope this long explanation is clear enough. |
this means that maps all physical pages in the process space, i think this is what application does.
do you happen to have any idea why latency is bigger than rclcpp intra-process? i am not sure, but maybe page fault to map physical pages into process space in subscription side when using rmw intra-process? (means same physical page but different virtual address, it costs system time.) at least i think rclcpp intra-process uses the same exact virtual address to access the data if possible. i can be wrong on this... thanks in advance. |
@fujitatomoya I did some quick profiling, looks like when using |
thanks 👍 but i am not sure how we can tell latency difference from this graph since it only shows ratio how much it consumes CPU....
i think time consuming ratio is almost same? just checking if i am missing something. |
The problem is that this line is zero-initializing the message, right? rclcpp/rclcpp/src/rclcpp/executor.cpp Line 637 in 7d8b269
(make sure you use permalinks, if not the link will be poiniting to another part of the code when updated) The difference between zero and default initialization in cpp is extremely subtle, but it's also possible to achieve the seconde one. e.g. auto data = new T; // default initialized, there won't be any memset for POD, it starts "uninitialized"
auto data = new T(); // this one is zero-initialized :) we're using
which is equivalent to zero-initialization (uses // allocate
auto message_ptr = std::allocator_traits<MessageAlloc>::allocate(*message_allocator_.get(), sizeof(MessageT));
try {
// placement new, using default initialization
message_ptr = new(message_ptr) MessageAlloc;
return std::shared_ptr(
message_ptr,
[message_allocator_] (MessageT * p) {
std::allocator_traits<MessageAlloc>::deallocate(*message_allocator_.get(), message_ptr, sizeof(MessageT));
},
*message_allocator_.get());
} catch (...) {
// deallocate here
auto std::allocator_traits<MessageAlloc>::deallocate(*message_allocator_.get(), message_ptr, sizeof(MessageT));
throw;
} should do it. Another thing to check is the default constructors of the message we're generating. |
This can be avoided forcing synchronous publishing. Using |
Posting here latest tests on Loaned Messages, now including also CycloneDDS. Notes about tests:
tl; dr; Conclusions Latency: CycloneDDS Both graphs are the same, just different scale to better appreciate the difference with small messages. Latency: FastDDS Both graphs are the same, just different scale to better appreciate the difference with small messages. Memory RSS: FastDDS & CycloneDDS Memory VSZ: FastDDS & CycloneDDS Defining future works
|
Appreciate the update, this is informative 👍 What it does not make sense to me is the latency between I guess this is related to event notification mechanism, not sharing data. Any opinion? |
Thanks for the update @mauropasse. Which exact version of ROS 2 did you use and which HW? There are also differences between the initial Galactic release and the first patch release |
Interesting results, thanks for posting them. I wonder about the hybrid approach we had in the past where the pointers and actual data were handled in rclcpp, but a message was passed via rmw (therefore via dds vendor) to notify the subscription. That might achieve good results with large messages where currently the rclcpp IPC (of today) out performs the loaned message approach. Obviously we should try to improve the loaned message support first, but just a thought. |
This issue has been mentioned on ROS Discourse. There might be relevant details there: |
I did some CPU profiling on different architectures (x86, MIPS-500MHz, ARM quad-core -1.2GHz) around LoanedMessages publishing/subscribing a single byte message at a high frequency, to identify areas where we can save some CPU time. The use of such a small message is to point out work done by the Loaned Message infrastructure and reduce noise about message creation and handling. Note: Below, when I say IPC I mean intra-process communication in RCLCPP layer, not the DDS intra-process, which is not exercised here. FastDDS 2.3.x: From the flamegraphs we can see that either the use of LoanedMsg or IPC OFF leads to almost the same path of APIs, and they seem to take similar CPU time. The differences: on IPC OFF we have the serialize operation (there's no deserialize if data sharing is ON), while in the LoanedMsg we have the (expensive) construction/destruction of LoanedMessage instances. The take operation seems also to take more time on LoanedMessage (rmw_take_loaned_message_with_info vs rmw_take_with_info). Same with the return message, more expensive using LoanedMsg (rmw_return_loaned_message_from_subscription vs return_message). So in short, for small messages using LoanedMsg is more expensive than IPC OFF. For bigger messages, the increase of serialize/deserialize durations is what makes the difference in favour of LoanedMsg. Based on flamegraphs of LoanedMsgs, I identified expensive operations and tried to reduce or remove them directly like:
A flamegraph called loaned-optimised.svg included in FastDDS-Flamegraphs.zip shows the resulting CPU usage after removal of apis. Here some comparison graphs of running all mentioned alternatives on 2 different embedded systems. CycloneDDS There seem to be quite a big overhead from IOX, manager of the shared memory (see violet area in flamegraph), which totally explains previous results on #1642 (comment) about big latency using Loaned Messages on CycloneDDS, compared to IPC OFF: Lot of information here, tried to squeeze it as much as I could. Let me know if something is not clear. |
@mauropasse @budrus Looking at the Flamegraphs and your conclusions I agree. I only consider the CycloneDDS/ iceoryx communication path here.
Those are separate issues that can be resolved independently. Loaned Message overheadIdeally LoanedMessage construction should not need to zero any memory nor construct any payload data, this must be done by the user later via an emplace API. Construction should only acquire memory for the payload and initialize some meta-data of constant size. The emplace API called by the user would take constructor arguments and populate the payload memory via placement new. iox Listener notification overheadWe need some kind of mechanism that informs us that there is data. However, there might be redundancy since the Executor will also look for data and perform other actions. So we are actually listening with two threads for the same data. Both of them use at some point semaphore/futexes as a signal mechanism, which are not exactly cheap (but there are no real alternatives I think). It might be possible to improve the performance of the All of this explains why the data pretty well I think.
Reducing the |
CPU profiling of a ROS2 application shows that using the DDS intra-process communication uses almost twice CPU time than using the rclcpp IPC (handled by the intra-process manager).
This happens due the executor makes a copy of the message when a subscription is ready to take it:
rclcpp/rclcpp/src/rclcpp/executor.cpp
Line 637 in 7d8b269
The copy is made regardless of the use of IPC by the DDS middleware, where the message might be passed from the publisher to the subscriber without actually copying it.
The following plot shows the total CPU time spent to publish a 100Kb message and receive it on the subscription.
StaticSingleThreadedExecutor
on a single core platform (RPi 1) at 500Mhz:Publisher → Msg: 100kb ← Subscriber
![image](https://user-images.githubusercontent.com/16389257/116142021-51761180-a6d1-11eb-818d-149c827b0d35.png)
The graph below shows a FlameGraph (CPU profiling) of a similar system, using the DDS intra-process (rclcpp IPC disabled)
Publisher → Msg: 8mb at 10Hz ← Subscriber
![image](https://user-images.githubusercontent.com/16389257/116142255-98fc9d80-a6d1-11eb-9423-0980ca3d7215.png)
We can see memmove/memset operations called twice, once by the publisher when creating the message, and once by the executor when taking the subscription message on
execute_subscription()
, where a new message is created.We're planning to fix this, maybe reusing some of the logic used to take loaned messages, but first we'd like to know if this is a known issue and maybe there's already a plan for a fix, or a proposal on a possible fix implementation?
The text was updated successfully, but these errors were encountered: