Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Loaned messages memory handling #2624

Draft
wants to merge 3 commits into
base: rolling
Choose a base branch
from

Conversation

mauropasse
Copy link
Collaborator

@mauropasse mauropasse commented Sep 11, 2024

Related to #2401

Loaned messages can be held simultaneously by different entities:

  1. User (storing messages beyond the subscription callback).
  2. Subscription (storing messages until it is executed).
  3. Publisher (storing messages for late joiner subscriptions).

The last reference to a loaned message has to return the loan to the RMW when the message goes out of scope. This allows for memory to be reused for future loaned messages.


The case when the user is the last holding references to the message, wasn't working as expected because the loans were returned after executing the subscription, regardless of whether the user had made copies of the message.

This PR improves the situation by introducing a custom message deleter, which calls the return loan API upon the destruction of the last reference.
Note: In this PR, the new message pointer passed to a subscription is not const, so the user can modify messages contents. This is likely not desired, but its useful for demonstration purposes here.

On FastDDS, memory is correctly handled when using on the FASTRTPS_DEFAULT_PROFILES_FILE:

<historyMemoryPolicy>DYNAMIC_REUSABLE</historyMemoryPolicy>

A small fix is still required in rmw_cyclonedds_cpp to avoid finalizing an allocator multiple times: don't call
dds_data_allocator_fini here.


In a different scenario, CycloneDDS suffers from a "double delivery issue," in which a late joiner subscription receives both interprocess and loaned transient local messages (published before its creation), before segfaulting. FastDDS does not experience this problem.


In this PR I add a unit test for testing different scenarios:

- Verify that loaned messages are returned when expected, and confirm memory reuse after return:
    - When the user is the last owner
    - When the subscriber is the last owner
    - When the publisher is the last owner

- Validate that memory is indeed shared:
    - When the user overrides messages:
      - Other subscriptions that have not yet executed (with messages queued) should receive the modified version of the messages.
      - Publisher's transient local messages should be impacted: A late joiner subscription should receive the modified messages.

- Tests to demonstrate current behavior (which may not be the expected):
  - Test the maximum publisher depth setting (as loaned messages are not used beyond this limit)
  - Determine the maximum number of loans that can be requested without publishing the messages (thus not returning the loans)
  - Highlight the double delivery issue on transient local pub/sub for CycloneDDS

Mauro Passerino added 2 commits September 5, 2024 14:54
Instead of returning the loaned after executing
the subscription, return it when the last ref
of the message goes out of scope.

 - ros2#2401

Signed-off-by: Mauro Passerino <mpasserino@irobot.com>
Signed-off-by: Mauro Passerino <mpasserino@irobot.com>
@clalancette
Copy link
Contributor

While this approach fixes the situation for CycloneDDS, FastDDS still has problems; it seems it has no way to track multiple references to the same loaned message, so it might reuse and overwrite memory that is still in use, without issuing warnings or errors.

Since Fast-DDS is our default RMW, I don't think we can move forward without fixing this somehow. So I'm going to mark this as a "draft" for now.

@clalancette clalancette marked this pull request as draft September 12, 2024 18:42
@mauropasse
Copy link
Collaborator Author

@clalancette by changing the history memory policy on the FastDDS profile, the tests pass! That is, memory is not reused until all entities have returned the loans (as expected).

<     <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
---
>     <historyMemoryPolicy>DYNAMIC_REUSABLE</historyMemoryPolicy>

I'll update the PR description.

Signed-off-by: Mauro Passerino <mpasserino@irobot.com>
EXPECT_TRUE(loaned_addresses_match(pub_sub_loaned, sub_loaned));
}

// FastDDS: Show that the publisher gets loans from memory still in use:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are some tests that are specific to some RMW implementation.
I'm also not sure if they are all valid unit-tests or rather "examples" showing the limitations of some RMWs.

In any case, we should run these tests only for the RMWs they matter, for example with

  if (std::string(rmw_get_implementation_identifier()).find("rmw_connextdds") == 0)
  {
    GTEST_SKIP();
  }

@Micbetter
Copy link

Hi @mauropasse , thanks for your great job. I have tried this improvement on ros2-humble with cyclonedds in our project. But things didn't go as well.

Below are some code snippets. I tried to copy the loaned_msg_ptr to a dataset. The initial return of loaned_msg_ptr to iceoryx was successful. But the later can not be retruned correctly.

using MSG = std_msgs::msg::Uint16;

int main(){
    // some ros2 initialization

    std::vector<MSG::SharedPtr> dataset; 
    auto sub = node->create_subscription<MSG>(
        topic_name, 10,
        [&data_set](MSG::SharedPtr loaned_msg_ptr) {
          data_set.push_back(loaned_msg_ptr);
    });
    // sleep for a while to wait
    std::this_thread::sleep_for(std::chrono::seconds(10));
    // after at least two subscription callback, release loaned_msg_ptr
    dataset.clear();
}

This happens only when the second subscription callback come before the first loaned message is returned. Because every rmw_take_loan_int is invoked , for one subscription handle , rmw initializes the same dds allocator and the dds allocator will be finished every time a loaned message is returned. dds_data_allocator_free can not free a loaned message when data allocator is finished because d->kind = DDS_IOX_ALLOCATOR_KIND_FINI after finish.

// https://github.com/eclipse-cyclonedds/cyclonedds/blob/releases/0.10.x/src/core/ddsc/src/dds_data_allocator.c#L146
dds_iox_allocator_t *d = (dds_iox_allocator_t *)data_allocator->opaque.bytes;
switch (d->kind) {
  case DDS_IOX_ALLOCATOR_KIND_FINI:
    ret = DDS_RETCODE_PRECONDITION_NOT_MET;
    break;
  case DDS_IOX_ALLOCATOR_KIND_NONE:
    ddsrt_free(ptr);
    break;
  case DDS_IOX_ALLOCATOR_KIND_SUBSCRIBER:
    if (ptr != NULL) {
      ddsrt_mutex_lock(&d->mutex);
      shm_lock_iox_sub(d->ref.sub);
      iox_sub_release_chunk(d->ref.sub, ptr);
      shm_unlock_iox_sub(d->ref.sub);
      ddsrt_mutex_unlock(&d->mutex);
    }
    break;

I think this issue also exists in ROS2 Rolling, right?
Some fixes have been added to rmw_cyclonedds/rmw_node.cpp to make loan message safe in multi-threads but it is still in testing. I'm pleasure to share my codes if it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants