Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random exception during node startup with use_sim_time:=true when Iceoryx is used #469

Closed
ksuszka opened this issue Sep 18, 2023 · 10 comments

Comments

@ksuszka
Copy link

ksuszka commented Sep 18, 2023

Bug report

Required Info:

  • Operating System:
    • official ros:humble docker image
  • Installation type:
    • official ros:humble docker image
  • Version or commit hash:
    • ros:humble hash sha256:12cc6355d9492ebd485370791a6aee424c17bf5e92ecccd178e058c50e2d3577
  • DDS implementation:
    • rmw_cyclonedds_cpp
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

  1. Create following Dockerfile:
FROM ros:humble

RUN apt-get update -y && apt-get install -y ros-$ROS_DISTRO-rmw-cyclonedds-cpp

ENV RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

RUN cat <<'EOF' > /run-test.sh

export CYCLONEDDS_URI='<Domain id="any"><SharedMemory><Enable>true</></></>'

# Running multiple ros nodes to increase chance of error
while true; do
    ( ( ros2 run rclcpp_components component_container --ros-args -p use_sim_time:=true &
        sleep 1
        pkill -INT -P $! ) & )
    sleep 0.2
done

EOF
  1. Build the Dockerfile:
docker build -f Dockerfile . -t test
  1. Open three terminal windows. In the first terminal window run:
docker run -it --shm-size 1G --rm --name test --init test iox-roudi
  1. In the second terminal window run:
docker exec -it test bash ./ros_entrypoint.sh ros2 topic pub /clock rosgraph_msgs/msg/Clock "{}" -r 10000 -p 10000
  1. In the third terminal window run:
docker exec -it test bash ./ros_entrypoint.sh bash /run-test.sh 

Expected behavior

Each node should start without errors.

Actual behavior

Random nodes fail startup with the following exception:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cannot store a negative time point in rclcpp::Time

Additional information

This happens only in ros:humble. It doesn't happen in ros:iron and ros:rolling. I'm not sure why it happens only after enabling Iceoryx as it doesn't seems to be related directly to Iceoryx.

We tried to find the culprit and it seems to be this line https://github.com/ros2/rclcpp/blob/724b4588ecbab3d2bb1ce4fbf800defa0e5a842e/rclcpp/src/rclcpp/time_source.cpp#L166, which was accidentally (as far as commit's comment says) fixed in this commit in rolling: ros2/rclcpp@7d660ac

Basically, it seems that with Iceoryx enabled the "message loaning" mechanism is used for receiving /clock messages, and in the subscription the shared pointer to this message is stored as a field of an object and message is destroyed at the end of the callback and than stored shared pointer becomes invalid.

I'm not sure if this is an error in the usage of the shared pointer, or is this an error in message loaning mechanism.


Feature request

Feature description

Implementation considerations

@clalancette
Copy link
Contributor

We tried to find the culprit and it seems to be this line https://github.com/ros2/rclcpp/blob/724b4588ecbab3d2bb1ce4fbf800defa0e5a842e/rclcpp/src/rclcpp/time_source.cpp#L166, which was accidentally (as far as commit's comment says) fixed in this commit in rolling: ros2/rclcpp@7d660ac

That was no accident; the original bug report that that commit fixes mentions the same problem: ros2/rclcpp#2088

We could consider backporting ros2/rclcpp#2092 to Humble, since it doesn't break API or ABI. Can you confirm that it fixes the issue for you?

@zygfrydw
Copy link

Yes, it does fix the issue for us.

We use the term "accident" because the commit message is about avoiding unintended modification of shared message while the main reason of this error that the memory kept by this shared_ptr is freed in the background.

We are affraid that lot of other code can be affected with with issue, because we are suprising to see Use-After-Free error despite std::shared_ptr is beeing used to keep last_time_msg_ in ClocksState.

We have not seen any ROS documentation saying that one shold not keep shared_ptr to message from subcription callback.

I have analyzed this error thoroughly and the reason this error happen can be seen in the following code snippets:

// rclcpp/rclcpp/src/rclcpp/executor.cpp:627
void Executor::execute_subscription(rclcpp::SubscriptionBase::SharedPtr subscription) {
  [...]
  if (subscription->can_loan_messages()) {
      void * loaned_msg = nullptr;
      [...]
      rcl_ret_t ret = rcl_take_loaned_message(
        subscription->get_subscription_handle().get(),
        &loaned_msg,
        &message_info.get_rmw_message_info(),
        nullptr);
      [...]
      subscription->handle_loaned_message(loaned_msg, message_info);
      [...]
      rcl_ret_t ret = rcl_return_loaned_message_from_subscription(
        subscription->get_subscription_handle().get(),
        loaned_msg);
      loaned_msg = nullptr;
  
// rclcpp/rclcpp/include/rclcpp/subscription.hpp:361
void  handle_loaned_message(void * loaned_message, const rclcpp::MessageInfo & message_info) override
  {
    [...]
    auto typed_message = static_cast<ROSMessageType *>(loaned_message);
    // message is loaned, so we have to make sure that the deleter does not deallocate the message
    auto sptr = std::shared_ptr<ROSMessageType>(typed_message, [](ROSMessageType * msg) {(void) msg;});
   [...]
    any_callback_.dispatch(sptr, message_info);

// rclcpp/rclcpp/src/rclcpp/time_source.cpp:164
// Cache the last clock message received
void ClocksState::cache_last_msg(std::shared_ptr<const rosgraph_msgs::msg::Clock> msg)
{
  last_msg_set_ = msg;
}

In the above code the excecutor creates dummy shared_ptr with empty destructor and loads the loaned message, than calls the subscription callbacks (cache_last_msg) and frees the memory, altought the shared_ptr is still kept by ClockState.

I have confirmed it also with Valgrind:
==17578== Invalid read of size 8
==17578==    at 0x53EB245: void __gnu_cxx::new_allocator<builtin_interfaces::msg::Time_<std::allocator<void> > >::construct<builtin_interfaces::msg::Time_<std::allocator<void> >, builtin_interfaces::msg::Time_<std::allocator<void> > const&>(builtin_interfaces::msg::Time_<std::allocator<void> >*, builtin_interfaces::msg::Time_<std::allocator<void> > const&) (new_allocator.h:162)
==17578==    by 0x53E2765: void std::allocator_traits<std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > > >::construct<builtin_interfaces::msg::Time_<std::allocator<void> >, builtin_interfaces::msg::Time_<std::allocator<void> > const&>(std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > >&, builtin_interfaces::msg::Time_<std::allocator<void> >*, builtin_interfaces::msg::Time_<std::allocator<void> > const&) (alloc_traits.h:516)
==17578==    by 0x53D91BB: std::_Sp_counted_ptr_inplace<builtin_interfaces::msg::Time_<std::allocator<void> >, std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > >, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<builtin_interfaces::msg::Time_<std::allocator<void> > const&>(std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > >, builtin_interfaces::msg::Time_<std::allocator<void> > const&) (shared_ptr_base.h:519)
==17578==    by 0x53D3647: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<builtin_interfaces::msg::Time_<std::allocator<void> >, std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > >, builtin_interfaces::msg::Time_<std::allocator<void> > const&>(builtin_interfaces::msg::Time_<std::allocator<void> >*&, std::_Sp_alloc_shared_tag<std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > > >, builtin_interfaces::msg::Time_<std::allocator<void> > const&) (shared_ptr_base.h:650)
==17578==    by 0x53D03B7: std::__shared_ptr<builtin_interfaces::msg::Time_<std::allocator<void> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > >, builtin_interfaces::msg::Time_<std::allocator<void> > const&>(std::_Sp_alloc_shared_tag<std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > > >, builtin_interfaces::msg::Time_<std::allocator<void> > const&) (shared_ptr_base.h:1342)
==17578==    by 0x53CC038: std::shared_ptr<builtin_interfaces::msg::Time_<std::allocator<void> > >::shared_ptr<std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > >, builtin_interfaces::msg::Time_<std::allocator<void> > const&>(std::_Sp_alloc_shared_tag<std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > > >, builtin_interfaces::msg::Time_<std::allocator<void> > const&) (shared_ptr.h:409)
==17578==    by 0x53C98F1: std::shared_ptr<builtin_interfaces::msg::Time_<std::allocator<void> > > std::allocate_shared<builtin_interfaces::msg::Time_<std::allocator<void> >, std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > >, builtin_interfaces::msg::Time_<std::allocator<void> > const&>(std::allocator<builtin_interfaces::msg::Time_<std::allocator<void> > > const&, builtin_interfaces::msg::Time_<std::allocator<void> > const&) (shared_ptr.h:863)
==17578==    by 0x53C7754: std::shared_ptr<builtin_interfaces::msg::Time_<std::allocator<void> > > std::make_shared<builtin_interfaces::msg::Time_<std::allocator<void> >, builtin_interfaces::msg::Time_<std::allocator<void> > const&>(builtin_interfaces::msg::Time_<std::allocator<void> > const&) (shared_ptr.h:879)
==17578==    by 0x53C3AE8: rclcpp::ClocksState::attachClock(std::shared_ptr<rclcpp::Clock>) (time_source.cpp:103)
==17578==    by 0x53C5810: rclcpp::TimeSource::NodeState::attachClock(std::shared_ptr<rclcpp::Clock>) (time_source.cpp:300)
==17578==    by 0x53C2693: rclcpp::TimeSource::attachClock(std::shared_ptr<rclcpp::Clock>) (time_source.cpp:527)
==17578==    by 0x52DA77F: rclcpp::node_interfaces::NodeTimeSource::NodeTimeSource(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeTopicsInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeGraphInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeServicesInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeLoggingInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeClockInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeParametersInterface>, rclcpp::QoS const&, bool) (node_time_source.cpp:49)
==17578==  Address 0x61084f0 is 0 bytes inside a block of size 8 free'd
==17578==    at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==17578==    by 0x628A4D7: dds_data_allocator_free (dds_data_allocator.c:150)
==17578==    by 0x616573B: UnknownInlinedFun (rmw_node.cpp:1392)
==17578==    by 0x616573B: UnknownInlinedFun (rmw_node.cpp:3554)
==17578==    by 0x616573B: rmw_return_loaned_message_from_subscription (rmw_node.cpp:3573)
==17578==    by 0x5AC4213: rcl_return_loaned_message_from_subscription (subscription.c:656)
==17578==    by 0x523DE31: rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>) (executor.cpp:629)
==17578==    by 0x523CEA6: rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) (executor.cpp:524)
==17578==    by 0x523C91C: rclcpp::Executor::spin_once_impl(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) (executor.cpp:472)
==17578==    by 0x53C8642: rclcpp::FutureReturnCode rclcpp::Executor::spin_until_future_complete<std::future<void>, long, std::ratio<1l, 1000l> >(std::future<void> const&, std::chrono::duration<long, std::ratio<1l, 1000l> >) (executor.hpp:362)
==17578==    by 0x53C5C17: rclcpp::TimeSource::NodeState::create_clock_sub()::{lambda()#1}::operator()() const (time_source.cpp:387)
==17578==    by 0x5444DEF: void std::__invoke_impl<void, rclcpp::TimeSource::NodeState::create_clock_sub()::{lambda()#1}>(std::__invoke_other, rclcpp::TimeSource::NodeState::create_clock_sub()::{lambda()#1}&&) (invoke.h:61)
==17578==    by 0x54409F7: std::__invoke_result<rclcpp::TimeSource::NodeState::create_clock_sub()::{lambda()#1}>::type std::__invoke<rclcpp::TimeSource::NodeState::create_clock_sub()::{lambda()#1}>(rclcpp::TimeSource::NodeState::create_clock_sub()::{lambda()#1}&&) (invoke.h:96)
==17578==    by 0x5435691: void std::thread::_Invoker<std::tuple<rclcpp::TimeSource::NodeState::create_clock_sub()::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (std_thread.h:259)
==17578==  Block was alloc'd at
==17578==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==17578==    by 0x629212C: ddsrt_malloc (heap.c:26)
==17578==    by 0x616B146: UnknownInlinedFun (rmw_node.cpp:1375)
==17578==    by 0x616B146: rmw_take_loan_int(rmw_subscription_s const*, void**, bool*, rmw_message_info_s*) (rmw_node.cpp:3427)
==17578==    by 0x5AC3ED4: rcl_take_loaned_message (subscription.c:632)
==17578==    by 0x523D970: rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>)::{lambda()#3}::operator()() const (executor.cpp:615)
==17578==    by 0x524357B: bool std::__invoke_impl<bool, rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>)::{lambda()#3}&>(std::__invoke_other, rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>)::{lambda()#3}&) (invoke.h:61)
==17578==    by 0x52425A3: std::enable_if<is_invocable_r_v<bool, rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>)::{lambda()#3}&>, bool>::type std::__invoke_r<bool, rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>)::{lambda()#3}&>(rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>)::{lambda()#3}&) (invoke.h:114)
==17578==    by 0x52412E5: std::_Function_handler<bool (), rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>)::{lambda()#3}>::_M_invoke(std::_Any_data const&) (std_function.h:290)
==17578==    by 0x5245C67: std::function<bool ()>::operator()() const (std_function.h:590)
==17578==    by 0x523D22B: take_and_do_error_handling(char const*, char const*, std::function<bool ()>, std::function<void ()>) (executor.cpp:558)
==17578==    by 0x523DDBE: rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>) (executor.cpp:610)
==17578==    by 0x523CEA6: rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) (executor.cpp:524) 

In my opinion this is against the rules of using smart pointers and is very misleading. According to my knowledge, the intention of using smart pointers is that one should not manualy control the life time of memory.

I have not found any ROS documentation saying that when using a subcription callback with shared_ptr you MAY NOT keep the message pointer beyond callback scope. Oposite, based on the std::shared_ptr properties I have assumed that you may safely keep the pointer to message as class member for later use just by increasing the use count for shared_ptr.

My take from this issue is that subcription callbacks with shared_ptr should be forbiden compleatly and replaced with const Msg& as there is no difference, just additionaly wrapping of shared_ptr.

I understand that impact of this claim is tremendous and it is not feasible, therfore in my opinion as resolution of this issue ROS documentation and tutorials should be extented with a big statement that one MUST NOT extend message lifetime - the message should be explicitly copied (instead of just increasing shared_ptr use count).

My other proposition is to check the use_count() of the message after callback and if the use count is increased at least log warning that access violation may occure. The best woud be to terminate such an application because as shown in this issue, the application may happily run reading forbidden memory without any error causing undefined behaviour.

This issue does not affect only loaned messages but all other messages types too (serialized message, other message [rclcpp/rclcpp/src/rclcpp/executor.cpp:585]).

In case of other message types the message pool is used, but the algorithm is the same: get message from pool, call subscription callbacks, return the message to pool. Therfore if the subcriber extends the life of message, the executor will keep modyfing the message. This will cause bugs which are extreamly hard to find, but will not crush the application because the memory still belongs to application.

@clalancette
Copy link
Contributor

I have not found any ROS documentation saying that when using a subcription callback with shared_ptr you MAY NOT keep the message pointer beyond callback scope. Oposite, based on the std::shared_ptr properties I have assumed that you may safely keep the pointer to message as class member for later use just by increasing the use count for shared_ptr.

My take from this issue is that subcription callbacks with shared_ptr should be forbiden compleatly and replaced with const Msg& as there is no difference, just additionaly wrapping of shared_ptr.

The whole point of the shared_ptr signature is exactly so that user code can hold onto the pointer without copying. That is an explicit goal. And in point of fact, if you use the non-loaned APIs, it works exactly as you expect; the object can live beyond the scope of the callback, assuming the callback takes a reference. So we should not forbid the shared_ptr callbacks completely, because they do work in that (common) scenario.

That said, I see exactly what you mean about the loaned APIs. They do not work like that; once the memory is returned to the underlying middleware, they should be considered invalid. So maybe we should forbid the shared_ptr version just for loaned messages. That said, I wasn't involved in the development of the loaned message API originally, so I'm not sure. @wjwwood do you have some idea on how this was supposed to work, and maybe what we should do here?

@zygfrydw
Copy link

zygfrydw commented Sep 19, 2023

The whole point of the shared_ptr signature is exactly so that user code can hold onto the pointer without copying. That is an explicit goal. And in point of fact, if you use the non-loaned APIs, it works exactly as you expect; the object can live beyond the scope of the callback, assuming the callback takes a reference. So we should not forbid the shared_ptr callbacks completely, because they do work in that (common) scenario.

Based on my analysis of rclcpp code this works in a way you described only in very specific configuration - default memory strategy and not loaned messages (serialized or "normal?").

This is the code from rclcpp/src/rclcpp/executor.cpp. I have simplified it for the sake of this discussion.

class Executor{
  [...]
  void execute_subscription(rclcpp::SubscriptionBase::SharedPtr subscription)
  {
    rclcpp::MessageInfo message_info;
    message_info.get_rmw_message_info().from_intra_process = false;
  
    if (subscription->is_serialized()) {
      std::shared_ptr<SerializedMessage> serialized_msg = subscription->create_serialized_message();
      subscription->take_serialized(*serialized_msg.get(), message_info);};
      subscription->handle_serialized_message(serialized_msg, message_info);
      subscription->return_serialized_message(serialized_msg);
    } else if (subscription->can_loan_messages()) {
      void * loaned_msg = nullptr;
      auto sub_handle = subscription->get_subscription_handle().get()
      rcl_take_loaned_message(sub_handle, &loaned_msg, [...]);
      subscription->handle_loaned_message(loaned_msg, message_info);});
      rcl_return_loaned_message_from_subscription(sub_handle, loaned_msg);
    } else {
      std::shared_ptr<void> message = subscription->create_message();
      return subscription->take_type_erased(message.get(), message_info);
      subscription->handle_message(message, message_info);
      subscription->return_message(message);
    }
  }
  [...]
}

class Subscription{
  [...]
  std::shared_ptr<void> create_message() override
  {
    return message_memory_strategy_->borrow_message();
  }
  std::shared_ptr<rclcpp::SerializedMessage> create_serialized_message() override
  {
    return message_memory_strategy_->borrow_serialized_message();
  }
  [...]
}

The function names clearly indicate that messages are borrowed, filled with data, passed to callbacks, and messages are returned after callbacks are complete.
When other memory strategies are used the messages will be reused despite it is kept by the subscriber. In fact, it is exactly what happens when you use MessagePoolMemoryStrategy. The message is returned to pools without checking if someone captured the shared pointer:

class MessagePoolMemoryStrategy{
  [...]
  void return_message(std::shared_ptr<MessageT> & msg)
  {
    for (size_t i = 0; i < Size; ++i) {
      if (pool_[i].msg_ptr_ == msg) {
        pool_[i].used = false;
        return;
      }
    }
    throw std::runtime_error("Unrecognized message ptr in return_message.");
  }
  [...]
}

I have confirmed the described behavior with the sample publisher and subscriber.

#include <algorithm>
#include <rclcpp/rclcpp.hpp>
#include <rclcpp/strategies/message_pool_memory_strategy.hpp>
#include <sstream>
#include <std_msgs/msg/int64.hpp>
#include <vector>

class Subscriber : public rclcpp::Node {
public:
  Subscriber() : Node("Subscriber") {
    sub_ = this->create_subscription<std_msgs::msg::Int64>(
        "my_topic", 10,
        std::bind(&Subscriber::callback, this, std::placeholders::_1),
        rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void>>(),
        std::make_shared<
            rclcpp::strategies::message_pool_memory_strategy::
                MessagePoolMemoryStrategy<std_msgs::msg::Int64, 10>>());
  }

private:
  void callback(std_msgs::msg::Int64::ConstSharedPtr msg) {
    RCLCPP_INFO(this->get_logger(), "Received: '%ld'", msg->data);
    std::stringstream ss;
    std::for_each(msg_history_.begin(), msg_history_.end(),
                  [&ss](const auto &msg) { ss << msg->data << "; "; });
    RCLCPP_INFO(this->get_logger(), "History: [%s]", ss.str().c_str());
    msg_history_.push_back(msg);
  }
  std::vector<std_msgs::msg::Int64::ConstSharedPtr> msg_history_;
  rclcpp::Subscription<std_msgs::msg::Int64>::SharedPtr sub_;
};

class Publisher : public rclcpp::Node {
public:
  Publisher() : Node("Publisher"), count_(0) {
    pub_ = this->create_publisher<std_msgs::msg::Int64>("my_topic", 10);
    timer_ =
        this->create_wall_timer(std::chrono::milliseconds(500),
                                std::bind(&Publisher::timer_callback, this));
  }

private:
  void timer_callback() {
    auto msg = std_msgs::msg::Int64();
    msg.data = count_++;
    RCLCPP_INFO(this->get_logger(), "Publishing: '%ld'", msg.data);
    pub_->publish(msg);
  }
  rclcpp::Publisher<std_msgs::msg::Int64>::SharedPtr pub_;
  rclcpp::TimerBase::SharedPtr timer_;
  int64_t count_;
};

int main(int argc, char **argv) {
  rclcpp::init(argc, argv);
  std::shared_ptr<rclcpp::Node> node;
  if (argc > 1 && std::string(argv[1]) == "sub") {
    node = std::make_shared<Subscriber>();
  } else {
    node = std::make_shared<Publisher>();
  }
  rclcpp::spin(node);
  rclcpp::shutdown();
  return 0;
}

The output is

[INFO] [1695114185.643947679] [Subscriber]: Received: '13'
[INFO] [1695114185.644047798] [Subscriber]: History: [10; 11; 12; 13; 4; 5; 6; 7; 8; 9; 10; 11; 12; ]

You can see that messages in history are being modified despite being kept as std::shared_ptr<const std_msgs::msg::Int64> which I would expect to not change over time.

For me, it is arguable if it is a correct behavior or not. I do not like the idea that the handling of messages changes with subscription configuration. In my opinion, it violates shared pointer principles that even if I hold the pointer, it can be released in the background.

I can agree that in case of changing memory strategy to a memory pool, one should expect that if captures more messages that are available in the pool, he should expect to run into troubles. However, it is not well documented and requires good knowledge of the underlying rclcpp implementation, as it is not clear from API (shared_ptr gives a false promise of ownership).
If the current behavior is correct, I would add an additional check to the memory pool and print a big warning that the message is kept by someone before returning the message to the pool. Such a situation most likely will destabilize the application so maybe even an abort should be considered.

EDIT:
After a discussion with my team, we came to the conclusion that when using a memory pool, we would expect that if you keep the pointer to the message, the memory is returned to the pool until is freed. If the pool runs out of memory it should raise an error or allocate more memory to the pool like std::vector, depending on the strategy.

I would say that this issue is less urgent with memory pools, as changing memory strategy requires a code modification inside Node and the developer should verify the impact of such a change on related callback.
The switch between loaned or not loaned messages does not require any code modification and can be done only by changing the RMW implementation. The user changing just RMW would not expect that he should revise his subscription callbacks.
Based on this discussion we will have to verify if nowhere in our code base we store the pointer to received message.

@ralwing
Copy link

ralwing commented Sep 19, 2023

So maybe we should forbid the shared_ptr version just for loaned messages.

The ROS framework design should not allow invalid usage. So the library shouldn't disable particular functionality depending on the configuration, because developer who implements code doesn't know the settings where the software will be used, and in fact he shouldn't know.
So IMHO we should drop support for shared_ptr, and pass messages as a const Msg& as @zygfrydw suggested.
Since this would have a huge impact of all depeneded libraries, we should change this code, so that the shared_ptr destructor will release the borrowed memory.

 void
  handle_loaned_message(
    void * loaned_message,
    const rclcpp::MessageInfo & message_info) override
  {
    auto typed_message = static_cast<CallbackMessageT *>(loaned_message);
    // message is loaned, so we have to make sure that the deleter does not deallocate the message
    auto sptr = std::shared_ptr<CallbackMessageT>(
      typed_message, [](CallbackMessageT * msg) {(void) msg;});
    any_callback_.dispatch(sptr, message_info);
  }

Although, this may end-up with "memory leaks", since the dependent code may hold the shared_ptr forever.

@clalancette
Copy link
Contributor

The ROS framework design should not allow invalid usage. So the library shouldn't disable particular functionality depending on the configuration, because developer who implements code doesn't know the settings where the software will be used, and in fact he shouldn't know. So IMHO we should drop support for shared_ptr, and pass messages as a const Msg& as @zygfrydw suggested.

I don't think this is a valid way forward. The shared_ptr signature is very important for performance reasons, so that users don't have to copy potentially large messages to keep a reference.

Since this would have a huge impact of all depeneded libraries, we should change this code, so that the shared_ptr destructor will release the borrowed memory.

This I totally agree with. What I think we should do here is change it so that the custom deleter for the shared_ptr that is created in handle_loaned_message is the one that releases the memory (i.e. move the rcl_return_loaned_message_from_subscription into the custom deleter). We can do the same for the shared_ptr that is created for normal messages as well, that way we'll be sure not to free things erroneously if not using the default message allocation strategy.

With those fixes, I think that the surprising result from using a shared_ptr will be fixed, and that we can still keep the shared_ptr subscription signature. You are right that users can potentially cause a "leak" by never releasing that memory, but that is always a problem in C++ and the users will have to be careful to eventually free it.

Are you willing to submit a pull request making those changes?

@zygfrydw
Copy link

I will try to prepare a pull request with releasing memory in shared_ptr destructor but I have an additional concern that this may result in additional problem with multithreding.
If the shared_ptr is passed to user code the destructor may be called from different thread than it was created - it may be a problem for some libraries, I have not analyzed the rcl code - rcl_take_loaned_message, rcl_return_loaned_message_from_subscription may be called from different threads. Do you have a knowlage if this functions are thread safe?
For sure there will be a problem with MessagePoolMemoryStrategy as I see that it is not thread safe and it does not handle situation when elements in pool are released in random order.

std::shared_ptr<MessageT> MessagePoolMemoryStrategy::borrow_message()
{
  size_t current_index = next_array_index_;
  next_array_index_ = (next_array_index_ + 1) % Size;
  if (pool_[current_index].used) {
    throw std::runtime_error("Tried to access message that was still in use! Abort.");
  }
  [...]
}

You can see in the above snipet that borrow_messages assues that next message in pool will be avaliable. If the user can retain shared_ptr, this function should search the message pool for avaliable message, as we do not know in which order messages are released.

@clalancette
Copy link
Contributor

rcl_take_loaned_message, rcl_return_loaned_message_from_subscription may be called from different threads. Do you have a knowlage if this functions are thread safe?

They are not thread-safe. However, we don't do locking at the rcl layer, only at the rclcpp layer, so we'll need to add appropriate locks.

You can see in the above snipet that borrow_messages assues that next message in pool will be avaliable. If the user can retain shared_ptr, this function should search the message pool for avaliable message, as we do not know in which order messages are released.

Yes, very good point. We'll have to fix that as well.

@clalancette
Copy link
Contributor

You can see in the above snipet that borrow_messages assues that next message in pool will be avaliable. If the user can retain shared_ptr, this function should search the message pool for avaliable message, as we do not know in which order messages are released.

ros2/rclcpp#2336 should fix this particular part of the problem.

We are also looking at disabling loaned messages by default in ros2/rclcpp#2335 and ros2/rcl#1110 .

That should at least make things safe to use. Later on we'll likely do more work to re-enable loaning with shared_ptrs by doing a custom deleter.

@clalancette
Copy link
Contributor

We've done some work that fixes this:

  1. We disabled loans by default via Disable the loaned messages inside the executor. rclcpp#2335 and Set disable loan to on by default. rcl#1110
  2. We backported the above to Humble and Iron.
  3. We fixed the MessageMemoryPool to honor shared_ptr references in Support users holding onto shared pointers in the message memory pool rclcpp#2336

The above works around the issue. There is one remaining issue here, which is to re-enable loaned messages along with shared_ptrs. I've opened up ros2/rclcpp#2401 to track that.

With all of that, then, I'm going to close this one out. Thanks for the report, it led to a number of good fixes here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants