Get exception "Resource deadlock avoided" while running TestTimeSource.clock_sleep_until_with_ros_time_basic #1861

Barry-Xu-2018 · 2022-01-11T07:38:26Z

Bug report

Required Info:

Operating System:
- Ubuntu 20.04
Installation type:
- source
Version or commit hash:
- 6b321ed
DDS implementation:
- cyclonedds
Client library (if applicable):
- rclcpp

Steps to reproduce issue

Find this issue in CI execution https://build.ros2.org/job/Rpr__rclcpp__ubuntu_focal_amd64/1636/testReport/projectroot.test/rclcpp/test_time_source/

The below steps also can reproduce this problem

$ source install/setup.bash
$ build/rclcpp/test/rclcpp/test_time_source --gtest_filter=*clock_sleep_until_with_ros_time_basic --gtest_repeat=10000

...
Repeating all tests (iteration 9304) . . .

Note: Google Test filter = *clock_sleep_until_with_ros_time_basic*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestTimeSource
[ RUN      ] TestTimeSource.clock_sleep_until_with_ros_time_basic
terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided
Aborted (core dumped)

Expected behavior

No exception

Actual behavior

Unexpected exception occurs

The text was updated successfully, but these errors were encountered:

fujitatomoya · 2022-01-11T07:59:31Z

Does this happen with rmw_fastrtps with mainline?

Barry-Xu-2018 · 2022-01-11T08:25:04Z

Does this happen with rmw_fastrtps with mainline?

Also happen after checking.

Barry-Xu-2018 · 2022-01-11T08:40:18Z

Backstrace info based on coredump file
gdb_coredump.log

Barry-Xu-2018 · 2022-01-14T09:46:22Z

After checking codes, I find the possible cause

In test codes

rclcpp/rclcpp/test/rclcpp/test_time_source.cpp

Lines 751 to 755 in 80f93d1

    
           auto clock = std::make_shared<rclcpp::Clock>(RCL_ROS_TIME); 
        
           rclcpp::TimeSource time_source; 
        
           time_source.attachNode(node); 
        
           time_source.attachClock(clock);

NodeState will create a thread (new executor) to run a subscriber to receive clock message.

rclcpp/rclcpp/src/rclcpp/time_source.cpp

Lines 419 to 433 in 80f93d1

    
           clock_subscription_ = rclcpp::create_subscription<rosgraph_msgs::msg::Clock>( 
        
             node_parameters_, 
        
             node_topics_, 
        
             "/clock", 
        
             qos_, 
        
             [state = std::weak_ptr<NodeState>(this->shared_from_this())]( 
        
               std::shared_ptr<const rosgraph_msgs::msg::Clock> msg) { 
        
               if (auto state_ptr = state.lock()) { 
        
                 state_ptr->clock_cb(msg); 
        
               } 
        
               // Do nothing if the pointer could not be locked because it means the TimeSource is now 
        
               // without an attached node 
        
             }, 
        
             options 
        
           );

While destroying TimeSource (maybe internal NodeState is being destroyed), above callback is called. So NodeState isn't destroyed while destroying TimeSource. NodeState is handled by new shared_ptr of NodeState from the weak_ptr of NodeState (code is auto state_ptr = state.lock()) in callback.

According to coredump info, the problem occurs in callback.

#17 0x00007f0f319ff6f6 in std::shared_ptr<rclcpp::TimeSource::NodeState>::~shared_ptr (this=0x7f0f167f9e30, __in_chrg=<optimized out>)
    at /usr/include/c++/9/bits/shared_ptr.h:103
#18 0x00007f0f31a008b2 in rclcpp::TimeSource::NodeState::create_clock_sub()::{lambda(std::shared_ptr<rosgraph_msgs::msg::Clock_<std::allocator<void> > const>)#2}::operator()(std::shared_ptr<rosgraph_msgs::msg::Clock_<std::allocator<void> > const>) const (__closure=0x55ea11be9bc0, msg=std::shared_ptr<const struct rosgraph_msgs::msg::Clock_<std::allocator<void> >> (use count 4, weak count 0) = {...}) at /tmp/ros2-test/src/ros2/rclcpp/rclcpp/src/rclcpp/time_source.cpp:426

While leaving callback, NodeState will be truly destroyed at this time. While destructing NodeState, the below function will be called. It wants to wait current thread termination. So lead to resource deadlock issue.

rclcpp/rclcpp/src/rclcpp/time_source.cpp

Lines 437 to 447 in 80f93d1

    
           void destroy_clock_sub() 
        
           { 
        
             std::lock_guard<std::mutex> guard(clock_sub_lock_); 
        
             if (clock_executor_thread_.joinable()) { 
        
               cancel_clock_executor_promise_.set_value(); 
        
               clock_executor_->cancel(); 
        
               clock_executor_thread_.join(); 
        
               clock_executor_->remove_callback_group(clock_callback_group_); 
        
             } 
        
             clock_subscription_.reset(); 
        
           }

In order to avoid this issue, we should make sure clock subscriber thread is terminated while destroying TimeSource.
I modify code as below. This change maybe not the best way.

TimeSource::~TimeSource()
{
  // Make sure clock subscriber thread is terminated at this time
  node_state_->detachNode();
}

#1865

After modification, I re-run test many times (> 60000). This problem doesn't occur.

Barry-Xu-2018 · 2022-01-17T07:42:41Z

@clalancette is refactoring relevant codes.
https://github.com/clalancette/rclcpp/tree/clalancette/time-source-cleanups

It can fix this issue and it is better solution. So I will close my fixing #1865.

Barry-Xu-2018 mentioned this issue Jan 11, 2022

not to hold callback group in subscription #1833

Closed

Barry-Xu-2018 mentioned this issue Jan 14, 2022

Fix potential resource deadlock of NodeSate while destorying timeSource #1865

Closed

clalancette mentioned this issue Jan 18, 2022

Cleanup time source object lifetimes #1867

Merged

clalancette closed this as completed in #1867 Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get exception "Resource deadlock avoided" while running TestTimeSource.clock_sleep_until_with_ros_time_basic #1861

Get exception "Resource deadlock avoided" while running TestTimeSource.clock_sleep_until_with_ros_time_basic #1861

Barry-Xu-2018 commented Jan 11, 2022 •

edited

Loading

fujitatomoya commented Jan 11, 2022

Barry-Xu-2018 commented Jan 11, 2022

Barry-Xu-2018 commented Jan 11, 2022

Barry-Xu-2018 commented Jan 14, 2022 •

edited

Loading

Barry-Xu-2018 commented Jan 17, 2022

Get exception "Resource deadlock avoided" while running TestTimeSource.clock_sleep_until_with_ros_time_basic #1861

Get exception "Resource deadlock avoided" while running TestTimeSource.clock_sleep_until_with_ros_time_basic #1861

Comments

Barry-Xu-2018 commented Jan 11, 2022 • edited Loading

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

fujitatomoya commented Jan 11, 2022

Barry-Xu-2018 commented Jan 11, 2022

Barry-Xu-2018 commented Jan 11, 2022

Barry-Xu-2018 commented Jan 14, 2022 • edited Loading

Barry-Xu-2018 commented Jan 17, 2022

Barry-Xu-2018 commented Jan 11, 2022 •

edited

Loading

Barry-Xu-2018 commented Jan 14, 2022 •

edited

Loading