Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActionClient Crash: Taking data from action client but nothing is ready #2242

Closed
jmachowinski opened this issue Jul 20, 2023 · 29 comments
Closed

Comments

@jmachowinski
Copy link
Contributor

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04
  • Installation type:
    • binaries
  • Version or commit hash:
    • iron
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Start action server

  • server shall accept goal, but not terminate it

Start action client

  • send goal

wait

Expected behavior

no crash

Actual behavior

after some time we get the exception
terminate called after throwing an instance of 'std::runtime_error' what(): Taking data from action client but nothing is ready

Additional information

Using a multi threaded spinner, but the default callback group, I guess this one is mutual exclusive

@Yadunund
Copy link
Member

Could you prepare a minimal example that one could build and run to reproduce the error? It would also provide clarity on how various callbacks are defined.

@jmachowinski
Copy link
Contributor Author

I tried to create a minimal example, and figured out, that the cause of this issue was most likely due to deadlock issues in the client or the server. I can not reproduce the issue as original described in the minimal example.

However I managed reproduce the error now with a slightly different setup.

https://github.com/jmachowinski/rclcppActionDeadlock branch nothingReady

start server
start the client
strg+c the client
directly restart the client

Server:

terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  goal_handle attempted invalid transition from state EXECUTING with event CANCELED, at ./src/rcl_action/goal_handle.c:95
Aborted (core dumped)

client:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Taking data from action client but nothing is ready
Aborted (core dumped)

start the client
start the server

client just runs

server directly crashes with

 ./server 
terminate called after throwing an instance of 'std::runtime_error'
  what():  Taking data from action server but nothing is ready
Aborted (core dumped)

Note, you need to do this multiple times, about 1 time out of 10 the error will trigger.

@jmachowinski
Copy link
Contributor Author

Okay, I had a look at the code now and it it pretty obvious that a race condition is ongoing.
In order to fix this, I would need some background information on

rcl_action_server_wait_set_get_entities_ready

or repectitive

rcl_wait

From the documentation of this function, it is unclear to me if it may return multiple events to be ready at once.
@wjwwood You seem to have worked a lot on this code, do you know the answer ?

@jmachowinski
Copy link
Contributor Author

I dug further into this issue, and from my tests I can see, that rcl_action_server_wait_set_get_entities_ready may return multiple events at one time.

Now I am a bit more lost on how to correctly map this to the waitable interface, as is_ready can not signal that multiple events are ready. The only way I see currently is to keep track of incoming events, as reported events through the waitable interface, which is somewhat cumbersome. Any ideas ?

@fujitatomoya
Copy link
Collaborator

I can reproduce the following exception on server side.

terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  goal_handle attempted invalid transition from state EXECUTING with event CANCELED, at ./src/rcl_action/goal_handle.c:95
Aborted (core dumped)

i believe this is because canceled is called on goal_handler in the handle_cancel_ user callback. (cancel request is not accepted yet, that saying the goal is NOT canceling state yet.) after all, rcl_action detects this as invalid transition as designed, https://design.ros2.org/articles/actions.html.

i think what expected here is, user callback should decide if accepted or not only, then process canceling sequence in the application then call canceled on that goal_handler.

btw, the following patch is required to build with rolling.

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 59c6de8..4c509bc 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -25,5 +25,9 @@ add_executable(client src/Client.cpp)
 ament_target_dependencies(client PUBLIC rclcpp rclcpp_action)
 target_link_libraries(client PRIVATE ${cpp_typesupport_target})
 
+install(TARGETS
+  server
+  client
+  DESTINATION lib/${PROJECT_NAME})
 
 ament_package()
diff --git a/package.xml b/package.xml
index 6b65995..f865ad0 100644
--- a/package.xml
+++ b/package.xml
@@ -9,6 +9,9 @@
 
   <buildtool_depend>ament_cmake</buildtool_depend>
 
+  <depend>rclcpp</depend>
+  <depend>rclcpp_action</depend>
+
   <test_depend>ament_lint_auto</test_depend>
   <test_depend>ament_lint_common</test_depend>
   <buildtool_depend>rosidl_default_generators</buildtool_depend>
gdb core stacktrace
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140540045682240) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140540045682240) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140540045682240, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007fd22dc85476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fd22dc6b7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007fd22df2dbbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fd22df3924c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fd22df392b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fd22df3923e in std::rethrow_exception(std::__exception_ptr::exception_ptr) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007fd22ec7a8d9 in rclcpp::exceptions::throw_from_rcl_error (ret=2301, prefix="", error_state=0x0, 
    reset_error=0x7fd22e0ce794 <rcutils_reset_error>) at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/exceptions/exceptions.cpp:90
#10 0x00007fd22f1cbc8a in rclcpp_action::ServerGoalHandleBase::_canceled (this=0x7fd1ec001130)
    at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/src/server_goal_handle.cpp:96
#11 0x00005646ab15d534 in rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction>::canceled (this=0x7fd1ec001130, 
    result_msg=std::shared_ptr<deadlock::action::SomeAction_Result_<std::allocator<void> >> (use count 2, weak count 0) = {...})
    at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/include/rclcpp_action/server_goal_handle.hpp:210
#12 0x00005646ab15b683 in ServerNode::cancelGoal (this=0x5646ac6fef60, 
    goalHandle=std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction>> (use count 3, weak count 1) = {...})
    at /root/ros2_ws/colcon_ws/src/ros2/rclcppActionDeadlock/src/Server.cpp:59
#13 0x00005646ab1663a5 in std::__invoke_impl<rclcpp_action::CancelResponse, rclcpp_action::CancelResponse (ServerNode::*&)(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >), ServerNode*&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > > (
    __f=@0x5646aca6aca0: (rclcpp_action::CancelResponse (ServerNode::*)(ServerNode * const, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)) 0x5646ab15b3da <ServerNode::cancelGoal(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>, 
    __t=@0x5646aca6acb0: 0x5646ac6fef60) at /usr/include/c++/11/bits/invoke.h:74
#14 0x00005646ab165043 in std::__invoke<rclcpp_action::CancelResponse (ServerNode::*&)(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >), ServerNode*&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > > (
    __fn=@0x5646aca6aca0: (rclcpp_action::CancelResponse (ServerNode::*)(ServerNode * const, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)) 0x5646ab15b3da <ServerNode::cancelGoal(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>)
    at /usr/include/c++/11/bits/invoke.h:96
#15 0x00005646ab163d10 in std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>::__call<rclcpp_action::CancelResponse, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&, 0ul, 1ul>(std::tuple<std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&>&&, std::_Index_tuple<0ul, 1ul>) (
    this=0x5646aca6aca0, __args=...) at /usr/include/c++/11/functional:420
#16 0x00005646ab162fa0 in std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>::operator()<std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >, rclcpp_action::CancelResponse>(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (this=0x5646aca6aca0)
    at /usr/include/c++/11/functional:503
#17 0x00005646ab162353 in std::__invoke_impl<rclcpp_action::CancelResponse, std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > >(std::__invoke_other, std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (
    __f=...) at /usr/include/c++/11/bits/invoke.h:61
#18 0x00005646ab160929 in std::__invoke_r<rclcpp_action::CancelResponse, std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > >(std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (__fn=...)
--Type <RET> for more, q to quit, c to continue without paging--
    at /usr/include/c++/11/bits/invoke.h:114
#19 0x00005646ab15e724 in std::_Function_handler<rclcpp_action::CancelResponse (std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >), std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (
    __functor=..., __args#0=...) at /usr/include/c++/11/bits/std_function.h:290
#20 0x00005646ab16b335 in std::function<rclcpp_action::CancelResponse (std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>::operator()(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >) const (this=0x5646ac84ba60, 
    __args#0=std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction>> (empty) = {...})
    at /usr/include/c++/11/bits/std_function.h:590
#21 0x00005646ab169b66 in rclcpp_action::Server<deadlock::action::SomeAction>::call_handle_cancel_callback (this=0x5646ac84b9b0, uuid=...)
    at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/include/rclcpp_action/server.hpp:439
#22 0x00007fd22f1b6fde in rclcpp_action::ServerBase::execute_cancel_request_received (this=0x5646ac84b9b0, 
    data=std::shared_ptr<void> (use count 2, weak count 0) = {...}) at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/src/server.cpp:459
#23 0x00007fd22f1b5be4 in rclcpp_action::ServerBase::execute (this=0x5646ac84b9b0, data=std::shared_ptr<void> (use count 2, weak count 0) = {...})
    at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/src/server.cpp:299
#24 0x00007fd22ec84aec in rclcpp::Executor::execute_any_executable (this=0x7fff63e3e250, any_exec=...)
    at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:546
#25 0x00007fd22ecafc51 in rclcpp::executors::MultiThreadedExecutor::run (this=0x7fff63e3e250, this_thread_number=9)
    at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp:100
#26 0x00007fd22ecb1cb7 in std::__invoke_impl<void, void (rclcpp::executors::MultiThreadedExecutor::*&)(unsigned long), rclcpp::executors::MultiThreadedExecutor*&, unsigned long&> (
    __f=@0x5646aca6da98: (void (rclcpp::executors::MultiThreadedExecutor::*)(rclcpp::executors::MultiThreadedExecutor * const, unsigned long)) 0x7fd22ecafad4 <rclcpp::executors::MultiThreadedExecutor::run(unsigned long)>, __t=@0x5646aca6dab0: 0x7fff63e3e250) at /usr/include/c++/11/bits/invoke.h:74
#27 0x00007fd22ecb1bc9 in std::__invoke<void (rclcpp::executors::MultiThreadedExecutor::*&)(unsigned long), rclcpp::executors::MultiThreadedExecutor*&, unsigned long&> (
    __fn=@0x5646aca6da98: (void (rclcpp::executors::MultiThreadedExecutor::*)(rclcpp::executors::MultiThreadedExecutor * const, unsigned long)) 0x7fd22ecafad4 <rclcpp::executors::MultiThreadedExecutor::run(unsigned long)>) at /usr/include/c++/11/bits/invoke.h:96
#28 0x00007fd22ecb1ad2 in std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>::__call<void, , 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) (this=0x5646aca6da98, __args=...) at /usr/include/c++/11/functional:420
#29 0x00007fd22ecb1a15 in std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>::operator()<, void>() (this=0x5646aca6da98) at /usr/include/c++/11/functional:503
#30 0x00007fd22ecb19bc in std::__invoke_impl<void, std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>>(std::__invoke_other, std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>&&) (__f=...) at /usr/include/c++/11/bits/invoke.h:61
#31 0x00007fd22ecb1965 in std::__invoke<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>>(std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>&&) (__fn=...) at /usr/include/c++/11/bits/invoke.h:96
#32 0x00007fd22ecb1906 in std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x5646aca6da98) at /usr/include/c++/11/bits/std_thread.h:253
#33 0x00007fd22ecb18d6 in std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > >::operator()() (this=0x5646aca6da98) at /usr/include/c++/11/bits/std_thread.h:260
#34 0x00007fd22ecb18b6 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > > >::_M_run() (this=0x5646aca6da90) at /usr/include/c++/11/bits/std_thread.h:211
#35 0x00007fd22df672b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#36 0x00007fd22dcd7b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#37 0x00007fd22dd69a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

@fujitatomoya
Copy link
Collaborator

@jmachowinski i think doc section is not correct, see #2266

@jmachowinski
Copy link
Contributor Author

@fujitatomoya I agree.
Can someone review #2250 ?
This fixes the 'Taking data from client' issues for us.

@fujitatomoya
Copy link
Collaborator

I think goal_handle attempted invalid transition can be resolved with application code, and doc fix #2266.

This fixes the 'Taking data from client' issues for us.

on this one, i cannot reproduce this one, can you elaborate what the root case of this before starting review? that would be really appreciated to understand the problem 1st.

@jmachowinski
Copy link
Contributor Author

jmachowinski commented Aug 22, 2023

This problem is hard to reproduce, you need a certain workload and a actions running with a multithreaded spinner.

As for the explanation of what is going on, this is a long one...

First we need to look at the multithreaded executor:

MultiThreadedExecutor::run(size_t this_thread_number)
{
(void)this_thread_number;
while (rclcpp::ok(this->context_) && spinning.load()) {
rclcpp::AnyExecutable any_exec;
{
std::lock_guard wait_lock{wait_mutex_};
if (!rclcpp::ok(this->context_) || !spinning.load()) {
return;
}
if (!get_next_executable(any_exec, next_exec_timeout_)) {
continue;
}
}
if (yield_before_execute_) {
std::this_thread::yield();
}
execute_any_executable(any_exec);

The important point here is that get_next_executable is called with a lock and execute_any_executable(any_exec) is called without the lock.

Now we need to look at the function

ClientBase::is_ready(rcl_wait_set_t * wait_set)
{
rcl_ret_t ret = rcl_action_client_wait_set_get_entities_ready(
wait_set, pimpl_->client_handle.get(),
&pimpl_->is_feedback_ready,
&pimpl_->is_status_ready,
&pimpl_->is_goal_response_ready,
&pimpl_->is_cancel_response_ready,
&pimpl_->is_result_response_ready);

This function will be called by get_next_executable deep under the hood.
The important point here is that the kind of next executable is stored in the global pimpl.

Next is the function

ClientBase::execute(std::shared_ptr<void> & data)

This function is called by execute_any_executable. As pointed out above, this call is not protected any more by a lock, and
any other thread may have been calling ClientBase::is_ready in-between.

As one can see, the execute function uses the values in the global pimpl object to determine the kind of action that needs processing. This it the point were the race occurs, as the global pimpl may have been altered by another thread, leading to the "Executing action client but nothing is ready" exception.

  if (pimpl_->is_feedback_ready) {
  } else if (pimpl_->is_status_ready) {
  } else if (pimpl_->is_goal_response_ready) {
  } else if (pimpl_->is_result_response_ready) {
  } else if (pimpl_->is_cancel_response_ready) {
  } else {
    throw std::runtime_error("Executing action client but nothing is ready");
  }

Note, I also saw races between is_ready and take_data. Which are a mystery to me, as this should be protected by the mutex in the spinner. The patch also addresses these.

@jmachowinski
Copy link
Contributor Author

can you elaborate what the root case of this before starting review? that would be really appreciated to understand the problem 1st.

@fujitatomoya is the issue now clear to you ?

@doisyg
Copy link

doisyg commented Sep 3, 2023

Seeing a simialr error on a code that was running fine under humble

@jmachowinski
Copy link
Contributor Author

@doisyg does MR #2250 fix the issue for you ?

@tommy-erko
Copy link

Hi @jmachowinski , thank you for your digging and elaboration on the issue. We have been encountering it in our project since april, randomly in some 1/10 cases and seemingly dependent on the computer utilization. I am now testing our application with your fix and will let you know once I will be able to determine if it helps.
I hope this issue gets more attention here, dear ROS2 guys, it would really help to speed up the review of the related PR:)

@medvedevigorek
Copy link

@jmachowinski we ran into the same issue and it's kind of a blocker for us due to its quite consistent reproduction. Your explanation here was very useful to bootstrap the investigation we have started as well however there is one thing we cannot puzzle - originally you talked about runtime exception with a message "Taking data from action client but nothing is ready" (and we get exactly the same error) but your explanation leads to "Executing action client but nothing is ready" which seems to be different (at least we haven't seen that one yet). Does it mean that ones you fixed the race between is_ready() and take_data() you started to see execute() call throwing that runtime exception?

@jmachowinski
Copy link
Contributor Author

@medvedevigorek Its a race between the tree function. is_ready() sets class members, which represent which data must be taken. In take_data() theses members are actually not cleared, as they are 'misused' to store the type of data that was taken. Later in execute the class members are used to figure out what kind of data was taken, and the class members are cleared. This leads to the three possible failures:

  • is_ready might be called in between take_data and execute, which leads to a wrong cast of the taken data -> segfault
  • is_ready signals data to be taken, execute intterupts and clears the member, take_data finds no member set and throws "Taking data from action client but nothing is ready" .
  • is_ready is called, and a message is ready, take data is called, thread change before execute is called. Second thread jumps in, calls is_ready, and a second message is ready, take data is called. Execute is called (still in the second thread), the class member for the message is cleared. Thread 1 calls execute, and does not find any set class member and throws "Executing action client but nothing is ready"

Note that #2250 is a working fix

@medvedevigorek
Copy link

Thank you, @jmachowinski , for additional details. After getting the race condition caused by execute() fixed I continued to experience "Taking data from action client.." runtime error and at this point it wouldn't be caused by any race condition as is_ready() and take_data() couldn't be called by different threads of the multi threaded executor. So I kept poking around and found the actual cause for it - take_data() might get called multiple times for a single message in the queue.

Quick overview of involved functions:

  • get_next_executable() first checks if there is a ready executable by calling get_next_ready_executable(), if none then it calls wait_for_work() and runs get_next_ready_executable() again.
  • get_next_ready_executable() gets a next waitable from the memory strategy and calls take_data() on it. The memory strategy pops next waitable from the triggered waitables list which has an execution slot in its callback group - in case of mutually exclusive group no other callbacks are being executed at the moment.
  • wait_for_work() clears all handles in the memory strategy, collects all handles from registered nodes (this is when our action client is added), then updates the wait set and finally calls remove_null_handles() on the memory strategy. The latter checks what waitables are ready by calling is_ready() on each of them and moves the ready ones to the triggered waitables list.

Now what sequence leads to the multiple calls of take_data():

  1. The first call to get_next_ready_executable() returns nothing and wait_for_work() is called.
  2. It finds that an action client aka waitable has some data to process as its is_ready() returned true and it was moved to the triggered waitable list
  3. The second call to get_next_ready_executable() doesn't return that waitable as something else is running in its callback group which is set to mutually exclusive mode or some other service was returned
  4. The loop continues either in the same thread or a different thread - it doesn't matter
  5. Again the first get_next_ready_executable() is called and it returns nothing - our waitable is still blocked by the callback group and it's being kept in the triggered list.
  6. wait_for_work() is called again and finds that our waitable is still ready hence it adds it again to the triggered list. Now we have two records for the same waitable.
  7. get_next_ready_executable() finally returns the first waitable from the triggered waitable list which already got take_data() invoked.
  8. In the same or another thread either step 5 or step 7 will get the second record of the waitable in question from the triggered list but this time take_data() will fail as it doesn't know what message type to read.

So the actual problem must be in AllocatorMemoryStrategy as it blindly adds ready waitables to the triggered list.

for (size_t i = 0; i < waitable_handles_.size(); ++i) {
if (waitable_handles_[i]->is_ready(wait_set)) {
waitable_triggered_handles_.emplace_back(std::move(waitable_handles_[i]));
}
}

I ended up with a simple change to add a waitable only if it's not present in the triggered list and it fixed the problem for me. This is in addition to a change in execute() method to not depend on class members for a message type.

I also tried your fix and works as well as it makes is_ready() to read data in place and not in take_data() anymore hence no duplicates in the triggered list possible.

Not sure what else might be impacted by the flow described above besides action client base - don't know what else uses Waitable interface in rclcpp.

@jmachowinski
Copy link
Contributor Author

@medvedevigorek Nice debugging. I agree, this a second bug we are hitting here.
I wonder, would not just clearing waitable_triggered_handles_ at the start of remove_null_handles fix this issue ?

@medvedevigorek
Copy link

@jmachowinski, I was thinking about clearing waitable_triggered_handles_ along with other handles in clear_handles() as it's called by the executor in the beginning of wait_for_work() routine. However I am not sure about the contract here as potentially the following is possible: collect_entities() adds a waitable in the list, it's then moved to waitable_triggered_handles_ but not getting executed, on the next iteration the same waitable is not added into the list as it's not in a group anymore hence it's not executed while it was marked ready for execution on the previous iteration. I am not that familiar with the internals so cannot say for sure if such scenario is even a valid one or not. Something tells it's not but didn't want to have more issues so went with a check before adding a waitable instead of clearing the list. If someone familiar with the matter can confirm that then clearing the triggered list might be a better approach.

@wjwwood
Copy link
Member

wjwwood commented Nov 21, 2023

on the next iteration the same waitable is not added into the list as it's not in a group anymore

@medvedevigorek when you say this, do you mean it's not in a group (I assumed you mean callback group) anymore because it was deleted? Or for some other reason?

@jmachowinski
Copy link
Contributor Author

@wjwwood Its not added by collect_entities(), as the waitable is at this point of time in execution, and therefore filtered out of the waitset.
See

if (!group || !group->can_be_taken_from().load()) {

@medvedevigorek
Copy link

@wjwwood yes, I meant callback group. collect_entities collects waitables from callback groups and later a waitable can be moved to waitable_triggered_handles list. And the question was if collect_entities is called again is it possible to not have the same waitable collected again? If it's not possible then clearing the waitable_triggered_handles might be okay.

@jmachowinski
Copy link
Contributor Author

Fixed by the merge of #2495 in jazzy

@alexmillane
Copy link

Thank you for your efforts in fixing this. Is there a possibility to backport it to humble? 🙏

@fujitatomoya
Copy link
Collaborator

@alexmillane unfortunately i do not think we can do that, since #2495 is API/ABI break change.

@jmachowinski
Copy link
Contributor Author

Its pretty much straight forward to modify the patch to be non ABI/API breaking. @alexmillane feel free to do it ;-)

@dsandber
Copy link

dsandber commented May 7, 2024

Looking at #2250 it seems this was applied to rolling/jazzy. Has anyone tried to apply to iron? Does the patch apply cleanly? Thanks!

PS. Or can the (patched) rolling rclcpp package be used with an iron distro?

@firesurfer
Copy link

@jmachowinski Could you eleborate how? I am willing to put in the effort into backporting this patch to iron as it basically makes it impossible to run any ros2control application for a longer period when using any controller with an action interface.

@jmachowinski
Copy link
Contributor Author

jmachowinski commented May 13, 2024

@jmachowinski Could you eleborate how? I am willing to put in the effort into backporting this patch to iron as it basically makes it impossible to run any ros2control application for a longer period when using any controller with an action interface.

Use the patch from #2495 and change it, so that the signature of

  execute_goal_request_received(const std::shared_ptr<void> & data);
  execute_cancel_request_received(const std::shared_ptr<void> & data);
  execute_result_request_received(const std::shared_ptr<void> & data);

and the enum

EntityType

are not changed. Should be straight forward.

@edgarcamilocamacho
Copy link

edgarcamilocamacho commented Sep 30, 2024

Hello @alexmillane and @doisyg, I just backported the fix from iron branch to humble and created the PR #2635

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.