Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

test_action_client_server__Fibonacci fails on aarch64 repeated #152

Closed
clalancette opened this issue Dec 7, 2018 · 7 comments
Closed

test_action_client_server__Fibonacci fails on aarch64 repeated #152

clalancette opened this issue Dec 7, 2018 · 7 comments
Assignees

Comments

@clalancette
Copy link

Last night one of the test failures on aarch64 repeated was test_action_client_server__Fibonacci: https://ci.ros2.org/view/nightly/job/nightly_linux-aarch64_repeated/624/testReport/junit/test_communication.build.test_communication/test_action_client_server__Fibonacci__rclcpp__rmw_fastrtps_cpp_/test_action_client_server/ . The output was:

def test_action_client_server():
        namespace = '/test_time_%s' % time.strftime('%H_%M_%S', time.gmtime())
    
        ld = LaunchDescriptor()
        action_client_cmd = ['/home/rosbuild/ci_scripts/ws/build/test_communication/test_action_client_cpp', 'Fibonacci', namespace]
        action_server_cmd = ['/home/rosbuild/ci_scripts/ws/build/test_communication/test_action_server_cpp', 'Fibonacci', namespace]
    
        action_server_env = dict(os.environ)
        action_client_env = dict(os.environ)
    
        if 'rclcpp' == 'rclpy':
            action_server_cmd.insert(0, sys.executable)
            action_server_env['PYTHONUNBUFFERED'] = '1'
        if 'rclcpp' == 'rclpy':
            action_client_cmd.insert(0, sys.executable)
            action_client_env['PYTHONUNBUFFERED'] = '1'
    
        action_server_env['RCL_ASSERT_RMW_ID_MATCHES'] = 'rmw_fastrtps_cpp'
        action_server_env['RMW_IMPLEMENTATION'] = 'rmw_fastrtps_cpp'
        ld.add_process(
            cmd=action_server_cmd,
            name='test_action_server',
            env=action_server_env,
        )
    
        action_client_env['RCL_ASSERT_RMW_ID_MATCHES'] = 'rmw_fastrtps_cpp'
        action_client_env['RMW_IMPLEMENTATION'] = 'rmw_fastrtps_cpp'
        ld.add_process(
            cmd=action_client_cmd,
            name='test_action_client',
            env=action_client_env,
            exit_handler=primary_exit_handler,
        )
    
        launcher = DefaultLauncher()
        launcher.add_launch_descriptor(ld)
        rc = launcher.launch()
    
>       assert rc == 0, \
            "The launch file failed with exit code '" + str(rc) + "'. " \
            'May be the action client did not receive any completed goals from the action server?'
E       AssertionError: The launch file failed with exit code '-15'. May be the action client did not receive any completed goals from the action server?
E       assert -15 == 0

../../../../build/test_communication/test_action_client_server__Fibonacci__rclcpp__rmw_fastrtps_cpp_.py:50: AssertionError

@jacobperron @sloretz FYI

@sloretz sloretz self-assigned this Dec 7, 2018
@sloretz
Copy link

sloretz commented Dec 7, 2018

Still looking into this. First impression is the action server did not shut down after receiving SIGINT, and was sent SIGTERM at the test timeout of 10 seconds. I can reproduce it on amd64 bionic about 1 in every 10ish to 30ish runs.

@sloretz sloretz assigned hidmic and unassigned sloretz Dec 7, 2018
@hidmic
Copy link

hidmic commented Dec 7, 2018

Alright, it seems we have a race condition when triggering guard conditions to wake up wait sets when a signal comes in. Here's the backtrace:

Thread 1 "test_action_ser" hit Breakpoint 6, GuardCondition::trigger (this=0x141d3b0)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/types/guard_condition.hpp:38
38            std::unique_lock<std::mutex> clock(*conditionMutex_);
(gdb) bt
#0  GuardCondition::trigger (this=0x141d3b0) at /home/michel/Workspaces/ros2_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/types/guard_condition.hpp:38
#1  0x00007ffff339af9a in rmw_fastrtps_shared_cpp::__rmw_trigger_guard_condition (identifier=0x7ffff39fb5b2 "rmw_fastrtps_cpp", guard_condition_handle=0x1389820)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_trigger_guard_condition.cpp:38
#2  0x00007ffff39f96c2 in rmw_trigger_guard_condition (guard_condition_handle=0x1389820)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_trigger_guard_condition.cpp:28
#3  0x00007ffff5de6142 in rmw_trigger_guard_condition (v1=0x1389820)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:361
#4  0x00007ffff6205f0f in rcl_trigger_guard_condition (guard_condition=0x141d428) at /home/michel/Workspaces/ros2_ws/src/ros2/rcl/rcl/src/rcl/guard_condition.c:159
#5  0x00007ffff78902ea in rclcpp::notify_all () at /home/michel/Workspaces/ros2_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/utilities.cpp:393
#6  0x00007ffff788e84d in signal_handler (signal_value=2, siginfo=0x7fffffff47b0, context=0x7fffffff4680)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/utilities.cpp:153
#7  <signal handler called>
#8  pthread_mutex_lock (mutex=0x76b630) at forward.c:192
#9  0x000000000040b8b2 in __gthread_mutex_lock (__mutex=0x76b630) at /usr/include/x86_64-linux-gnu/c++/5/bits/gthr-default.h:748
#10 0x000000000040db72 in std::mutex::lock (this=0x76b630) at /usr/include/c++/5/mutex:135
#11 0x000000000040e664 in std::lock_guard<std::mutex>::lock_guard (this=0x7fffffff4dd0, __m=...) at /usr/include/c++/5/mutex:386
#12 0x00007ffff339d286 in GuardCondition::attachCondition (this=0x76b630, conditionMutex=0x1389f90, conditionVariable=0x1389f60)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/types/guard_condition.hpp:53
#13 0x00007ffff339b71c in rmw_fastrtps_shared_cpp::__rmw_wait (subscriptions=0x150abb8, guard_conditions=0x150abd0, services=0x150ac00, clients=0x150abe8, 
    wait_set=0x141d380, wait_timeout=0x7fffffff5090) at /home/michel/Workspaces/ros2_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:137
#14 0x00007ffff39f970a in rmw_wait (subscriptions=0x150abb8, guard_conditions=0x150abd0, services=0x150ac00, clients=0x150abe8, wait_set=0x141d380, 
    wait_timeout=0x7fffffff5090) at /home/michel/Workspaces/ros2_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_wait.cpp:32
#15 0x00007ffff5de626f in rmw_wait (v6=0x150abb8, v5=0x150abd0, v4=0x150ac00, v3=0x150abe8, v2=0x141d380, v1=0x7fffffff5090)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:373
#16 0x00007ffff6217495 in rcl_wait (wait_set=0x7fffffff9f10, timeout=-1) at /home/michel/Workspaces/ros2_ws/src/ros2/rcl/rcl/src/rcl/wait.c:553
#17 0x00007ffff779f520 in rclcpp::executor::Executor::wait_for_work (this=0x7fffffff9ef0, timeout=...)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:450
#18 0x00007ffff77a0358 in rclcpp::executor::Executor::get_next_executable (this=0x7fffffff9ef0, any_executable=..., timeout=...)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:578
#19 0x00007ffff77aa436 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x7fffffff9ef0)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/single_threaded_executor.cpp:35
#20 0x00007ffff77a4028 in rclcpp::spin (node_ptr=std::shared_ptr (count 5, weak 1) 0x76b800)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executors.cpp:35
#21 0x00007ffff77a4146 in rclcpp::spin (node_ptr=std::shared_ptr (count 2, weak 1) 0x76b740)
    at /home/michel/Workspaces/ros2_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executors.cpp:42
#22 0x000000000040c7f5 in main (argc=3, argv=0x7fffffffa278)
    at /home/michel/Workspaces/ros2_ws/src/ros2/system_tests/test_communication/test/test_action_server.cpp:191

There you can see that while we're about to wait on a wait set (having this mutex locked), a signal comes along and attempts to lock that same mutex. We've been discussing with @wjwwood a solution. It'll likely involve deferring some of the work the signal handler is currently doing to a separate thread.

@sloretz
Copy link

sloretz commented Dec 11, 2018

@hidmic any other places that could be blocking? I can still reproduce this using ros2/rclcpp#605 and ros2/rclcpp#608 ros2/rclcpp#607 with

colcon test --retest-until-fail 100 --event-handlers console_direct+ --packages-select test_communication --ctest-args -R test_action_client_server__Fibonacci__rclcpp__rmw_fastrtps_cpp\$

It hangs within a minute or two; I'm not sure how many tests are run on average before that point.

Edit: oops, I didn't look closely enough at the test output

@hidmic
Copy link

hidmic commented Dec 11, 2018

It blocks with both? I can imagine why it does for ros2/rclcpp#605, but I'd have to look into ros2/rclcpp#608 ros2/rclcpp#607. I'd be actually surprised if it blocks for the same reason.

@tfoote
Copy link

tfoote commented Dec 12, 2018

@sloretz did you mean ros2/rclcpp#607 instead of 608 here too?

@sloretz
Copy link

sloretz commented Dec 12, 2018

@hidmic any other places that could be blocking? I can still reproduce this using ros2/rclcpp#605 and ros2/rclcpp#608 ros2/rclcpp#607 with

Oops, I think I didn't look closely enough here. With ros2/rclcpp#605 I haven't seen this test failure where the server hangs. Instead I see a different issue less frequently where the client fails to find the server in wait_for_action_server().

@sloretz
Copy link

sloretz commented Dec 20, 2018

I don't see this test failure in the latest nightly repeated aarch64 job. Since ros2/rclcpp#605 was merged I'm going to close this as fixed by that PR.

@sloretz sloretz closed this as completed Dec 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants