-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce arbitrator retry attempts to keep the operation at 1hz #2415
Conversation
Quality Gate passedIssues Measures |
@MishkaMN do we have any information about the nature of these ROS call failures. Maybe there are node configurations we can set to reduce the amount of failed calls to mitigate the impact of retry attempts. I have had similar issues with SNMP calls that would fail to frequently. I was able to reduce failed by increasing the SNMP Client timeout for responses since almost all the failures were related to client timeouts. Just curious if we have looked into ROS node configurations relate to these calls to see if we can reduce the frequency of failing calls. |
if we have time, we can probably investigate more about node configurations, specifically Quality of Service settings. Moreover, the current failure in arbitrator is mostly just symptom rather than an issue itself at this point (rarely the mysterious ROS error that this retry logic attempts to resolve). This PR is merely reducing the occurrence of the symptom so carma doesn't fail as often. There are other issues that can cause this arbitrator failure such as some nodes not correctly ported to ROS2 but enabled anyways, or not having enough time to initialize carma so some nodes are not activated, or deactuvated nodes publishing services as available etc. They are open issues that will be resolved in time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arbitrarily modifying this value does not seem like a long term solution for failing ROS2 system calls. The reason this seems arbitrary to me is that we initially set the value to 10. Not sure what setting it to 2 does other than fail earlier. We may want to determine the root cause of this failure and try to mitigate it.
Yes we should plan to figure out the root cause by getting some data to work with. Currently this safeguard was put out of personal experience over the years. 10 was arbitrary number retry number that is plain out wrong since it exceeds the 1hz operation we intend (10 comes out to 0.2Hz) and doing significantly more harm than good. This PR is only to fix that at the moment, and this 2nd attempt with 500ms timeout is not arbitrary because it is aiming to keep strategic planning close to the 1hz that is anticipated. |
PR Details
Description
This PR aims to fix an issue where arbitrator is getting frozen due to too many repeated failed calls.
With 500ms service call timeout, the arbitrator can only afford maximum 2 retry attempts to satisfy 1hz operation (it will be slightly later than 1s period due to other successful calls returning, but that is okay).
Arbitrator should at least retry 2nd time again because sometimes ROS service call can fail due to ROS error.
Without retrying, the next planning will be after 1 second, which is too late for some planning such as checking red light in lci_strategic_plugin.
This may not be full fix (we should consider threaded calls, so that each 500ms wait is not sequential), but this will significantly reduce any issues we encounter for carma-platform to run out of the box.
Related GitHub Issue
#2385
Related Jira Key
https://usdot-carma.atlassian.net/browse/CAR-6039
Motivation and Context
Demos after 4.5.0 frequently encounter this issue by default because some nodes in ROS2 are not integration tested well but converted. This makes some of the nodes fail to activate or just due to host machine's performance, and repeated calls to such failed nodes freeze the arbitrator for up to 5 sec with 10 retry attempts.
How Has This Been Tested?
integration tested locally
Types of changes
Checklist: