Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce arbitrator retry attempts to keep the operation at 1hz #2415

Merged
merged 1 commit into from
Jul 16, 2024

Conversation

MishkaMN
Copy link
Contributor

@MishkaMN MishkaMN commented Jul 8, 2024

PR Details

Description

This PR aims to fix an issue where arbitrator is getting frozen due to too many repeated failed calls.
With 500ms service call timeout, the arbitrator can only afford maximum 2 retry attempts to satisfy 1hz operation (it will be slightly later than 1s period due to other successful calls returning, but that is okay).
Arbitrator should at least retry 2nd time again because sometimes ROS service call can fail due to ROS error.
Without retrying, the next planning will be after 1 second, which is too late for some planning such as checking red light in lci_strategic_plugin.

This may not be full fix (we should consider threaded calls, so that each 500ms wait is not sequential), but this will significantly reduce any issues we encounter for carma-platform to run out of the box.

Related GitHub Issue

#2385

Related Jira Key

https://usdot-carma.atlassian.net/browse/CAR-6039

Motivation and Context

Demos after 4.5.0 frequently encounter this issue by default because some nodes in ROS2 are not integration tested well but converted. This makes some of the nodes fail to activate or just due to host machine's performance, and repeated calls to such failed nodes freeze the arbitrator for up to 5 sec with 10 retry attempts.

How Has This Been Tested?

integration tested locally

Types of changes

  • Defect fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that cause existing functionality to change)

Checklist:

  • I have added any new packages to the sonar-scanner.properties file
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@MishkaMN MishkaMN added the anomaly Something isn't working label Jul 8, 2024
@MishkaMN MishkaMN requested a review from JonSmet July 8, 2024 19:08
@MishkaMN MishkaMN self-assigned this Jul 8, 2024
Copy link

sonarqubecloud bot commented Jul 8, 2024

@paulbourelly999
Copy link
Contributor

@MishkaMN do we have any information about the nature of these ROS call failures. Maybe there are node configurations we can set to reduce the amount of failed calls to mitigate the impact of retry attempts. I have had similar issues with SNMP calls that would fail to frequently. I was able to reduce failed by increasing the SNMP Client timeout for responses since almost all the failures were related to client timeouts. Just curious if we have looked into ROS node configurations relate to these calls to see if we can reduce the frequency of failing calls.

@MishkaMN
Copy link
Contributor Author

MishkaMN commented Jul 9, 2024

@MishkaMN do we have any information about the nature of these ROS call failures. Maybe there are node configurations we can set to reduce the amount of failed calls to mitigate the impact of retry attempts. I have had similar issues with SNMP calls that would fail to frequently. I was able to reduce failed by increasing the SNMP Client timeout for responses since almost all the failures were related to client timeouts. Just curious if we have looked into ROS node configurations relate to these calls to see if we can reduce the frequency of failing calls.

if we have time, we can probably investigate more about node configurations, specifically Quality of Service settings.
#2245 is the original issue that triggered this code implementation during VRU. And although I believe that specific issue was resolved, I have seen over the years that ROS service call can just fail without much logs. I don't have documented logs unfortunately but it looks somewhat similar like the link.

Moreover, the current failure in arbitrator is mostly just symptom rather than an issue itself at this point (rarely the mysterious ROS error that this retry logic attempts to resolve). This PR is merely reducing the occurrence of the symptom so carma doesn't fail as often.

There are other issues that can cause this arbitrator failure such as some nodes not correctly ported to ROS2 but enabled anyways, or not having enough time to initialize carma so some nodes are not activated, or deactuvated nodes publishing services as available etc. They are open issues that will be resolved in time.
@paulbourelly999

Copy link
Contributor

@paulbourelly999 paulbourelly999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arbitrarily modifying this value does not seem like a long term solution for failing ROS2 system calls. The reason this seems arbitrary to me is that we initially set the value to 10. Not sure what setting it to 2 does other than fail earlier. We may want to determine the root cause of this failure and try to mitigate it.

@MishkaMN
Copy link
Contributor Author

Arbitrarily modifying this value does not seem like a long term solution for failing ROS2 system calls. The reason this seems arbitrary to me is that we initially set the value to 10. Not sure what setting it to 2 does other than fail earlier. We may want to determine the root cause of this failure and try to mitigate it.

Yes we should plan to figure out the root cause by getting some data to work with. Currently this safeguard was put out of personal experience over the years. 10 was arbitrary number retry number that is plain out wrong since it exceeds the 1hz operation we intend (10 comes out to 0.2Hz) and doing significantly more harm than good. This PR is only to fix that at the moment, and this 2nd attempt with 500ms timeout is not arbitrary because it is aiming to keep strategic planning close to the 1hz that is anticipated.

@MishkaMN MishkaMN merged commit a92c44e into develop Jul 16, 2024
4 checks passed
@MishkaMN MishkaMN changed the title Update capabilities_interface.tpp Reduce arbitrator retry attempts to keep the operation at 1hz Jul 16, 2024
@MishkaMN MishkaMN deleted the fix/arbitrator-freezing-tuning branch July 16, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
anomaly Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants