Reduce arbitrator retry attempts to keep the operation at 1hz #2415

MishkaMN · 2024-07-08T19:08:23Z

PR Details

Description

This PR aims to fix an issue where arbitrator is getting frozen due to too many repeated failed calls.
With 500ms service call timeout, the arbitrator can only afford maximum 2 retry attempts to satisfy 1hz operation (it will be slightly later than 1s period due to other successful calls returning, but that is okay).
Arbitrator should at least retry 2nd time again because sometimes ROS service call can fail due to ROS error.
Without retrying, the next planning will be after 1 second, which is too late for some planning such as checking red light in lci_strategic_plugin.

This may not be full fix (we should consider threaded calls, so that each 500ms wait is not sequential), but this will significantly reduce any issues we encounter for carma-platform to run out of the box.

Related GitHub Issue

#2385

Related Jira Key

https://usdot-carma.atlassian.net/browse/CAR-6039

Motivation and Context

Demos after 4.5.0 frequently encounter this issue by default because some nodes in ROS2 are not integration tested well but converted. This makes some of the nodes fail to activate or just due to host machine's performance, and repeated calls to such failed nodes freeze the arbitrator for up to 5 sec with 10 retry attempts.

How Has This Been Tested?

integration tested locally

Types of changes

Defect fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that cause existing functionality to change)

Checklist:

I have added any new packages to the sonar-scanner.properties file
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

sonarqubecloud · 2024-07-08T20:03:09Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

paulbourelly999 · 2024-07-09T13:11:45Z

@MishkaMN do we have any information about the nature of these ROS call failures. Maybe there are node configurations we can set to reduce the amount of failed calls to mitigate the impact of retry attempts. I have had similar issues with SNMP calls that would fail to frequently. I was able to reduce failed by increasing the SNMP Client timeout for responses since almost all the failures were related to client timeouts. Just curious if we have looked into ROS node configurations relate to these calls to see if we can reduce the frequency of failing calls.

MishkaMN · 2024-07-09T20:31:32Z

@MishkaMN do we have any information about the nature of these ROS call failures. Maybe there are node configurations we can set to reduce the amount of failed calls to mitigate the impact of retry attempts. I have had similar issues with SNMP calls that would fail to frequently. I was able to reduce failed by increasing the SNMP Client timeout for responses since almost all the failures were related to client timeouts. Just curious if we have looked into ROS node configurations relate to these calls to see if we can reduce the frequency of failing calls.

if we have time, we can probably investigate more about node configurations, specifically Quality of Service settings.
#2245 is the original issue that triggered this code implementation during VRU. And although I believe that specific issue was resolved, I have seen over the years that ROS service call can just fail without much logs. I don't have documented logs unfortunately but it looks somewhat similar like the link.

Moreover, the current failure in arbitrator is mostly just symptom rather than an issue itself at this point (rarely the mysterious ROS error that this retry logic attempts to resolve). This PR is merely reducing the occurrence of the symptom so carma doesn't fail as often.

There are other issues that can cause this arbitrator failure such as some nodes not correctly ported to ROS2 but enabled anyways, or not having enough time to initialize carma so some nodes are not activated, or deactuvated nodes publishing services as available etc. They are open issues that will be resolved in time.
@paulbourelly999

paulbourelly999

Arbitrarily modifying this value does not seem like a long term solution for failing ROS2 system calls. The reason this seems arbitrary to me is that we initially set the value to 10. Not sure what setting it to 2 does other than fail earlier. We may want to determine the root cause of this failure and try to mitigate it.

MishkaMN · 2024-07-16T13:32:05Z

Arbitrarily modifying this value does not seem like a long term solution for failing ROS2 system calls. The reason this seems arbitrary to me is that we initially set the value to 10. Not sure what setting it to 2 does other than fail earlier. We may want to determine the root cause of this failure and try to mitigate it.

Yes we should plan to figure out the root cause by getting some data to work with. Currently this safeguard was put out of personal experience over the years. 10 was arbitrary number retry number that is plain out wrong since it exceeds the 1hz operation we intend (10 comes out to 0.2Hz) and doing significantly more harm than good. This PR is only to fix that at the moment, and this 2nd attempt with 500ms timeout is not arbitrary because it is aiming to keep strategic planning close to the 1hz that is anticipated.

Update capabilities_interface.tpp

de9901a

MishkaMN added the anomaly Something isn't working label Jul 8, 2024

MishkaMN requested a review from JonSmet July 8, 2024 19:08

MishkaMN self-assigned this Jul 8, 2024

MishkaMN requested review from adev4a and paulbourelly999 July 8, 2024 19:15

paulbourelly999 approved these changes Jul 16, 2024

View reviewed changes

MishkaMN merged commit a92c44e into develop Jul 16, 2024
4 checks passed

MishkaMN changed the title ~~Update capabilities_interface.tpp~~ Reduce arbitrator retry attempts to keep the operation at 1hz Jul 16, 2024

MishkaMN deleted the fix/arbitrator-freezing-tuning branch July 16, 2024 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce arbitrator retry attempts to keep the operation at 1hz #2415

Reduce arbitrator retry attempts to keep the operation at 1hz #2415

MishkaMN commented Jul 8, 2024 •

edited

Loading

sonarqubecloud bot commented Jul 8, 2024

paulbourelly999 commented Jul 9, 2024

MishkaMN commented Jul 9, 2024

paulbourelly999 left a comment

MishkaMN commented Jul 16, 2024

Reduce arbitrator retry attempts to keep the operation at 1hz #2415

Reduce arbitrator retry attempts to keep the operation at 1hz #2415

Conversation

MishkaMN commented Jul 8, 2024 • edited Loading

PR Details

Description

Related GitHub Issue

Related Jira Key

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

sonarqubecloud bot commented Jul 8, 2024

Quality Gate passed

paulbourelly999 commented Jul 9, 2024

MishkaMN commented Jul 9, 2024

paulbourelly999 left a comment

Choose a reason for hiding this comment

MishkaMN commented Jul 16, 2024

MishkaMN commented Jul 8, 2024 •

edited

Loading