-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make use of a retry mechanism in the CommandRunner for retrying when an exception occurs #5881
Conversation
… exception occurs, to retry the function
Since this PR was opened after 5pm yesterday and merged before 8am today, I didn't get a chance to review. @jebriede Could you share how you tested this change to find any scenario where the retry works? This test failed even with the retry, from what I can tell: Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=724244&view=results |
@donnie-msft The retry mechanism doesn't guarantee the tests will pass. The usage in this case is to only handle Timeout exceptions and retry one time. If there's a second timeout, the test will fail. I'm not seeing the build you linked as having encountered a timeout exception, though. Since we're specifically only retrying for timeout exceptions, any other exceptions will not cause the operation to be retried and still cause the test to fail, by design. The retry mechanism I built is flexible, however, and we can consider using it to retry for other exception types, but the task for now is to handle retrying only when encountering TimeoutException. How it was tested: See the unit tests for the retry mechanism that validate the operation it wraps will be rerun when the specified exception is encountered. If you see that the retry mechanism is failing to retry the specified number of times when the operation encounters a TimeoutException, then please file a bug against it. Please be sure to verify that the exception is a timeout exception and there's no retry in the logs. Note that a timeout from the second run will still cause the test to fail. We can play with the retry count as needed. |
Your screenshot shows that the test duration was 2 minutes, and given the error message says that the timeout was 60 seconds, this implies that it successfully retried once, but the second attempt also timed out. Although it's suspicious that the retry message wasn't written to the output. It appears that If there are any dotnet.integration.tests that took over 60 seconds, but less than 2 minutes, those are probably examples of tests that would have previously failed, but now will succeed. |
Yes, this is what I'm asking for as well: Where is an example showing this change actually made an impact on this flakey behavior?
The retry does occur, it's not in the screenshot, you have to open the logs. What I'm trying to understand is whether immediately retrying is giving an opportunity for the issue to resolve, because in 2 builds (including the example I linked), it still fails on the 2nd try. |
Just so happens, I think I've found some that meet those criteria!
|
Bug
Fixes: https://github.com/NuGet/Client.Engineering/issues/2866
Regression? Last working version: N/A
Description
Added a retry mechanism that can rerun a function a given number of times if the function throws a specific exception. To retry when any exception is thrown, the caller can simply use Exception as a generic parameter. Added tests for the retry runner.
PR Checklist
PR has a meaningful title
PR has a linked issue.
Described changes
Tests
Documentation