Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make use of a retry mechanism in the CommandRunner for retrying when an exception occurs #5881

Merged
merged 3 commits into from
Jun 28, 2024

Conversation

jebriede
Copy link
Contributor

Bug

Fixes: https://github.com/NuGet/Client.Engineering/issues/2866

Regression? Last working version: N/A

Description

Added a retry mechanism that can rerun a function a given number of times if the function throws a specific exception. To retry when any exception is thrown, the caller can simply use Exception as a generic parameter. Added tests for the retry runner.

PR Checklist

  • PR has a meaningful title

  • PR has a linked issue.

  • Described changes

  • Tests

    • Automated tests added
    • OR
    • Test exception
    • OR
    • N/A
  • Documentation

    • Documentation PR or issue filled
    • OR
    • N/A

@jebriede jebriede requested a review from a team as a code owner June 28, 2024 00:47
@jebriede jebriede merged commit cc433c1 into dev Jun 28, 2024
28 checks passed
@jebriede jebriede deleted the dev-jebriede-RetryRunner branch June 28, 2024 14:39
@donnie-msft
Copy link
Contributor

Since this PR was opened after 5pm yesterday and merged before 8am today, I didn't get a chance to review.
Looking at it now, I'm seeing a build for this PR failed, despite the retry.

@jebriede Could you share how you tested this change to find any scenario where the retry works?
Should we reopen the issue and look for more approaches?

This test failed even with the retry, from what I can tell: Dotnet.Integration.Test.PackCommandTests.PackCommand_SuppressDependencies_DoesNotContainAnyDependency(frameworkToSuppress: "net45", expectedInFramework: "netstandard1.3")

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=724244&view=results

@jebriede
Copy link
Contributor Author

jebriede commented Jun 28, 2024

Since this PR was opened after 5pm yesterday and merged before 8am today, I didn't get a chance to review. Looking at it now, I'm seeing a build for this PR failed, despite the retry.

@jebriede Could you share how you tested this change to find any scenario where the retry works? Should we reopen the issue and look for more approaches?

This test failed even with the retry, from what I can tell: Dotnet.Integration.Test.PackCommandTests.PackCommand_SuppressDependencies_DoesNotContainAnyDependency(frameworkToSuppress: "net45", expectedInFramework: "netstandard1.3")

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=724244&view=results

@donnie-msft The retry mechanism doesn't guarantee the tests will pass. The usage in this case is to only handle Timeout exceptions and retry one time. If there's a second timeout, the test will fail. I'm not seeing the build you linked as having encountered a timeout exception, though. Since we're specifically only retrying for timeout exceptions, any other exceptions will not cause the operation to be retried and still cause the test to fail, by design. The retry mechanism I built is flexible, however, and we can consider using it to retry for other exception types, but the task for now is to handle retrying only when encountering TimeoutException.

How it was tested: See the unit tests for the retry mechanism that validate the operation it wraps will be rerun when the specified exception is encountered.

If you see that the retry mechanism is failing to retry the specified number of times when the operation encounters a TimeoutException, then please file a bug against it. Please be sure to verify that the exception is a timeout exception and there's no retry in the logs. Note that a timeout from the second run will still cause the test to fail. We can play with the retry count as needed.

@donnie-msft
Copy link
Contributor

donnie-msft commented Jun 28, 2024

How it was tested: See the unit tests for the retry mechanism that validate the operation it wraps will be rerun when the specified exception is encountered.

Your unit tests test your code, but I'm asking if you tested whether this retry mechanism actually prevented a flakey CI run for any of the command tests?

I'm not seeing the build you linked as having encountered a timeout exception, though.

I see TimeoutException from the pipeline failure I linked:

System.TimeoutException : D:\a_work\1\s.test\work\215d696f\de07e040\dotnet.exe restore ClassLibrary1.csproj -nodereuse:false timed out after 60.00 seconds
image

@zivkan
Copy link
Member

zivkan commented Jun 28, 2024

Your screenshot shows that the test duration was 2 minutes, and given the error message says that the timeout was 60 seconds, this implies that it successfully retried once, but the second attempt also timed out. Although it's suspicious that the retry message wasn't written to the output. It appears that ITestOutputHelper isn't being passed through, at least for this test.

If there are any dotnet.integration.tests that took over 60 seconds, but less than 2 minutes, those are probably examples of tests that would have previously failed, but now will succeed.

@donnie-msft
Copy link
Contributor

donnie-msft commented Jun 29, 2024

If there are any dotnet.integration.tests that took over 60 seconds, but less than 2 minutes, those are probably examples of tests that would have previously failed, but now will succeed.

Yes, this is what I'm asking for as well: Where is an example showing this change actually made an impact on this flakey behavior?

Although it's suspicious that the retry message wasn't written to the output.

The retry does occur, it's not in the screenshot, you have to open the logs. What I'm trying to understand is whether immediately retrying is giving an opportunity for the issue to resolve, because in 2 builds (including the example I linked), it still fails on the 2nd try.

@donnie-msft
Copy link
Contributor

If there are any dotnet.integration.tests that took over 60 seconds, but less than 2 minutes, those are probably examples of tests that would have previously failed, but now will succeed.

Just so happens, I think I've found some that meet those criteria!
As you say, we'll have to assume this executed the retry. I still see some failures after the retry, so we'll still have to investigate these command tests at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants