Make use of a retry mechanism in the CommandRunner for retrying when an exception occurs #5881

jebriede · 2024-06-28T00:47:02Z

Bug

Fixes: https://github.com/NuGet/Client.Engineering/issues/2866

Regression? Last working version: N/A

Description

Added a retry mechanism that can rerun a function a given number of times if the function throws a specific exception. To retry when any exception is thrown, the caller can simply use Exception as a generic parameter. Added tests for the retry runner.

PR Checklist

… exception occurs, to retry the function

donnie-msft · 2024-06-28T22:46:08Z

Since this PR was opened after 5pm yesterday and merged before 8am today, I didn't get a chance to review.
Looking at it now, I'm seeing a build for this PR failed, despite the retry.

@jebriede Could you share how you tested this change to find any scenario where the retry works?
Should we reopen the issue and look for more approaches?

This test failed even with the retry, from what I can tell: Dotnet.Integration.Test.PackCommandTests.PackCommand_SuppressDependencies_DoesNotContainAnyDependency(frameworkToSuppress: "net45", expectedInFramework: "netstandard1.3")

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=724244&view=results

jebriede · 2024-06-28T23:08:21Z

Since this PR was opened after 5pm yesterday and merged before 8am today, I didn't get a chance to review. Looking at it now, I'm seeing a build for this PR failed, despite the retry.

@jebriede Could you share how you tested this change to find any scenario where the retry works? Should we reopen the issue and look for more approaches?

This test failed even with the retry, from what I can tell: Dotnet.Integration.Test.PackCommandTests.PackCommand_SuppressDependencies_DoesNotContainAnyDependency(frameworkToSuppress: "net45", expectedInFramework: "netstandard1.3")

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=724244&view=results

@donnie-msft The retry mechanism doesn't guarantee the tests will pass. The usage in this case is to only handle Timeout exceptions and retry one time. If there's a second timeout, the test will fail. I'm not seeing the build you linked as having encountered a timeout exception, though. Since we're specifically only retrying for timeout exceptions, any other exceptions will not cause the operation to be retried and still cause the test to fail, by design. The retry mechanism I built is flexible, however, and we can consider using it to retry for other exception types, but the task for now is to handle retrying only when encountering TimeoutException.

How it was tested: See the unit tests for the retry mechanism that validate the operation it wraps will be rerun when the specified exception is encountered.

If you see that the retry mechanism is failing to retry the specified number of times when the operation encounters a TimeoutException, then please file a bug against it. Please be sure to verify that the exception is a timeout exception and there's no retry in the logs. Note that a timeout from the second run will still cause the test to fail. We can play with the retry count as needed.

donnie-msft · 2024-06-28T23:49:48Z

How it was tested: See the unit tests for the retry mechanism that validate the operation it wraps will be rerun when the specified exception is encountered.

Your unit tests test your code, but I'm asking if you tested whether this retry mechanism actually prevented a flakey CI run for any of the command tests?

I'm not seeing the build you linked as having encountered a timeout exception, though.

I see TimeoutException from the pipeline failure I linked:

System.TimeoutException : D:\a_work\1\s.test\work\215d696f\de07e040\dotnet.exe restore ClassLibrary1.csproj -nodereuse:false timed out after 60.00 seconds

zivkan · 2024-06-28T23:54:24Z

Your screenshot shows that the test duration was 2 minutes, and given the error message says that the timeout was 60 seconds, this implies that it successfully retried once, but the second attempt also timed out. Although it's suspicious that the retry message wasn't written to the output. It appears that ITestOutputHelper isn't being passed through, at least for this test.

If there are any dotnet.integration.tests that took over 60 seconds, but less than 2 minutes, those are probably examples of tests that would have previously failed, but now will succeed.

donnie-msft · 2024-06-29T00:17:25Z

If there are any dotnet.integration.tests that took over 60 seconds, but less than 2 minutes, those are probably examples of tests that would have previously failed, but now will succeed.

Yes, this is what I'm asking for as well: Where is an example showing this change actually made an impact on this flakey behavior?

Although it's suspicious that the retry message wasn't written to the output.

The retry does occur, it's not in the screenshot, you have to open the logs. What I'm trying to understand is whether immediately retrying is giving an opportunity for the issue to resolve, because in 2 builds (including the example I linked), it still fails on the 2nd try.

donnie-msft · 2024-06-29T00:37:09Z

If there are any dotnet.integration.tests that took over 60 seconds, but less than 2 minutes, those are probably examples of tests that would have previously failed, but now will succeed.

Just so happens, I think I've found some that meet those criteria!
As you say, we'll have to assume this executed the retry. I still see some failures after the retry, so we'll still have to investigate these command tests at some point.

jebriede added 2 commits June 27, 2024 16:56

WIP: Add a retry mechanism that can be used by test utilities when an…

af40d54

… exception occurs, to retry the function

Add test to verify only specified exception retries operation

ff48e64

jebriede requested a review from a team as a code owner June 28, 2024 00:47

Added summary comment and renamed parameter

3acd5fb

jeffkl approved these changes Jun 28, 2024

View reviewed changes

jebriede merged commit cc433c1 into dev Jun 28, 2024
28 checks passed

jebriede deleted the dev-jebriede-RetryRunner branch June 28, 2024 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make use of a retry mechanism in the CommandRunner for retrying when an exception occurs #5881

Make use of a retry mechanism in the CommandRunner for retrying when an exception occurs #5881

Uh oh!

jebriede commented Jun 28, 2024

Uh oh!

Uh oh!

donnie-msft commented Jun 28, 2024

Uh oh!

jebriede commented Jun 28, 2024 •

edited

Loading

Uh oh!

donnie-msft commented Jun 28, 2024 •

edited

Loading

Uh oh!

zivkan commented Jun 28, 2024

Uh oh!

donnie-msft commented Jun 29, 2024 •

edited

Loading

Uh oh!

donnie-msft commented Jun 29, 2024

Uh oh!

Uh oh!

Make use of a retry mechanism in the CommandRunner for retrying when an exception occurs #5881

Make use of a retry mechanism in the CommandRunner for retrying when an exception occurs #5881

Uh oh!

Conversation

jebriede commented Jun 28, 2024

Bug

Description

PR Checklist

Uh oh!

Uh oh!

donnie-msft commented Jun 28, 2024

Uh oh!

jebriede commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

donnie-msft commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zivkan commented Jun 28, 2024

Uh oh!

donnie-msft commented Jun 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

donnie-msft commented Jun 29, 2024

Uh oh!

Uh oh!

jebriede commented Jun 28, 2024 •

edited

Loading

donnie-msft commented Jun 28, 2024 •

edited

Loading

donnie-msft commented Jun 29, 2024 •

edited

Loading