Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 9195] Evict silos from cluster if they remain in the Joining or Created state for longer than MaxJoinAttemptTime #9201

Conversation

Chris-Eckhardt
Copy link
Contributor

@Chris-Eckhardt Chris-Eckhardt commented Oct 23, 2024

Fixes #9195

In this PR:

  • Adds the flag ClusterMembershipOptions.EvictWhenMaxJoinAttemptTimeExceeded (default to true)
  • Checks MembershipTableSnapshot for silos in a state of Created or Joined for longer than ClusterMembershipOptions.MaxJoinAttemptTime
Microsoft Reviewers: Open in CodeFlow

@Chris-Eckhardt
Copy link
Contributor Author

@dotnet-policy-service agree

@Chris-Eckhardt
Copy link
Contributor Author

Currently looking at what tests I can modify to cover this or If I need to add a new one.

@ReubenBond
Copy link
Member

Currently looking at what tests I can modify to cover this or If I need to add a new one.

ClusterHealthMonitorTests.cs may be a good fit. The test case there is large (too large), but adding a new test case or modifying the existing one is ok

@Chris-Eckhardt
Copy link
Contributor Author

I added a test with numVotesForDeathDeclaration == 1. If I set it to anything higher the test will fail, I need to address that.

@Chris-Eckhardt
Copy link
Contributor Author

I ended up refactoring the tests in ClusterHealthMonitorTests.cs to be able to cover multiple scenarios.

Tests added:

  • EvictWhenMaxJoinAttemptTimeExceeded is disabled
  • NumVotesForDeathDeclaration == 1
  • NumVotesForDeathDeclaration == 2
  • NumVotesForDeathDeclaration == 3

Changes Related to the refactor:

  • Created private ClusterHealthMonitorTestRig class
  • Both ClusterHealthMonitor_BasicScenario_Runner and the new ClusterHealthMonitor_StaleJoinOrCreatedSilos_Runner use this internally now.
        private class ClusterHealthMonitorTestRig(
            MembershipTableManager manager,
            IClusterMembershipService membershipService,
            IOptionsMonitor<ClusterMembershipOptions> optionsMonitor,
            ClusterHealthMonitor.ITestAccessor testAccessor)
        {
            public readonly MembershipTableManager Manager = manager;
            public readonly IClusterMembershipService MembershipService = membershipService;
            public readonly IOptionsMonitor<ClusterMembershipOptions> OptionsMonitor = optionsMonitor;
            public readonly ClusterHealthMonitor.ITestAccessor TestAccessor = testAccessor;
        }

Copy link
Member

@benjaminpetit benjaminpetit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that would help a lot when recovering a failed silo!

@ReubenBond ReubenBond merged commit f086d1b into dotnet:main Nov 7, 2024
16 checks passed
@Chris-Eckhardt Chris-Eckhardt deleted the chris-eckhardt/issue-9195/evict-max-join-time branch November 7, 2024 21:22
@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evict silos from cluster if they remain in the Joining or Created state for longer than MaxJoinAttemptTime
3 participants