Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][Segment Replication] testClusterGreenAfterPartialRelocationNoPreferenceShardMovementPrimaryFirstEnabled test suite timeout #9178

Closed
Poojita-Raj opened this issue Aug 8, 2023 · 3 comments · Fixed by #9420
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep v2.10.0

Comments

@Poojita-Raj
Copy link
Contributor

Describe the bug
org.opensearch.cluster.routing.ShardMovementStrategyTests.testClusterGreenAfterPartialRelocationNoPreferenceShardMovementPrimaryFirstEnabled is occasionally running into a test suite timeout.

java.lang.Exception: Test abandoned because suite timeout was reached.

This is also causing a related failure of a thread leak in this test class:

org.opensearch.cluster.routing.allocation.BalancedSingleShardTests.classMethod 

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
Test should pass within timeout.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@Poojita-Raj
Copy link
Contributor Author

The timeout is being caused in this test due to the latch countdown not being released.
The condition for it to be released is that all the expected primaries move over/ relocate to the new zone nodes and are in started state.

In the case of this case, the deprecated setting SHARD_MOVE_PRIMARY_FIRST_SETTING is enabled and the new setting of SHARD_MOVEMENT_STRATEGY_SETTING is left to its default case of NO_PREFERENCE which doesn't force primaries or replicas to move in any particular order.

In this case, we respect the SHARD_MOVE_PRIMARY_FIRST_SETTING and expect the primary shards to move to the new zone first.

However, the latch countdown never occurs since some shard is prematurely being closed when it's expected to be in started state on the new zone. This causes a timeout waiting for the latch and a thread leak in further tests.

@Poojita-Raj
Copy link
Contributor Author

When the applyClusterState is called on a clusterChangedEvent, it calls removeShards to remove any local shards that doesn't match what the cluster-manager expects.
This deletes a shard that is not allocated i.e., the shard is in RELOCATING state to the target node but it isn't present in the source node's routing table so it's removed from the indexService.

@Poojita-Raj
Copy link
Contributor Author

Issue: Throttling of primaries wasn't being recognized in the case of shardMovementStrategy being set to NO_PREFERENCE and movePrimaryFirst being set to true.
Making a change to ensure that replicas don't relocate if primaries are being throttled in this case will fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep v2.10.0
Projects
Status: Done
1 participant