[BUG][Segment Replication] testClusterGreenAfterPartialRelocationNoPreferenceShardMovementPrimaryFirstEnabled test suite timeout #9178

Poojita-Raj · 2023-08-08T20:30:44Z

Describe the bug
org.opensearch.cluster.routing.ShardMovementStrategyTests.testClusterGreenAfterPartialRelocationNoPreferenceShardMovementPrimaryFirstEnabled is occasionally running into a test suite timeout.

java.lang.Exception: Test abandoned because suite timeout was reached.

This is also causing a related failure of a thread leak in this test class:

org.opensearch.cluster.routing.allocation.BalancedSingleShardTests.classMethod

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
Test should pass within timeout.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

Poojita-Raj · 2023-08-09T16:59:41Z

The timeout is being caused in this test due to the latch countdown not being released.
The condition for it to be released is that all the expected primaries move over/ relocate to the new zone nodes and are in started state.

In the case of this case, the deprecated setting SHARD_MOVE_PRIMARY_FIRST_SETTING is enabled and the new setting of SHARD_MOVEMENT_STRATEGY_SETTING is left to its default case of NO_PREFERENCE which doesn't force primaries or replicas to move in any particular order.

In this case, we respect the SHARD_MOVE_PRIMARY_FIRST_SETTING and expect the primary shards to move to the new zone first.

However, the latch countdown never occurs since some shard is prematurely being closed when it's expected to be in started state on the new zone. This causes a timeout waiting for the latch and a thread leak in further tests.

Poojita-Raj · 2023-08-09T17:21:09Z

When the applyClusterState is called on a clusterChangedEvent, it calls removeShards to remove any local shards that doesn't match what the cluster-manager expects.
This deletes a shard that is not allocated i.e., the shard is in RELOCATING state to the target node but it isn't present in the source node's routing table so it's removed from the indexService.

Poojita-Raj · 2023-08-17T19:10:35Z

Issue: Throttling of primaries wasn't being recognized in the case of shardMovementStrategy being set to NO_PREFERENCE and movePrimaryFirst being set to true.
Making a change to ensure that replicas don't relocate if primaries are being throttled in this case will fix the issue.

Poojita-Raj added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run v2.10.0 Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Aug 8, 2023

Poojita-Raj added this to Segment Replication Aug 8, 2023

github-project-automation bot moved this to Todo in Segment Replication Aug 8, 2023

Poojita-Raj removed the untriaged label Aug 8, 2023

This was referenced Aug 8, 2023

[AUTOCUT] Gradle Check Failure on push to 2.x #9161

Closed

[AUTOCUT] Gradle Check Failure on push to 2.x #9162

Closed

kotwanikunal mentioned this issue Aug 8, 2023

[Backport 2.x] Bump org.apache.zookeeper:zookeeper from 3.8.2 to 3.9.0 in /test/fixtures/hdfs-fixture #9179

Merged

Poojita-Raj self-assigned this Aug 8, 2023

This was referenced Aug 8, 2023

Mute flaky test ShardMovementStrategyTests.testClusterGreenAfterParti… #9181

Merged

Mute flaky test SearchWeightedRoutingIT.testStrictWeightedRoutingWith… #9175

Merged

dbwiddis mentioned this issue Aug 9, 2023

Fix flaky test by preventing duplicate keys in random mapping fields #9184

Merged

3 tasks

Poojita-Raj mentioned this issue Aug 9, 2023

[Remote Store] Fix couple of Remote Store flaky test and use bulk api for ingestion #9190

Merged

6 tasks

kotwanikunal mentioned this issue Aug 9, 2023

Add interface changes for async repository downloads #9182

Closed

6 tasks

reta mentioned this issue Aug 9, 2023

[AUTOCUT] Gradle Check Failure on push to 2.x #9180

Closed

Poojita-Raj mentioned this issue Aug 17, 2023

Fix testClusterRelocationNoPreferenceShardMovementPrimaryFirstEnabled failure due to timeout #9420

Merged

6 tasks

mch2 closed this as completed in #9420 Aug 21, 2023

github-project-automation bot moved this from Todo to Done in Segment Replication Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][Segment Replication] testClusterGreenAfterPartialRelocationNoPreferenceShardMovementPrimaryFirstEnabled test suite timeout #9178

[BUG][Segment Replication] testClusterGreenAfterPartialRelocationNoPreferenceShardMovementPrimaryFirstEnabled test suite timeout #9178

Poojita-Raj commented Aug 8, 2023

Poojita-Raj commented Aug 9, 2023

Poojita-Raj commented Aug 9, 2023

Poojita-Raj commented Aug 17, 2023

[BUG][Segment Replication] testClusterGreenAfterPartialRelocationNoPreferenceShardMovementPrimaryFirstEnabled test suite timeout #9178

[BUG][Segment Replication] testClusterGreenAfterPartialRelocationNoPreferenceShardMovementPrimaryFirstEnabled test suite timeout #9178

Comments

Poojita-Raj commented Aug 8, 2023

Poojita-Raj commented Aug 9, 2023

Poojita-Raj commented Aug 9, 2023

Poojita-Raj commented Aug 17, 2023