Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flaky test org.opensearch.discovery.ClusterManagerDisruptionIT.testIsolateClusterManagerAndVerifyClusterStateConsensus #12095

Closed
Poojita-Raj opened this issue Jan 30, 2024 · 5 comments
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run

Comments

@Poojita-Raj
Copy link
Contributor

Poojita-Raj commented Jan 30, 2024

Describe the bug

Found this flaky test on gradle check:
org.opensearch.discovery.ClusterManagerDisruptionIT.testIsolateClusterManagerAndVerifyClusterStateConsensus

https://build.ci.opensearch.org/job/gradle-check/32905/

Related component

Cluster Manager

To Reproduce

Might show up on running gradle check

Expected behavior

Expect test to pass

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.discovery.ClusterManagerDisruptionIT.testIsolateClusterManagerAndVerifyClusterStateConsensus" -Dtests.seed=CA2671C1619AE45E -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=da-DK -Dtests.timezone=America/Lima -Druntime.java=21

org.opensearch.discovery.ClusterManagerDisruptionIT > testIsolateClusterManagerAndVerifyClusterStateConsensus FAILED
    java.lang.AssertionError
        at __randomizedtesting.SeedInfo.seed([CA2671C1619AE45E:7556C7CBC3B6BC4F]:0)
        at org.junit.Assert.fail(Assert.java:87)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.junit.Assert.assertTrue(Assert.java:53)
        at org.opensearch.discovery.ClusterManagerDisruptionIT.lambda$testIsolateClusterManagerAndVerifyClusterStateConsensus$0(ClusterManagerDisruptionIT.java:204)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1089)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1062)
        at org.opensearch.discovery.ClusterManagerDisruptionIT.testIsolateClusterManagerAndVerifyClusterStateConsensus(ClusterManagerDisruptionIT.java:167)
@Poojita-Raj Poojita-Raj added bug Something isn't working untriaged labels Jan 30, 2024
@peternied peternied added flaky-test Random test failure that succeeds on second run Build Build Tasks/Gradle Plugin, groovy scripts, build tools, Javadoc enforcement. labels Jan 31, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@Poojita-Raj thanks for filing, we'd gladly review a pull request

@shwetathareja
Copy link
Member

shwetathareja commented Mar 27, 2024

Able to consistently repro the test failure with

./gradlew ':server:internalClusterTest' --tests "org.opensearch.discovery.ClusterManagerDisruptionIT.testIsolateClusterManagerAndVerifyClusterStateConsensus" -Dtests.seed=CA2671C1619AE45E -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=da-DK -Dtests.timezone=America/Lima

Issue is that in test

ClusterStateStats clusterStateStats = internalCluster().clusterService().getClusterManagerService().getClusterStateStats();
assertTrue(clusterStateStats.getUpdateFailed() > 0);

It is picking a random node as specific node is not provided in internalCluster().clusterService() which results in the random behavior.

isolatedNode should be passed here as internalCluster().clusterService(isolatedNode) so that it deterministically checks this node for clusterState update failed count. The test is passing locally consistently. will raise the PR.

@shwetathareja shwetathareja self-assigned this Mar 27, 2024
@peternied peternied removed the Build Build Tasks/Gradle Plugin, groovy scripts, build tools, Javadoc enforcement. label Jul 30, 2024
@rajiv-kv rajiv-kv moved this from 🆕 New to Later (6 months plus) in Cluster Manager Project Board Dec 19, 2024
@rajiv-kv
Copy link
Contributor

[Triage Attendees - 1, 2, 3]
Fails on verification of failed stats on Isolated node
Next Steps

  • Verify if the cluster state is guaranteed to be updated after the isolation

@jaideep-m
Copy link

Q: Verify if the cluster state is guaranteed to be updated after the isolation

Yes its guaranteed that the cluster state will be updated after the isolation

Tried reproducing the issue by running the test 1000 times. Its not reproducing now

@github-project-automation github-project-automation bot moved this from Next (Next Quarter) to ✅ Done in Cluster Manager Project Board Jan 21, 2025
@rajiv-kv
Copy link
Contributor

Thanks @jaideep-m - this looks fixed with adding isolatedNode being passed as parameter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run
Projects
Status: ✅ Done
Development

No branches or pull requests

5 participants