Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AUTOCUT] Gradle Check Flaky Test Report for MinimumClusterManagerNodesIT #14289

Open
opensearch-ci-bot opened this issue Jun 13, 2024 · 4 comments
Labels
autocut Cluster Manager flaky-test Random test failure that succeeds on second run Storage:Remote >test-failure Test failure from CI, local build, etc.

Comments

@opensearch-ci-bot
Copy link
Collaborator

opensearch-ci-bot commented Jun 13, 2024

Flaky Test Report for MinimumClusterManagerNodesIT

Noticed the MinimumClusterManagerNodesIT has some flaky, failing tests that failed during post-merge actions.

Details

Git Reference Merged Pull Request Build Details Test Name
6049587 14040 40080 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

org.opensearch.cluster.MinimumClusterManagerNodesIT.classMethod
9675c4f 14465 41398 org.opensearch.cluster.MinimumClusterManagerNodesIT.classMethod

org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
0d780b6 15121 44058 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
2eb148c 15677 47308 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
3fa710b 15648 46708 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
50f411e 15582 46459 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
725ed36 15783 47574 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
9cd2635 15483 45625 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
a05d6d1 15905 47730 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
b35690c 14795 42953 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
c801270 15660 46762 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
d56d8c8 14489 41572 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
fc1bf2c 15759 47451 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
0fc94ca 13799 39263 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
1305002 14851 43091 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
56d0b76 14401 41153 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
acc4631 14587 42139 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
afa479b 14748 42455 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
c89a17c 13888 40047 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
017f7d4 15704 47021 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
107f0ce 15867 47672 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
1386a9b 13930 39885 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
2e13e9c 14107 40782 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
3a38a6c 14365 41154 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
5919409 13945 39654 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
67eceaa 15617 46561 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
9014894 15401 45300 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
dbdc151 15589 46329 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
f85a58f 14684 43162 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
fabf9bd 15293 44791 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
01c5e56 15750 47339 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
03b1306 15019 43633 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
06698dd 14922 43325 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
0c2ff03 14230 41005 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
11f8d79 14716 42338 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
1562100 15400 45193 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
1bee506 15227 44685 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
201c673 14458 41410 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
3a1be63 14639 41976 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
3ddb199 15586 46999 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
4038a3c 14074 40854 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
468f120 15724 47162 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
4c7d94c 14839 42902 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
57a597f 15494 45869 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
5bb2e28 15200 44343 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
64383dd 14561 41702 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
71d122b 15554 46063 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
7650e64 14345 41056 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
7e7e775 14864 43002 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
802f2e6 14424 41253 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
887698d 15132 44111 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
8e32ed7 14394 41336 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
903784b 14414 41239 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
913013b 13948 39666 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
a0a7098 14884 43148 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
a17aea5 14710 42244 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
a968790 15932 47815 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
afeddc2 14037 40910 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
b8c7819 12782 41391 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
bb013da 13717 39669 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
bde48a7 14133 40532 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
bf43678 13809 39614 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
c49eca4 13721 40576 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
cad81b0 15216 45855 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
d1cd7a2 15512 45861 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
d2bc9fc 15656 46780 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
e1a632f 14340 40973 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
e67ced7 13784 39156 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
edcbfd4 14923 43290 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
f9d15df 15715 47146 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
fef2003 15181 44308 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

The other pull requests, besides those involved in post-merge actions, that contain failing tests with the MinimumClusterManagerNodesIT class are:

For more details on the failed tests refer to OpenSearch Gradle Check Metrics dashboard.

@andrross
Copy link
Member

Adding the Storage:Remote label to this one because I believe it has been traced back to a commit related to that feature. From the original issue:

I believe I have traced this back to the commit that introduced the flakiness: 9119b6d (#9105)

The following command will reliably reproduce the failure for me:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.iters=100

If I select the commit immediately preceding 9119b6d then it does not reproduce.

This is a bit concerning because the commit in question is related to the remote store feature but MinimumClusterManagerNodesIT does not do anything related to remote store, so it is possible there is a significant regression here.

@dbwiddis
Copy link
Member

Mostly just adding some debugging logging statements.

  1. We start out with 3 nodes, [node_t0, node_t1, node_t2]
  2. We find the set of non-CM nodes, [node_t1, node_t0]
  3. We shut down the non-CM nodes, leaving [node_t2]
  4. We use the local path of the two nodes shut down to start up new nodes, they have the same UUID
  5. Most of the time when the test passes, the new nodes are renamed, node_t0 -> node_t3 and node_t1 -> node_t4.
  6. When the test fails, it's consistently because the (formerly CM) node still thinks it's in a cluster with node_t0 and node_t1 and its cluster state version is 2 versions behind the other two nodes.
  7. The other two (new) nodes think that node_t2 is the cluster manager but it hasn't caught up yet.
  8. The 2nd cluster state update is likely the cluster manager assignment, so the root cause is probably the first cluster state update that is failing on the (formerly cluster manager) node:
java.lang.AssertionError: a started primary with non-pending operation term must be in primary mode [test][1], node[ZdcgPV1JSmut1DojEIhCEw], [P], s[STARTED], a[id=Yl7dClDeQ0Ox4vlafvVO_A]
        at __randomizedtesting.SeedInfo.seed([D54CD0A4D377FB88]:0)
        at org.opensearch.index.shard.IndexShard.updateShardState(IndexShard.java:840)
        at org.opensearch.indices.cluster.IndicesClusterStateService.updateShard(IndicesClusterStateService.java:712)
        at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:651)
        at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:294)
        at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:626)
        at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:612)
        at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:580)
        at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:503)
        at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:205)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:923)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

@dbwiddis
Copy link
Member

Adding a flush after the 2 nodes are randomly dropped seems effective in preventing the flakiness, but also takes a long time

client().admin().indices().prepareFlush().execute().actionGet();

Adding a refresh() fails at this point because there is no cluster manager.

@dbwiddis
Copy link
Member

Placing a refresh() between the two node terminations seems to reduce, but not eliminate, the flakiness.

I'm about at the limit of what debug logging can tell me, but I'd suggest someone with knowledge of the linked PR investigate the interaction of that code with the cluster state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut Cluster Manager flaky-test Random test failure that succeeds on second run Storage:Remote >test-failure Test failure from CI, local build, etc.
Projects
Status: 🆕 New
Status: Ready To Be Picked
Development

No branches or pull requests

6 participants