Fixing flaky test testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness #3646

imRishN · 2022-06-22T07:08:01Z

Signed-off-by: Rishab Nahata rnnahata@amazon.com

Description

Caused by #3563

org.opensearch.cluster.allocation.AwarenessAllocationIT > testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness FAILED
    java.lang.AssertionError: unexpected
        at org.opensearch.test.InternalTestCluster.removeExclusions(InternalTestCluster.java:1912)
        at org.opensearch.test.InternalTestCluster.stopNodesAndClients(InternalTestCluster.java:1777)
        at org.opensearch.test.InternalTestCluster.stopNodesAndClient(InternalTestCluster.java:1764)
        at org.opensearch.test.InternalTestCluster.stopRandomNode(InternalTestCluster.java:1672)
        at org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness(AwarenessAllocationIT.java:425)

        Caused by:
        java.util.concurrent.ExecutionException: MasterNotDiscoveredException[null]
            at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:286)
            at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:273)
            at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:104)
            at org.opensearch.test.InternalTestCluster.removeExclusions(InternalTestCluster.java:1910)
            ... 4 more

            Caused by:
            MasterNotDiscoveredException[null]
                at app//org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:282)
                at app//org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
                at app//org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
                at app//org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:697)
                at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:739)
                at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
                at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
                at java.base@17.0.3/java.lang.Thread.run(Thread.java:833)

    MasterNotDiscoveredException[null]
        at app//org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:282)
        at app//org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
        at app//org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
        at app//org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:697)
        at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:739)
        at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base@17.0.3/java.lang.Thread.run(Thread.java:833)

Issues Resolved

#3603

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…onIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness by adding dedicated cluster manager node Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

opensearch-ci-bot · 2022-06-22T07:44:46Z

✅ Gradle Check success 9b30f1a711d9a06dfa96996513124b21a60f263c
Log 6210

Reports 6210

opensearch-ci-bot · 2022-06-22T07:54:11Z

❌ Gradle Check failure 06b2acd
Log 6211

Reports 6211

…lueAndLoadAwareness Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

opensearch-ci-bot · 2022-06-22T15:38:24Z

❌ Gradle Check failure 7095265
Log 6218

Reports 6218

imRishN · 2022-06-22T16:38:25Z

Ran the test 100 times now. Succeeds every time.

for i in {1..100}
do
echo "Task $i"
./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=CD3B9289D31206B8 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl -Dtests.timezone=Asia/Katmandu -Druntime.java=17 --rerun-tasks;
response_code=$?
echo "Response code $response_code"
if [[ $response_code = 1 ]]; then
	echo "Test $i failed"
	break
else
	echo "Test $i passed. Sleeping 5 seconds"
	sleep 5
fi
done

kartg · 2022-06-23T21:05:00Z

Seems like both gradle check failures are from a flaky test - #3650

Refiring.

kartg · 2022-06-23T21:05:25Z

start gradle check

opensearch-ci-bot · 2022-06-23T21:42:22Z

❌ Gradle Check failure 7095265
Log 6267

Reports 6267

dreamer-89 · 2022-06-25T18:38:34Z

Ran the test 100 times now. Succeeds every time.

for i in {1..100}
do
echo "Task $i"
./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=CD3B9289D31206B8 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl -Dtests.timezone=Asia/Katmandu -Druntime.java=17 --rerun-tasks;
response_code=$?
echo "Response code $response_code"
if [[ $response_code = 1 ]]; then
	echo "Test $i failed"
	break
else
	echo "Test $i passed. Sleeping 5 seconds"
	sleep 5
fi
done

Thank you @imRishN for this PR. Appreciate for taking time in fixing this flaky test.

Previously it has been observed that a flaky test rarely fail when run in isolation as single test. I suspect the test will still pass without your fix. Running entire gradle check will provide a better picture as it is what running on CI today. Can you give it a try ?

dreamer-89 · 2022-06-25T18:47:31Z

...er/src/internalClusterTest/java/org/opensearch/cluster/allocation/AwarenessAllocationIT.java

@@ -364,18 +364,22 @@ public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws E
            .put("cluster.routing.allocation.awareness.force.zone.values", "a,b,c")
            .put("cluster.routing.allocation.load_awareness.skew_factor", "0.0")
            .put("cluster.routing.allocation.load_awareness.provisioned_capacity", Integer.toString(nodeCountPerAZ * 3))
+            .put("cluster.routing.allocation.allow_rebalance", "indices_primaries_active")


@imRishN From test failure trace (MasterNotDiscoveredException), it is not clear if it is an actual issue or a flaky one. Can you explain how existing test is identified as flaky and changes here fixes it ?

The test here now adds a dedicated cluster manager node where as previously there was no dedicated cluster manager setup and the test was randomly killing half the nodes in a particular zone. I assume MasterNotDiscoveredException was coming when a node that was stopped was an active master that time and hence the exception was thrown sometimes.

Thanks @imRishN for the clarification.

imRishN · 2022-06-26T18:58:19Z

Ran the test 100 times now. Succeeds every time.
for i in {1..100}
do
echo "Task $i"
./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=CD3B9289D31206B8 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl -Dtests.timezone=Asia/Katmandu -Druntime.java=17 --rerun-tasks;
response_code=$?
echo "Response code $response_code"
if [[ $response_code = 1 ]]; then
	echo "Test $i failed"
	break
else
	echo "Test $i passed. Sleeping 5 seconds"
	sleep 5
fi
done
Thank you @imRishN for this PR. Appreciate for taking time in fixing this flaky test.

Previously it has been observed that a flaky test rarely fail when run in isolation as single test. I suspect the test will still pass without your fix. Running entire gradle check will provide a better picture as it is what running on CI today. Can you give it a try ?

The build passes locally

Bukhtawar · 2022-06-26T19:32:06Z

start gradle check

opensearch-ci-bot · 2022-06-26T20:04:15Z

❌ Gradle Check failure 7095265
Log 6348

Reports 6348

dreamer-89 · 2022-06-27T01:33:29Z

Test (flaky) failure. Tracked in #3579

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testHighWatermarkNotExceeded" -Dtests.seed=3CBC6279C41EB13E -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=cs-CZ -Dtests.timezone=Asia/Dili -Druntime.java=17

org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT > testHighWatermarkNotExceeded FAILED
    java.lang.AssertionError: Mismatching shard routings: []
    Expected: a collection with size <1>
         but: collection size was <0>
        at __randomizedtesting.SeedInfo.seed([3CBC6279C41EB13E:D59D83CB44D878D0]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:964)
        at org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.lambda$assertBusyWithDiskUsageRefresh$5(DiskThresholdDeciderIT.java:362)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1049)
        at org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.assertBusyWithDiskUsageRefresh(DiskThresholdDeciderIT.java:355)
        at org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testHighWatermarkNotExceeded(DiskThresholdDeciderIT.java:188)

dreamer-89 · 2022-06-27T01:33:42Z

start gradle check

opensearch-ci-bot · 2022-06-27T02:04:52Z

✅ Gradle Check success 7095265
Log 6352

Reports 6352

dreamer-89

LGTM!

dreamer-89 · 2022-06-27T01:56:25Z

...er/src/internalClusterTest/java/org/opensearch/cluster/allocation/AwarenessAllocationIT.java

@@ -364,18 +364,22 @@ public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws E
            .put("cluster.routing.allocation.awareness.force.zone.values", "a,b,c")
            .put("cluster.routing.allocation.load_awareness.skew_factor", "0.0")
            .put("cluster.routing.allocation.load_awareness.provisioned_capacity", Integer.toString(nodeCountPerAZ * 3))
+            .put("cluster.routing.allocation.allow_rebalance", "indices_primaries_active")


Thanks @imRishN for the clarification.

dreamer-89 · 2022-06-27T01:57:32Z

...er/src/internalClusterTest/java/org/opensearch/cluster/allocation/AwarenessAllocationIT.java

            nodeCountPerAZ,
            Settings.builder().put(commonSettings).put("node.attr.zone", "a").build()
        );
-        List<String> nodes_in_zone_b = internalCluster().startNodes(
+        List<String> nodes_in_zone_b = internalCluster().startDataOnlyNodes(


nit: Looks like nodes_in_zone_b is not used after declaration. In that case, it can be removed.

dreamer-89 · 2022-06-27T01:57:52Z

...er/src/internalClusterTest/java/org/opensearch/cluster/allocation/AwarenessAllocationIT.java

            nodeCountPerAZ,
            Settings.builder().put(commonSettings).put("node.attr.zone", "b").build()
        );
-        List<String> nodes_in_zone_c = internalCluster().startNodes(
+        List<String> nodes_in_zone_c = internalCluster().startDataOnlyNodes(


Same as above

…reness (#3646) * Fixing flaky test org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness by adding dedicated cluster manager node Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

imRishN requested review from a team and reta as code owners June 22, 2022 07:08

Fixing flaky test org.opensearch.cluster.allocation.AwarenessAllocati…

06b2acd

…onIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness by adding dedicated cluster manager node Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

imRishN force-pushed the main branch from 9b30f1a to 06b2acd Compare June 22, 2022 07:19

Updating rebalance setting for testThreeZoneOneReplicaWithForceZoneVa…

7095265

…lueAndLoadAwareness Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

imRishN force-pushed the main branch from 4ebb41b to 7095265 Compare June 22, 2022 15:07

dreamer-89 reviewed Jun 25, 2022

View reviewed changes

dreamer-89 mentioned this pull request Jun 27, 2022

[BUG] Test Failure org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testHighWatermarkNotExceeded #3579

Closed

dreamer-89 approved these changes Jun 27, 2022

View reviewed changes

Bukhtawar approved these changes Jun 27, 2022

View reviewed changes

Bukhtawar merged commit 22b42e4 into opensearch-project:main Jun 27, 2022

Poojita-Raj mentioned this pull request Nov 15, 2022

[Meta] Fix random test failures #1715

Closed

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing flaky test testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness #3646

Fixing flaky test testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness #3646

imRishN commented Jun 22, 2022 •

edited

Loading

opensearch-ci-bot commented Jun 22, 2022

opensearch-ci-bot commented Jun 22, 2022

opensearch-ci-bot commented Jun 22, 2022

imRishN commented Jun 22, 2022

kartg commented Jun 23, 2022

kartg commented Jun 23, 2022

opensearch-ci-bot commented Jun 23, 2022

dreamer-89 commented Jun 25, 2022 •

edited

Loading

dreamer-89 Jun 25, 2022

imRishN Jun 26, 2022

dreamer-89 Jun 27, 2022

imRishN commented Jun 26, 2022

Bukhtawar commented Jun 26, 2022

opensearch-ci-bot commented Jun 26, 2022

dreamer-89 commented Jun 27, 2022

dreamer-89 commented Jun 27, 2022

opensearch-ci-bot commented Jun 27, 2022

dreamer-89 left a comment

dreamer-89 Jun 27, 2022

dreamer-89 Jun 27, 2022

dreamer-89 Jun 27, 2022

Fixing flaky test testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness #3646

Fixing flaky test testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness #3646

Conversation

imRishN commented Jun 22, 2022 • edited Loading

Description

Issues Resolved

Check List

opensearch-ci-bot commented Jun 22, 2022

opensearch-ci-bot commented Jun 22, 2022

opensearch-ci-bot commented Jun 22, 2022

imRishN commented Jun 22, 2022

kartg commented Jun 23, 2022

kartg commented Jun 23, 2022

opensearch-ci-bot commented Jun 23, 2022

dreamer-89 commented Jun 25, 2022 • edited Loading

dreamer-89 Jun 25, 2022

Choose a reason for hiding this comment

imRishN Jun 26, 2022

Choose a reason for hiding this comment

dreamer-89 Jun 27, 2022

Choose a reason for hiding this comment

imRishN commented Jun 26, 2022

Bukhtawar commented Jun 26, 2022

opensearch-ci-bot commented Jun 26, 2022

dreamer-89 commented Jun 27, 2022

dreamer-89 commented Jun 27, 2022

opensearch-ci-bot commented Jun 27, 2022

dreamer-89 left a comment

Choose a reason for hiding this comment

dreamer-89 Jun 27, 2022

Choose a reason for hiding this comment

dreamer-89 Jun 27, 2022

Choose a reason for hiding this comment

dreamer-89 Jun 27, 2022

Choose a reason for hiding this comment

imRishN commented Jun 22, 2022 •

edited

Loading

dreamer-89 commented Jun 25, 2022 •

edited

Loading