Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI]flaky test faiure - org.opensearch.remotestore.RemoteIndexRecoveryIT.testRerouteRecovery #9580

Closed
Gaganjuneja opened this issue Aug 28, 2023 · 9 comments · Fixed by #11918
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep Storage Issues and PRs relating to data and metadata storage :test Adding or fixing a test >test-failure Test failure from CI, local build, etc.

Comments

@Gaganjuneja
Copy link
Contributor

Gaganjuneja commented Aug 28, 2023

Seen in #9506 (comment), build No.23062

> Task :server:internalClusterTest

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.RemoteIndexRecoveryIT.testRerouteRecovery" -Dtests.seed=8D8D1A98A3C03275 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=zh-Hans-CN -Dtests.timezone=Africa/Nairobi -Druntime.java=20

org.opensearch.remotestore.RemoteIndexRecoveryIT > testRerouteRecovery FAILED
    java.lang.AssertionError: 
    Expected: <1>
         but: was <0>
        at __randomizedtesting.SeedInfo.seed([8D8D1A98A3C03275:2026A9CC96985037]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:964)
        at org.junit.Assert.assertThat(Assert.java:930)
        at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:528)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1085)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1058)
        at org.opensearch.indices.recovery.IndexRecoveryIT.testRerouteRecovery(IndexRecoveryIT.java:526)
...
@tlfeng
Copy link
Collaborator

tlfeng commented Aug 29, 2023

There is another kind of error message:
Test failure in build No.23515:

> Task :server:internalClusterTest

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.RemoteIndexRecoveryIT.testRerouteRecovery" -Dtests.seed=B167119956043DFE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=cs -Dtests.timezone=America/Porto_Velho -Druntime.java=17

org.opensearch.remotestore.RemoteIndexRecoveryIT > testRerouteRecovery FAILED
    [test-idx-1/3V73pL2FSM-78qiSkgX-vg] IndexNotFoundException[no such index [test-idx-1]]
        at __randomizedtesting.SeedInfo.seed([B167119956043DFE:1CCCA2CD635C5FBC]:0)
        at app//org.opensearch.indices.IndicesService.indexServiceSafe(IndicesService.java:641)
        at app//org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:528)
        at app//org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1085)
        at app//org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1058)
        at app//org.opensearch.indices.recovery.IndexRecoveryIT.testRerouteRecovery(IndexRecoveryIT.java:526)

Another same error message in build No.23672:

> Task :server:internalClusterTest

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.RemoteIndexRecoveryIT.testRerouteRecovery" -Dtests.seed=8EFA50BA43FF46D0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-MX -Dtests.timezone=America/Recife -Druntime.java=17

org.opensearch.remotestore.RemoteIndexRecoveryIT > testRerouteRecovery FAILED
    [test-idx-1/2mnl7-UnTEOqHtM5AL59BA] IndexNotFoundException[no such index [test-idx-1]]
        at __randomizedtesting.SeedInfo.seed([8EFA50BA43FF46D0:2351E3EE76A72492]:0)
        at app//org.opensearch.indices.IndicesService.indexServiceSafe(IndicesService.java:691)
        at app//org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:528)
        at app//org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1085)
        at app//org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1058)
        at app//org.opensearch.indices.recovery.IndexRecoveryIT.testRerouteRecovery(IndexRecoveryIT.java:526)

The test failure occurred very frequently since first found in #9448 (comment), build No.23397 on Aug.24th 5am PDT

@anasalkouz
Copy link
Member

@sachinpkale anyone looking into this?

@anasalkouz anasalkouz added Storage Issues and PRs relating to data and metadata storage and removed untriaged labels Aug 31, 2023
@tlfeng
Copy link
Collaborator

tlfeng commented Aug 31, 2023

Hi @anasalkouz, I noticed there is already a PR created to fix it #9637

@linuxpi
Copy link
Collaborator

linuxpi commented Sep 1, 2023

We started seeing this test fail due to a recent merge on main. I am working on fixing this.

@anasalkouz

@ashking94
Copy link
Member

This is easily reproducible after 20-25 iterations while running the test on loop. I have also been able to root cause the current issue that is leading to the test failure. I am suspecting that there might be some more flakiness later the current assertion failure. I will run the test for multiple iterations to ascertain the same. I will try to share the root cause and raise PR accordingly.

ashking94 added a commit to ashking94/OpenSearch that referenced this issue Jan 18, 2024
@ashking94
Copy link
Member

The issue is happening due to 2 reasons as we have seen the stack trace -

  1. Index not found
org.opensearch.remotestore.RemoteIndexRecoveryIT > testRerouteRecovery FAILED
    [test-idx-1/gUdo_ZgeQSeTKz75x2BNLA] IndexNotFoundException[no such index [test-idx-1]]
        at __randomizedtesting.SeedInfo.seed([216A80F8C219BD32:8CC133ACF741DF70]:0)
        at app//org.opensearch.indices.IndicesService.indexServiceSafe(IndicesService.java:689)
        at app//org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:528)
  1. Assertion failure
java.lang.AssertionError: 
Expected: <1>
     but: was <0>
	at __randomizedtesting.SeedInfo.seed([A66685B0FBDCA162:BCD36E4CE84C320]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:964)
	at org.junit.Assert.assertThat(Assert.java:930)
	at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:528)

The first issue happens due to cluster state publication cleaning up the index shard on the old primary node between 2 assertion check interval. The current assertBusy has exponential backoff between 2 assertion checks and between these 2 checks, the condition of assertion becomes true and the index shard itself gets cleared from the old node. However, due to high interval, it misses hitting the assertion true condition. The fix for this issue is to have the assertion checks done at fixed interval.

The second issue occurs due to the same reason as above but in this case the peer recovery has completed and changed the state of recovery state of the index shard.

ashking94 added a commit to ashking94/OpenSearch that referenced this issue Jan 18, 2024
gbbafna pushed a commit that referenced this issue Jan 31, 2024
Signed-off-by: Ashish Singh <ssashish@amazon.com>
opensearch-trigger-bot bot pushed a commit that referenced this issue Feb 9, 2024
Signed-off-by: Ashish Singh <ssashish@amazon.com>
(cherry picked from commit c6cebc7)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
sachinpkale pushed a commit that referenced this issue Feb 10, 2024
… (#12275)

(cherry picked from commit c6cebc7)

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
peteralfonsi pushed a commit to peteralfonsi/OpenSearch that referenced this issue Mar 1, 2024
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this issue Mar 18, 2024
@shwetathareja
Copy link
Member

Looks like this testcase was not fixed fully. The same test failed here - #8992 (comment)
Stack trace

java.lang.AssertionError: 
Expected: <1>
     but: was <0>
	at __randomizedtesting.SeedInfo.seed([7F10E4C3CE05B7C8:D2BB5797FB5DD58A]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:964)
	at org.junit.Assert.assertThat(Assert.java:930)
	at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
	at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1150)
	at org.opensearch.indices.recovery.IndexRecoveryIT.testRerouteRecovery(IndexRecoveryIT.java:525)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at org.opensearch.test.OpenSearchTestClusterRule$1.evaluate(OpenSearchTestClusterRule.java:369)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:1583)
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more
	Suppressed: java.lang.AssertionError: 
Expected: <1>
     but: was <0>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
		at org.junit.Assert.assertThat(Assert.java:964)
		at org.junit.Assert.assertThat(Assert.java:930)
		at org.opensearch.indices.recovery.IndexRecoveryIT.lambda$testRerouteRecovery$1(IndexRecoveryIT.java:527)
		at org.opensearch.test.OpenSearchTestCase.assertBusyWithFixedSleepTime(OpenSearchTestCase.java:1141)
		... 40 more

@shwetathareja
Copy link
Member

@ashking94 it looks like your fix just reduced the likelihood (though significantly) but the test can still fail. As RecoveryStats are just point in time and not cumulative. So race condition can still happen betwen fixed interval

@gbbafna gbbafna removed the untriaged label Apr 4, 2024
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this issue Apr 25, 2024
…roject#9580 (opensearch-project#11918)

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
@reta
Copy link
Collaborator

reta commented Jun 19, 2024

Closing in favour of #14323

@reta reta closed this as completed Jun 19, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Storage Project Board Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep Storage Issues and PRs relating to data and metadata storage :test Adding or fixing a test >test-failure Test failure from CI, local build, etc.
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

10 participants