Skip to content

Conversation

@guojialiang92
Copy link
Contributor

@guojialiang92 guojialiang92 commented Aug 1, 2025

Description

I found this flakey test in #18890's gradle test.
I investigated the cause and fixed it.

java.lang.AssertionError: unexpected exception type thrown; expected:<java.io.IOException> but was:<java.lang.AssertionError>
	at __randomizedtesting.SeedInfo.seed([6A2AFF70E0EFA082:D601FF6FAC77A86]:0)
	at org.junit.Assert.assertThrows(Assert.java:1020)
	at org.junit.Assert.assertThrows(Assert.java:981)
	at org.opensearch.index.store.remote.utils.TransferManagerTestCase.testOverflowDisabledAsynchronous(TransferManagerTestCase.java:205)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:1447)
Caused by: java.lang.AssertionError
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertNotNull(Assert.java:713)
	at org.junit.Assert.assertNotNull(Assert.java:723)
	at org.opensearch.index.store.remote.utils.TransferManagerTestCase.asyncFetchBlobWithName(TransferManagerTestCase.java:350)
	at org.opensearch.index.store.remote.utils.TransferManagerTestCase.lambda$testOverflowDisabledAsynchronous$0(TransferManagerTestCase.java:205)
	at org.junit.Assert.assertThrows(Assert.java:1001)
	... 40 more

Reproduce

The issue can be reproduced by running the following command, and it occurs approximately once every 3000 runs.

./gradlew ':server:test' --tests 'org.opensearch.index.store.remote.utils.TransferManagerRemoteDirectoryReaderTests.testOverflowDisabledAsynchronous' -Dtests.seed=6A2AFF70E0EFA082 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sk-Latn-SK -Dtests.timezone=America/Indiana/Petersburg -Druntime.java=24

Analysis

The reason is that TransferManager#fetchBlobAsync is an asynchronous operation, and if TransferManager#fetchBlobAsync completes before the assertNotNull, problem will be reproduced.

If a Thread.sleep(1000) is added before the assertNotNull, this issue can be reproduced stably.

private IndexInput asyncFetchBlobWithName(String blobname) throws IOException {
    List<BlobFetchRequest.BlobPart> blobParts = new ArrayList<>();
    blobParts.add(new BlobFetchRequest.BlobPart("blob", 0, EIGHT_MB));
    BlobFetchRequest blobFetchRequest = BlobFetchRequest.builder().fileName(blobname).directory(directory).blobParts(blobParts).build();
    // It will trigger async load
    transferManager.fetchBlobAsync(blobFetchRequest);
    // To reproduce the problem stably, we can add Thread.sleep(1000) before the assertNotNull.
    try {
        Thread.sleep(1000);
    } catch (Exception e) {}
    assertNotNull(fileCache.get(blobFetchRequest.getFilePath()));
    fileCache.decRef(blobFetchRequest.getFilePath());
    // Making the read call to fetch from file cache
    return transferManager.fetchBlob(blobFetchRequest);
}

In summary we should remove assertNotNull.

Solution

After discussing with @andrross , we made the following modifications:

  1. To avoid using Thread.sleep in the test, we modified the return value type of TransferManager#fetchBlobAsync to CompletableFuture<IndexInput>.
  2. Removed redundant logic from test code TransferManagerTestCase#asyncFetchBlobWithName.

Meanwhile, we would like to invite the author (@ajaymovva) to take a look and see if the above understanding is correct.

Related Issues

Resolves #[18850]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

❌ Gradle check result for ff212d3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92 guojialiang92 force-pushed the dev/fix_test_testOverflowDisabledAsynchronous branch from ff212d3 to 3a0dc16 Compare August 1, 2025 13:33
@guojialiang92 guojialiang92 marked this pull request as draft August 1, 2025 14:05
@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

❌ Gradle check result for 3a0dc16: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

❌ Gradle check result for a7f6100: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@andrross
Copy link
Member

andrross commented Aug 1, 2025

Thanks @guojialiang92. It looks like there's a similar problem here:

transferManager.fetchBlobAsync(blobFetchRequest);
Thread.sleep(2000);
assertEquals(Optional.of(0), Optional.of(fileCache.getRef(blobFetchRequest.getFilePath())));
assertNotNull(fileCache.get(blobFetchRequest.getFilePath()));

That fixed sleep is bad and does not guarantee the async operation completes

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92
Copy link
Contributor Author

@andrross

Thanks @guojialiang92. It looks like there's a similar problem here:

transferManager.fetchBlobAsync(blobFetchRequest);
Thread.sleep(2000);
assertEquals(Optional.of(0), Optional.of(fileCache.getRef(blobFetchRequest.getFilePath())));
assertNotNull(fileCache.get(blobFetchRequest.getFilePath()));

That fixed sleep is bad and does not guarantee the async operation completes

Yes, you are right. We should avoid calls like Thread.sleep, and I have also modified the logic you mentioned.

@andrross
Copy link
Member

andrross commented Aug 1, 2025

@guojialiang92 The other option here is to revert commit 36844f5 until the author can come back and properly fix these test cases. It was only committed a few days ago and should revert cleanly. What do you think?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92
Copy link
Contributor Author

@andrross

The other option here is to revert commit 36844f5 until the author can come back and properly fix these test cases. It was only committed a few days ago and should revert cleanly. What do you think?

I changed the return type of TransferManager#fetchBlobAsync to CompletableFuture<IndexInput>, and it seems that the test no longer needs to rely on the sleep operation. Do you think this modification is acceptable?

@guojialiang92 guojialiang92 marked this pull request as ready for review August 1, 2025 16:32
@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

✅ Gradle check result for 51dd56c: SUCCESS

@codecov
Copy link

codecov bot commented Aug 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.77%. Comparing base (9b22c9b) to head (89cb38e).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18891      +/-   ##
============================================
- Coverage     72.77%   72.77%   -0.01%     
- Complexity    68690    68724      +34     
============================================
  Files          5582     5587       +5     
  Lines        315456   315716     +260     
  Branches      45778    45817      +39     
============================================
+ Hits         229568   229749     +181     
- Misses        67290    67358      +68     
- Partials      18598    18609      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2025

✅ Gradle check result for 89cb38e: SUCCESS

@andrross andrross merged commit e22b412 into opensearch-project:main Aug 2, 2025
36 of 37 checks passed
sunqijun1 pushed a commit to sunqijun1/OpenSearch that referenced this pull request Aug 4, 2025
…18891)

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
Signed-off-by: sunqijun.jun <sunqijun.jun@bytedance.com>
tandonks pushed a commit to tandonks/OpenSearch that referenced this pull request Aug 5, 2025
…18891)

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
vinaykpud pushed a commit to vinaykpud/OpenSearch that referenced this pull request Sep 26, 2025
…18891)

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants