Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections is flaky #5157

Closed
dblock opened this issue Nov 8, 2022 · 5 comments
Closed
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run

Comments

@dblock
Copy link
Member

dblock commented Nov 8, 2022

Describe the bug

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections" -Dtests.seed=3AA0904AC11EAEA5 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=uk-UA -Dtests.timezone=America/Argentina/Jujuy -Druntime.java=19

org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests > testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections FAILED
    java.lang.AssertionError: expected null, but was:<org.opensearch.index.stats.IndexingPressurePerShardStats@3a0179b7>
        at __randomizedtesting.SeedInfo.seed([3AA0904AC11EAEA5:BC95A9DD94E04AA8]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotNull(Assert.java:756)
        at org.junit.Assert.assertNull(Assert.java:738)
        at org.junit.Assert.assertNull(Assert.java:748)
        at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections(ShardIndexingPressureConcurrentExecutionTests.java:274)

#5143

@dblock dblock added bug Something isn't working flaky-test Random test failure that succeeds on second run labels Nov 8, 2022
@andrross
Copy link
Member

andrross commented Nov 9, 2022

This is a very similar test and failure as mentioned in #4212. I suspect the fix will be the same for both. I'm capturing the notes I've collected on this so far below:

The test expects shardIndexingPressure.shardStats().getIndexingPressureShardStats(shardId1) to return null.

In theory, this close call should clean up everything being tracked and result in that assertion being true. The only suspect I have is that the close conditionally cleans up the stats being tracked. I suspect there may be a race preventing the values getting to zero tested in that condition. I haven't figured out how that might be possible though.

I can get this to fail locally somewhat reliably. However, if it fails, it always seems to fail on the first attempt. If I run it many times the retries never seem to fail. This suggests to me it is some sort of race condition.

@Rishikesh1159
Copy link
Member

Similar to falky test : #4212 , test only fails when rejectionCount is equal to NUM_THREADS. Possible fix/workaround would be similar to this comment. Need @getsaurabh02 opinion on this to move forward with fix.

@getsaurabh02
Copy link
Member

thanks @Rishikesh1159 . Could you pls check #4212 (comment)

@Rishikesh1159
Copy link
Member

Closing this as #5439 is merged, which fixes the issue. Please feel free to reopen if this test fails again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run
Projects
None yet
Development

No branches or pull requests

4 participants