Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testReplicaAlreadyAtCheckpoint is flaky #11255

Closed
kasundra07 opened this issue Nov 17, 2023 · 2 comments
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run

Comments

@kasundra07
Copy link
Contributor

Describe the bug
The test case org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testReplicaAlreadyAtCheckpoint is flaky:

java.lang.AssertionError:  inconsistent generation 
	at __randomizedtesting.SeedInfo.seed([138297ACFE85FB99]:0)
	at org.opensearch.index.translog.transfer.TranslogCheckpointTransferSnapshot$Builder.build(TranslogCheckpointTransferSnapshot.java:180)
	at org.opensearch.index.translog.RemoteFsTranslog.upload(RemoteFsTranslog.java:338)
	at org.opensearch.index.translog.RemoteFsTranslog.prepareAndUpload(RemoteFsTranslog.java:310)
	at org.opensearch.index.translog.RemoteFsTranslog.sync(RemoteFsTranslog.java:365)
	at org.opensearch.index.translog.InternalTranslogManager.syncTranslog(InternalTranslogManager.java:196)
	at org.opensearch.index.engine.InternalEngine.syncTranslog(InternalEngine.java:610)
	at org.opensearch.index.shard.IndexShard.postActivatePrimaryMode(IndexShard.java:3449)
	at org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:727)
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4052)
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4022)
	at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$37(IndexShard.java:3973)
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
	at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

To Reproduce

./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testReplicaAlreadyAtCheckpoint" -Dtests.seed=138297ACFE85FB99

Expected behavior
The test must always pass

Additional context
https://build.ci.opensearch.org/job/gradle-check/30101/testReport/junit/org.opensearch.remotestore/SegmentReplicationUsingRemoteStoreIT/testReplicaAlreadyAtCheckpoint/

@kasundra07 kasundra07 added bug Something isn't working untriaged labels Nov 17, 2023
@mch2
Copy link
Member

mch2 commented Nov 17, 2023

This cause looks identical to #10902. @kasundra07 was this build on 2.x or main?

The fix for this is in main but had not yet landed in 2.x - #11168

edit - just merged the fix to 2.x.

@kartg
Copy link
Member

kartg commented Dec 27, 2023

It appears that the fix referenced above has resolved the flakiness here. I was unable to repro the inconsistent generation issue even after repeatedly running the repro command from this run several thousand times:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testReplicaAlreadyAtCheckpoint" -Dtests.seed=138297ACFE85FB99 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=lt -Dtests.timezone=Europe/Warsaw -Druntime.java=21

Also, no open items reference this flaky failure. Note that when using -Dtests.iters parameter to repeatedly run the test, I did run into other failures, though these are unrelated to the flakiness in this issue. Examples:

I'm opting to close this issue as a fixed flaky test. Please reopen if this signature resurfaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run
Projects
None yet
Development

No branches or pull requests

4 participants