Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.remotestore.SegmentReplicationRemoteStoreIT.testReplicaHasDiffFilesThanPrimary is flaky #7643

Closed
sachinpkale opened this issue May 19, 2023 · 5 comments · Fixed by #8863 or #8912
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Storage:Durability Issues and PRs related to the durability framework >test-failure Test failure from CI, local build, etc.

Comments

@sachinpkale
Copy link
Member

Describe the bug

org.opensearch.remotestore.SegmentReplicationRemoteStoreIT.testReplicaHasDiffFilesThanPrimary is flaky

  2> java.lang.AssertionError: timed out waiting for green state
        at org.junit.Assert.fail(Assert.java:89)
        at org.opensearch.test.OpenSearchIntegTestCase.ensureColor(OpenSearchIntegTestCase.java:1002)
        at org.opensearch.test.OpenSearchIntegTestCase.ensureGreen(OpenSearchIntegTestCase.java:933)
        at org.opensearch.test.OpenSearchIntegTestCase.ensureGreen(OpenSearchIntegTestCase.java:922)
        at org.opensearch.indices.replication.SegmentReplicationIT.testReplicaHasDiffFilesThanPrimary(SegmentReplicationIT.java:780)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.base/java.lang.reflect.Method.invoke(Method.java:578)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
.
.
.
    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=84, name=opensearch[node_t3][generic][T#1], state=RUNNABLE, group=TGRP-SegmentReplicationRemoteStoreIT]

        Caused by:
        java.lang.AssertionError: file (name [segment_infos_snapshot_filename__3], reused [false], length [395], recovered [477])
            at __randomizedtesting.SeedInfo.seed([E05CF3CB304D9302]:0)
            at org.opensearch.index.shard.StoreRecovery$StatsDirectoryWrapper.copyFrom(StoreRecovery.java:316)
            at org.opensearch.index.shard.IndexShard.syncSegmentsFromRemoteSegmentStore(IndexShard.java:4520)
            at org.opensearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:248)
            at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:604)
            at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806)
            at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)                                                                                      at java.base/java.lang.Thread.run(Thread.java:1589)

To Reproduce

./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationRemoteStoreIT.testReplicaHasDiffFilesThanPrimary" -Dtests.seed=E05CF3CB304D9302 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-CA -Dtests.timezone=Asia/Kolkata -Druntime.java=19

Additional Context

https://build.ci.opensearch.org/job/gradle-check/15752/consoleFull

@sachinpkale sachinpkale added bug Something isn't working untriaged Storage:Durability Issues and PRs related to the durability framework flaky-test Random test failure that succeeds on second run v2.8.0 'Issues and PRs related to version v2.8.0' and removed untriaged labels May 19, 2023
@sachinpkale sachinpkale self-assigned this May 19, 2023
@sachinpkale sachinpkale added v2.9.0 'Issues and PRs related to version v2.9.0' and removed v2.8.0 'Issues and PRs related to version v2.8.0' labels Jun 8, 2023
@BhumikaSaini-Amazon
Copy link
Contributor

Hi @sachinpkale,
I have the org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testDeleteOperations and org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testUpdateOperations tests failing in this build. Error is:

java.lang.AssertionError: Expected no missing or different segments between primary and replica but diff was missing: [name [_1_1_Lucene90_0.dvm], length [160], checksum [webb4a], writtenBy [9.7.0], name [_2.si], length [325], checksum [1auy4un], writtenBy [9.7.0], name [_2.cfe], length [479], checksum [1rwi2tu], writtenBy [9.7.0], name [_2.cfs], length [5447], checksum [cjpdu8], writtenBy [9.7.0], name [_1_1_Lucene90_0.dvd], length [87], checksum [1cgqi8w], writtenBy [9.7.0], name [_1_1.fnm], length [1216], checksum [1xrgj1t], writtenBy [9.7.0]] Different: [] Primary Replication Checkpoint : ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=1, segmentsGen=3, version=14, size=60012, codec=Lucene95} Replica Replication Checkpoint: ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=1, segmentsGen=5, version=11, size=52528, codec=Lucene95}

I am wondering if these 2 failures and this issue are related. They appear to map to the same base (org.opensearch.indices.replication.SegmentReplicationBaseIT#verifyStoreContent):

final Store.RecoveryDiff recoveryDiff = Store.segmentReplicationDiff(
primarySegmentMetadata,
replicaShard.getSegmentMetadataMap()
);
if (recoveryDiff.missing.isEmpty() == false || recoveryDiff.different.isEmpty() == false) {
fail(
"Expected no missing or different segments between primary and replica but diff was missing: "
+ recoveryDiff.missing
+ " Different: "
+ recoveryDiff.different
+ " Primary Replication Checkpoint : "
+ primaryShard.getLatestReplicationCheckpoint()
+ " Replica Replication Checkpoint: "
+ replicaShard.getLatestReplicationCheckpoint()
);
}

@kotwanikunal
Copy link
Member

@sachinpkale Are you still looking into this issue?

@sachinpkale
Copy link
Member Author

Yes, this should be fixed with recent fixes (#8134). I will check running this test multiple times on my local and update.

@mch2
Copy link
Member

mch2 commented Jul 17, 2023

Another failure here https://build.ci.opensearch.org/job/gradle-check/20268/.

Reproducible seed on main without this PR's changes - -Dtests.seed=A1F5B109BF8CE77B

@mch2 mch2 reopened this Jul 17, 2023
@github-project-automation github-project-automation bot moved this from Done to In Progress in Segment Replication Jul 17, 2023
@dreamer-89 dreamer-89 added >test-failure Test failure from CI, local build, etc. and removed untriaged v2.9.0 'Issues and PRs related to version v2.9.0' labels Jul 17, 2023
@mch2
Copy link
Member

mch2 commented Jul 25, 2023

This is failing in two ways -

[2023-07-25T01:06:07,896][WARN ][o.o.i.e.Engine           ] [node_t2] [test-idx-1][0] failed engine [refresh failed source[api]]
java.lang.ArithmeticException: / by zero
	at org.opensearch.index.shard.RemoteStoreRefreshListener.updateFinalStatusInSegmentTracker(RemoteStoreRefreshListener.java:464) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:264) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefresh(RemoteStoreRefreshListener.java:171) ~[classes/:?]
	at org.opensearch.index.shard.CloseableRetryableRefreshListener.afterRefresh(CloseableRetryableRefreshListener.java:57) ~[classes/:?]
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:432) ~[classes/:?]
	at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:412) ~[classes/:?]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1752) ~[classes/:?]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1729) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.refresh(IndexShard.java:1273) ~[classes/:?]
	at org.opensearch.action.admin.indices.refresh.TransportShardRefreshAction.lambda$shardOperationOnPrimary$0(TransportShardRefreshAction.java:101) ~[classes/:?]
	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[classes/:?]
	at org.opensearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnPrimary(TransportShardRefreshAction.java:100) ~[classes/:?]
	at org.opensearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnPrimary(TransportShardRefreshAction.java:57) ~[classes/:?]
	at org.opensearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1260) ~[classes/:?]
	at org.opensearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:158) ~[classes/:?]
	at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:581) ~[classes/:?]
	at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:480) ~[classes/:?]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$34(IndexShard.java:3801) ~[classes/:?]
	at org.opensearch.action.ActionListener$3.onResponse(ActionListener.java:130) ~[classes/:?]
	at org.opensearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:310) ~[classes/:?]
	at org.opensearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:255) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:3772) ~[classes/:?]
	at org.opensearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:1189) ~[classes/:?]
	at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:477) ~[classes/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[classes/:?]
	at org.opensearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:416) ~[classes/:?]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) [classes/:?]
	at org.opensearch.transport.TransportService$8.doRun(TransportService.java:1063) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [classes/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [classes/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.lang.Thread.run(Thread.java:1589) [?:?

and with a FileAlreadyExistsException because segments are not always deleted when syncing with remote store and there is a cksum mismatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Storage:Durability Issues and PRs related to the durability framework >test-failure Test failure from CI, local build, etc.
Projects
Status: Done
5 participants