[BUG] Test SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileFetchingMetadata is flaky #10902

mch2 · 2023-10-25T04:02:52Z

Describe the bug
This test is flaky - It doesn't look like its failing for any download/replication step, rather it is failing post failover with remote translog logic.

java.lang.AssertionError:  inconsistent generation 
	at __randomizedtesting.SeedInfo.seed([EB6B5954B3E7827C]:0)
	at org.opensearch.index.translog.transfer.TranslogCheckpointTransferSnapshot$Builder.build(TranslogCheckpointTransferSnapshot.java:180)
	at org.opensearch.index.translog.RemoteFsTranslog.upload(RemoteFsTranslog.java:333)
	at org.opensearch.index.translog.RemoteFsTranslog.prepareAndUpload(RemoteFsTranslog.java:305)
	at org.opensearch.index.translog.RemoteFsTranslog.sync(RemoteFsTranslog.java:360)
	at org.opensearch.index.translog.InternalTranslogManager.syncTranslog(InternalTranslogManager.java:197)
	at org.opensearch.index.shard.IndexShard.postActivatePrimaryMode(IndexShard.java:3404)
	at org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:714)
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4007)
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:3977)
	at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$36(IndexShard.java:3928)
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
	at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1623)

Seed:

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileFetchingMetadata" -Dtests.seed=EB6B5954B3E7827C -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-MX -Dtests.timezone=Asia/Qostanay -Druntime.java=20

Expected behavior
No flaky

amkhar · 2023-10-28T08:17:21Z

#10964 (comment)

ashking94 · 2023-10-30T08:48:05Z

I was able to reproduce this issue after 2.5K iterations. I have identified the fix and raised the PR - #10985. It appears to be a very corner case where during test cleanup, index deletion and activate primary mode are happening simultaneously in a such a way that the translog writer gets rotated while closing the current writer due to translog being open, but gets closed at the time of upload. This is race condition which is very rare and non-problematic in prod mode.

ashking94 · 2023-11-01T19:16:31Z

Ran with the fix for 16K iterations and there have been no failures yet.

jed326 · 2023-11-10T23:36:51Z

Ran with the fix for 16K iterations and there have been no failures yet.

@ashking94 are you running all 16k iterations on the same test seed or on different test seeds?

ashking94 · 2023-11-11T07:35:04Z

Ran with the fix for 16K iterations and there have been no failures yet.

@ashking94 are you running all 16k iterations on the same test seed or on different test seeds?

This is with different seed and with the prospective fix. I have not been able to get reviews on #10985 which fixes this issue.
@gbbafna @sachinpkale @mch2 can one of you help take this to closure?

reta · 2024-01-17T20:03:34Z

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=2071, name=opensearch[node_t2][remote_refresh_retry][T#1], state=RUNNABLE, group=TGRP-SegmentReplicationUsingRemoteStoreDisruptionIT]
	at __randomizedtesting.SeedInfo.seed([A4AA445A8D01A380:44FE23D23262BE4A]:0)
Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5487c53f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@471f16c3[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007f2708b52a48@5c16d5af]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@718da3f9[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4]
	at __randomizedtesting.SeedInfo.seed([A4AA445A8D01A380]:0)
	at app//org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
	at java.base@21.0.1/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at java.base@21.0.1/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at app//org.opensearch.threadpool.ThreadPool.schedule(ThreadPool.java:468)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:125)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:178)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)

andrross · 2024-03-13T15:46:37Z

[Triage - attendees 1 2 3]
@mch2 Thanks for filing this issue, looks like there is still a problem with this test.

mch2 · 2024-03-13T18:42:16Z

Taking a look

shourya035 · 2024-04-24T06:05:14Z

Ran this for 1000 iterations and didn't observe any failures.

From the first comment after this issue was reopened, it seems like the exception is related to an OpenSearchRejectedExecutionException coming up from the ReleasableRetryableRefreshListener when it tries to re-schedule a remote refresh. The rejection is coming up since the executor has been shutdown.

Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5487c53f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@471f16c3[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007f2708b52a48@5c16d5af]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@718da3f9[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4]

Checking the code for ScheduledThreadPoolExecutor it looks like it rejects the execution if the isShutdown() assertion of the executor returns true.

https://github.com/openjdk/jdk/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/java.base/share/classes/java/util/concurrent/ScheduledThreadPoolExecutor.java#L338-L348

We have a similar implementation under ThreadPool which utilizes the executor.isShutdown method to skip scheduling tasks:

OpenSearch/server/src/main/java/org/opensearch/threadpool/ThreadPool.java

Lines 472 to 490 in 9e62ccf

    
           public void scheduleUnlessShuttingDown(TimeValue delay, String executor, Runnable command) { 
        
               try { 
        
                   schedule(command, delay, executor); 
        
               } catch (OpenSearchRejectedExecutionException e) { 
        
                   if (e.isExecutorShutdown()) { 
        
                       logger.debug( 
        
                           new ParameterizedMessage( 
        
                               "could not schedule execution of [{}] after [{}] on [{}] as executor is shut down", 
        
                               command, 
        
                               delay, 
        
                               executor 
        
                           ), 
        
                           e 
        
                       ); 
        
                   } else { 
        
                       throw e; 
        
                   } 
        
               } 
        
           }

We should use this method instead of the usual schedule here:

OpenSearch/server/src/main/java/org/opensearch/index/shard/ReleasableRetryableRefreshListener.java

Lines 125 to 132 in 9e62ccf

    
           try { 
        
               this.threadPool.schedule( 
        
                   () -> runAfterRefreshWithPermit(didRefresh, () -> retryScheduled.set(false)), 
        
                   interval, 
        
                   retryThreadPoolName 
        
               ); 
        
               scheduled = true; 
        
               getLogger().info("Scheduled retry with didRefresh={}", didRefresh);

shourya035 · 2024-04-24T06:22:23Z

Closing since fixes have been made through #13301

mch2 added bug Something isn't working untriaged :test Adding or fixing a test flaky-test Random test failure that succeeds on second run labels Oct 25, 2023

mch2 mentioned this issue Oct 25, 2023

Fix flaky test testSendCorruptBytesToReplica #10897

Merged

8 tasks

neetikasinghal mentioned this issue Oct 25, 2023

Update the indexRandom function to create more segments #10247

Merged

6 tasks

mch2 mentioned this issue Oct 25, 2023

[Backport 2.x] Fix flaky testClusterStateBatchedUpdates test #10930

Merged

reta mentioned this issue Oct 27, 2023

[Backport 2.x] Add tracing instrumentation for indexing paths #10962

Merged

8 tasks

bharath-techie mentioned this issue Oct 28, 2023

Fix flaky tests 'testDeleteAllAndListAllPits' and 'testDeleteWhileSearch' #10946

Merged

8 tasks

amkhar mentioned this issue Oct 28, 2023

Add statsName field on stream while constructing PersistedStateStats #10964

Merged

5 tasks

ashking94 mentioned this issue Oct 30, 2023

[Remote Translog] Fix translog inconsistent generation assertion failure in ITs #10985

Merged

8 tasks

ashking94 mentioned this issue Oct 30, 2023

[Remote Translog] Fix translog inconsistent generation assertion failure in ITs #10986

Closed

8 tasks

reta mentioned this issue Oct 30, 2023

[Backport 2.x] Disable sort optimization for HALF_FLOAT #11004

Merged

This was referenced Nov 1, 2023

[BUG] RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown test is flaky #10972

Closed

[BUG] Assertion failure (inconsistent generation) during translog upload #9724

Closed

reta mentioned this issue Nov 9, 2023

[AUTOCUT] Gradle Check Failure on push to 2.x #11146

Closed

jed326 mentioned this issue Nov 10, 2023

Add indexRandomForConcurrentSearch method to tests: #11118

Merged

8 tasks

mch2 closed this as completed in #10985 Nov 12, 2023

mch2 mentioned this issue Nov 17, 2023

[BUG] org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testReplicaAlreadyAtCheckpoint is flaky #11255

Closed

reta reopened this Jan 17, 2024

reta mentioned this issue Jan 17, 2024

Extract cluster management for integration tests into JUnit test rule out of OpenSearchIntegTestCase #11877

Merged

8 tasks

vikasvb90 added Indexing:Replication Issues and PRs related to core replication framework eg segrep Storage:Remote and removed Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Jan 30, 2024

Bukhtawar added this to Storage Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Feb 15, 2024

andrross removed the untriaged label Mar 13, 2024

BhumikaSaini-Amazon assigned shourya035 Mar 18, 2024

reta mentioned this issue Mar 21, 2024

Switch 2.x to 2.14.0 #12839

Merged

8 tasks

shourya035 mentioned this issue Apr 24, 2024

Using scheduleUnlessShuttingDown instead of schedule in retrying remote refresh calls #13363

Closed

8 tasks

shourya035 closed this as completed Apr 24, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Storage Project Board Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Test SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileFetchingMetadata is flaky #10902

[BUG] Test SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileFetchingMetadata is flaky #10902

mch2 commented Oct 25, 2023

amkhar commented Oct 28, 2023

ashking94 commented Oct 30, 2023

ashking94 commented Nov 1, 2023

jed326 commented Nov 10, 2023

ashking94 commented Nov 11, 2023

reta commented Jan 17, 2024

andrross commented Mar 13, 2024

mch2 commented Mar 13, 2024

shourya035 commented Apr 24, 2024 •

edited

Loading

shourya035 commented Apr 24, 2024

[BUG] Test SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileFetchingMetadata is flaky #10902

[BUG] Test SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileFetchingMetadata is flaky #10902

Comments

mch2 commented Oct 25, 2023

amkhar commented Oct 28, 2023

ashking94 commented Oct 30, 2023

ashking94 commented Nov 1, 2023

jed326 commented Nov 10, 2023

ashking94 commented Nov 11, 2023

reta commented Jan 17, 2024

andrross commented Mar 13, 2024

mch2 commented Mar 13, 2024

shourya035 commented Apr 24, 2024 • edited Loading

shourya035 commented Apr 24, 2024

shourya035 commented Apr 24, 2024 •

edited

Loading