Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Test SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileFetchingMetadata is flaky #10902

Closed
mch2 opened this issue Oct 25, 2023 · 10 comments · Fixed by #10985
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Storage:Remote :test Adding or fixing a test

Comments

@mch2
Copy link
Member

mch2 commented Oct 25, 2023

Describe the bug
This test is flaky - It doesn't look like its failing for any download/replication step, rather it is failing post failover with remote translog logic.

java.lang.AssertionError:  inconsistent generation 
	at __randomizedtesting.SeedInfo.seed([EB6B5954B3E7827C]:0)
	at org.opensearch.index.translog.transfer.TranslogCheckpointTransferSnapshot$Builder.build(TranslogCheckpointTransferSnapshot.java:180)
	at org.opensearch.index.translog.RemoteFsTranslog.upload(RemoteFsTranslog.java:333)
	at org.opensearch.index.translog.RemoteFsTranslog.prepareAndUpload(RemoteFsTranslog.java:305)
	at org.opensearch.index.translog.RemoteFsTranslog.sync(RemoteFsTranslog.java:360)
	at org.opensearch.index.translog.InternalTranslogManager.syncTranslog(InternalTranslogManager.java:197)
	at org.opensearch.index.shard.IndexShard.postActivatePrimaryMode(IndexShard.java:3404)
	at org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:714)
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4007)
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:3977)
	at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$36(IndexShard.java:3928)
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
	at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1623)

Seed:

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileFetchingMetadata" -Dtests.seed=EB6B5954B3E7827C -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-MX -Dtests.timezone=Asia/Qostanay -Druntime.java=20

Expected behavior
No flaky

@amkhar
Copy link
Contributor

amkhar commented Oct 28, 2023

#10964 (comment)

@ashking94
Copy link
Member

I was able to reproduce this issue after 2.5K iterations. I have identified the fix and raised the PR - #10985. It appears to be a very corner case where during test cleanup, index deletion and activate primary mode are happening simultaneously in a such a way that the translog writer gets rotated while closing the current writer due to translog being open, but gets closed at the time of upload. This is race condition which is very rare and non-problematic in prod mode.

@ashking94
Copy link
Member

Ran with the fix for 16K iterations and there have been no failures yet.

@jed326
Copy link
Collaborator

jed326 commented Nov 10, 2023

Ran with the fix for 16K iterations and there have been no failures yet.

@ashking94 are you running all 16k iterations on the same test seed or on different test seeds?

@ashking94
Copy link
Member

Ran with the fix for 16K iterations and there have been no failures yet.

@ashking94 are you running all 16k iterations on the same test seed or on different test seeds?

This is with different seed and with the prospective fix. I have not been able to get reviews on #10985 which fixes this issue.
@gbbafna @sachinpkale @mch2 can one of you help take this to closure?

@reta
Copy link
Collaborator

reta commented Jan 17, 2024

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=2071, name=opensearch[node_t2][remote_refresh_retry][T#1], state=RUNNABLE, group=TGRP-SegmentReplicationUsingRemoteStoreDisruptionIT]
	at __randomizedtesting.SeedInfo.seed([A4AA445A8D01A380:44FE23D23262BE4A]:0)
Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5487c53f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@471f16c3[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007f2708b52a48@5c16d5af]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@718da3f9[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4]
	at __randomizedtesting.SeedInfo.seed([A4AA445A8D01A380]:0)
	at app//org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
	at java.base@21.0.1/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at java.base@21.0.1/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at app//org.opensearch.threadpool.ThreadPool.schedule(ThreadPool.java:468)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:125)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:178)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)

@vikasvb90 vikasvb90 added Indexing:Replication Issues and PRs related to core replication framework eg segrep Storage:Remote and removed Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Jan 30, 2024
@andrross
Copy link
Member

[Triage - attendees 1 2 3]
@mch2 Thanks for filing this issue, looks like there is still a problem with this test.

@mch2
Copy link
Member Author

mch2 commented Mar 13, 2024

Taking a look

@shourya035
Copy link
Member

shourya035 commented Apr 24, 2024

Ran this for 1000 iterations and didn't observe any failures.

From the first comment after this issue was reopened, it seems like the exception is related to an OpenSearchRejectedExecutionException coming up from the ReleasableRetryableRefreshListener when it tries to re-schedule a remote refresh. The rejection is coming up since the executor has been shutdown.

Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5487c53f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@471f16c3[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007f2708b52a48@5c16d5af]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@718da3f9[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4]

Checking the code for ScheduledThreadPoolExecutor it looks like it rejects the execution if the isShutdown() assertion of the executor returns true.

https://github.com/openjdk/jdk/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/java.base/share/classes/java/util/concurrent/ScheduledThreadPoolExecutor.java#L338-L348

We have a similar implementation under ThreadPool which utilizes the executor.isShutdown method to skip scheduling tasks:

public void scheduleUnlessShuttingDown(TimeValue delay, String executor, Runnable command) {
try {
schedule(command, delay, executor);
} catch (OpenSearchRejectedExecutionException e) {
if (e.isExecutorShutdown()) {
logger.debug(
new ParameterizedMessage(
"could not schedule execution of [{}] after [{}] on [{}] as executor is shut down",
command,
delay,
executor
),
e
);
} else {
throw e;
}
}
}

We should use this method instead of the usual schedule here:

try {
this.threadPool.schedule(
() -> runAfterRefreshWithPermit(didRefresh, () -> retryScheduled.set(false)),
interval,
retryThreadPoolName
);
scheduled = true;
getLogger().info("Scheduled retry with didRefresh={}", didRefresh);

@shourya035
Copy link
Member

Closing since fixes have been made through #13301

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Storage Project Board Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment