-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] RemoteIndexShardTests.testSegRepSucceedsOnPreviousCopiedFiles is flaky - suite timeout was reached #10885
Comments
Impacted - #10800 (comment) |
@dreamer-89 Can you pls take a look on this? |
@dreamer-89 - I think whats happening here is the test is throwing an error from getSegmentFiles outside of the actual remote fetch. super.getSegmentFiles(replicationId, checkpoint, filesToFetch, indexShard, (fileName, bytesRecovered) -> {}, listener);
runAfterGetFiles[index.getAndIncrement()].run(); From the trace it looks like the open indexInput is because the first set of fetch has not issued a cancel on the source because the fail first notifies listeners & then invokes cancel on the target.: public void fail(ReplicationFailedException e, boolean sendShardFailure) {
if (finished.compareAndSet(false, true)) {
try {
logger.debug("marking target " + description() + " as failed", e);
notifyListener(e, sendShardFailure);
} finally {
try {
cancellableThreads.cancel("failed" + description() + "[" + ExceptionsHelper.stackTrace(e) + "]");
} finally {
// release the initial reference. replication files will be cleaned as soon as ref count goes to zero, potentially now
decRef();
}
}
}
} I haven't been able to repro this race, but I think we can avoid it by adding an assertBusy that the first target has a 0 refcount similar to this test. |
Looking into it. |
From build failures 28991, 28624, the failure is happening because of unclosed IndexOutput on _0.si segment file (stack trace below).
|
The usual suspect here is incomplete download of files in first round of segment replication. The incomplete download results in open
|
Describe the bug
org.opensearch.index.shard.RemoteIndexShardTests.testSegRepSucceedsOnPreviousCopiedFiles
test is flaky.Stacktrace
To Reproduce
CI - https://build.ci.opensearch.org/job/gradle-check/28624/testReport/org.opensearch.index.shard/RemoteIndexShardTests/testSegRepSucceedsOnPreviousCopiedFiles/
Expected behavior
Test should always pass
The text was updated successfully, but these errors were encountered: