Skip to content

Conversation

@Gagan6164
Copy link
Contributor

@Gagan6164 Gagan6164 commented Nov 28, 2025

Description

Making the Composite directory sync method Noop to fix NoSuchFileException.

Related Issues

Resolves #19658

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Refactor
    • Changed synchronization behavior to skip active file sync operations and emit a status log, making sync a no-op at runtime.
  • Tests
    • Removed a test that validated multiple sync scenarios and exception handling, reducing coverage for sync-related behaviors.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Gagan Singh Saini <gagasa@amazon.com>
@coderabbitai
Copy link

coderabbitai bot commented Nov 28, 2025

Walkthrough

CompositeDirectory.sync was simplified to no longer compute or perform file synchronization; it now logs that the sync was skipped. The corresponding test method exercising sync behavior (testSync) was removed.

Changes

Cohort / File(s) Summary
Sync logic removed
server/src/main/java/org/opensearch/index/store/CompositeDirectory.java
Removed logic that computed remote/local file sets and invoked localDirectory.sync(...); replaced with a log and early return so sync becomes a no-op.
Test removed
server/src/test/java/org/opensearch/index/store/CompositeDirectoryTests.java
Deleted public void testSync() throws IOException, which validated sync behavior, deletion/race scenarios, and NoSuchFileException paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review focus:
    • CompositeDirectory.sync — confirm intentional removal of sync and assess callers that depend on fsync semantics.
    • Tests — ensure missing coverage is addressed or intentional.
    • Remote-store/durability flows (race scenarios like linked issue #19658) — verify skipping sync does not introduce correctness or durability regressions.

Poem

🐇 I hopped through index lanes at night,
Found a jittery race and dimmed the light.
I paused the sync, I held my paws,
Quiet files hummed without a cause.
🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: making CompositeDirectory.sync a no-op to fix the NoSuchFileException issue.
Description check ✅ Passed The description includes required sections (Description and Related Issues) and references issue #19658, but the checklist items for functionality testing are unchecked, which may indicate incomplete testing confirmation.
Linked Issues check ✅ Passed The code changes directly address the root cause described in issue #19658 by making CompositeDirectory.sync a no-op, preventing NoSuchFileException from being thrown during sync operations on files that may have been evicted locally.
Out of Scope Changes check ✅ Passed All changes are in-scope: the sync method modification and test removal directly address the linked issue's requirements with no extraneous modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db4facb and 27a6673.

📒 Files selected for processing (1)
  • server/src/test/java/org/opensearch/index/store/CompositeDirectoryTests.java (0 hunks)
💤 Files with no reviewable changes (1)
  • server/src/test/java/org/opensearch/index/store/CompositeDirectoryTests.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: Analyze (java)

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ee30dc and 81c0cd0.

📒 Files selected for processing (1)
  • server/src/main/java/org/opensearch/index/store/CompositeDirectory.java (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: detect-breaking-change

Comment on lines 252 to 255
@Override
public void sync(Collection<String> names) throws IOException {
ensureOpen();
logger.trace("Composite Directory[{}]: sync() called {}", this::toString, () -> names);
Set<String> remoteFiles = Set.of(getRemoteFiles());
Set<String> localFilesHavingBlocks = Arrays.stream(listLocalFiles())
.filter(FileTypeUtils::isBlockFile)
.map(file -> file.substring(0, file.indexOf(BLOCK_FILE_IDENTIFIER)))
.collect(Collectors.toSet());
Collection<String> fullFilesToSync = names.stream()
.filter(name -> (remoteFiles.contains(name) == false) && (localFilesHavingBlocks.contains(name) == false))
.collect(Collectors.toList());
logger.trace("Composite Directory[{}]: Synced files : {}", this::toString, () -> fullFilesToSync);
localDirectory.sync(fullFilesToSync);
logger.trace("Composite Directory[{}]: sync() called {}; Skipping sync", this::toString, () -> names);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for RemoteSegmentStoreDirectory and its upload flow
echo "=== Finding RemoteSegmentStoreDirectory ==="
fd -t f "RemoteSegmentStoreDirectory" --type f

echo ""
echo "=== Checking RemoteSegmentStoreDirectory for upload and sync operations ==="
rg -n "uploadBlob|upload|sync" --type=java -g '*RemoteSegmentStoreDirectory*' -A 2 -B 2

Repository: opensearch-project/OpenSearch

Length of output: 202


🏁 Script executed:

#!/bin/bash
# Search for afterSyncToRemote and related methods
echo "=== Finding afterSyncToRemote ==="
rg -n "afterSyncToRemote" --type=java -A 5 -B 2

echo ""
echo "=== Finding file eviction and unpinning logic ==="
rg -n "evicted|unpin|eviction" --type=java -g '*CompositeDirectory*' -A 3 -B 3

Repository: opensearch-project/OpenSearch

Length of output: 202


🏁 Script executed:

#!/bin/bash
# Search for sync() method calls to understand durability expectations
echo "=== Finding sync method calls in IndexWriter/Engine ==="
rg -n "\.sync\(" --type=java -A 2 -B 2 | head -50

echo ""
echo "=== Looking for exception handling around sync in similar contexts ==="
rg -n "NoSuchFileException|sync.*catch" --type=java -A 3 -B 3 | head -40

Repository: opensearch-project/OpenSearch

Length of output: 202


🏁 Script executed:

#!/bin/bash
# Read the CompositeDirectory file to understand the full sync method context
echo "=== Reading CompositeDirectory around sync method ==="
wc -l server/src/main/java/org/opensearch/index/store/CompositeDirectory.java

Repository: opensearch-project/OpenSearch

Length of output: 202


🌐 Web query:

OpenSearch remote segment store durability guarantees fsync sync

💡 Result:

Short summary of OpenSearch remote segment-store durability and fsync/sync semantics:

  • Remote segment store uploads segment files from the primary after refresh/flush/commit; those uploads are asynchronous (segments are sent after segments are created). This gives "refresh-level" durability (segments uploaded after each refresh) but the uploads themselves are asynchronous. [1][2]

  • Translog durability is separate. For full request-level durability you must enable remote translog (translog uploads) or set index.translog.durability=request for local fsync behavior. With remote-backed storage you can enable translog uploads so the translog is uploaded to the remote store before acknowledging the request (request-level durability). Without remote translog, acknowledged writes may be lost if not fsynced locally. [1][2][5]

  • fsync behavior: Lucene commit/flush uses fsync to make a commit point durable on local disk. OpenSearch still relies on fsync for commit stability; design notes state segments are fsynced on commit to maintain durability and replicas rely on commit points. Remote segment uploads do not replace the Lucene fsync semantics on the primary — remote uploads are an additional step to copy segment files to remote storage. [3][5]

  • Practical consequence: remote segment store provides durable copies of segment files in remote storage (useful for recovery/replicas) but unless you also enable remote translog or use request-level translog durability, an acknowledged write might not be safe until its translog is durably persisted (either local fsync or remote translog upload). In other words, for no-loss guarantees you must combine remote segment store with request-level translog durability (or local fsync semantics). [1][2][5]

Sources:

  • OpenSearch docs, Remote-backed storage / remote store (includes refresh-level vs request-level durability, translog.durability). [1][2]
  • OpenSearch design/issue discussions on fsync and remote translog/segment proposals (design rationale: fsync on commit, remote translog for request-level durability). [3][5]

🌐 Web query:

OpenSearch issue 19658 NoSuchFileException sync evicted files

💡 Result:

I couldn't find an OpenSearch issue #19658 or any matching report about "NoSuchFileException sync evicted files" in the OpenSearch repos or issues. Could you confirm:

  • which repository/org (opensearch-project/OpenSearch or another) and
  • a link, or the exact error log / stacktrace?

If you want, I can run a targeted search of GitHub issues, PRs and mailing lists for that repo — confirm and I'll search and summarize results with links.


Data durability is compromised by making sync() a complete no-op, which violates Lucene's fsync semantics.

The web search confirms a critical issue with this change: remote segment uploads are asynchronous and do not replace Lucene's fsync requirements for commit durability. OpenSearch documentation explicitly states that "fsync behavior: Lucene commit/flush uses fsync to make a commit point durable on local disk. OpenSearch still relies on fsync for commit stability... remote segment uploads do not replace the Lucene fsync semantics on the primary."

Making sync() a no-op means:

  • Files evicted before being uploaded will never be fsynced
  • Uncommitted translog entries may be lost if the primary crashes before remote upload completes
  • Violates Lucene's Directory contract that sync() ensures durability via fsync
  • Remote segment store provides additional redundancy but is not a durability substitute

Safer alternatives that maintain durability:

  1. Catch and skip only evicted files (minimal change, preserves sync for existing files):

    • Check if file exists before attempting fsync
    • Skip only files that were already evicted, sync all others
  2. Sync before marking for eviction (safest approach):

    • Call localDirectory.sync() for a file before it is unpinned/marked for eviction in afterSyncToRemote()
    • Prevents the race condition entirely by guaranteeing fsync before eviction is possible

The current no-op approach trades off a runtime exception against data loss risk, which is not an acceptable tradeoff for a durability operation.

@github-actions
Copy link
Contributor

❌ Gradle check result for 81c0cd0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gagan Singh Saini <gagasa@amazon.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for db4facb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gagan Singh Saini <gagasa@amazon.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 27a6673: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Storage:Remote

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[BUG] NoSuchFileException in CompositeDirectory.sync

1 participant