-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Making Composite Directory Sync Noop to fix NoSuchFileException issue #20115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Making Composite Directory Sync Noop to fix NoSuchFileException issue #20115
Conversation
Signed-off-by: Gagan Singh Saini <gagasa@amazon.com>
WalkthroughCompositeDirectory.sync was simplified to no longer compute or perform file synchronization; it now logs that the sync was skipped. The corresponding test method exercising sync behavior (testSync) was removed. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/java/org/opensearch/index/store/CompositeDirectory.java(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: Analyze (java)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: detect-breaking-change
| @Override | ||
| public void sync(Collection<String> names) throws IOException { | ||
| ensureOpen(); | ||
| logger.trace("Composite Directory[{}]: sync() called {}", this::toString, () -> names); | ||
| Set<String> remoteFiles = Set.of(getRemoteFiles()); | ||
| Set<String> localFilesHavingBlocks = Arrays.stream(listLocalFiles()) | ||
| .filter(FileTypeUtils::isBlockFile) | ||
| .map(file -> file.substring(0, file.indexOf(BLOCK_FILE_IDENTIFIER))) | ||
| .collect(Collectors.toSet()); | ||
| Collection<String> fullFilesToSync = names.stream() | ||
| .filter(name -> (remoteFiles.contains(name) == false) && (localFilesHavingBlocks.contains(name) == false)) | ||
| .collect(Collectors.toList()); | ||
| logger.trace("Composite Directory[{}]: Synced files : {}", this::toString, () -> fullFilesToSync); | ||
| localDirectory.sync(fullFilesToSync); | ||
| logger.trace("Composite Directory[{}]: sync() called {}; Skipping sync", this::toString, () -> names); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for RemoteSegmentStoreDirectory and its upload flow
echo "=== Finding RemoteSegmentStoreDirectory ==="
fd -t f "RemoteSegmentStoreDirectory" --type f
echo ""
echo "=== Checking RemoteSegmentStoreDirectory for upload and sync operations ==="
rg -n "uploadBlob|upload|sync" --type=java -g '*RemoteSegmentStoreDirectory*' -A 2 -B 2Repository: opensearch-project/OpenSearch
Length of output: 202
🏁 Script executed:
#!/bin/bash
# Search for afterSyncToRemote and related methods
echo "=== Finding afterSyncToRemote ==="
rg -n "afterSyncToRemote" --type=java -A 5 -B 2
echo ""
echo "=== Finding file eviction and unpinning logic ==="
rg -n "evicted|unpin|eviction" --type=java -g '*CompositeDirectory*' -A 3 -B 3Repository: opensearch-project/OpenSearch
Length of output: 202
🏁 Script executed:
#!/bin/bash
# Search for sync() method calls to understand durability expectations
echo "=== Finding sync method calls in IndexWriter/Engine ==="
rg -n "\.sync\(" --type=java -A 2 -B 2 | head -50
echo ""
echo "=== Looking for exception handling around sync in similar contexts ==="
rg -n "NoSuchFileException|sync.*catch" --type=java -A 3 -B 3 | head -40Repository: opensearch-project/OpenSearch
Length of output: 202
🏁 Script executed:
#!/bin/bash
# Read the CompositeDirectory file to understand the full sync method context
echo "=== Reading CompositeDirectory around sync method ==="
wc -l server/src/main/java/org/opensearch/index/store/CompositeDirectory.javaRepository: opensearch-project/OpenSearch
Length of output: 202
🌐 Web query:
OpenSearch remote segment store durability guarantees fsync sync
💡 Result:
Short summary of OpenSearch remote segment-store durability and fsync/sync semantics:
-
Remote segment store uploads segment files from the primary after refresh/flush/commit; those uploads are asynchronous (segments are sent after segments are created). This gives "refresh-level" durability (segments uploaded after each refresh) but the uploads themselves are asynchronous. [1][2]
-
Translog durability is separate. For full request-level durability you must enable remote translog (translog uploads) or set index.translog.durability=request for local fsync behavior. With remote-backed storage you can enable translog uploads so the translog is uploaded to the remote store before acknowledging the request (request-level durability). Without remote translog, acknowledged writes may be lost if not fsynced locally. [1][2][5]
-
fsync behavior: Lucene commit/flush uses fsync to make a commit point durable on local disk. OpenSearch still relies on fsync for commit stability; design notes state segments are fsynced on commit to maintain durability and replicas rely on commit points. Remote segment uploads do not replace the Lucene fsync semantics on the primary — remote uploads are an additional step to copy segment files to remote storage. [3][5]
-
Practical consequence: remote segment store provides durable copies of segment files in remote storage (useful for recovery/replicas) but unless you also enable remote translog or use request-level translog durability, an acknowledged write might not be safe until its translog is durably persisted (either local fsync or remote translog upload). In other words, for no-loss guarantees you must combine remote segment store with request-level translog durability (or local fsync semantics). [1][2][5]
Sources:
- OpenSearch docs, Remote-backed storage / remote store (includes refresh-level vs request-level durability, translog.durability). [1][2]
- OpenSearch design/issue discussions on fsync and remote translog/segment proposals (design rationale: fsync on commit, remote translog for request-level durability). [3][5]
🌐 Web query:
OpenSearch issue 19658 NoSuchFileException sync evicted files
💡 Result:
I couldn't find an OpenSearch issue #19658 or any matching report about "NoSuchFileException sync evicted files" in the OpenSearch repos or issues. Could you confirm:
- which repository/org (opensearch-project/OpenSearch or another) and
- a link, or the exact error log / stacktrace?
If you want, I can run a targeted search of GitHub issues, PRs and mailing lists for that repo — confirm and I'll search and summarize results with links.
Data durability is compromised by making sync() a complete no-op, which violates Lucene's fsync semantics.
The web search confirms a critical issue with this change: remote segment uploads are asynchronous and do not replace Lucene's fsync requirements for commit durability. OpenSearch documentation explicitly states that "fsync behavior: Lucene commit/flush uses fsync to make a commit point durable on local disk. OpenSearch still relies on fsync for commit stability... remote segment uploads do not replace the Lucene fsync semantics on the primary."
Making sync() a no-op means:
- Files evicted before being uploaded will never be fsynced
- Uncommitted translog entries may be lost if the primary crashes before remote upload completes
- Violates Lucene's Directory contract that
sync()ensures durability via fsync - Remote segment store provides additional redundancy but is not a durability substitute
Safer alternatives that maintain durability:
-
Catch and skip only evicted files (minimal change, preserves sync for existing files):
- Check if file exists before attempting fsync
- Skip only files that were already evicted, sync all others
-
Sync before marking for eviction (safest approach):
- Call
localDirectory.sync()for a file before it is unpinned/marked for eviction inafterSyncToRemote() - Prevents the race condition entirely by guaranteeing fsync before eviction is possible
- Call
The current no-op approach trades off a runtime exception against data loss risk, which is not an acceptable tradeoff for a durability operation.
|
❌ Gradle check result for 81c0cd0: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Gagan Singh Saini <gagasa@amazon.com>
|
❌ Gradle check result for db4facb: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Gagan Singh Saini <gagasa@amazon.com>
|
❌ Gradle check result for 27a6673: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Making the Composite directory
syncmethod Noop to fix NoSuchFileException.Related Issues
Resolves #19658
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.