Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.x] Fix race with eviction when reading from FileCache #6630

Merged
merged 1 commit into from
Mar 11, 2023

Conversation

opensearch-trigger-bot[bot]
Copy link
Contributor

Backport d139ebc from #6592.

The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue #6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
(cherry picked from commit d139ebc)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationAllocationIT.testAllocationWithDisruption
      1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=pit/10_basic/Delete all}

@codecov-commenter
Copy link

Codecov Report

Merging #6630 (1797548) into 2.x (b392d2a) will increase coverage by 0.04%.
The diff coverage is 90.41%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##                2.x    #6630      +/-   ##
============================================
+ Coverage     70.37%   70.41%   +0.04%     
- Complexity    59292    59336      +44     
============================================
  Files          4800     4799       -1     
  Lines        284964   284914      -50     
  Branches      41414    41405       -9     
============================================
+ Hits         200530   200621      +91     
+ Misses        67720    67532     -188     
- Partials      16714    16761      +47     
Impacted Files Coverage Δ
...arch/index/store/remote/utils/TransferManager.java 78.12% <61.11%> (+3.12%) ⬆️
...e/remote/file/OnDemandBlockSnapshotIndexInput.java 77.27% <100.00%> (+6.43%) ⬆️
...search/index/store/remote/filecache/FileCache.java 82.85% <100.00%> (-1.76%) ⬇️
...earch/index/store/remote/utils/cache/LRUCache.java 92.13% <100.00%> (+10.26%) ⬆️
...index/store/remote/utils/cache/SegmentedCache.java 88.00% <100.00%> (-0.68%) ⬇️

... and 510 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants