Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race with eviction when reading from FileCache #6592

Merged
merged 1 commit into from
Mar 11, 2023

Conversation

andrross
Copy link
Member

@andrross andrross commented Mar 9, 2023

The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the IndexInput was cloned (and therefore had its reference count incremented). Since the IndexInputs are stateful this is very bad. The least-recently-used semantics meant that in a properly-configured system this would be unlikely since accessing a zero-reference count item would move it to be most-recently used and therefore least likely to be evicted. However, there was still a latent bug that was possible to encounter (see issue #6295).

The only way to fix this, as far as I can see, is to change the cache behavior so that fetching an item from the cache atomically increments its reference count. This also led to a change to TransferManager to ensure that all requests for an item ultimately read through the cache to eliminate any possibility of a race. I have implement some concurrent unit tests that put the cache into a worst-case thrashing scenario to ensure that concurrent access never closes an IndexInput while it is still being used.

Issues Resolved

Closes #6536

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2023

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Mar 9, 2023

Codecov Report

Merging #6592 (a177172) into main (bdb4f7a) will increase coverage by 0.03%.
The diff coverage is 69.38%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6592      +/-   ##
============================================
+ Coverage     70.77%   70.80%   +0.03%     
+ Complexity    59156    59136      -20     
============================================
  Files          4804     4803       -1     
  Lines        283102   283118      +16     
  Branches      40813    40811       -2     
============================================
+ Hits         200361   200463     +102     
+ Misses        66305    66186     -119     
- Partials      16436    16469      +33     
Impacted Files Coverage Δ
...pensearch/common/logging/OpenSearchJsonLayout.java 0.00% <0.00%> (ø)
...in/java/org/opensearch/index/shard/IndexShard.java 69.73% <0.00%> (-0.52%) ⬇️
...s/replication/SegmentReplicationTargetService.java 49.03% <0.00%> (-0.97%) ⬇️
...x/store/remote/filecache/FileCachedIndexInput.java 54.16% <33.33%> (-1.39%) ⬇️
...g/opensearch/common/settings/WriteableSetting.java 69.58% <50.00%> (-2.00%) ⬇️
...arch/index/store/remote/utils/TransferManager.java 78.12% <61.11%> (+3.12%) ⬆️
...search/index/store/remote/filecache/FileCache.java 82.85% <76.47%> (-8.45%) ⬇️
...n/java/org/opensearch/common/settings/Setting.java 86.32% <100.00%> (+0.19%) ⬆️
.../main/java/org/opensearch/env/NodeEnvironment.java 76.64% <100.00%> (+0.25%) ⬆️
...e/remote/file/OnDemandBlockSnapshotIndexInput.java 77.27% <100.00%> (+6.43%) ⬆️
... and 2 more

... and 460 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2023

Gradle Check (Jenkins) Run Completed with:


// refcount = 0 at the beginning
FileCachedIndexInput newOrigin = new FileCachedIndexInput(fileCache, blobFetchRequest.getFilePath(), downloaded);
if (origin == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why we don't use computeIfAbsent ?

IndexInput origin = fileCache.computeIfAbsent(key, (key) -> downloadBlockLocally(...));

The fileCache.computeIfPresent would not be needed in this case since origin should not be null (or should be re-downloaded again).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not keep the IndexInput open always. In case of cache restore, we insert a ClosedIndexInput instance which just has the length of the index input to keep track of capacity. We surely do need a mechanism to keep track of isClosed mechanism if the file hasn't been completely eliminated from local disk.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We surely do need a mechanism to keep track of isClosed mechanism if the file hasn't been completely eliminated from local disk.

Correct, those should be removed, right? And computeIfAbsent would work as expected

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of this can be replaced with a V compute(K key, BiFunction<K, V, V> remappingFunction) call. I can handle all three cases:

  • exists and is open, return as-is
  • exists and is closed, remap it an open IndexInput
  • does not exist, download and create a new one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your suggestion on the lines of -

// Do download/network fetches with this block
IndexInput origin = fileCache.computeIfAbsent(key, (key) -> downloadBlockLocally(...));
// Once we have the block, either downloaded or on disk, do the following?
if (cachedIndexInput.isClosed()) {
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We surely do need a mechanism to keep track of isClosed mechanism if the file hasn't been completely eliminated from local disk.

Correct, those should be removed, right? And computeIfAbsent would work as expected

@reta We still need to handle the closed IndexInput case. On startup, we populate the cache with placeholder IndexInput entries and defer creating the real IndexInput entry that actually opens a handle to the file on disk until it is actually needed to be used.

// Another thread is downloading the same resource. Wait for it
// to complete then make a recursive call to fetch it from the
// cache.
existingLatch.await();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we need to add a termination mechanism here. Edge case, but what if the downloads do not complete, we will be blocking quite a few threads with a bunch of requests coming in, given that we do not use a dedicated threadpool anymore.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

IndexInput origin = fileCache.computeIfPresent(blobFetchRequest.getFilePath(), (path, cachedIndexInput) -> {
if (cachedIndexInput.isClosed()) {
// if it's already in the file cache, but closed, open it and replace the original one
final IndexInput origin = fileCache.compute(key, (path, cachedIndexInput) -> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look much cleaner to me, thanks @andrross !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This approach seems kind of obvious in hindsight. This is why code reviews are good :)

@andrross andrross force-pushed the concurrent-cache-bug branch 2 times, most recently from d7245ab to 1ad82a7 Compare March 10, 2023 21:47
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@andrross
Copy link
Member Author

Another failure of #6531

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@andrross
Copy link
Member Author

More SegmentReplicationRelocationIT failures...

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@kotwanikunal
Copy link
Member

Whitesource issue unrelated to the PR changes. Raised a PR for the fix - #6629

@kotwanikunal kotwanikunal merged commit d139ebc into opensearch-project:main Mar 11, 2023
@kotwanikunal kotwanikunal added the backport 2.x Backport to 2.x branch label Mar 11, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 11, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue #6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
(cherry picked from commit d139ebc)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kotwanikunal pushed a commit that referenced this pull request Mar 11, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue #6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.


(cherry picked from commit d139ebc)

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
mingshl pushed a commit to mingshl/OpenSearch-Mingshl that referenced this pull request Mar 24, 2023
…t#6592)

The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Mingshi Liu <mingshl@amazon.com>
@andrross andrross deleted the concurrent-cache-bug branch April 6, 2023 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] [Searchable Snapshot] Index input clone races with close on cache eviction
4 participants