Use per-key latch to wait on file downloads #7015

andrross · 2023-04-05T19:17:49Z

The lock that guards FileCache.compute is per-cache-segment, and therefore means unrelated keys can get stuck waiting for one another. This refactors the code to do the download outside of the cache operation, and uses a per-key latch mechanism to ensure that only requests for the exact same blob will block on each other.

See this issue for details about the cache implementation. I think it is possible to re-work the cache so that locking would be much more precise and this change would not be necessary. However, that is a bigger change potentially with other tradeoffs, so I think this fix is a reasonable thing to do now.

Issues Resolved

Closes #7031

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

The lock that guards `FileCache.compute` is per-cache-segment, and therefore means unrelated keys can get stuck waiting for one another. This refactors the code to do the download outside of the cache operation, and uses a per-key latch mechanism to ensure that only requests for the exact same blob will block on each other. See [this issue][1] for details about the cache implementation. I think it is possible to re-work the cache so that locking would be much more precise and this change would not be necessary. However, that is a bigger change potentially with other tradeoffs, so I think this fix is a reasonable thing to do now. [1]: opensearch-project#6225 (comment) Signed-off-by: Andrew Ross <andrross@amazon.com>

andrross · 2023-04-05T19:18:55Z

@reta I'm interested in your thoughts here. There are performance implications that I hadn't considered in this discussion on a previous PR.

github-actions · 2023-04-05T19:22:18Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13562/
CommitID: 0f4ee22
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-05T19:53:56Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13567/
CommitID: 0f4ee22
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

reta · 2023-04-05T20:12:53Z

@reta I'm interested in your thoughts here. There are performance implications that I hadn't considered in this discussion on a previous PR.

Ah I see, we indeed would block on segment, not a key .... I honestly don't feel great about latches (in this particular case) for basically 3 reasons:

we need to keep state in 2 different places (latches map and file cache)
we would probably still have a race when latch is removed and another thread puts it back even if input is already fetched
it is very difficult to reason about the control flow (the await and countDown are in a different places)

It seems like we need a reliable primitive to access the value behind the particular key since its computation takes time and we don't want to lock the whole segment. The idea I wanted to try out is to use CompletableFuture instead of latches to fence all operations over file FileCachedIndexInput: the first thread puts the CompletableFuture and kicks in download, other threads wait of CompletableFuture::join to get the value when ready.

I have not tried to prototype it yet, but wdyt?

peterzhuamazon · 2023-04-05T20:40:48Z

Gradle check is fixed now rerun at your will.

andrross · 2023-04-05T21:32:48Z

It seems like we need a reliable primitive to access the value behind the particular key since its computation takes time

@reta Yeah this is exactly right. I'm thinking the cached object itself should not be an IndexInput at all, but rather something capable of creating (or returning a previously created) IndexInput. It would need to know the length of the file (for the cache weigher function to work) and then provide a thread-safe mechanism for getting/creating the IndexInput (perhaps using a CompletableFuture). The signature changes might have a bit of a ripple effect but the overall change shouldn't be too complex. Does that sound right?

reta · 2023-04-05T21:34:22Z

Does that sound right?

Absolutely, it does

andrross · 2023-04-06T16:59:08Z

Moved to draft until the idea in #7015 (comment) is implemented

andrross added the skip-changelog label Apr 5, 2023

andrross requested review from reta, anasalkouz, Bukhtawar, CEHENKLE, dblock, gbbafna, setiah, kartg, kotwanikunal, mch2, nknize, owaiskazi19, Rishikesh1159, ryanbogan, saratvemulapalli, shwetathareja, dreamer-89, tlfeng, VachaShah and xuezhou25 as code owners April 5, 2023 19:17

andrross mentioned this pull request Apr 6, 2023

[Searchable Snapshots] Downloads should block only on downloads to the exact same key #7031

Closed

andrross marked this pull request as draft April 6, 2023 16:58

andrross closed this Apr 12, 2023

andrross deleted the per-key-latch branch May 7, 2024 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use per-key latch to wait on file downloads #7015

Use per-key latch to wait on file downloads #7015

andrross commented Apr 5, 2023 •

edited

Loading

andrross commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

reta commented Apr 5, 2023 •

edited

Loading

peterzhuamazon commented Apr 5, 2023

andrross commented Apr 5, 2023

reta commented Apr 5, 2023

andrross commented Apr 6, 2023

Use per-key latch to wait on file downloads #7015

Use per-key latch to wait on file downloads #7015

Conversation

andrross commented Apr 5, 2023 • edited Loading

Issues Resolved

Check List

andrross commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

reta commented Apr 5, 2023 • edited Loading

peterzhuamazon commented Apr 5, 2023

andrross commented Apr 5, 2023

reta commented Apr 5, 2023

andrross commented Apr 6, 2023

andrross commented Apr 5, 2023 •

edited

Loading

reta commented Apr 5, 2023 •

edited

Loading