Skip to content

Conversation

@sgup432
Copy link
Contributor

@sgup432 sgup432 commented Aug 8, 2025

Description

Within tiered cache, when a query runs into exceptions here around timeouts, or parent task cancels it etc, it causes the NPE to be thrown here which though is swallowed eventually but this internally causes the query/key to not be removed from the temporary map here which is responsible to handle concurrent requests for the same key.

This causes that query/key to be stuck around in that map with the exception value, and when the user hits the same query again, it returns the same exception from the map instead of recomputing it where it could have possibly run successfully without exception. This behavior causes the user to see same exception again and again for the same query/key, until unless the next refresh/invalidation happens which causes the key itself to change and recompute the query.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

sgup432 added 2 commits August 8, 2025 14:08
Signed-off-by: Sagar Upadhyaya <sagar.upadhyaya.121@gmail.com>
Signed-off-by: Sagar Upadhyaya <sagar.upadhyaya.121@gmail.com>
sgup432 added 2 commits August 8, 2025 14:22
Signed-off-by: Sagar Upadhyaya <sagar.upadhyaya.121@gmail.com>
Signed-off-by: Sagar Upadhyaya <sagar.upadhyaya.121@gmail.com>
Copy link
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2025

✅ Gradle check result for 802babf: SUCCESS

@codecov
Copy link

codecov bot commented Aug 8, 2025

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.92%. Comparing base (9f13e37) to head (802babf).
⚠️ Report is 35 commits behind head on main.

Files with missing lines Patch % Lines
...search/cache/common/tier/TieredSpilloverCache.java 88.23% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19000      +/-   ##
============================================
+ Coverage     72.88%   72.92%   +0.04%     
- Complexity    69327    69380      +53     
============================================
  Files          5643     5645       +2     
  Lines        318720   318787      +67     
  Branches      46113    46125      +12     
============================================
+ Hits         232294   232479     +185     
+ Misses        67595    67496      -99     
+ Partials      18831    18812      -19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jainankitk jainankitk merged commit f967a72 into opensearch-project:main Aug 8, 2025
31 checks passed
@kkhatua
Copy link
Member

kkhatua commented Aug 11, 2025

this internally causes the query/key to not be removed from the temporary map here which is responsible to handle concurrent requests for the same key.

@sgup432 ... good find! However, this looks like a design flaw in that an incorrect entry has been made into the cache first and then we need to perform a check to remove. Is there a performance benefit to writing to the cache, and then cleaning up later?

@sgup432
Copy link
Contributor Author

sgup432 commented Aug 11, 2025

However, this looks like a design flaw in that an incorrect entry has been made into the cache first and then we need to perform a check to remove. Is there a performance benefit to writing to the cache, and then cleaning up later?

Yeah this is needed. This logic is used to handle concurrent requests for the same key.
So lets say if there is a query1, and 5 users send the same query1 at the same time which lands onto request cache.

Here query1 was not in the cache already, so we need to compute the value, and then later cache the response (which is how request cache works).

In this case, considering there 5 duplicate requests for same query1, we need to avoid a scenario where we compute the value for all 5 requests, as it is redundant in terms of CPU/JVM. And we use this map to handle this scenario. In a nutshell, only 1 request will be successful to put a future in the map, and compute the value. The rest 4 of the requests, will see(in the map) that there is already a same query1 running, so they all wait(via future) and reuse the response avoiding consuming extra CPU/JVM etc.

@sgup432 sgup432 deleted the tiered_cache_handle_exception branch August 12, 2025 18:20
RajatGupta02 pushed a commit to RajatGupta02/OpenSearch that referenced this pull request Aug 18, 2025
…#19000)

Signed-off-by: Sagar Upadhyaya <sagar.upadhyaya.121@gmail.com>
kh3ra pushed a commit to kh3ra/OpenSearch that referenced this pull request Sep 5, 2025
…#19000)

Signed-off-by: Sagar Upadhyaya <sagar.upadhyaya.121@gmail.com>
vinaykpud pushed a commit to vinaykpud/OpenSearch that referenced this pull request Sep 26, 2025
…#19000)

Signed-off-by: Sagar Upadhyaya <sagar.upadhyaya.121@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants