-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add circuit breaking logic for GET _mappings to estimate the size of the mappings before executing a request #19857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…aking Signed-off-by: Craig Perkins <cwperx@amazon.com>
Signed-off-by: Craig Perkins <cwperx@amazon.com>
|
Note to self: GET _mappings is handled by the cluster manager but the circuit breaking logic is checked on the coordinator that receives the request. |
|
❌ Gradle check result for c14a50a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for f04c2f1: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Craig Perkins <cwperx@amazon.com>
|
❌ Gradle check result for fc36b7d: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Craig Perkins <cwperx@amazon.com>
|
❌ Gradle check result for 117c434: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
@cwperks Thanks for making this change but the mentioned issues are about dev-console firing too many cluster manager requests which are leading to timeout based health check failures rather OOMs. CBs primarily designed to avoid OOMs in OpenSearch |
|
❌ Gradle check result for c7443c6: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
@kaushalmahi12 capturing some of the knowledge here for anyone reading. Since 2.13, OpenSearch has an optimization for Cluster manager read operations that allows them to run on any node to alleviate pressure on the cluster manager. The read operations are run on the node that receives the request and are run async. Cluster manager write operations must be handled by the active cluster manager and are processed serially. This is the relevant code that dispatches the cluster manager operation: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/action/support/clustermanager/TransportClusterManagerNodeAction.java#L475-L504 |
| ensureGreen(index); | ||
|
|
||
| // Now call GET _mappings and expect the REQUEST breaker to trip. | ||
| final GetMappingsRequest req = new GetMappingsRequest().indices(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write a test firing multiple GET mappings call to ensure that operation execution is parallel and triggering the CB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new IT suite that runs requests concurrently. Each individual request is below the CB limit, but in aggregate it shows that it trips the parent CB.
|
❌ Gradle check result for c7443c6: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Craig Perkins <cwperx@amazon.com>
|
❌ Gradle check result for 6b861b2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Although it is a great start, but I feel throughput based rate limiter on cluster state read APIs would be more effective since the Request CB is shared across all the requests. What do you think @cwperks ? |
Yea I agree. Esp if the same caller is requesting the same data multiple times w/o change. Another thing I am looking into is reduce the in-memory size of the cluster state wrt mappings. If 2 indices share the same mappings then I think we should only store that in-memory once whereas we are currently storing the same map in memory for each concrete index that has the mappings. |
Description
This is a proposal to solve a longstanding stability issue that comes from using Dev Tools in OpenSearch Dashboards. As part of autocomplete, Dev Tools will issue a
GET _mappingscall which is an expensive call on clusters with a large number of indices and mappings for those indices.I propose to enhance the circuit breaker to eagerly reject these requests if there is not enough heap to handle the request. Setting this to Draft to solicit comments on if this is the right direction for a fix.
Related Issues
Attempt to fix:
<index>/_mappingsrather than_mappingsAPI OpenSearch-Dashboards#10550Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.