Skip to content

Conversation

@cwperks
Copy link
Member

@cwperks cwperks commented Oct 31, 2025

Description

This is a proposal to solve a longstanding stability issue that comes from using Dev Tools in OpenSearch Dashboards. As part of autocomplete, Dev Tools will issue a GET _mappings call which is an expensive call on clusters with a large number of indices and mappings for those indices.

I propose to enhance the circuit breaker to eagerly reject these requests if there is not enough heap to handle the request. Setting this to Draft to solicit comments on if this is the right direction for a fix.

Related Issues

Attempt to fix:

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…aking

Signed-off-by: Craig Perkins <cwperx@amazon.com>
Signed-off-by: Craig Perkins <cwperx@amazon.com>
Signed-off-by: Craig Perkins <cwperx@amazon.com>
@cwperks
Copy link
Member Author

cwperks commented Oct 31, 2025

Note to self: GET _mappings is handled by the cluster manager but the circuit breaking logic is checked on the coordinator that receives the request.

@github-actions
Copy link
Contributor

❌ Gradle check result for c14a50a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

This reverts commit c14a50a.

Signed-off-by: Craig Perkins <cwperx@amazon.com>
This reverts commit eff861a.

Signed-off-by: Craig Perkins <cwperx@amazon.com>
…cuit breaking"

This reverts commit beb3f7f.

Signed-off-by: Craig Perkins <cwperx@amazon.com>
Signed-off-by: Craig Perkins <cwperx@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2025

❌ Gradle check result for f04c2f1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Craig Perkins <cwperx@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

❌ Gradle check result for fc36b7d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Craig Perkins <cwperx@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

❌ Gradle check result for 117c434: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Craig Perkins <cwperx@amazon.com>
@kaushalmahi12
Copy link
Contributor

@cwperks Thanks for making this change but the mentioned issues are about dev-console firing too many cluster manager requests which are leading to timeout based health check failures rather OOMs. CBs primarily designed to avoid OOMs in OpenSearch

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

❌ Gradle check result for c7443c6: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@cwperks cwperks changed the title Fix circuit breaking logic for GET _mappings to estimate the size of the mappings before executing a request Add circuit breaking logic for GET _mappings to estimate the size of the mappings before executing a request Nov 6, 2025
@cwperks
Copy link
Member Author

cwperks commented Nov 6, 2025

@kaushalmahi12 capturing some of the knowledge here for anyone reading.

Since 2.13, OpenSearch has an optimization for Cluster manager read operations that allows them to run on any node to alleviate pressure on the cluster manager.

The read operations are run on the node that receives the request and are run async. Cluster manager write operations must be handled by the active cluster manager and are processed serially.

This is the relevant code that dispatches the cluster manager operation: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/action/support/clustermanager/TransportClusterManagerNodeAction.java#L475-L504

ensureGreen(index);

// Now call GET _mappings and expect the REQUEST breaker to trip.
final GetMappingsRequest req = new GetMappingsRequest().indices(index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write a test firing multiple GET mappings call to ensure that operation execution is parallel and triggering the CB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new IT suite that runs requests concurrently. Each individual request is below the CB limit, but in aggregate it shows that it trips the parent CB.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

❌ Gradle check result for c7443c6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Craig Perkins <cwperx@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

❌ Gradle check result for 6b861b2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@kaushalmahi12
Copy link
Contributor

Although it is a great start, but I feel throughput based rate limiter on cluster state read APIs would be more effective since the Request CB is shared across all the requests. What do you think @cwperks ?

@cwperks
Copy link
Member Author

cwperks commented Nov 7, 2025

Although it is a great start, but I feel throughput based rate limiter on cluster state read APIs would be more effective since the Request CB is shared across all the requests. What do you think @cwperks ?

Yea I agree. Esp if the same caller is requesting the same data multiple times w/o change.

Another thing I am looking into is reduce the in-memory size of the cluster state wrt mappings. If 2 indices share the same mappings then I think we should only store that in-memory once whereas we are currently storing the same map in memory for each concrete index that has the mappings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants