Add circuit breaking logic for GET _mappings to estimate the size of the mappings before executing a request #19857

cwperks · 2025-10-31T20:37:29Z

Description

This is a proposal to solve a longstanding stability issue that comes from using Dev Tools in OpenSearch Dashboards. As part of autocomplete, Dev Tools will issue a GET _mappings call which is an expensive call on clusters with a large number of indices and mappings for those indices.

I propose to enhance the circuit breaker to eagerly reject these requests if there is not enough heap to handle the request. Setting this to Draft to solicit comments on if this is the right direction for a fix.

Related Issues

Attempt to fix:

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…aking Signed-off-by: Craig Perkins <cwperx@amazon.com>

Signed-off-by: Craig Perkins <cwperx@amazon.com>

cwperks · 2025-10-31T21:51:55Z

Note to self: GET _mappings is handled by the cluster manager but the circuit breaking logic is checked on the coordinator that receives the request.

github-actions · 2025-10-31T21:57:12Z

❌ Gradle check result for c14a50a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

This reverts commit c14a50a. Signed-off-by: Craig Perkins <cwperx@amazon.com>

This reverts commit eff861a. Signed-off-by: Craig Perkins <cwperx@amazon.com>

…cuit breaking" This reverts commit beb3f7f. Signed-off-by: Craig Perkins <cwperx@amazon.com>

Signed-off-by: Craig Perkins <cwperx@amazon.com>

github-actions · 2025-11-04T03:31:35Z

❌ Gradle check result for f04c2f1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Craig Perkins <cwperx@amazon.com>

github-actions · 2025-11-05T22:19:14Z

❌ Gradle check result for fc36b7d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Craig Perkins <cwperx@amazon.com>

github-actions · 2025-11-05T23:08:22Z

❌ Gradle check result for 117c434: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Craig Perkins <cwperx@amazon.com>

kaushalmahi12 · 2025-11-06T04:44:29Z

@cwperks Thanks for making this change but the mentioned issues are about dev-console firing too many cluster manager requests which are leading to timeout based health check failures rather OOMs. CBs primarily designed to avoid OOMs in OpenSearch

github-actions · 2025-11-06T04:47:11Z

❌ Gradle check result for c7443c6: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

cwperks · 2025-11-06T19:29:32Z

@kaushalmahi12 capturing some of the knowledge here for anyone reading.

Since 2.13, OpenSearch has an optimization for Cluster manager read operations that allows them to run on any node to alleviate pressure on the cluster manager.

The read operations are run on the node that receives the request and are run async. Cluster manager write operations must be handled by the active cluster manager and are processed serially.

This is the relevant code that dispatches the cluster manager operation: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/action/support/clustermanager/TransportClusterManagerNodeAction.java#L475-L504

kaushalmahi12 · 2025-11-06T19:30:08Z

...alClusterTest/java/org/opensearch/action/admin/indices/mapping/get/GetMappingsBreakerIT.java

+        ensureGreen(index);
+
+        // Now call GET _mappings and expect the REQUEST breaker to trip.
+        final GetMappingsRequest req = new GetMappingsRequest().indices(index);


Can we write a test firing multiple GET mappings call to ensure that operation execution is parallel and triggering the CB

Added a new IT suite that runs requests concurrently. Each individual request is below the CB limit, but in aggregate it shows that it trips the parent CB.

github-actions · 2025-11-06T19:54:21Z

❌ Gradle check result for c7443c6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Craig Perkins <cwperx@amazon.com>

github-actions · 2025-11-06T22:03:21Z

❌ Gradle check result for 6b861b2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

kaushalmahi12 · 2025-11-07T16:08:13Z

Although it is a great start, but I feel throughput based rate limiter on cluster state read APIs would be more effective since the Request CB is shared across all the requests. What do you think @cwperks ?

cwperks · 2025-11-07T16:49:50Z

Although it is a great start, but I feel throughput based rate limiter on cluster state read APIs would be more effective since the Request CB is shared across all the requests. What do you think @cwperks ?

Yea I agree. Esp if the same caller is requesting the same data multiple times w/o change.

Another thing I am looking into is reduce the in-memory size of the cluster state wrt mappings. If 2 indices share the same mappings then I think we should only store that in-memory once whereas we are currently storing the same map in memory for each concrete index that has the mappings.

cwperks added 3 commits October 31, 2025 10:56

WIP for estimating size in bytes for GET _mapping API for circuit bre…

beb3f7f

…aking Signed-off-by: Craig Perkins <cwperx@amazon.com>

Fix tests

eff861a

Signed-off-by: Craig Perkins <cwperx@amazon.com>

Attempt to calculate size for circuit breaker

c14a50a

Signed-off-by: Craig Perkins <cwperx@amazon.com>

This was referenced Nov 1, 2025

[AUTOCUT] Gradle Check Flaky Test Report for PruneFileCacheIT #19724

Open

[AUTOCUT] Gradle Check Flaky Test Report for ShardsLimitAllocationDeciderIT #19726

Open

cwperks added 4 commits November 3, 2025 21:50

Revert "Attempt to calculate size for circuit breaker"

a980a9a

This reverts commit c14a50a. Signed-off-by: Craig Perkins <cwperx@amazon.com>

Revert "Fix tests"

62336f1

This reverts commit eff861a. Signed-off-by: Craig Perkins <cwperx@amazon.com>

Revert "WIP for estimating size in bytes for GET _mapping API for cir…

64cbe24

…cuit breaking" This reverts commit beb3f7f. Signed-off-by: Craig Perkins <cwperx@amazon.com>

Move circuit breaker logic inside transport action

f04c2f1

Signed-off-by: Craig Perkins <cwperx@amazon.com>

Add unit tests

fc36b7d

Signed-off-by: Craig Perkins <cwperx@amazon.com>

Add integration test

117c434

Signed-off-by: Craig Perkins <cwperx@amazon.com>

Fix test

c7443c6

Signed-off-by: Craig Perkins <cwperx@amazon.com>

cwperks changed the title ~~Fix circuit breaking logic for GET _mappings to estimate the size of the mappings before executing a request~~ Add circuit breaking logic for GET _mappings to estimate the size of the mappings before executing a request Nov 6, 2025

kaushalmahi12 reviewed Nov 6, 2025

View reviewed changes

Add test with concurrent requests

6b861b2

Signed-off-by: Craig Perkins <cwperx@amazon.com>

kaushalmahi12 approved these changes Nov 6, 2025

View reviewed changes

This was referenced Nov 7, 2025

[AUTOCUT] Gradle Check Flaky Test Report for SearchRestCancellationIT #14311

Open

[AUTOCUT] Gradle Check Flaky Test Report for WarmIndexSegmentReplicationIT #18157

Open

cwperks mentioned this pull request Nov 7, 2025

Optimize getFieldFilter to only return a predicate when index has FLS restrictions for user opensearch-project/security#5777

Open

5 tasks

cwperks mentioned this pull request Nov 7, 2025

Only store single reference to mapping for indices that share the same mappings in the cluster state #19929

Draft

3 tasks

Add circuit breaking logic for GET _mappings to estimate the size of the mappings before executing a request #19857

Are you sure you want to change the base?

Add circuit breaking logic for GET _mappings to estimate the size of the mappings before executing a request #19857

Uh oh!

Conversation

cwperks commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

cwperks commented Oct 31, 2025

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

github-actions bot commented Nov 4, 2025

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

kaushalmahi12 commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

cwperks commented Nov 6, 2025

Uh oh!

kaushalmahi12 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

cwperks Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

kaushalmahi12 commented Nov 7, 2025

Uh oh!

cwperks commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cwperks commented Oct 31, 2025 •

edited

Loading