Skip to content

[BUG] Circuit Breaker Triggered by Reverse Operation on Large Datasets #3925

@selsong

Description

@selsong

What is the bug?
The reverse operation fails when applied to datasets larger than 10,000 rows.
When Calcite fallback is enabled:

  • Datasets between 12,000 and 26,000 rows consistently fail with timeouts.
  • Datasets above 26,000 rows trigger a fielddata circuit breaker due to excessive memory usage on the _id field.

When Calcite fallback is disabled:

  • All datasets larger than 10,000 rows, including those above 26,000, fail with timeouts, not circuit breaker exceptions.
  • The maximum row count that consistently succeeds under both modes is 10,000.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Use a source with a large number of rows (e.g., source=big5)
  2. Apply a head operation to limit the number of rows greater than 10,000
  3. Apply the reverse operation
  4. Execute the query
Query: source=big5 | head 10000 | reverse
Iterations: 3, Timeout: 180s
  Run 1: 993ms
  Run 2: 766ms
  Run 3: 766ms
  Percentiles: P90=993ms | P95=993ms
Query: source=big5 | head 12000 | reverse
  Run 1: FAILED (3982ms)
  Run 2: FAILED (3933ms)
  Run 3: FAILED (3944ms)
Query: source=big5 | head 30000 | reverse
  Run 1: FAILED (14189ms)
  Run 2: FAILED (3876ms)
  Run 3: FAILED (3912ms)

What is the expected behavior?
The operation fails with a CircuitBreakingException when the fielddata for _id exceeds the configured breaker limit, when calcite fallback is enabled. When calcite fallback is disabled, the operation fails with a Timeout error.

CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12027099143/11.2gb], which is larger than the limit of [12025908428/11.1gb]]

Full error stack trace:

[2025-07-25T22:33:28,014][WARN ][o.o.i.b.fielddata        ] [ip-172-31-35-163] [fielddata] New used memory 12027099143 [11.2gb] for data of [_id] would be larger than configured breaker: 12025908428 [11.1gb], breaking
[2025-07-25T22:33:28,015][ERROR][o.o.s.p.r.RestPPLQueryAction] [ip-172-31-35-163] Error happened during query handling
java.lang.RuntimeException: java.sql.SQLException: exception while executing query: all shards failed
        at org.opensearch.sql.opensearch.executor.OpenSearchExecutionEngine.lambda$execute$6(OpenSearchExecutionEngine.java:203) ~[?:?]
        at java.base/java.security.AccessController.doPrivileged(AccessController.java:319) ~[?:?]
        ...
Caused by: org.opensearch.core.common.breaker.CircuitBreakingException: [fielddata] Data too large, data for [_id] would be [12027099143/11.2gb], which is larger than the limit of [12025908428/11.1gb]
        at org.opensearch.common.breaker.ChildMemoryCircuitBreaker.circuitBreak(ChildMemoryCircuitBreaker.java:104) ~[opensearch-3.0.0.jar:3.0.0]
        ...

What is your host/environment?

  • OpenSearch Version: 3.0.0
  • Plugins: SQL, on Calcite Engine
  • Java Version: Java 17+
  • AWS EC2 instance, Ubuntu
  • Memory Configuration: Circuit breaker limit configured at 11.1GB for fielddata

Do you have any screenshots?
Image

Do you have any additional context?

  • The issue appears to be specifically related to the _id field's memory usage during the reverse operation
  • The threshold between success and failure is between 10,000 and 12,000 rows
  • The circuit breaker is triggered very consistently at the same memory threshold (11.2GB vs 11.1GB limit)
  • This may affect other operations that need to sort large datasets in reverse order

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't workingcalcitecalcite migration releated

Type

No type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions