Skip to content

Conversation

@LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented Aug 5, 2025

Description

In v2, ResourceMonitor gets the memory usage by using total memory - free memory of runtime.
But in v3, Calcite will produce lots of recyclable memory garbage during its planning and implementation process. For Clickbench q30, it produces around 70 mb in optimizing + 30 mb in implementation on local test, which can all be GC, ref #3971 (comment)

This PR changes the behaviour of ResourceMonitor for v3:
ResourceMonitor only checks the memory health by calculating the memory usage after GC of old gen.

Related Issues

Also can improve/resolve the issue of #3750

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ry usage after GC of old gen

Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin LantaoJin added enhancement New feature or request performance Make it fast! calcite calcite migration releated labels Aug 5, 2025
Comment on lines +115 to +120
if (t instanceof Exception) {
listener.onFailure((Exception) t);
} else if (t instanceof VirtualMachineError) {
// throw and fast fail the VM errors such as OOM (same with v2).
throw t;
} else {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To align with behaviour of v2, OOM error will fail the whole JVM. We caught OOM error in previous implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It think it's OK to keep align with v2. But will it be better to trigger GC and then check memory usage again before making the service crash? Anyway, the primary purpose of this PR should not be blocked by this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But will it be better to trigger GC and then check memory usage again before making the service crash?

Maybe more discussion is required and track in a follow-up issue.

Comment on lines +45 to +47
if (!this.monitor.isHealthy()) {
throw new NonFallbackCalciteException("insufficient resources to run the query, quit.");
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Healthy checking is missed in access OpenSearchSystemIndex.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed this previously. Can we do this in non-intrusive way? Maybe AOP/wrap on operator level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed this previously. Can we do this in non-intrusive way? Maybe AOP/wrap on operator level?

Yes, technically we could wrap it on operator level. But it will introduce

  1. A new logical node LogicalMonitor
  2. A ConverterRule to convert a logical node to physical node
  3. A new physical node such as EnumerableMonitor similar to ResourceMonitorPlan in v2.
  4. Optimization Skipping codegen and compile for Scan only plan #3853 has to find another way to workaround because the root.rel (physical plan) won't be a Scannable any more.

Not quite sure it's still worth to do the AOP way now.

Copy link
Collaborator

@dai-chen dai-chen Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, probably in future we need this for other operators that possibly generate more records, such as cross/lateral join, right?

Copy link
Collaborator

@dai-chen dai-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to understand the idea more deeply: the assumption is that most objects in Young Gen—such as Calcite’s internal structures—are short-lived, meaning they die after GC and do not survive into the Old Gen, right? As a result, the memory usage calculated from this approach reflects a lower bound of the actual memory pressure?

Comment on lines +45 to +47
if (!this.monitor.isHealthy()) {
throw new NonFallbackCalciteException("insufficient resources to run the query, quit.");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed this previously. Can we do this in non-intrusive way? Maybe AOP/wrap on operator level?

@LantaoJin
Copy link
Member Author

I just want to understand the idea more deeply: the assumption is that most objects in Young Gen—such as Calcite’s internal structures—are short-lived, meaning they die after GC and do not survive into the Old Gen, right? As a result, the memory usage calculated from this approach reflects a lower bound of the actual memory pressure?

Not actually, the objects can survive into the Old Gen but can be easily GCed by concurrent GC (not YGC).

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin
Copy link
Member Author

The IT could be still flaky in GitHub CI sometimes. But I cannot reproduce the resource limit issue in local with current implementation. But without this patch, resource limit issue fails easily in CalcitePPLClickBenchIT.test.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
qianheng-aws
qianheng-aws previously approved these changes Aug 11, 2025
qianheng-aws
qianheng-aws previously approved these changes Aug 14, 2025
@LantaoJin
Copy link
Member Author

LantaoJin commented Aug 15, 2025

The IT could be still flaky in GitHub CI sometimes. But I cannot reproduce the resource limit issue in local with current implementation. But without this patch, resource limit issue fails easily in CalcitePPLClickBenchIT.test.

Not see it again in latest two runs of CI. @dai-chen @penghuo can you check this PR again?

return isCalciteEnabled(settings)
? GCedMemoryUsage.getInstance()
: RuntimeMemoryUsage.getInstance();
} catch (MemoryUsageException e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fallback to RuntimeMemoryUsage will not trigger; A runtime exception thrown during static field initialization is wrapped in ExceptionInInitializerError.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Signed-off-by: Lantao Jin <ltjin@amazon.com>
@qianheng-aws qianheng-aws merged commit 0398f68 into opensearch-project:main Aug 22, 2025
23 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19-dev failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-3983-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 0398f6852bd3aed3c79b88c2885934cf18e6f71b
# Push it to GitHub
git push --set-upstream origin backport/backport-3983-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev

Then, create a pull request where the base branch is 2.19-dev and the compare/head branch is backport/backport-3983-to-2.19-dev.

LantaoJin added a commit to LantaoJin/search-plugins-sql that referenced this pull request Aug 22, 2025
…ry usage after GC of old gen in v3 (opensearch-project#3983)

* ResourceMonitor only checks the memory health by calculating the memory usage after GC of old gen

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* address comments

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* address comments

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Add memory usage fallback

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Fix IT

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Generate script code lazily to avoid unnecessary usage in Calcite

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* typo

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Address the comments of 3929 followup

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Catch throwable instead of MemoryUsageException

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Fix unit test when the code in script query expression generated lazily

Signed-off-by: Lantao Jin <ltjin@amazon.com>

---------

Signed-off-by: Lantao Jin <ltjin@amazon.com>
(cherry picked from commit 0398f68)
@LantaoJin LantaoJin added the backport-manually Filed a PR to backport manually. label Aug 22, 2025
penghuo pushed a commit that referenced this pull request Aug 25, 2025
…calculating the memory usage after GC of old gen in v3 (#3983) (#4105)

* ResourceMonitor only checks the memory health by calculating the memory usage after GC of old gen in v3 (#3983)

* ResourceMonitor only checks the memory health by calculating the memory usage after GC of old gen

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* address comments

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* address comments

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Add memory usage fallback

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Fix IT

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Generate script code lazily to avoid unnecessary usage in Calcite

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* typo

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Address the comments of 3929 followup

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Catch throwable instead of MemoryUsageException

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Fix unit test when the code in script query expression generated lazily

Signed-off-by: Lantao Jin <ltjin@amazon.com>

---------

Signed-off-by: Lantao Jin <ltjin@amazon.com>
(cherry picked from commit 0398f68)

* Fix IT: disable GCedMemoryUsage if no concurrent GC MXBean

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Fix IT

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Ignore q30 when GCedMemoryUsage not initialized

Signed-off-by: Lantao Jin <ltjin@amazon.com>

---------

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev backport-failed backport-manually Filed a PR to backport manually. calcite calcite migration releated enhancement New feature or request performance Make it fast!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants