-
Notifications
You must be signed in to change notification settings - Fork 181
ResourceMonitor only checks the memory health by calculating the memory usage after GC of old gen in v3 #3983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ResourceMonitor only checks the memory health by calculating the memory usage after GC of old gen in v3 #3983
Conversation
…ry usage after GC of old gen Signed-off-by: Lantao Jin <ltjin@amazon.com>
| if (t instanceof Exception) { | ||
| listener.onFailure((Exception) t); | ||
| } else if (t instanceof VirtualMachineError) { | ||
| // throw and fast fail the VM errors such as OOM (same with v2). | ||
| throw t; | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To align with behaviour of v2, OOM error will fail the whole JVM. We caught OOM error in previous implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It think it's OK to keep align with v2. But will it be better to trigger GC and then check memory usage again before making the service crash? Anyway, the primary purpose of this PR should not be blocked by this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But will it be better to trigger GC and then check memory usage again before making the service crash?
Maybe more discussion is required and track in a follow-up issue.
| if (!this.monitor.isHealthy()) { | ||
| throw new NonFallbackCalciteException("insufficient resources to run the query, quit."); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Healthy checking is missed in access OpenSearchSystemIndex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed this previously. Can we do this in non-intrusive way? Maybe AOP/wrap on operator level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed this previously. Can we do this in non-intrusive way? Maybe AOP/wrap on operator level?
Yes, technically we could wrap it on operator level. But it will introduce
- A new logical node
LogicalMonitor - A
ConverterRuleto convert a logical node to physical node - A new physical node such as
EnumerableMonitorsimilar toResourceMonitorPlanin v2. - Optimization Skipping codegen and compile for Scan only plan #3853 has to find another way to workaround because the
root.rel(physical plan) won't be aScannableany more.
Not quite sure it's still worth to do the AOP way now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, probably in future we need this for other operators that possibly generate more records, such as cross/lateral join, right?
opensearch/src/main/java/org/opensearch/sql/opensearch/monitor/GCedMemoryUsage.java
Outdated
Show resolved
Hide resolved
opensearch/src/main/java/org/opensearch/sql/opensearch/monitor/OpenSearchMemoryHealthy.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to understand the idea more deeply: the assumption is that most objects in Young Gen—such as Calcite’s internal structures—are short-lived, meaning they die after GC and do not survive into the Old Gen, right? As a result, the memory usage calculated from this approach reflects a lower bound of the actual memory pressure?
opensearch/src/main/java/org/opensearch/sql/opensearch/monitor/MemoryUsage.java
Show resolved
Hide resolved
opensearch/src/main/java/org/opensearch/sql/opensearch/monitor/GCedMemoryUsage.java
Outdated
Show resolved
Hide resolved
| if (!this.monitor.isHealthy()) { | ||
| throw new NonFallbackCalciteException("insufficient resources to run the query, quit."); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed this previously. Can we do this in non-intrusive way? Maybe AOP/wrap on operator level?
Not actually, the objects can survive into the Old Gen but can be easily GCed by concurrent GC (not YGC). |
Signed-off-by: Lantao Jin <ltjin@amazon.com>
|
The IT could be still flaky in GitHub CI sometimes. But I cannot reproduce the resource limit issue in local with current implementation. But without this patch, resource limit issue fails easily in CalcitePPLClickBenchIT.test. |
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Not see it again in latest two runs of CI. @dai-chen @penghuo can you check this PR again? |
| return isCalciteEnabled(settings) | ||
| ? GCedMemoryUsage.getInstance() | ||
| : RuntimeMemoryUsage.getInstance(); | ||
| } catch (MemoryUsageException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fallback to RuntimeMemoryUsage will not trigger; A runtime exception thrown during static field initialization is wrapped in ExceptionInInitializerError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
|
The backport to To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-3983-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 0398f6852bd3aed3c79b88c2885934cf18e6f71b
# Push it to GitHub
git push --set-upstream origin backport/backport-3983-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-devThen, create a pull request where the |
…ry usage after GC of old gen in v3 (opensearch-project#3983) * ResourceMonitor only checks the memory health by calculating the memory usage after GC of old gen Signed-off-by: Lantao Jin <ltjin@amazon.com> * address comments Signed-off-by: Lantao Jin <ltjin@amazon.com> * address comments Signed-off-by: Lantao Jin <ltjin@amazon.com> * Add memory usage fallback Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * Generate script code lazily to avoid unnecessary usage in Calcite Signed-off-by: Lantao Jin <ltjin@amazon.com> * typo Signed-off-by: Lantao Jin <ltjin@amazon.com> * Address the comments of 3929 followup Signed-off-by: Lantao Jin <ltjin@amazon.com> * Catch throwable instead of MemoryUsageException Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix unit test when the code in script query expression generated lazily Signed-off-by: Lantao Jin <ltjin@amazon.com> --------- Signed-off-by: Lantao Jin <ltjin@amazon.com> (cherry picked from commit 0398f68)
…calculating the memory usage after GC of old gen in v3 (#3983) (#4105) * ResourceMonitor only checks the memory health by calculating the memory usage after GC of old gen in v3 (#3983) * ResourceMonitor only checks the memory health by calculating the memory usage after GC of old gen Signed-off-by: Lantao Jin <ltjin@amazon.com> * address comments Signed-off-by: Lantao Jin <ltjin@amazon.com> * address comments Signed-off-by: Lantao Jin <ltjin@amazon.com> * Add memory usage fallback Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * Generate script code lazily to avoid unnecessary usage in Calcite Signed-off-by: Lantao Jin <ltjin@amazon.com> * typo Signed-off-by: Lantao Jin <ltjin@amazon.com> * Address the comments of 3929 followup Signed-off-by: Lantao Jin <ltjin@amazon.com> * Catch throwable instead of MemoryUsageException Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix unit test when the code in script query expression generated lazily Signed-off-by: Lantao Jin <ltjin@amazon.com> --------- Signed-off-by: Lantao Jin <ltjin@amazon.com> (cherry picked from commit 0398f68) * Fix IT: disable GCedMemoryUsage if no concurrent GC MXBean Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * Ignore q30 when GCedMemoryUsage not initialized Signed-off-by: Lantao Jin <ltjin@amazon.com> --------- Signed-off-by: Lantao Jin <ltjin@amazon.com>
Description
In v2,
ResourceMonitorgets the memory usage by using total memory - free memory of runtime.But in v3, Calcite will produce lots of recyclable memory garbage during its planning and implementation process. For Clickbench q30, it produces around 70 mb in optimizing + 30 mb in implementation on local test, which can all be GC, ref #3971 (comment)
This PR changes the behaviour of
ResourceMonitorfor v3:ResourceMonitor only checks the memory health by calculating the memory usage after GC of old gen.
Related Issues
Also can improve/resolve the issue of #3750
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.