Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track resource consumption for query and fetch phases #1575

Closed

Conversation

malpani
Copy link
Contributor

@malpani malpani commented Nov 18, 2021

Description

Part 1: This is first phase towards providing visibility into the most memory/compute heavy queries

This change measures memory and cpu-time used during query and fetch phases. I am leveraging the single threaded execution model to track resources consumed by the thread executing the query/fetch phase.

Signed-off-by: Ankit Malpani ankit.malpani@gmail.com

Issues Resolved

Search memory tracking

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@malpani malpani requested a review from a team as a code owner November 18, 2021 00:13
@opensearch-ci-bot
Copy link
Collaborator

Can one of the admins verify this patch?

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success ba2cbb300f45f50ece128fdfc76f7253b82b02be

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure ba2cbb300f45f50ece128fdfc76f7253b82b02be
Log 1117

Reports 1117

@malpani malpani changed the title Track memory and compute resource consumption for query and fetch phases Track resource consumption for query and fetch phases Nov 18, 2021
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Precommit failure ba2cbb300f45f50ece128fdfc76f7253b82b02be
Log 1577

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 2c03ff6469825633b79ac7b68e2f5120eb6c66f8

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Precommit failure 2c03ff6469825633b79ac7b68e2f5120eb6c66f8
Log 1579

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 2c03ff6469825633b79ac7b68e2f5120eb6c66f8
Log 1119

Reports 1119

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 5091c1f446e6f7593f95c27cdbaaabc8410640b6

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Precommit failure 5091c1f446e6f7593f95c27cdbaaabc8410640b6
Log 1580

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 5091c1f446e6f7593f95c27cdbaaabc8410640b6
Log 1122

Reports 1122

Part 1: This is first phase towards providing visibility into the most memory/compute heavy queries

This change measures memory and cpu-time used during query and fetch phases.
This is leveraging the single threaded execution model to track resources consumed by the thread executing the query/fetch phase.

Signed-off-by: Ankit Malpani <ankit.malpani@gmail.com>
@malpani malpani force-pushed the query-resource-tracking branch from 5091c1f to e0b2678 Compare November 18, 2021 01:38
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success e0b2678

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success e0b2678

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success e0b2678
Log 1123

Reports 1123

*/
public void reset() {
this.startingCPUTime = threadMXBean.getCurrentThreadCpuTime();
this.startingAllocatedBytes = threadMXBean.getThreadAllocatedBytes(Thread.currentThread().getId());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safer to consult if thread memory allocation measurement is supported (otherwise UnsupportedOperationException) and enabled [1]:

if (threadMXBean.isThreadAllocatedMemorySupported() && threadMXBean.isThreadAllocatedMemoryEnabled {
    this.startingAllocatedBytes = threadMXBean.getThreadAllocatedBytes(Thread.currentThread().getId());
}

[1] https://docs.oracle.com/javase/8/docs/jre/api/management/extension/com/sun/management/ThreadMXBean.html

@reta
Copy link
Collaborator

reta commented Nov 19, 2021

@malpani I believe there are some overlaps with #1555 (please correct me if I am wrong). It seems like you are focusing on search only, but by and large, the query and fetch phases are just variations of SearchShardTask, may be it is better to report this stats on task level instead? What do you think?

@malpani
Copy link
Contributor Author

malpani commented Nov 19, 2021

@reta - yes they are related, @sruti1312 and I work on the same team and are collaborating to roll resource tracking out. To keep diffs small, we have split this out into multiple atomic changes instead of raising a giant merged PR that will be hard to review -

  1. (Consumer - Task framework) Extending task framework to track resource consumption by tasks - Add resource stats to task framework #1555
  2. (Producer - query resource tracking) This PR - measuring resources for search related tasks
  3. Merging - merged code will finally look like https://github.com/malpani/OpenSearch/tree/stat-task-framework
  4. (Consumer - develop sinks) Logging every task will have a perf hit, we will be providing options eg. a default top N resource consumers based sink
  5. (Producer - measure more frequently) - Instead of this PR just measuring at the end of the task, we will track resource usage on a per phase basis within a task eg. aggregation/reduce etc.
  6. Consumer - system index sink) - this is optional and instead of having just log as a sink, this will write docs into a system index

Coming back to your question - can this be purely done from task side - the answer is no. eg. Bulk tasks work very differently from search, so it will be upto different task executors to define how to track their resource usage. Given indexing usage is very easy to determine as a function of the incoming payload as against search where most of the problems arise and a small payload can cause extensive usage), the focus is currently on query resource observability. Hope this helps clarify where we are headed with this.

@reta
Copy link
Collaborator

reta commented Nov 19, 2021

Thanks @malpani, it makes sense. Just to highlight some gaps: it may not be possible to capture memory and CPU consumption accurately since additional executors may be involved along the way. Fe, we are exploring the experimental Apache Lucene support of the concurrent segments search (#1500) and it uses different executor behind the scene to do that. The resources consumed by those pooled threads won't be captured during query / fetch phases. Hope it sounds reasonable, thank you.

@malpani
Copy link
Contributor Author

malpani commented Nov 20, 2021

@reta - thanks for the heads up, i skimmed through the PR and when concurrent segments search is enabled, the current form for tracking resource consumption of local thread will not work as multiple threads will be involved. My initial thought to have this work in that world will be a wrapped ResourceTrackingQuery whose createWeight could trigger the resource collector object instantiation and eventually they will need to be summed up, i implemented something similar for ultrawarm. However, that can be explored more when the experimental tag for concurrent segment search gets removed.

@dblock
Copy link
Member

dblock commented Nov 22, 2021

I'm happy to merge if @reta is A-OK with the change, LMK?

@reta
Copy link
Collaborator

reta commented Nov 22, 2021

@dblock yes, I am A-OK, thanks for asking!

@dblock
Copy link
Member

dblock commented Nov 22, 2021

@zelinh can you please help with the whitesource failure in this PR?

*/
@SuppressForbidden(reason = "ThreadMXBean enables tracking resource consumption by a thread. "
+ "It is platform dependent and i am not aware of an alternate mechanism to extract this info")
public class ResourceTracker {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of having ResourceTracker in SearchContext, we can move it as interface to Task class itself. This interface can then expose the methods like:

  1. setup/initialize where it can initialize the current values from MXBean.
  2. recordStats() where it can record the diff

Then once setTask is called on Context object, it can initialize the collector/tracker. In updateResourceTracking it can call the context.getResourceTracker().recordStats(). This will avoid the need for reset as each task will be updating its own internal stats. Context object will be set with new task for query and fetch phase separately.

Also how about renaming ResourceTracker as StatsCollector or MetricsCollector ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving ResourceTracker to Task

  • The reason I kept it in SearchContext is, ResourceTracker with its thread level stats is unique to tracking resource consumption for queries only. Currently, I am not seeing other consumers. We can move it upwards if there are other Task that would need similar logic
  • Also, if the concurrent segment search becomes the default in future, we might need multiple ResourceTracker per search task
  • I do see your point to avoid/optimize reset - and need to see the scenarios where the context could be reused. Either way will check if memory/cpu is non zero as a condition on pre-empting reset
  • what are your thoughts?

Also how about renaming ResourceTracker as StatsCollector or MetricsCollector ?

  • Stats has a special meaning as stats it is a request param that allows grouping stats on a tag
  • Collector has a special meaning too - Lucene collector
  • What are your thoughts on
    • MetricsTracker - tracking metrics - resource consumption, other future metrics
    • ResourceTracker - tracking resource consumption

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving ResourceTracker to Task

  • The reason for proposing ResourceTracker to move it to Task is, these metrics/stats belongs to task. Putting it within context is creating the need for reset since context is used by multiple tasks. Considering Task will be tracking metrics then having an interface which provides the API to update/get stats related information in Task will fit well. I see that not all tasks is currently supporting the metrics collection, however that can be achieved by implementing the update/get only by specific search related tasks. The base task can be a no-op for these calls.
  • For concurrent segment search, there still will be single task and context for search but segment level operation will be done in parallel by multiple threads. So each of these threads will now need to update the stats for same task.

Also how about renaming ResourceTracker as StatsCollector or MetricsCollector ?

I don't think stats/collector definition in the example you shared makes it reserved word for only those modules. It is a general term and can be used in multiple places, for example: SearchExecutionStatsCollector. But I am fine with MetricsTracker as well since that will cover other categories like latencies too and not just resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have pushed a new revision which incorporates this feedback:

  1. Got rid of the reset calls as SearchContext is not reused anymore
  2. Renamed to MetricsTracker

@zelinh
Copy link
Member

zelinh commented Nov 24, 2021

Referring to #1593. We may ignore the WhiteSource check for OpenSearch repo for now. There is an issue on WhiteSource side and they are working to fix it.

Signed-off-by: Ankit Malpani <ankit.malpani@gmail.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 03146e8

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 03146e8

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 03146e8
Log 1295

Reports 1295

Comment on lines +26 to +39
static ThreadMXBean threadMXBean = (ThreadMXBean) ManagementFactory.getThreadMXBean();
long startingAllocatedBytes;
long startingCPUTime;
long memoryAllocated;
long cpuTime;

/**
* Takes current snapshot of resource usage by thread since the creation of this object
*/
public void updateMetrics() {
this.memoryAllocated = threadMXBean.getThreadAllocatedBytes(Thread.currentThread().getId()) - startingAllocatedBytes;
this.cpuTime = threadMXBean.getCurrentThreadCpuTime() - startingCPUTime;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most optimal way to achieve it is below. Same is being used in a similar PR #1643

+    static {
+        threadMXBean = ManagementFactory.getThreadMXBean();
+        Method getBytes;
+        try {
+            getBytes = threadMXBean.getClass()
+                    .getMethod("getThreadAllocatedBytes", long[].class);
+            getBytes.setAccessible(true);
+            failMessages = Collections.emptyList();
+        } catch (NoSuchMethodException e) {
+            getBytes = null;
+        }
+        getThreadAllocatedBytes = getBytes;
+    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for safely handling JVMs that may not have support for tracking or are there other reasons to switch to reflection - as currently this is under a cluster setting that is disabled by default i was originally planning to add and test on different platforms at a later point but can add it right away.

For safety purposes - I was wondering of leveraging isThreadAllocatedMemorySupported and isCurrentThreadCpuTimeSupported instead. Any other reasons why the reflection route could be better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to confirm if those checks are good enough. See https://bugs.openjdk.java.net/browse/JDK-8152859

@dblock
Copy link
Member

dblock commented Dec 3, 2021

@malpani should I merge it as is or are you making changes suggested by @Bukhtawar?

@dblock
Copy link
Member

dblock commented Dec 3, 2021

start gradle check

Comment on lines +27 to +28
long startingAllocatedBytes;
long startingCPUTime;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these values are never assigned. Did you intend to initialize these to the initial measurements upon instance creation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching this. The constructor got removed after the recent refactor/reset removal! fixing it.

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 03146e8
Log 1322

Reports 1322

@malpani
Copy link
Contributor Author

malpani commented Dec 6, 2021

@dblock i will have one more revision to incorporate feedback from @andrross and @Bukhtawar . Also, we now have 2 similar PRs for resource tracking - one is for #1042 and the current one is for query observability. There is similar logic in both in terms of relying on resource usage deltas via ThreadMXBean and while both can get merged independently on the short term, i do think we can consolidate things.

@malpani
Copy link
Contributor Author

malpani commented Dec 10, 2021

closing this out in favor of a merged approach with #1042

@malpani malpani closed this Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants