Track resource consumption for query and fetch phases #1575

malpani · 2021-11-18T00:13:43Z

Description

Part 1: This is first phase towards providing visibility into the most memory/compute heavy queries

This change measures memory and cpu-time used during query and fetch phases. I am leveraging the single threaded execution model to track resources consumed by the thread executing the query/fetch phase.

Signed-off-by: Ankit Malpani ankit.malpani@gmail.com

Issues Resolved

Search memory tracking

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

opensearch-ci-bot · 2021-11-18T00:14:03Z

Can one of the admins verify this patch?

opensearch-ci-bot · 2021-11-18T00:16:22Z

✅ Gradle Wrapper Validation success ba2cbb300f45f50ece128fdfc76f7253b82b02be

opensearch-ci-bot · 2021-11-18T00:17:12Z

❌ Gradle Check failure ba2cbb300f45f50ece128fdfc76f7253b82b02be
Log 1117

Reports 1117

opensearch-ci-bot · 2021-11-18T00:20:01Z

❌ Gradle Precommit failure ba2cbb300f45f50ece128fdfc76f7253b82b02be
Log 1577

opensearch-ci-bot · 2021-11-18T00:33:54Z

✅ Gradle Wrapper Validation success 2c03ff6469825633b79ac7b68e2f5120eb6c66f8

opensearch-ci-bot · 2021-11-18T00:36:20Z

❌ Gradle Precommit failure 2c03ff6469825633b79ac7b68e2f5120eb6c66f8
Log 1579

opensearch-ci-bot · 2021-11-18T01:11:43Z

❌ Gradle Check failure 2c03ff6469825633b79ac7b68e2f5120eb6c66f8
Log 1119

Reports 1119

opensearch-ci-bot · 2021-11-18T01:15:11Z

✅ Gradle Wrapper Validation success 5091c1f446e6f7593f95c27cdbaaabc8410640b6

opensearch-ci-bot · 2021-11-18T01:23:40Z

❌ Gradle Precommit failure 5091c1f446e6f7593f95c27cdbaaabc8410640b6
Log 1580

opensearch-ci-bot · 2021-11-18T01:25:04Z

❌ Gradle Check failure 5091c1f446e6f7593f95c27cdbaaabc8410640b6
Log 1122

Reports 1122

Part 1: This is first phase towards providing visibility into the most memory/compute heavy queries This change measures memory and cpu-time used during query and fetch phases. This is leveraging the single threaded execution model to track resources consumed by the thread executing the query/fetch phase. Signed-off-by: Ankit Malpani <ankit.malpani@gmail.com>

opensearch-ci-bot · 2021-11-18T01:39:55Z

✅ Gradle Wrapper Validation success e0b2678

opensearch-ci-bot · 2021-11-18T01:42:59Z

✅ Gradle Precommit success e0b2678

opensearch-ci-bot · 2021-11-18T02:20:35Z

✅ Gradle Check success e0b2678
Log 1123

Reports 1123

server/src/main/java/org/opensearch/common/metrics/ResourceTracker.java

reta · 2021-11-19T18:34:34Z

server/src/main/java/org/opensearch/common/metrics/ResourceTracker.java

+     */
+    public void reset() {
+        this.startingCPUTime = threadMXBean.getCurrentThreadCpuTime();
+        this.startingAllocatedBytes = threadMXBean.getThreadAllocatedBytes(Thread.currentThread().getId());


I think it would be safer to consult if thread memory allocation measurement is supported (otherwise UnsupportedOperationException) and enabled [1]:

if (threadMXBean.isThreadAllocatedMemorySupported() && threadMXBean.isThreadAllocatedMemoryEnabled { this.startingAllocatedBytes = threadMXBean.getThreadAllocatedBytes(Thread.currentThread().getId()); }

[1] https://docs.oracle.com/javase/8/docs/jre/api/management/extension/com/sun/management/ThreadMXBean.html

reta · 2021-11-19T18:46:11Z

@malpani I believe there are some overlaps with #1555 (please correct me if I am wrong). It seems like you are focusing on search only, but by and large, the query and fetch phases are just variations of SearchShardTask, may be it is better to report this stats on task level instead? What do you think?

malpani · 2021-11-19T19:28:09Z

@reta - yes they are related, @sruti1312 and I work on the same team and are collaborating to roll resource tracking out. To keep diffs small, we have split this out into multiple atomic changes instead of raising a giant merged PR that will be hard to review -

(Consumer - Task framework) Extending task framework to track resource consumption by tasks - Add resource stats to task framework #1555
(Producer - query resource tracking) This PR - measuring resources for search related tasks
Merging - merged code will finally look like https://github.com/malpani/OpenSearch/tree/stat-task-framework
(Consumer - develop sinks) Logging every task will have a perf hit, we will be providing options eg. a default top N resource consumers based sink
(Producer - measure more frequently) - Instead of this PR just measuring at the end of the task, we will track resource usage on a per phase basis within a task eg. aggregation/reduce etc.
Consumer - system index sink) - this is optional and instead of having just log as a sink, this will write docs into a system index

Coming back to your question - can this be purely done from task side - the answer is no. eg. Bulk tasks work very differently from search, so it will be upto different task executors to define how to track their resource usage. Given indexing usage is very easy to determine as a function of the incoming payload as against search where most of the problems arise and a small payload can cause extensive usage), the focus is currently on query resource observability. Hope this helps clarify where we are headed with this.

reta · 2021-11-19T19:58:14Z

Thanks @malpani, it makes sense. Just to highlight some gaps: it may not be possible to capture memory and CPU consumption accurately since additional executors may be involved along the way. Fe, we are exploring the experimental Apache Lucene support of the concurrent segments search (#1500) and it uses different executor behind the scene to do that. The resources consumed by those pooled threads won't be captured during query / fetch phases. Hope it sounds reasonable, thank you.

malpani · 2021-11-20T02:54:24Z

@reta - thanks for the heads up, i skimmed through the PR and when concurrent segments search is enabled, the current form for tracking resource consumption of local thread will not work as multiple threads will be involved. My initial thought to have this work in that world will be a wrapped ResourceTrackingQuery whose createWeight could trigger the resource collector object instantiation and eventually they will need to be summed up, i implemented something similar for ultrawarm. However, that can be explored more when the experimental tag for concurrent segment search gets removed.

dblock · 2021-11-22T17:46:22Z

I'm happy to merge if @reta is A-OK with the change, LMK?

reta · 2021-11-22T17:51:11Z

@dblock yes, I am A-OK, thanks for asking!

dblock · 2021-11-22T17:56:54Z

@zelinh can you please help with the whitesource failure in this PR?

sohami · 2021-11-24T23:34:49Z

server/src/main/java/org/opensearch/common/metrics/ResourceTracker.java

+*/
+@SuppressForbidden(reason = "ThreadMXBean enables tracking resource consumption by a thread. "
+    + "It is platform dependent and i am not aware of an alternate mechanism to extract this info")
+public class ResourceTracker {


I think instead of having ResourceTracker in SearchContext, we can move it as interface to Task class itself. This interface can then expose the methods like:

setup/initialize where it can initialize the current values from MXBean.

recordStats() where it can record the diff

Then once setTask is called on Context object, it can initialize the collector/tracker. In updateResourceTracking it can call the context.getResourceTracker().recordStats(). This will avoid the need for reset as each task will be updating its own internal stats. Context object will be set with new task for query and fetch phase separately.

Also how about renaming ResourceTracker as StatsCollector or MetricsCollector ?

moving ResourceTracker to Task

The reason I kept it in SearchContext is, ResourceTracker with its thread level stats is unique to tracking resource consumption for queries only. Currently, I am not seeing other consumers. We can move it upwards if there are other Task that would need similar logic

Also, if the concurrent segment search becomes the default in future, we might need multiple ResourceTracker per search task

I do see your point to avoid/optimize reset - and need to see the scenarios where the context could be reused. Either way will check if memory/cpu is non zero as a condition on pre-empting reset

what are your thoughts?

Also how about renaming ResourceTracker as StatsCollector or MetricsCollector ?

Stats has a special meaning as stats it is a request param that allows grouping stats on a tag

Collector has a special meaning too - Lucene collector

What are your thoughts on

MetricsTracker - tracking metrics - resource consumption, other future metrics

ResourceTracker - tracking resource consumption

moving ResourceTracker to Task

The reason for proposing ResourceTracker to move it to Task is, these metrics/stats belongs to task. Putting it within context is creating the need for reset since context is used by multiple tasks. Considering Task will be tracking metrics then having an interface which provides the API to update/get stats related information in Task will fit well. I see that not all tasks is currently supporting the metrics collection, however that can be achieved by implementing the update/get only by specific search related tasks. The base task can be a no-op for these calls.

For concurrent segment search, there still will be single task and context for search but segment level operation will be done in parallel by multiple threads. So each of these threads will now need to update the stats for same task.

Also how about renaming ResourceTracker as StatsCollector or MetricsCollector ?

I don't think stats/collector definition in the example you shared makes it reserved word for only those modules. It is a general term and can be used in multiple places, for example: SearchExecutionStatsCollector. But I am fine with MetricsTracker as well since that will cover other categories like latencies too and not just resources.

have pushed a new revision which incorporates this feedback:

Got rid of the reset calls as SearchContext is not reused anymore

Renamed to MetricsTracker

zelinh · 2021-11-24T23:37:25Z

Referring to #1593. We may ignore the WhiteSource check for OpenSearch repo for now. There is an issue on WhiteSource side and they are working to fix it.

Signed-off-by: Ankit Malpani <ankit.malpani@gmail.com>

opensearch-ci-bot · 2021-12-01T23:43:23Z

✅ Gradle Wrapper Validation success 03146e8

opensearch-ci-bot · 2021-12-01T23:50:24Z

✅ Gradle Precommit success 03146e8

opensearch-ci-bot · 2021-12-02T00:05:10Z

❌ Gradle Check failure 03146e8
Log 1295

Reports 1295

Bukhtawar · 2021-12-02T16:43:12Z

server/src/main/java/org/opensearch/common/metrics/MetricsTracker.java

+    static ThreadMXBean threadMXBean = (ThreadMXBean) ManagementFactory.getThreadMXBean();
+    long startingAllocatedBytes;
+    long startingCPUTime;
+    long memoryAllocated;
+    long cpuTime;
+
+    /**
+     * Takes current snapshot of resource usage by thread since the creation of this object
+     */
+    public void updateMetrics() {
+        this.memoryAllocated = threadMXBean.getThreadAllocatedBytes(Thread.currentThread().getId()) - startingAllocatedBytes;
+        this.cpuTime = threadMXBean.getCurrentThreadCpuTime() - startingCPUTime;
+    }
+


Most optimal way to achieve it is below. Same is being used in a similar PR #1643

+ static { + threadMXBean = ManagementFactory.getThreadMXBean(); + Method getBytes; + try { + getBytes = threadMXBean.getClass() + .getMethod("getThreadAllocatedBytes", long[].class); + getBytes.setAccessible(true); + failMessages = Collections.emptyList(); + } catch (NoSuchMethodException e) { + getBytes = null; + } + getThreadAllocatedBytes = getBytes; + }

Is this for safely handling JVMs that may not have support for tracking or are there other reasons to switch to reflection - as currently this is under a cluster setting that is disabled by default i was originally planning to add and test on different platforms at a later point but can add it right away.

For safety purposes - I was wondering of leveraging isThreadAllocatedMemorySupported and isCurrentThreadCpuTimeSupported instead. Any other reasons why the reflection route could be better?

We would need to confirm if those checks are good enough. See https://bugs.openjdk.java.net/browse/JDK-8152859

dblock · 2021-12-03T20:49:16Z

@malpani should I merge it as is or are you making changes suggested by @Bukhtawar?

dblock · 2021-12-03T20:49:23Z

start gradle check

andrross · 2021-12-03T21:17:51Z

server/src/main/java/org/opensearch/common/metrics/MetricsTracker.java

+    long startingAllocatedBytes;
+    long startingCPUTime;


It looks like these values are never assigned. Did you intend to initialize these to the initial measurements upon instance creation?

thanks for catching this. The constructor got removed after the recent refactor/reset removal! fixing it.

opensearch-ci-bot · 2021-12-03T21:40:56Z

❌ Gradle Check failure 03146e8
Log 1322

Reports 1322

malpani · 2021-12-06T05:17:08Z

@dblock i will have one more revision to incorporate feedback from @andrross and @Bukhtawar . Also, we now have 2 similar PRs for resource tracking - one is for #1042 and the current one is for query observability. There is similar logic in both in terms of relying on resource usage deltas via ThreadMXBean and while both can get merged independently on the short term, i do think we can consolidate things.

malpani · 2021-12-10T19:44:08Z

closing this out in favor of a merged approach with #1042

malpani requested a review from a team as a code owner November 18, 2021 00:13

malpani changed the title ~~Track memory and compute resource consumption for query and fetch phases~~ Track resource consumption for query and fetch phases Nov 18, 2021

malpani force-pushed the query-resource-tracking branch from 5091c1f to e0b2678 Compare November 18, 2021 01:38

andrross reviewed Nov 18, 2021

View reviewed changes

server/src/main/java/org/opensearch/common/metrics/ResourceTracker.java Outdated Show resolved Hide resolved

reta reviewed Nov 19, 2021

View reviewed changes

sohami reviewed Nov 24, 2021

View reviewed changes

Rename ResourceTracker and get rid of reset methods

03146e8

Signed-off-by: Ankit Malpani <ankit.malpani@gmail.com>

Bukhtawar reviewed Dec 2, 2021

View reviewed changes

andrross reviewed Dec 3, 2021

View reviewed changes

malpani closed this Dec 10, 2021

Track resource consumption for query and fetch phases #1575

Track resource consumption for query and fetch phases #1575

Conversation

malpani commented Nov 18, 2021 • edited Loading

Description

Issues Resolved

Check List

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

opensearch-ci-bot commented Nov 18, 2021

Choose a reason for hiding this comment

reta commented Nov 19, 2021

malpani commented Nov 19, 2021 • edited Loading

reta commented Nov 19, 2021 • edited Loading

malpani commented Nov 20, 2021

dblock commented Nov 22, 2021

reta commented Nov 22, 2021

dblock commented Nov 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zelinh commented Nov 24, 2021

opensearch-ci-bot commented Dec 1, 2021

opensearch-ci-bot commented Dec 1, 2021

opensearch-ci-bot commented Dec 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dblock commented Dec 3, 2021

dblock commented Dec 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

opensearch-ci-bot commented Dec 3, 2021

malpani commented Dec 6, 2021

malpani commented Dec 10, 2021

malpani commented Nov 18, 2021 •

edited

Loading

malpani commented Nov 19, 2021 •

edited

Loading

reta commented Nov 19, 2021 •

edited

Loading