Improve profile API #298

kaituo · 2020-10-29T19:04:01Z

Issue #, if available:

Description of changes:
This PR did various things to improve profile API:
First, the PR fixed the hang issue. Previously, when users run _profile/state or _profile on the multi-entity detector, the request hangs. The problem is due to incorrect maxResponseCount passed to MultiResponsesDelegateActionListener.
Second, the PR fixes the multi-entity detector's wrong state issue. Previously, we can show the init state after an anomaly has shown up. We may have the problem because we read the most active entity's init progress in the cache for a detector's init_progress. But the entity already produced anomaly has been evicted out of the cache. This PR fixes the issue by double-checking the result index's non-zero RCF score for a multi-entity detector before reporting the init state. If there is any non-zero RCF score, we say running state instead of the initing state.
Third, this PR adds more information to the entity level profile, including last_active_timestamp, last_sample_timestamp, init_progress, model, and state.
Fourth, this PR adds models and total_size_in_bytes to the multi-entity detector level profile.

This PR also fixes various "fail to return" issues in the rest API related transport action. We didn't return after sending channel responses. Later, when we use the channel to send back responses again, we get " java.lang.IllegalStateException: Channel is already closed."

Testing done:

manual testing passes.
added unit tests

After the change, we have the following output for multi-entity detectors:

http://localhost:9200/_opendistro/_anomaly_detection/detectors/T4c3dXUBj-2IZN7itix_/_profile?_all=true&pretty

{
	"state": "RUNNING",
	"models": [{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_4",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_5",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_6",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_0",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_1",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_2",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_3",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		}
	],
	"total_size_in_bytes": 4987360,
	"init_progress": {
		"percentage": "100%"
	},
	"total_entities": 7,
	"active_entities": 7
}

http://localhost:9200/_opendistro/_anomaly_detection/detectors/T4c3dXUBj-2IZN7itix_/_profile?_all=true&entity=app_6
{
    "category_field": "service",
    "value": "app_6",
    "is_active": true,
    "last_active_timestamp": 1604026394879,
    "last_sample_timestamp": 1604026394879,
    "init_progress": {
        "percentage": "100%"
    },
    "model": {
        "model_id": "TFUdd3UBBwIAGQeRh5IS_entity_app_6",
        "model_size_in_bytes": 712480,
        "node_id": "MQ-bTBW3Q2uU_2zX3pyEQg"
    },
    "state": "RUNNING"
}

http://localhost:9200/_opendistro/_anomaly_detection/detectors/T4c3dXUBj-2IZN7itix_/_profile/entity_info,init_progress?entity=app_0

{
    "category_field": "service",
    "value": "app_0",
    "is_active": true,
    "last_active_timestamp": 1604020861321,
    "last_sample_timestamp": 1604020861321,
    "init_progress": {
        "percentage": "100%"
    }
}

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

yizheliu-amazon · 2020-10-30T03:53:26Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+                        if (profilesToCollect.contains(DetectorProfileName.TOTAL_ENTITIES)) {
+                            totalResponsesToWait++;
+                        }
+                        if (profilesToCollect.contains(DetectorProfileName.COORDINATING_NODE)
+                            || profilesToCollect.contains(DetectorProfileName.SHINGLE_SIZE)
+                            || profilesToCollect.contains(DetectorProfileName.TOTAL_SIZE_IN_BYTES)
+                            || profilesToCollect.contains(DetectorProfileName.MODELS)
+                            || profilesToCollect.contains(DetectorProfileName.ACTIVE_ENTITIES)
+                            || profilesToCollect.contains(DetectorProfileName.INIT_PROGRESS)
+                            || profilesToCollect.contains(DetectorProfileName.STATE)) {
+                            totalResponsesToWait++;
+                        }


minor: can we combine these 2 if into single one?

I separate them on purpose. Each group will cost MultiResponsesDelegateActionListener one response.

yizheliu-amazon · 2020-10-30T04:03:45Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/caching/EntityCache.java

+     * @return if the entity is in the cache, return the timestamp in epoch
+     * milliseconds when the entity's state is lastly used.  Otherwise, return -1.
+     */
+    long getLastActiveMs(String detectorId, String entityModelId);


just name it getLastActiveModels

For "Ms", I meant milliseconds. Please see https://en.wikipedia.org/wiki/Millisecond.

yizheliu-amazon

thanks for the change.

This PR did various things to improve profile API: First, the PR fixed the hang issue. Previously, when users run _profile/state or _profile on the multi-entity detector, the request hangs. The problem is due to incorrect maxResponseCount passed to MultiResponsesDelegateActionListener. Second, the PR fixes the multi-entity detector's wrong state issue. Previously, we can show the init state after an anomaly has shown up. We may have the problem because we read the most active entity's init progress in the cache for a detector's init_progress. But the entity already produced anomaly has been evicted out of the cache. This PR fixes the issue by double-checking the result index's non-zero RCF score for a multi-entity detector before reporting the init state. If there is any non-zero RCF score, we say running state instead of the initing state. Third, this PR adds more information to the entity level profile, including last_active_timestamp, last_sample_timestamp, init_progress, model, and state. Fourth, this PR adds models and total_size_in_bytes to the multi-entity detector level profile. This PR also fixes various "fail to return" issues in the rest API related transport action. We didn't return after sending channel responses. Later, when we use the channel to send back responses again, we get " java.lang.IllegalStateException: Channel is already closed." Testing done: 1. manual testing passes. 2. actively adding unit tests

codecov · 2020-10-30T04:47:49Z

Codecov Report

Merging #298 into master will increase coverage by 0.76%.
The diff coverage is 63.89%.

@@             Coverage Diff              @@
##             master     #298      +/-   ##
============================================
+ Coverage     71.25%   72.01%   +0.76%     
- Complexity     1869     1967      +98     
============================================
  Files           194      199       +5     
  Lines          9024     9466     +442     
  Branches        766      844      +78     
============================================
+ Hits           6430     6817     +387     
- Misses         2231     2236       +5     
- Partials        363      413      +50

Flag	Coverage Δ	Complexity Δ
#plugin	`71.41% <63.89%> (+0.85%)`	`1967.00 <98.00> (+98.00)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...n/opendistroforelasticsearch/ad/MemoryTracker.java	`77.02% <0.00%> (-1.06%)`	`21.00 <0.00> (ø)`
...distroforelasticsearch/ad/constant/CommonName.java	`66.66% <ø> (ø)`	`1.00 <0.00> (ø)`
...opendistroforelasticsearch/ad/ml/ModelManager.java	`90.60% <0.00%> (ø)`	`110.00 <0.00> (ø)`
...stroforelasticsearch/ad/model/DetectorProfile.java	`29.41% <0.00%> (-1.61%)`	`14.00 <0.00> (-2.00)`
...ransport/DeleteAnomalyDetectorTransportAction.java	`58.88% <0.00%> (+5.47%)`	`16.00 <0.00> (+1.00)`
.../ad/util/MultiResponsesDelegateActionListener.java	`78.04% <ø> (-5.68%)`	`13.00 <0.00> (-2.00)`
...stroforelasticsearch/ad/caching/PriorityCache.java	`83.12% <24.00%> (-6.38%)`	`59.00 <0.00> (ø)`
...distroforelasticsearch/ad/model/EntityProfile.java	`35.37% <36.93%> (+35.37%)`	`5.00 <3.00> (+5.00)`
...distroforelasticsearch/ad/caching/CacheBuffer.java	`79.05% <45.45%> (-3.22%)`	`38.00 <1.00> (ø)`
...stroforelasticsearch/ad/model/AnomalyDetector.java	`65.51% <50.00%> (+0.66%)`	`60.00 <2.00> (+4.00)`
... and 37 more

ylwu-amzn · 2020-10-30T05:39:12Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+            if (hits.getTotalHits().value == 0L) {
+                processInitResponse(detector, profilesToCollect, totalUpdates, false, profileBuilder, listener);
+            } else {
+                createRunningStateAndInitProgress(profilesToCollect, profileBuilder);


If a running detector stopped, then restart but not pass initialization yet. We can find anomaly results with anomaly score > 0 as the detector was running before. We can't tell the detector is at running status exactly for this case.

I am searching records older than the job's enabled time. Does that cover the issue you mentioned?

Cool, make sense.

kaituo requested review from ylwu-amzn and yizheliu-amazon October 29, 2020 19:05

kaituo added enhancement New feature or request bug Something isn't working labels Oct 29, 2020

yizheliu-amazon reviewed Oct 30, 2020

View reviewed changes

yizheliu-amazon approved these changes Oct 30, 2020

View reviewed changes

kaituo force-pushed the profile branch from 52400f8 to b43af49 Compare October 30, 2020 04:46

ylwu-amzn reviewed Oct 30, 2020

View reviewed changes

ylwu-amzn approved these changes Oct 30, 2020

View reviewed changes

kaituo merged commit c2a5f4e into opendistro-for-elasticsearch:master Oct 30, 2020

weicongs-amazon removed the bug Something isn't working label Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve profile API #298

Improve profile API #298

kaituo commented Oct 29, 2020 •

edited

Loading

yizheliu-amazon Oct 30, 2020 •

edited

Loading

kaituo Oct 30, 2020

yizheliu-amazon Oct 30, 2020

kaituo Oct 30, 2020

yizheliu-amazon left a comment

codecov bot commented Oct 30, 2020 •

edited

Loading

ylwu-amzn Oct 30, 2020

kaituo Oct 30, 2020

ylwu-amzn Oct 30, 2020

Improve profile API #298

Improve profile API #298

Conversation

kaituo commented Oct 29, 2020 • edited Loading

yizheliu-amazon Oct 30, 2020 • edited Loading

Choose a reason for hiding this comment

kaituo Oct 30, 2020

Choose a reason for hiding this comment

yizheliu-amazon Oct 30, 2020

Choose a reason for hiding this comment

kaituo Oct 30, 2020

Choose a reason for hiding this comment

yizheliu-amazon left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 30, 2020 • edited Loading

Codecov Report

ylwu-amzn Oct 30, 2020

Choose a reason for hiding this comment

kaituo Oct 30, 2020

Choose a reason for hiding this comment

ylwu-amzn Oct 30, 2020

Choose a reason for hiding this comment

kaituo commented Oct 29, 2020 •

edited

Loading

yizheliu-amazon Oct 30, 2020 •

edited

Loading

codecov bot commented Oct 30, 2020 •

edited

Loading