Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Improve profile API #298

Merged
merged 1 commit into from
Oct 30, 2020
Merged

Conversation

kaituo
Copy link
Member

@kaituo kaituo commented Oct 29, 2020

Issue #, if available:

Description of changes:
This PR did various things to improve profile API:
First, the PR fixed the hang issue. Previously, when users run _profile/state or _profile on the multi-entity detector, the request hangs. The problem is due to incorrect maxResponseCount passed to MultiResponsesDelegateActionListener.
Second, the PR fixes the multi-entity detector's wrong state issue. Previously, we can show the init state after an anomaly has shown up. We may have the problem because we read the most active entity's init progress in the cache for a detector's init_progress. But the entity already produced anomaly has been evicted out of the cache. This PR fixes the issue by double-checking the result index's non-zero RCF score for a multi-entity detector before reporting the init state. If there is any non-zero RCF score, we say running state instead of the initing state.
Third, this PR adds more information to the entity level profile, including last_active_timestamp, last_sample_timestamp, init_progress, model, and state.
Fourth, this PR adds models and total_size_in_bytes to the multi-entity detector level profile.

This PR also fixes various "fail to return" issues in the rest API related transport action. We didn't return after sending channel responses. Later, when we use the channel to send back responses again, we get " java.lang.IllegalStateException: Channel is already closed."

Testing done:

  1. manual testing passes.
  2. added unit tests

After the change, we have the following output for multi-entity detectors:

http://localhost:9200/_opendistro/_anomaly_detection/detectors/T4c3dXUBj-2IZN7itix_/_profile?_all=true&pretty

{
	"state": "RUNNING",
	"models": [{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_4",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_5",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_6",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_0",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_1",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_2",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		},
		{
			"model_id": "T4c3dXUBj-2IZN7itix__entity_app_3",
			"model_size_in_bytes": 712480,
			"node_id": "g6pmr547QR-CfpEvO67M4g"
		}
	],
	"total_size_in_bytes": 4987360,
	"init_progress": {
		"percentage": "100%"
	},
	"total_entities": 7,
	"active_entities": 7
}

http://localhost:9200/_opendistro/_anomaly_detection/detectors/T4c3dXUBj-2IZN7itix_/_profile?_all=true&entity=app_6
{
    "category_field": "service",
    "value": "app_6",
    "is_active": true,
    "last_active_timestamp": 1604026394879,
    "last_sample_timestamp": 1604026394879,
    "init_progress": {
        "percentage": "100%"
    },
    "model": {
        "model_id": "TFUdd3UBBwIAGQeRh5IS_entity_app_6",
        "model_size_in_bytes": 712480,
        "node_id": "MQ-bTBW3Q2uU_2zX3pyEQg"
    },
    "state": "RUNNING"
}

http://localhost:9200/_opendistro/_anomaly_detection/detectors/T4c3dXUBj-2IZN7itix_/_profile/entity_info,init_progress?entity=app_0

{
    "category_field": "service",
    "value": "app_0",
    "is_active": true,
    "last_active_timestamp": 1604020861321,
    "last_sample_timestamp": 1604020861321,
    "init_progress": {
        "percentage": "100%"
    }
}

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@kaituo kaituo added enhancement New feature or request bug Something isn't working labels Oct 29, 2020
Comment on lines +131 to +143
if (profilesToCollect.contains(DetectorProfileName.TOTAL_ENTITIES)) {
totalResponsesToWait++;
}
if (profilesToCollect.contains(DetectorProfileName.COORDINATING_NODE)
|| profilesToCollect.contains(DetectorProfileName.SHINGLE_SIZE)
|| profilesToCollect.contains(DetectorProfileName.TOTAL_SIZE_IN_BYTES)
|| profilesToCollect.contains(DetectorProfileName.MODELS)
|| profilesToCollect.contains(DetectorProfileName.ACTIVE_ENTITIES)
|| profilesToCollect.contains(DetectorProfileName.INIT_PROGRESS)
|| profilesToCollect.contains(DetectorProfileName.STATE)) {
totalResponsesToWait++;
}
Copy link
Contributor

@yizheliu-amazon yizheliu-amazon Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: can we combine these 2 if into single one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separate them on purpose. Each group will cost MultiResponsesDelegateActionListener one response.

* @return if the entity is in the cache, return the timestamp in epoch
* milliseconds when the entity's state is lastly used. Otherwise, return -1.
*/
long getLastActiveMs(String detectorId, String entityModelId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just name it getLastActiveModels

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "Ms", I meant milliseconds. Please see https://en.wikipedia.org/wiki/Millisecond.

Copy link
Contributor

@yizheliu-amazon yizheliu-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the change.

This PR did various things to improve profile API:
First, the PR fixed the hang issue. Previously, when users run _profile/state or _profile on the multi-entity detector, the request hangs. The problem is due to incorrect maxResponseCount passed to MultiResponsesDelegateActionListener.
Second, the PR fixes the multi-entity detector's wrong state issue. Previously, we can show the init state after an anomaly has shown up. We may have the problem because we read the most active entity's init progress in the cache for a detector's init_progress. But the entity already produced anomaly has been evicted out of the cache. This PR fixes the issue by double-checking the result index's non-zero RCF score for a multi-entity detector before reporting the init state. If there is any non-zero RCF score, we say running state instead of the initing state.
Third, this PR adds more information to the entity level profile, including last_active_timestamp, last_sample_timestamp, init_progress, model, and state.
Fourth, this PR adds models and total_size_in_bytes to the multi-entity detector level profile.

This PR also fixes various "fail to return" issues in the rest API related transport action. We didn't return after sending channel responses. Later, when we use the channel to send back responses again, we get " java.lang.IllegalStateException: Channel is already closed."

Testing done:
1. manual testing passes.
2. actively adding unit tests
@codecov
Copy link

codecov bot commented Oct 30, 2020

Codecov Report

Merging #298 into master will increase coverage by 0.76%.
The diff coverage is 63.89%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #298      +/-   ##
============================================
+ Coverage     71.25%   72.01%   +0.76%     
- Complexity     1869     1967      +98     
============================================
  Files           194      199       +5     
  Lines          9024     9466     +442     
  Branches        766      844      +78     
============================================
+ Hits           6430     6817     +387     
- Misses         2231     2236       +5     
- Partials        363      413      +50     
Flag Coverage Δ Complexity Δ
#plugin 71.41% <63.89%> (+0.85%) 1967.00 <98.00> (+98.00)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...n/opendistroforelasticsearch/ad/MemoryTracker.java 77.02% <0.00%> (-1.06%) 21.00 <0.00> (ø)
...distroforelasticsearch/ad/constant/CommonName.java 66.66% <ø> (ø) 1.00 <0.00> (ø)
...opendistroforelasticsearch/ad/ml/ModelManager.java 90.60% <0.00%> (ø) 110.00 <0.00> (ø)
...stroforelasticsearch/ad/model/DetectorProfile.java 29.41% <0.00%> (-1.61%) 14.00 <0.00> (-2.00)
...ransport/DeleteAnomalyDetectorTransportAction.java 58.88% <0.00%> (+5.47%) 16.00 <0.00> (+1.00)
.../ad/util/MultiResponsesDelegateActionListener.java 78.04% <ø> (-5.68%) 13.00 <0.00> (-2.00)
...stroforelasticsearch/ad/caching/PriorityCache.java 83.12% <24.00%> (-6.38%) 59.00 <0.00> (ø)
...distroforelasticsearch/ad/model/EntityProfile.java 35.37% <36.93%> (+35.37%) 5.00 <3.00> (+5.00)
...distroforelasticsearch/ad/caching/CacheBuffer.java 79.05% <45.45%> (-3.22%) 38.00 <1.00> (ø)
...stroforelasticsearch/ad/model/AnomalyDetector.java 65.51% <50.00%> (+0.66%) 60.00 <2.00> (+4.00)
... and 37 more

if (hits.getTotalHits().value == 0L) {
processInitResponse(detector, profilesToCollect, totalUpdates, false, profileBuilder, listener);
} else {
createRunningStateAndInitProgress(profilesToCollect, profileBuilder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a running detector stopped, then restart but not pass initialization yet. We can find anomaly results with anomaly score > 0 as the detector was running before. We can't tell the detector is at running status exactly for this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am searching records older than the job's enabled time. Does that cover the issue you mentioned?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, make sense.

@kaituo kaituo merged commit c2a5f4e into opendistro-for-elasticsearch:master Oct 30, 2020
@weicongs-amazon weicongs-amazon removed the bug Something isn't working label Nov 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants