Add shingle size, total model size, and model's hash ring to profile API #113

kaituo · 2020-05-06T22:19:26Z

Issue #, if available:
#111

Description of changes:

Hash ring helps identify node X runs the AD job for a detector Y with models on node 1,2,3. This helps oncalls locate logs. Total model size gives transparency relating to the current memory usage. What's more, shingle size help answer question "why my detector does not report anything?"

This PR adds the above info to profile API via a broadcast call that consults ModelManager and FeatureManager about current state pertaining to a detector. Then these states are consolidated into information humans can parse.

This PR also queries all AD result indices instead of only current result index so that we can fetch a stopped detector's error after the result index with the error is rotated.

Testing done:

add unit tests for the newly added code
Run end-to-end testing to verify new profiles make senses when a detector stops running and is running

Example:

% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/cneh7HEBHPICjJIdXdrR/_profile?_all=true&pretty"
{
  "state" : "RUNNING",
  "models" : [
    {
      "model_id" : "cneh7HEBHPICjJIdXdrR_model_rcf_2",
      "model_size_in_bytes" : 4456448,
      "node_id" : "VS29z70PSzOdHiEw4SoV9Q"
    },
    {
      "model_id" : "cneh7HEBHPICjJIdXdrR_model_rcf_1",
      "model_size_in_bytes" : 4456448,
      "node_id" : "VS29z70PSzOdHiEw4SoV9Q"
    },
    {
      "model_id" : "cneh7HEBHPICjJIdXdrR_model_threshold",
      "node_id" : "Og23iUroTdKrkwS-y89zLw"
    },
    {
      "model_id" : "cneh7HEBHPICjJIdXdrR_model_rcf_0",
      "model_size_in_bytes" : 4456448,
      "node_id" : "Og23iUroTdKrkwS-y89zLw"
    }
  ],
  "shingle_size" : 8,
  "coordinating_node" : "Og23iUroTdKrkwS-y89zLw",
  "total_size_in_bytes" : 13369344
}

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

yizheliu-amazon · 2020-05-06T23:05:20Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

@@ -280,8 +312,42 @@ private SearchRequest createLatestAnomalyResultRequest(String detectorId, long e

        SearchSourceBuilder source = new SearchSourceBuilder().query(filterQuery).size(1).sort(sortQuery);

-        SearchRequest request = new SearchRequest(AnomalyResult.ANOMALY_RESULT_INDEX);
+        SearchRequest request = new SearchRequest(AnomalyDetectionIndices.ALL_AD_RESULTS_INDEX_PATTERN);


cool. I guess this can fix this issue as well: #111

jmazanec15

A couple questions: (1) Does Documentation need to be updated? (2) "This PR also queries all AD result indices instead of only current result index so that we can fetch a stopped detector's error" Could this call eventually get very expensive if many result indices are queried?

jmazanec15 · 2020-05-06T22:29:17Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

@@ -70,9 +78,26 @@ public void profile(String detectorId, ActionListener<DetectorProfile> listener,
            return;
        }

+        int totalListener = 0;


Nit: Add a comment describing what this means for those not familiar with MultiResponseDelegateActionListener

jmazanec15 · 2020-05-06T22:42:18Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/model/ModelProfile.java

+import org.elasticsearch.common.xcontent.XContentBuilder;
+
+public class ModelProfile implements Writeable, ToXContent {
+    // filed name in toXContent


jmazanec15 · 2020-05-06T22:44:23Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/model/ProfileName.java

    public static ProfileName getName(String name) {
        switch (name) {
            case "state":
                return STATE;
            case "error":
                return ERROR;
+            case "coordinating_node":


Why hardcode these strings?

changed to "case CommonName.COORDINATING_NODE:".

jmazanec15 · 2020-05-06T22:47:09Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/rest/RestGetAnomalyDetectorAction.java

        String typesStr = request.param(TYPE);
        String rawPath = request.rawPath();
        if (!Strings.isEmpty(typesStr) || rawPath.endsWith(PROFILE) || rawPath.endsWith(PROFILE + "/")) {
+            boolean all = request.paramAsBoolean("all", false);


Use _all instead of all?

jmazanec15 · 2020-05-06T22:53:51Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/transport/ProfileResponse.java

+ */
+public class ProfileResponse extends BaseNodesResponse<ProfileNodeResponse> implements ToXContentFragment {
+    // filed name in toXContent
+    static final String COORDINATING_NODE = "coordinating_node";


These strings are already defined aren't they?

Moved these strings to CommonName and use CommonName

src/main/java/com/amazon/opendistroforelasticsearch/ad/transport/ProfileResponse.java

yizheliu-amazon · 2020-05-06T23:22:30Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/transport/ProfileRequest.java

+    /**
+     * Key indicating all profiles should be retrieved
+     */
+    public static final String ALL_PROFILE_KEY = "_all";


is it used?

good catch. Removed.

yizheliu-amazon · 2020-05-06T23:27:26Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/transport/ProfileResponse.java

+    static final String COORDINATING_NODE = "coordinating_node";
+    static final String SHINGLE_SIZE = "shingle_size";
+    static final String TOTAL_SIZE = "total_size";
+    static final String MODELS = "models";


use ProfileName enum to replace these?

Moved these strings to CommonName and use CommonName

yizheliu-amazon · 2020-05-06T23:32:29Z

src/test/java/test/com/amazon/opendistroforelasticsearch/ad/util/JsonDeserializer.java

     * @param paths      path fragments
-     * @return list of double


why removing this?

added back with changed contents.

yizheliu-amazon · 2020-05-06T23:34:48Z

src/test/java/com/amazon/opendistroforelasticsearch/ad/rest/AnomalyDetectorRestApiIT.java

+
+        Response profileResponse = getDetectorProfile(detector.getDetectorId(), true);
+        assertEquals("Incorrect profile status", RestStatus.OK, restStatus(profileResponse));
+    }


Is there test case against customized profiles?

yizheliu-amazon

a few minor issues. no blocker

kaituo · 2020-05-07T01:15:49Z

A couple questions: (1) Does Documentation need to be updated? (2) "This PR also queries all AD result indices instead of only current result index so that we can fetch a stopped detector's error" Could this call eventually get very expensive if many result indices are queried?

(1) Yes, will contact our doc writer.
(2) It can be. It's a tradeoff between accuracy vs performance. I didn't include all AD result indices because I am afraid of performance issues. Let's see.

Hash ring helps identify node X runs the AD job for a detector Y with models on node 1,2,3. This helps oncalls locate logs. Total model size gives transparency relating to the current memory usage. What's more, shingle size help answer question "why my detector does not report anything?" This PR adds the above info to profile API via a broadcast call that consults ModelManager and FeatureManager about current state pertaining to a detector. Then these states are consolidated into information humans can parse. This PR also queries all AD result indices instead of only current result index so that we can fetch a stopped detector's error after the result index with the error is rotated. Testing done: 1. add unit tests for the newly added code 2. Run end-to-end testing to verify new profiles make senses when a detector stops running and is running

…API (opendistro-for-elasticsearch#113) Hash ring helps identify node X runs the AD job for a detector Y with models on node 1,2,3. This helps oncalls locate logs. Total model size gives transparency relating to the current memory usage. What's more, shingle size help answer question "why my detector does not report anything?" This PR adds the above info to profile API via a broadcast call that consults ModelManager and FeatureManager about current state pertaining to a detector. Then these states are consolidated into information humans can parse. This PR also queries all AD result indices instead of only current result index so that we can fetch a stopped detector's error after the result index with the error is rotated. Testing done: 1. add unit tests for the newly added code 2. Run end-to-end testing to verify new profiles make senses when a detector stops running and is running

* Add shingle size, total model size, and model's hash ring to profile API (#113) Hash ring helps identify node X runs the AD job for a detector Y with models on node 1,2,3. This helps oncalls locate logs. Total model size gives transparency relating to the current memory usage. What's more, shingle size help answer question "why my detector does not report anything?" This PR adds the above info to profile API via a broadcast call that consults ModelManager and FeatureManager about current state pertaining to a detector. Then these states are consolidated into information humans can parse. This PR also queries all AD result indices instead of only current result index so that we can fetch a stopped detector's error after the result index with the error is rotated. Testing done: 1. add unit tests for the newly added code 2. Run end-to-end testing to verify new profiles make senses when a detector stops running and is running * Fix bug in profile API (#115) DetectorProfile's merge does not include new fields added. This PR fixes that. Testing done: * Manually verified profile API works as expected * Improve profile API's error fetching efficiency (#117) Previously, profile API scans all anomaly result indices to get a detector's most recent error, which can cause performance bottleneck with large anomaly result indices. This PR improves this aspect via various efforts. First, when a detector is running, we only need to scan the current index, not all of the rolled over ones since we are interested in the latest error. Second, when a detector is disabled, we only need to scan the latest anomaly result indices created before the detector's enable time. Third, setting track total hits false makes ES terminate search early. ES will not try to count the number of documents and will be able to end the query as soon as N document have been collected per segment. Testing done: 1. patched a cluster with 1,000 detectors and 2GB anomaly result indices. Without the PR, scanning anomaly result indices 1000 times would timeout after 30 seconds. After the PR, we would not see the timeout. 2. A detector's error message can be on a rotated index. Adds a test case to makes sure we get error info from .opendistro-anomaly-results index that has been rolled over.

kaituo requested review from jmazanec15 and yizheliu-amazon May 6, 2020 22:19

yizheliu-amazon reviewed May 6, 2020

View reviewed changes

jmazanec15 reviewed May 6, 2020

View reviewed changes

yizheliu-amazon reviewed May 6, 2020

View reviewed changes

yizheliu-amazon approved these changes May 7, 2020

View reviewed changes

kaituo closed this May 7, 2020

kaituo reopened this May 7, 2020

jmazanec15 approved these changes May 7, 2020

View reviewed changes

kaituo force-pushed the allProfile branch from 198676b to cab19f9 Compare May 7, 2020 03:00

kaituo merged commit e06cf6f into opendistro-for-elasticsearch:opendistro-1.4 May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shingle size, total model size, and model's hash ring to profile API #113

Add shingle size, total model size, and model's hash ring to profile API #113

kaituo commented May 6, 2020 •

edited

Loading

yizheliu-amazon May 6, 2020

kaituo May 6, 2020

jmazanec15 left a comment •

edited by kaituo

Loading

jmazanec15 May 6, 2020

kaituo May 6, 2020

jmazanec15 May 6, 2020

kaituo May 6, 2020

jmazanec15 May 6, 2020

kaituo May 7, 2020

jmazanec15 May 6, 2020

kaituo May 7, 2020

jmazanec15 May 6, 2020

kaituo May 7, 2020

yizheliu-amazon May 6, 2020

kaituo May 7, 2020

yizheliu-amazon May 6, 2020

kaituo May 7, 2020

yizheliu-amazon May 6, 2020

kaituo May 7, 2020

yizheliu-amazon May 6, 2020

kaituo May 7, 2020

yizheliu-amazon left a comment

kaituo commented May 7, 2020

Add shingle size, total model size, and model's hash ring to profile API #113

Add shingle size, total model size, and model's hash ring to profile API #113

Conversation

kaituo commented May 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmazanec15 left a comment • edited by kaituo Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yizheliu-amazon left a comment

Choose a reason for hiding this comment

kaituo commented May 7, 2020

kaituo commented May 6, 2020 •

edited

Loading

jmazanec15 left a comment •

edited by kaituo

Loading