Improve profile API's error fetching efficiency #117

kaituo · 2020-05-08T20:42:20Z

Issue #, if available:
#111

Description of changes:

Previously, profile API scans all anomaly result indices to get a detector's most recent error, which can cause performance bottleneck with large anomaly result indices. This PR improves this aspect via various efforts.

First, when a detector is running, we only need to scan the current index, not all of the rolled over ones since we are interested in the latest error.
Second, when a detector is disabled, we only need to scan the latest anomaly result indices created before the detector's disabled time.
Third, setting track total hits false makes ES terminate search early. ES will not try to count the number of documents and will be able to end the query as soon as N document have been collected per segment.

Testing done:

patched a cluster with 1,000 detectors and 2GB anomaly result indices. Without the PR, scanning anomaly result indices 1000 times would timeout after 30 seconds. After the PR, we would not see the timeout.
A detector's error message can be on a rotated index. Adds a test case to makes sure we get error info from .opendistro-anomaly-results index that has been rolled over.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Previously, profile API scans all anomaly result indices to get a detector's most recent error, which can cause performance bottleneck with large anomaly result indices. This PR improves this aspect via various efforts. First, when a detector is running, we only need to scan the current index, not all of the rolled over ones since we are interested in the latest error. Second, when a detector is disabled, we only need to scan the latest anomaly result indices created before the detector's enable time. Third, setting track total hits false makes ES terminate search early. ES will not try to count the number of documents and will be able to end the query as soon as N document have been collected per segment. Testing done: 1. patched a cluster with 1,000 detectors and 2GB anomaly result indices. Without the PR, scanning anomaly result indices 1000 times would timeout after 30 seconds. After the PR, we would not see the timeout. 2. A detector's error message can be on a rotated index. Adds a test case to makes sure we get error info from .opendistro-anomaly-results index that has been rolled over.

yizheliu-amazon · 2020-05-08T22:07:57Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+                    int date = Integer.parseInt(m.group(3));
+                    // month starts with 0
+                    calendar.clear();
+                    calendar.set(year, month - 1, date);


will Jan. cause any issue? I guess in case of Jan., month is 1, not sure if this can cause any issue

I would suggest you use current year/month/date to initialize calendar, and do calendar.add(Calendar.MONTH, -1) instead.

will Jan. cause any issue? I guess in case of Jan., month is 1, not sure if this can cause any issue

just checked java doc. month for Jan is 0 here.

ylwu-amzn · 2020-05-08T22:36:11Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+                    // time
+                    if (timestamp <= disabledTimeMillis && maxTimestamp <= timestamp) {
+                        maxTimestamp = timestamp;
+                        // we can have two rotations on the same day and we don't know which one has our data, so we keep all


One edge case: suppose detector interval is 1 minute
1.Detector last run is at 2020-05-07, 11:59:50 PM, then AD result indices rolled over as .opendistro-anomaly-results-history-2020.05.07-001
2.Detector next run will be 2020-05-08, 00:00:50 AM. If user stop the detector at 2020-05-08 00:00:10 AM, detector will not have AD result on 2020-05-08.
So this code change will check latest AD result index on 2020-05-08, as 2020-05-08 <= 2020-05-08 00:00:10 AM(disabledTime). But we can't find any AD result for this detector on 2020-05-08. How about we check last two days' AD result indices to make sure we can always get AD result? Similar to set monitor interval as 2*detector_interval

Good point. Changed.

ohltyler

Great, LGTM

ylwu-amzn

LGTM. Thanks for the change.

…icsearch#117) Previously, profile API scans all anomaly result indices to get a detector's most recent error, which can cause performance bottleneck with large anomaly result indices. This PR improves this aspect via various efforts. First, when a detector is running, we only need to scan the current index, not all of the rolled over ones since we are interested in the latest error. Second, when a detector is disabled, we only need to scan the latest anomaly result indices created before the detector's enable time. Third, setting track total hits false makes ES terminate search early. ES will not try to count the number of documents and will be able to end the query as soon as N document have been collected per segment. Testing done: 1. patched a cluster with 1,000 detectors and 2GB anomaly result indices. Without the PR, scanning anomaly result indices 1000 times would timeout after 30 seconds. After the PR, we would not see the timeout. 2. A detector's error message can be on a rotated index. Adds a test case to makes sure we get error info from .opendistro-anomaly-results index that has been rolled over.

* Add shingle size, total model size, and model's hash ring to profile API (#113) Hash ring helps identify node X runs the AD job for a detector Y with models on node 1,2,3. This helps oncalls locate logs. Total model size gives transparency relating to the current memory usage. What's more, shingle size help answer question "why my detector does not report anything?" This PR adds the above info to profile API via a broadcast call that consults ModelManager and FeatureManager about current state pertaining to a detector. Then these states are consolidated into information humans can parse. This PR also queries all AD result indices instead of only current result index so that we can fetch a stopped detector's error after the result index with the error is rotated. Testing done: 1. add unit tests for the newly added code 2. Run end-to-end testing to verify new profiles make senses when a detector stops running and is running * Fix bug in profile API (#115) DetectorProfile's merge does not include new fields added. This PR fixes that. Testing done: * Manually verified profile API works as expected * Improve profile API's error fetching efficiency (#117) Previously, profile API scans all anomaly result indices to get a detector's most recent error, which can cause performance bottleneck with large anomaly result indices. This PR improves this aspect via various efforts. First, when a detector is running, we only need to scan the current index, not all of the rolled over ones since we are interested in the latest error. Second, when a detector is disabled, we only need to scan the latest anomaly result indices created before the detector's enable time. Third, setting track total hits false makes ES terminate search early. ES will not try to count the number of documents and will be able to end the query as soon as N document have been collected per segment. Testing done: 1. patched a cluster with 1,000 detectors and 2GB anomaly result indices. Without the PR, scanning anomaly result indices 1000 times would timeout after 30 seconds. After the PR, we would not see the timeout. 2. A detector's error message can be on a rotated index. Adds a test case to makes sure we get error info from .opendistro-anomaly-results index that has been rolled over.

kaituo requested review from ylwu-amzn, yizheliu-amazon and ohltyler May 8, 2020 20:42

yizheliu-amazon reviewed May 8, 2020

View reviewed changes

ylwu-amzn reviewed May 8, 2020

View reviewed changes

ohltyler approved these changes May 8, 2020

View reviewed changes

yizheliu-amazon approved these changes May 8, 2020

View reviewed changes

Scan one more index to address edge case

b1f597b

ylwu-amzn approved these changes May 9, 2020

View reviewed changes

kaituo merged commit a40ccf6 into opendistro-for-elasticsearch:opendistro-1.4 May 9, 2020

kaituo mentioned this pull request Jun 29, 2020

Adds initialization progress to profile API #164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve profile API's error fetching efficiency #117

Improve profile API's error fetching efficiency #117

kaituo commented May 8, 2020 •

edited

Loading

yizheliu-amazon May 8, 2020

yizheliu-amazon May 8, 2020

yizheliu-amazon May 8, 2020

ylwu-amzn May 8, 2020 •

edited

Loading

kaituo May 9, 2020

ohltyler left a comment

ylwu-amzn left a comment

Improve profile API's error fetching efficiency #117

Improve profile API's error fetching efficiency #117

Conversation

kaituo commented May 8, 2020 • edited Loading

yizheliu-amazon May 8, 2020

Choose a reason for hiding this comment

yizheliu-amazon May 8, 2020

Choose a reason for hiding this comment

yizheliu-amazon May 8, 2020

Choose a reason for hiding this comment

ylwu-amzn May 8, 2020 • edited Loading

Choose a reason for hiding this comment

kaituo May 9, 2020

Choose a reason for hiding this comment

ohltyler left a comment

Choose a reason for hiding this comment

ylwu-amzn left a comment

Choose a reason for hiding this comment

kaituo commented May 8, 2020 •

edited

Loading

ylwu-amzn May 8, 2020 •

edited

Loading