Use callbacks and bug fix #83

kaituo · 2020-04-09T18:20:11Z

Issue #, if available:
#78

Description of changes:
This PR includes the following changes:

remove classes that are not needed in jacocoExclusions since we have enough coverage for those classes.
Use ClientUtil instead of Elasticsearch’s client in AD job runner
Use one function to get the number of partitioned forests. Previously, we have redundant code in both ModelManager and ADStateManager.
Change ADStateManager.getAnomalyDetector to use callback.
Change AnomalyResultTransportAction to use callback to get features.
Add in AnomalyResultTransportAction to handle the case where all features have been disabled, and users' index does not exist.
Change get RCF and threshold result methods to use callback and add exception handling of IndexNotFoundException due to the change. Previously, getting RCF and threshold result methods won’t throw IndexNotFoundException.
Remove unused fields in StopDetectorTransportAction and AnomalyResultTransportAction
Unwrap EsRejectedExecutionException as it can be nested inside RemoteTransportException. Previously, we would not recognize EsRejectedExecutionException and thus miss anomaly results write retrying.
Add error in anomaly result schema.
Fix broken tests due to my changes.

Testing done:

unit/integration tests pass
do end-to-end testing and make sure my fix achieves the purpose
- timeout issue is gone
- when all features have been disabled or index does not exist, we will retry a few more times and disable AD jobs.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

This PR includes the following changes: 1. remove classes that are not needed in jacocoExclusions since we have enough coverage for those classes. 2. Use ClientUtil instead of Elasticsearch’s client in AD job runner 3. Use one function to get the number of partitioned forests. Previously, we have redundant code in both ModelManager and ADStateManager. 4. Change ADStateManager.getAnomalyDetector to use callback. 5. Change AnomalyResultTransportAction to use callback to get features. 6. Add in AnomalyResultTransportAction to handle the case where all features have been disabled, and users' index does not exist. 7. Change get RCF and threshold result methods to use callback and add exception handling of IndexNotFoundException due to the change. Previously, getting RCF and threshold result methods won’t throw IndexNotFoundException. 8. Remove unused fields in StopDetectorTransportAction and AnomalyResultTransportAction 9. Unwrap EsRejectedExecutionException as it can be nested inside RemoteTransportException. Previously, we would not recognize EsRejectedExecutionException and thus miss anomaly results write retrying. 10. Add error in anomaly result schema.11. Fix broken tests due to my changes. Testing done: 1. unit/integration tests pass 2. do end-to-end testing and make sure my fix achieves the purpose * timeout issue is gone * when all features have been disabled or index does not exist, we will retry a few more times and disable AD jobs.

ylwu-amzn · 2020-04-10T05:29:32Z

...in/java/com/amazon/opendistroforelasticsearch/ad/transport/AnomalyResultTransportAction.java

            if (!detector.isPresent()) {
-                listener.onFailure(new EndRunException(adID, "AnomalyDetector is not available.", true));
+                listener.onFailure(new EndRunException(adID, "AnomalyDetector is not available.", false));


Why not end run immediately if we can't find detector?

Changed it back to end run immediately.

ylwu-amzn · 2020-04-10T05:34:59Z

...in/java/com/amazon/opendistroforelasticsearch/ad/transport/AnomalyResultTransportAction.java

+            } else if (exception instanceof IllegalArgumentException) {
+                listener
+                    .onFailure(
+                        new EndRunException(adID, "Having trouble querying data. Maybe all of your features have been disabled.", false)


How about we check feature list and give user a definite answer? If that change takes time, please add some todo here.

good suggestion. Done.

ylwu-amzn · 2020-04-10T05:36:50Z

...in/java/com/amazon/opendistroforelasticsearch/ad/transport/AnomalyResultTransportAction.java

+        }, exception -> {
+            LOG.warn(exception);
+            if (exception instanceof IndexNotFoundException) {
+                listener.onFailure(new EndRunException(adID, "Having trouble querying data: " + exception.getMessage(), false));


How about we change to "Can't find index XXX"? So user can know clearly the trouble is missing index, rather than others like wrong query or network latency.

exception.getMessage would return such information.

wnbts

it would be good to separate changes unrelated to callbacks into other prs to speed up delivery

wnbts · 2020-04-10T17:25:10Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/ModelManager.java

+     * @return a pair of number of partitions and size of a parition (number of trees)
+     * @throws LimitExceededException when there is no sufficient resouce available
+     */
+    public Entry<Integer, Integer> getPartitionedForestSizes(String detectorId, int rcfNumFeatures) {


if we are going to refactor this method, I suggest the new api just takes a detector object, which contains all the needed info and simpler to use.

We only use use detector id as part of error message. Don't need other detector information.

if model manager takes a detector, it can compute the feature dimensions and partitioning so that will be only input needed and that will save client the work to provide a second rcfNumFeatures input. that's why i suggest doing that.

make sense. done

please take a look at the recent commit: d8ea9cf

wnbts · 2020-04-10T17:29:51Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/transport/ADStateManager.java

     * @return the number of RCF model's partition number for adID
     * @throws InterruptedException when we cannot get anomaly detector object for adID before timeout
     * @throws LimitExceededException when there is no sufficient resource available
     */
-    public int getPartitionNumber(String adID) throws InterruptedException {
+    public int getPartitionNumber(String adID, Optional<AnomalyDetector> detector) throws InterruptedException {


Minor. Why not validate detector first and just pass a detector afterwards? saving all the repetitive and unlikely handling of a non-existent detector.

good point. Done.

wnbts · 2020-04-10T17:33:15Z

src/main/resources/mappings/anomaly-results.json

@@ -45,6 +45,9 @@
    "execution_end_time": {
      "type": "date",
      "format": "strict_date_time||epoch_millis"
+    },
+    "error": {
+      "type": "text"


question. does error message need to be searched for?

Yes, the error message is not standardized, that means it could be any string, better to make it searchable for easy operation.

… to null (opendistro-for-elasticsearch#77) 1. Change the default value of lastUpdateTime from the current timestamp to null. Before the change, creating a detector returns one lastUpdateTime, while getting a detector returns a different lastUpdateTime. The difference is confusing to the user, and they may wonder what has happened between the creating and getting detector calls. After the change, creating a detector returns no last update time, while getting a detector returns a last update time. 2. Replace the mocked threadpool in 2 tests with a real threadpool object. Testing done: 1. verified lastUpdateTime change in a cluster 2. gradle build

1. remove uncessary optional passing and return end run exception after making sure all features are disabled 2. End run immediately when anomaly detector is not available 3. make getPartitionedForestSizes takes detector only

wnbts · 2020-04-13T20:35:16Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/transport/ADStateManager.java

     * @return the number of RCF model's partition number for adID
     * @throws InterruptedException when we cannot get anomaly detector object for adID before timeout
     * @throws LimitExceededException when there is no sufficient resource available
     */
-    public int getPartitionNumber(String adID) throws InterruptedException {
+    public int getPartitionNumber(String adID, AnomalyDetector detector) throws InterruptedException {


Minor. Is this exception still possible?

no, it is impossible now. Will fix.

wnbts · 2020-04-13T20:44:10Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/ModelManager.java

+    *                 onFailure is called IllegalArgumentException when training data is invalid
+    *                 onFailure is called LimitExceededException when a limit for training is exceeded
+    */
+    public void trainModel(AnomalyDetector anomalyDetector, double[][] dataPoints, ActionListener<Void> listener) {


suggestion. it's clearer to rebase and force push to avoid showing pulled changes that are checked in.

yes, will try next time

* Use callbacks and bug fix This PR includes the following changes: 1. remove classes that are not needed in jacocoExclusions since we have enough coverage for those classes. 2. Use ClientUtil instead of Elasticsearch’s client in AD job runner 3. Use one function to get the number of partitioned forests. Previously, we have redundant code in both ModelManager and ADStateManager. 4. Change ADStateManager.getAnomalyDetector to use callback. 5. Change AnomalyResultTransportAction to use callback to get features. 6. Add in AnomalyResultTransportAction to handle the case where all features have been disabled, and users' index does not exist. 7. Change get RCF and threshold result methods to use callback and add exception handling of IndexNotFoundException due to the change. Previously, getting RCF and threshold result methods won’t throw IndexNotFoundException. 8. Remove unused fields in StopDetectorTransportAction and AnomalyResultTransportAction 9. Unwrap EsRejectedExecutionException as it can be nested inside RemoteTransportException. Previously, we would not recognize EsRejectedExecutionException and thus miss anomaly results write retrying. 10. Add error in anomaly result schema.11. Fix broken tests due to my changes. Testing done: 1. unit/integration tests pass 2. do end-to-end testing and make sure my fix achieves the purpose * timeout issue is gone * when all features have been disabled or index does not exist, we will retry a few more times and disable AD jobs.

Author: Kaituo Li <kaituo@amazon.com> Date: Wed Apr 15 15:45:13 2020 -0700 Add state and error to profile API (opendistro-for-elasticsearch#84) * Add state and error to profile API We want to make it easy for customers and oncalls to identify a detector’s state and error if any. This PR adds such information to our new profile API. We expect three kinds of states: -Disabled: if get ad job api says the job is disabled; -Init: if anomaly score after the last update time of the detector is larger than 0 -Running: if neither of the above applies and no exceptions. Error is populated if error of the latest anomaly result is not empty. Testing done: -manual testing during a detector’s life cycle: not created, created but not started, started, during initialization, after initialization, stopped, restarted -added unit tests to cover above scenario commit 0c33050 Author: Kaituo Li <kaituo@amazon.com> Date: Tue Apr 14 11:52:20 2020 -0700 Use callbacks and bug fix (opendistro-for-elasticsearch#83) * Use callbacks and bug fix This PR includes the following changes: 1. remove classes that are not needed in jacocoExclusions since we have enough coverage for those classes. 2. Use ClientUtil instead of Elasticsearch’s client in AD job runner 3. Use one function to get the number of partitioned forests. Previously, we have redundant code in both ModelManager and ADStateManager. 4. Change ADStateManager.getAnomalyDetector to use callback. 5. Change AnomalyResultTransportAction to use callback to get features. 6. Add in AnomalyResultTransportAction to handle the case where all features have been disabled, and users' index does not exist. 7. Change get RCF and threshold result methods to use callback and add exception handling of IndexNotFoundException due to the change. Previously, getting RCF and threshold result methods won’t throw IndexNotFoundException. 8. Remove unused fields in StopDetectorTransportAction and AnomalyResultTransportAction 9. Unwrap EsRejectedExecutionException as it can be nested inside RemoteTransportException. Previously, we would not recognize EsRejectedExecutionException and thus miss anomaly results write retrying. 10. Add error in anomaly result schema.11. Fix broken tests due to my changes. Testing done: 1. unit/integration tests pass 2. do end-to-end testing and make sure my fix achieves the purpose * timeout issue is gone * when all features have been disabled or index does not exist, we will retry a few more times and disable AD jobs.

kaituo requested review from ylwu-amzn and yizheliu-amazon April 9, 2020 18:20

ylwu-amzn reviewed Apr 10, 2020

View reviewed changes

wnbts reviewed Apr 10, 2020

View reviewed changes

kaituo and others added 4 commits April 13, 2020 11:35

Add async getColdStartData (opendistro-for-elasticsearch#80)

b2fd4cd

add async trainModel (opendistro-for-elasticsearch#81)

d023166

Various

d8ea9cf

1. remove uncessary optional passing and return end run exception after making sure all features are disabled 2. End run immediately when anomaly detector is not available 3. make getPartitionedForestSizes takes detector only

wnbts approved these changes Apr 13, 2020

View reviewed changes

Remove impossible exception throwing

c554a4f

ylwu-amzn approved these changes Apr 14, 2020

View reviewed changes

kaituo merged commit 0c33050 into opendistro-for-elasticsearch:development Apr 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use callbacks and bug fix #83

Use callbacks and bug fix #83

kaituo commented Apr 9, 2020 •

edited

Loading

ylwu-amzn Apr 10, 2020

kaituo Apr 10, 2020 •

edited

Loading

ylwu-amzn Apr 10, 2020

kaituo Apr 10, 2020

ylwu-amzn Apr 10, 2020

kaituo Apr 10, 2020

wnbts left a comment

wnbts Apr 10, 2020

kaituo Apr 10, 2020 •

edited

Loading

wnbts Apr 11, 2020

kaituo Apr 11, 2020 •

edited

Loading

kaituo Apr 13, 2020

wnbts Apr 10, 2020

kaituo Apr 10, 2020

wnbts Apr 10, 2020

ylwu-amzn Apr 10, 2020

wnbts Apr 13, 2020

kaituo Apr 13, 2020 •

edited

Loading

wnbts Apr 13, 2020

kaituo Apr 13, 2020

Use callbacks and bug fix #83

Use callbacks and bug fix #83

Conversation

kaituo commented Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

kaituo Apr 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wnbts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo Apr 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo Apr 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo Apr 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo commented Apr 9, 2020 •

edited

Loading

kaituo Apr 10, 2020 •

edited

Loading

kaituo Apr 10, 2020 •

edited

Loading

kaituo Apr 11, 2020 •

edited

Loading

kaituo Apr 13, 2020 •

edited

Loading