record error and execution start/end time in AD result; handle except… #59

ylwu-amzn · 2020-03-10T16:19:08Z

Issue Record more info in AD result #47

Description of changes:

record error and execution start/end time in AD result
handle exception from AD result action.
If exception is EndRunException: 1).if endNow=true, will stop AD job run; otherwise; 2). if endNow=false, will retry AD job, if all retries failed or have no enough time to retry, will stop AD job.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…ion from AD result action

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorJobRunner.java

kaituo · 2020-03-12T03:22:04Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorJobRunner.java

+                        (EndRunException) exception
+                    );
+                } else {
+                    if (backoff.hasNext()) {


For retryable EndRunException, I didn't mean to retry right away. I meant to let AD job trigger the retry in the new few runs. After that, if things do not improve, stop AD job.

The backoff setting for EndRunException retry here is too fast. The formula of backoff delay is:
start + 10 * ((int) Math.exp(0.8d * (currentlyConsumed)) - 1)
In your current setting,
start = 1 sec
currentlyConsumed is between [0, 2] since we are gonna retry 3 times.
Then we are gonna retry in 1 sec, 11 sec, 31 sec.

See the definition of ExponentialBackoffIterator in Elasticsearch for further details.

Some issue like training data not available cannot be fixed right away. We need to wait a few more retries * AD job interval.

Also, would AD job run and the backoff retry happen at the same time? If yes, would there be any problem?

I think we need to retry several times to avoid skipping anomaly result by some transient exception. I have worked on change to stop job if several job runs fail consecutively. Will publish revision soon.

If training data not available, why throw EndRunException rather than ResourceNotFoundException? User may ingest data soon

Also, would AD job run and the backoff retry happen at the same time? If yes, would there be any problem?

Will not retry if left time is less than half - buffer. Any job run is possible to happen when last run still on going, hanguang's cancel request change will handle this.

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorJobRunner.java

...main/java/com/amazon/opendistroforelasticsearch/ad/rest/handler/AnomalyDetectorFunction.java

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorJobRunner.java

kaituo · 2020-03-13T04:33:01Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorJobRunner.java

+                    }
+                });
+                // if AD job failed consecutively due to EndRunException and failed times exceeds upper limit, will stop AD job
+                if (detectorEndRunExceptionCount.get(detectorId) > maxRetryForEndRunException) {


After much thought, I am still inclined to remove the backoff retry because we don't have a clear use case a quick retry is needed and there are complications for the backoff retry. This is the 3 cases where EndRunException is thrown with endNow being false:

training data for cold start not available

cold start cannot succeed

unknown prediction error

Two of the causes are related to cold start. Let's discuss it one by one:

training data for cold start not available: the situation won't improve within seconds.

cold start cannot succeed and unknown prediction error: this indicates some bugs of our side and system heavy load. Quick retry does not help.

Some complications of retry quickly:

Cold start is expensive as it runs 24 queries, initializing models, and saving checkpoints. Retry cold start quickly can impose performance pressure.

10 seconds buffer might not be enough to prevent multiple cold starts (started by backoff retry and AD job) at the same time as cold start cannot finish within a few seconds. Lai once said it takes about 20~30 seconds. The process can take longer if customers' feature queries are complex and we have more features. Hanguang's cancel would only apply if there are prediction queries for current window running. Cancel won't happen when current window queries finish, but we are running 24 queries for cold start, initializing models and saving checkpoints.

We are using our own threadpool and it is limited by 1/4 cores. Retry uses threadpool and may slow down other jobs' running.

Thanks for your analysis. Here the backoff retry is to resolve the transient exception.

Code start training data is not transient exception. We need to build finer granularity exception later to distinguish non-retryable and retryable exception. If we can't know which exception is transient and retryable in AnomalyResultTransportAction, I'm ok to remove the backoff retry now to avoid performance issue. But that's a tradeoff, as without retrying, some transient exception will cause current job run fail and if there is anomaly, user will miss it and will not get alerting notification. Sometimes missing anomaly&notification is not acceptable. For example, current detection interval is 1hour, and there should be anomaly in current interval, some transient exception may fail current AD job, so no anomaly found and user never know it. Then we start next AD job, maybe there is no anomaly in next 1hour, user will never know something wrong happened. In one word, this is some tradeoff between protecting our performance, user experience and what we can do currently.

So, can you help confirm if we can know which exception is retryable in AnomalyResultTransportAction? If we can't, will remove this backoff retry.

As we discussed offline, we can define some exceptions like fail to get RCF/threshold model result as retryable exception. Such exceptions are transient and maybe resolved by some backoff retry. Will add todo now and we can change it later.

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorJobRunner.java

kaituo · 2020-03-13T05:18:32Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorJobRunner.java

+                }
+            ));
+        } catch (Exception e) {
+            indexAnomalyResultException(jobParameter, lockService, lock, startTime, executionStartTime, e, true);


would we have double lock release since your code in the try block can release the lock? How about adding a isLockReleasedOrExpired before release?

Recheck and find no double lock release. LockService already handles exception. Lock expire time is equals to detection interval, lock will not expire during job run.

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorJobRunner.java

...in/java/com/amazon/opendistroforelasticsearch/ad/transport/handler/AnomalyResultHandler.java

yizheliu-amazon

Please address some minor comments before merging.

record error and execution start/end time in AD result; handle except…

638af5d

…ion from AD result action

ylwu-amzn requested review from kaituo and yizheliu-amazon March 10, 2020 16:19

kaituo reviewed Mar 12, 2020

View reviewed changes

ylwu-amzn added 2 commits March 12, 2020 09:06

fail AD job if fail consecutively more than limit times

fcb2222

fix typo

f15696c

kaituo reviewed Mar 13, 2020

View reviewed changes

ylwu-amzn added 2 commits March 13, 2020 15:35

tune variable name

84ea1c1

remove backoff retry

3ab3c17

kaituo approved these changes Mar 16, 2020

View reviewed changes

Merge branch 'development' into development

9fcfd11

yizheliu-amazon reviewed Mar 16, 2020

View reviewed changes

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorJobRunner.java Show resolved Hide resolved

yizheliu-amazon reviewed Mar 16, 2020

View reviewed changes

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorJobRunner.java Outdated Show resolved Hide resolved

yizheliu-amazon reviewed Mar 16, 2020

View reviewed changes

...in/java/com/amazon/opendistroforelasticsearch/ad/transport/handler/AnomalyResultHandler.java Show resolved Hide resolved

yizheliu-amazon reviewed Mar 16, 2020

View reviewed changes

...in/java/com/amazon/opendistroforelasticsearch/ad/transport/handler/AnomalyResultHandler.java Outdated Show resolved Hide resolved

yizheliu-amazon reviewed Mar 16, 2020

View reviewed changes

...in/java/com/amazon/opendistroforelasticsearch/ad/transport/handler/AnomalyResultHandler.java Outdated Show resolved Hide resolved

yizheliu-amazon approved these changes Mar 16, 2020

View reviewed changes

fix typo;add more log

f965717

ylwu-amzn merged commit b4a8dec into opendistro-for-elasticsearch:development Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

record error and execution start/end time in AD result; handle except… #59

record error and execution start/end time in AD result; handle except… #59

ylwu-amzn commented Mar 10, 2020 •

edited

Loading

kaituo Mar 12, 2020

ylwu-amzn Mar 12, 2020 •

edited

Loading

kaituo Mar 13, 2020

ylwu-amzn Mar 13, 2020

ylwu-amzn Mar 14, 2020

kaituo Mar 13, 2020

ylwu-amzn Mar 13, 2020 •

edited

Loading

yizheliu-amazon left a comment

record error and execution start/end time in AD result; handle except… #59

record error and execution start/end time in AD result; handle except… #59

Conversation

ylwu-amzn commented Mar 10, 2020 • edited Loading

kaituo Mar 12, 2020

Choose a reason for hiding this comment

ylwu-amzn Mar 12, 2020 • edited Loading

Choose a reason for hiding this comment

kaituo Mar 13, 2020

Choose a reason for hiding this comment

ylwu-amzn Mar 13, 2020

Choose a reason for hiding this comment

ylwu-amzn Mar 14, 2020

Choose a reason for hiding this comment

kaituo Mar 13, 2020

Choose a reason for hiding this comment

ylwu-amzn Mar 13, 2020 • edited Loading

Choose a reason for hiding this comment

yizheliu-amazon left a comment

Choose a reason for hiding this comment

ylwu-amzn commented Mar 10, 2020 •

edited

Loading

ylwu-amzn Mar 12, 2020 •

edited

Loading

ylwu-amzn Mar 13, 2020 •

edited

Loading