Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

record error and execution start/end time in AD result; handle except… #59

Merged

Conversation

ylwu-amzn
Copy link
Contributor

@ylwu-amzn ylwu-amzn commented Mar 10, 2020

Description of changes:

  • record error and execution start/end time in AD result
  • handle exception from AD result action.
    If exception is EndRunException: 1).if endNow=true, will stop AD job run; otherwise; 2). if endNow=false, will retry AD job, if all retries failed or have no enough time to retry, will stop AD job.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

(EndRunException) exception
);
} else {
if (backoff.hasNext()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For retryable EndRunException, I didn't mean to retry right away. I meant to let AD job trigger the retry in the new few runs. After that, if things do not improve, stop AD job.

The backoff setting for EndRunException retry here is too fast. The formula of backoff delay is:
start + 10 * ((int) Math.exp(0.8d * (currentlyConsumed)) - 1)
In your current setting,
start = 1 sec
currentlyConsumed is between [0, 2] since we are gonna retry 3 times.
Then we are gonna retry in 1 sec, 11 sec, 31 sec.

See the definition of ExponentialBackoffIterator in Elasticsearch for further details.

Some issue like training data not available cannot be fixed right away. We need to wait a few more retries * AD job interval.

Also, would AD job run and the backoff retry happen at the same time? If yes, would there be any problem?

Copy link
Contributor Author

@ylwu-amzn ylwu-amzn Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to retry several times to avoid skipping anomaly result by some transient exception. I have worked on change to stop job if several job runs fail consecutively. Will publish revision soon.

If training data not available, why throw EndRunException rather than ResourceNotFoundException? User may ingest data soon

Also, would AD job run and the backoff retry happen at the same time? If yes, would there be any problem?

Will not retry if left time is less than half - buffer. Any job run is possible to happen when last run still on going, hanguang's cancel request change will handle this.

}
});
// if AD job failed consecutively due to EndRunException and failed times exceeds upper limit, will stop AD job
if (detectorEndRunExceptionCount.get(detectorId) > maxRetryForEndRunException) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After much thought, I am still inclined to remove the backoff retry because we don't have a clear use case a quick retry is needed and there are complications for the backoff retry. This is the 3 cases where EndRunException is thrown with endNow being false:

  • training data for cold start not available
  • cold start cannot succeed
  • unknown prediction error

Two of the causes are related to cold start. Let's discuss it one by one:

  • training data for cold start not available: the situation won't improve within seconds.
  • cold start cannot succeed and unknown prediction error: this indicates some bugs of our side and system heavy load. Quick retry does not help.

Some complications of retry quickly:

  • Cold start is expensive as it runs 24 queries, initializing models, and saving checkpoints. Retry cold start quickly can impose performance pressure.
  • 10 seconds buffer might not be enough to prevent multiple cold starts (started by backoff retry and AD job) at the same time as cold start cannot finish within a few seconds. Lai once said it takes about 20~30 seconds. The process can take longer if customers' feature queries are complex and we have more features. Hanguang's cancel would only apply if there are prediction queries for current window running. Cancel won't happen when current window queries finish, but we are running 24 queries for cold start, initializing models and saving checkpoints.
  • We are using our own threadpool and it is limited by 1/4 cores. Retry uses threadpool and may slow down other jobs' running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your analysis. Here the backoff retry is to resolve the transient exception.

Code start training data is not transient exception. We need to build finer granularity exception later to distinguish non-retryable and retryable exception. If we can't know which exception is transient and retryable in AnomalyResultTransportAction, I'm ok to remove the backoff retry now to avoid performance issue. But that's a tradeoff, as without retrying, some transient exception will cause current job run fail and if there is anomaly, user will miss it and will not get alerting notification. Sometimes missing anomaly&notification is not acceptable. For example, current detection interval is 1hour, and there should be anomaly in current interval, some transient exception may fail current AD job, so no anomaly found and user never know it. Then we start next AD job, maybe there is no anomaly in next 1hour, user will never know something wrong happened. In one word, this is some tradeoff between protecting our performance, user experience and what we can do currently.

So, can you help confirm if we can know which exception is retryable in AnomalyResultTransportAction? If we can't, will remove this backoff retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, we can define some exceptions like fail to get RCF/threshold model result as retryable exception. Such exceptions are transient and maybe resolved by some backoff retry. Will add todo now and we can change it later.

}
));
} catch (Exception e) {
indexAnomalyResultException(jobParameter, lockService, lock, startTime, executionStartTime, e, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we have double lock release since your code in the try block can release the lock? How about adding a isLockReleasedOrExpired before release?

Copy link
Contributor Author

@ylwu-amzn ylwu-amzn Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recheck and find no double lock release. LockService already handles exception. Lock expire time is equals to detection interval, lock will not expire during job run.

Copy link
Contributor

@yizheliu-amazon yizheliu-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address some minor comments before merging.

@ylwu-amzn ylwu-amzn merged commit b4a8dec into opendistro-for-elasticsearch:development Mar 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants