Skip to content

Commit

Permalink
Adds page about recovering from a failed job to anomaly detection docs (
Browse files Browse the repository at this point in the history
elastic#1667) (elastic#1673)

Co-authored-by: Lisa Cawley <lcawley@elastic.co>
  • Loading branch information
szabosteve and lcawl authored May 25, 2021
1 parent e66885c commit efcbae2
Show file tree
Hide file tree
Showing 3 changed files with 50 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/en/stack/ml/anomaly-detection/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ include::create-jobs.asciidoc[leveloffset=+2]
include::job-tips.asciidoc[leveloffset=+3]
include::stopping-ml.asciidoc[leveloffset=+2]

include::ml-restart-failed-jobs.asciidoc[leveloffset=+2]

include::anomaly-detection-scale.asciidoc[leveloffset=+2]

include::ml-api-quickref.asciidoc[leveloffset=+1]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ you visualize and explore the results.

* <<create-jobs>>
* <<stopping-ml>>
* <<ml-restart-failed-jobs>>

After you learn how to create and stop {anomaly-detect} jobs, you can check the
<<anomaly-examples>> for more advanced settings and scenarios.
Expand Down
47 changes: 47 additions & 0 deletions docs/en/stack/ml/anomaly-detection/ml-restart-failed-jobs.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
[role="xpack"]
[[ml-restart-failed-jobs]]
= Restart failed {anomaly-jobs}

If an {anomaly-job} fails, try to restart the job by following the procedure
described below. If the restarted job runs as expected, then the problem that
caused the job to fail was transient and no further investigation is needed. If
the job quickly fails after the restart, then the problem is persistent and
needs further investigation. In this case, find out which node the failed job
was running on by checking the job stats on the **Job management** pane in
{kib}. Then get the logs for that node and look for exceptions and errors where
the ID of the {anomaly-job} is in the message to have a better understanding of
the issue.

If an {anomaly-job} has failed, do the following to recover from `failed` state:

. _Force_ stop the corresponding {dfeed} by using the
{ref}/ml-stop-datafeed.html[Stop {dfeed} API] with the `force` parameter being
`true`. For example, the following request force stops the `my_datafeed`
{dfeed}.
+
--
[source,console]
--------------------------------------------------
POST _ml/datafeeds/my_datafeed/_stop
{
"force": "true"
}
--------------------------------------------------
// TEST[skip]
--

. _Force_ close the {anomaly-job} by using the
{ref}/ml-close-job.html[Close {anomaly-job} API] with the `force` parameter
being `true`. For example, the following request force closes the `my_job`
{anomaly-job}:
+
--
[source,console]
--------------------------------------------------
POST _ml/anomaly_detectors/my_job/_close?force=true
--------------------------------------------------
// TEST[skip]
--

. Restart the {anomaly-job} on the **Job management** pane in {kib}.

0 comments on commit efcbae2

Please sign in to comment.