Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve resuming a data frame analytics job stopped during inference #67623

Conversation

dimitris-athanasiou
Copy link
Contributor

If a DFA job is stopped while in the inference phase, after
resuming we should start inference immediately. However, this
is currently not the case. Inference is tied in AnalyticsProcessManager
and thus we start a process, load data, restore state, etc., until
we get to start inference.

This commit gets rid of this unnecessary delay by factoring inference
out as an independent step and ensuring we can resume straight from
that phase upon restarting a job.

If a DFA job is stopped while in the inference phase, after
resuming we should start inference immediately. However, this
is currently not the case. Inference is tied in `AnalyticsProcessManager`
and thus we start a process, load data, restore state, etc., until
we get to start inference.

This commit gets rid of this unnecessary delay by factoring inference
out as an independent step and ensuring we can resume straight from
that phase upon restarting a job.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* Updates the progress tracker with potentially new in-between phases
* that were introduced in a later version while making sure progress indicators
* are correct.
* @param analysisPhases the new analysis phases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the phrase "new analysis phases" suggests that analysisPhases is a diff compared to some previous state whereas IIUC this is the full new list of phases.
Do you think it could be rephrased?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

@dimitris-athanasiou dimitris-athanasiou merged commit f449b8f into elastic:master Jan 18, 2021
@dimitris-athanasiou dimitris-athanasiou deleted the improve-resume-dfa-job-in-inference-phase branch January 18, 2021 16:19
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jan 18, 2021
…c#67623)

If a DFA job is stopped while in the inference phase, after
resuming we should start inference immediately. However, this
is currently not the case. Inference is tied in `AnalyticsProcessManager`
and thus we start a process, load data, restore state, etc., until
we get to start inference.

This commit gets rid of this unnecessary delay by factoring inference
out as an independent step and ensuring we can resume straight from
that phase upon restarting a job.

Backport of elastic#67623
dimitris-athanasiou added a commit that referenced this pull request Jan 18, 2021
#67669)

If a DFA job is stopped while in the inference phase, after
resuming we should start inference immediately. However, this
is currently not the case. Inference is tied in `AnalyticsProcessManager`
and thus we start a process, load data, restore state, etc., until
we get to start inference.

This commit gets rid of this unnecessary delay by factoring inference
out as an independent step and ensuring we can resume straight from
that phase upon restarting a job.

Backport of #67623
@droberts195 droberts195 changed the title [ML] Improve resuming a DFA job stopped during inference [ML] Improve resuming a data frame analytics job stopped during inference Jan 18, 2021
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jan 19, 2021
In elastic#67623 I moved persisting the data counts at the end of a
data frame analytics job into a `FinalStep` class. However,
I forgot to execute the index request with ML origin resulting
in authentication problems if the user that runs the DFA job
does not have read privileges in the ML stats index.

This commit fixes this by executing that index request with ML
origin.
dimitris-athanasiou added a commit that referenced this pull request Jan 19, 2021
In #67623 I moved persisting the data counts at the end of a
data frame analytics job into a `FinalStep` class. However,
I forgot to execute the index request with ML origin resulting
in authentication problems if the user that runs the DFA job
does not have read privileges in the ML stats index.

This commit fixes this by executing that index request with ML
origin.
dimitris-athanasiou added a commit that referenced this pull request Jan 19, 2021
… (#67683)

In #67623 I moved persisting the data counts at the end of a
data frame analytics job into a `FinalStep` class. However,
I forgot to execute the index request with ML origin resulting
in authentication problems if the user that runs the DFA job
does not have read privileges in the ML stats index.

This commit fixes this by executing that index request with ML
origin.

Backport of #67674
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jan 25, 2021
Now that data frame analytics jobs can be resumed straight into
the inference phase, we need to ensure data counts are persisted
at the end of the analysis step and restored when the job is
started again.

This commit removes the need for storing the progress on start
as a task parameter. Instead, when the task gets assigned we now
restore all stats by making a call to the get stats API. Additionally,
we now ensure that an allocated task that hasn't had its `StatsHolder`
restored yet is treated as a stopped task from the get stats API, which
means we will report the stored stats.

Relates elastic#67623
dimitris-athanasiou added a commit that referenced this pull request Jan 26, 2021
Now that data frame analytics jobs can be resumed straight into
the inference phase, we need to ensure data counts are persisted
at the end of the analysis step and restored when the job is
started again.

This commit removes the need for storing the progress on start
as a task parameter. Instead, when the task gets assigned we now
restore all stats by making a call to the get stats API. Additionally,
we now ensure that an allocated task that hasn't had its `StatsHolder`
restored yet is treated as a stopped task from the get stats API, which
means we will report the stored stats.

Relates #67623
dimitris-athanasiou added a commit that referenced this pull request Jan 26, 2021
… (#67979)

Now that data frame analytics jobs can be resumed straight into
the inference phase, we need to ensure data counts are persisted
at the end of the analysis step and restored when the job is
started again.

This commit removes the need for storing the progress on start
as a task parameter. Instead, when the task gets assigned we now
restore all stats by making a call to the get stats API. Additionally,
we now ensure that an allocated task that hasn't had its `StatsHolder`
restored yet is treated as a stopped task from the get stats API, which
means we will report the stored stats.

Relates #67623

Backport of #67937
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants