-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Improve resuming a data frame analytics job stopped during inference #67623
[ML] Improve resuming a data frame analytics job stopped during inference #67623
Conversation
If a DFA job is stopped while in the inference phase, after resuming we should start inference immediately. However, this is currently not the case. Inference is tied in `AnalyticsProcessManager` and thus we start a process, load data, restore state, etc., until we get to start inference. This commit gets rid of this unnecessary delay by factoring inference out as an independent step and ensuring we can resume straight from that phase upon restarting a job.
Pinging @elastic/ml-core (:ml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Updates the progress tracker with potentially new in-between phases | ||
* that were introduced in a later version while making sure progress indicators | ||
* are correct. | ||
* @param analysisPhases the new analysis phases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me the phrase "new analysis phases" suggests that analysisPhases
is a diff compared to some previous state whereas IIUC this is the full new list of phases.
Do you think it could be rephrased?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
@elasticmachine update branch |
…c#67623) If a DFA job is stopped while in the inference phase, after resuming we should start inference immediately. However, this is currently not the case. Inference is tied in `AnalyticsProcessManager` and thus we start a process, load data, restore state, etc., until we get to start inference. This commit gets rid of this unnecessary delay by factoring inference out as an independent step and ensuring we can resume straight from that phase upon restarting a job. Backport of elastic#67623
#67669) If a DFA job is stopped while in the inference phase, after resuming we should start inference immediately. However, this is currently not the case. Inference is tied in `AnalyticsProcessManager` and thus we start a process, load data, restore state, etc., until we get to start inference. This commit gets rid of this unnecessary delay by factoring inference out as an independent step and ensuring we can resume straight from that phase upon restarting a job. Backport of #67623
In elastic#67623 I moved persisting the data counts at the end of a data frame analytics job into a `FinalStep` class. However, I forgot to execute the index request with ML origin resulting in authentication problems if the user that runs the DFA job does not have read privileges in the ML stats index. This commit fixes this by executing that index request with ML origin.
In #67623 I moved persisting the data counts at the end of a data frame analytics job into a `FinalStep` class. However, I forgot to execute the index request with ML origin resulting in authentication problems if the user that runs the DFA job does not have read privileges in the ML stats index. This commit fixes this by executing that index request with ML origin.
… (#67683) In #67623 I moved persisting the data counts at the end of a data frame analytics job into a `FinalStep` class. However, I forgot to execute the index request with ML origin resulting in authentication problems if the user that runs the DFA job does not have read privileges in the ML stats index. This commit fixes this by executing that index request with ML origin. Backport of #67674
Now that data frame analytics jobs can be resumed straight into the inference phase, we need to ensure data counts are persisted at the end of the analysis step and restored when the job is started again. This commit removes the need for storing the progress on start as a task parameter. Instead, when the task gets assigned we now restore all stats by making a call to the get stats API. Additionally, we now ensure that an allocated task that hasn't had its `StatsHolder` restored yet is treated as a stopped task from the get stats API, which means we will report the stored stats. Relates elastic#67623
Now that data frame analytics jobs can be resumed straight into the inference phase, we need to ensure data counts are persisted at the end of the analysis step and restored when the job is started again. This commit removes the need for storing the progress on start as a task parameter. Instead, when the task gets assigned we now restore all stats by making a call to the get stats API. Additionally, we now ensure that an allocated task that hasn't had its `StatsHolder` restored yet is treated as a stopped task from the get stats API, which means we will report the stored stats. Relates #67623
… (#67979) Now that data frame analytics jobs can be resumed straight into the inference phase, we need to ensure data counts are persisted at the end of the analysis step and restored when the job is started again. This commit removes the need for storing the progress on start as a task parameter. Instead, when the task gets assigned we now restore all stats by making a call to the get stats API. Additionally, we now ensure that an allocated task that hasn't had its `StatsHolder` restored yet is treated as a stopped task from the get stats API, which means we will report the stored stats. Relates #67623 Backport of #67937
If a DFA job is stopped while in the inference phase, after
resuming we should start inference immediately. However, this
is currently not the case. Inference is tied in
AnalyticsProcessManager
and thus we start a process, load data, restore state, etc., until
we get to start inference.
This commit gets rid of this unnecessary delay by factoring inference
out as an independent step and ensuring we can resume straight from
that phase upon restarting a job.