Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] After a node restart FAILED and CLOSING jobs count as OPENING #31794

Closed
droberts195 opened this issue Jul 4, 2018 · 1 comment
Closed
Assignees

Comments

@droberts195
Copy link
Contributor

When a node is restarted ML job persistent tasks that were allocated to that node are reallocated to other nodes unless they are in the FAILED or CLOSING state. (Prior to 6.2.4 there was another bug that meant we'd reallocate such jobs to other nodes and set them back to OPENED.)

However, the logic that limits how many jobs are OPENING on a particular node considers all jobs allocated to the node that have stale persistent tasks to be OPENING. This means that if you restart a node that has 2 or more FAILED or CLOSING jobs then it cannot open any more jobs after being restarted (because we only allow two OPENING jobs per node at any time).

To prevent this problem, FAILED and CLOSING jobs should not be considered OPENING if their persistent task status is stale.

/cc @jgowdyelastic

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jul 4, 2018
Job persistent tasks with stale allocation IDs used to always be
considered as OPENING jobs in the ML job node allocation decision.
However, FAILED jobs are not relocated to other nodes, which leads
to them blocking up the nodes they failed on after node restarts.
FAILED jobs should not restrict how many other jobs can open on a
node, regardless of whether they are stale or not.

Closes elastic#31794
droberts195 added a commit that referenced this issue Jul 5, 2018
Job persistent tasks with stale allocation IDs used to always be
considered as OPENING jobs in the ML job node allocation decision.
However, FAILED jobs are not relocated to other nodes, which leads
to them blocking up the nodes they failed on after node restarts.
FAILED jobs should not restrict how many other jobs can open on a
node, regardless of whether they are stale or not.

Closes #31794
droberts195 added a commit that referenced this issue Jul 5, 2018
Job persistent tasks with stale allocation IDs used to always be
considered as OPENING jobs in the ML job node allocation decision.
However, FAILED jobs are not relocated to other nodes, which leads
to them blocking up the nodes they failed on after node restarts.
FAILED jobs should not restrict how many other jobs can open on a
node, regardless of whether they are stale or not.

Closes #31794
droberts195 added a commit that referenced this issue Jul 5, 2018
Job persistent tasks with stale allocation IDs used to always be
considered as OPENING jobs in the ML job node allocation decision.
However, FAILED jobs are not relocated to other nodes, which leads
to them blocking up the nodes they failed on after node restarts.
FAILED jobs should not restrict how many other jobs can open on a
node, regardless of whether they are stale or not.

Closes #31794
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants