-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] After a node restart FAILED and CLOSING jobs count as OPENING #31794
Labels
Comments
Pinging @elastic/ml-core |
droberts195
added a commit
to droberts195/elasticsearch
that referenced
this issue
Jul 4, 2018
Job persistent tasks with stale allocation IDs used to always be considered as OPENING jobs in the ML job node allocation decision. However, FAILED jobs are not relocated to other nodes, which leads to them blocking up the nodes they failed on after node restarts. FAILED jobs should not restrict how many other jobs can open on a node, regardless of whether they are stale or not. Closes elastic#31794
droberts195
added a commit
that referenced
this issue
Jul 5, 2018
Job persistent tasks with stale allocation IDs used to always be considered as OPENING jobs in the ML job node allocation decision. However, FAILED jobs are not relocated to other nodes, which leads to them blocking up the nodes they failed on after node restarts. FAILED jobs should not restrict how many other jobs can open on a node, regardless of whether they are stale or not. Closes #31794
droberts195
added a commit
that referenced
this issue
Jul 5, 2018
Job persistent tasks with stale allocation IDs used to always be considered as OPENING jobs in the ML job node allocation decision. However, FAILED jobs are not relocated to other nodes, which leads to them blocking up the nodes they failed on after node restarts. FAILED jobs should not restrict how many other jobs can open on a node, regardless of whether they are stale or not. Closes #31794
droberts195
added a commit
that referenced
this issue
Jul 5, 2018
Job persistent tasks with stale allocation IDs used to always be considered as OPENING jobs in the ML job node allocation decision. However, FAILED jobs are not relocated to other nodes, which leads to them blocking up the nodes they failed on after node restarts. FAILED jobs should not restrict how many other jobs can open on a node, regardless of whether they are stale or not. Closes #31794
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When a node is restarted ML job persistent tasks that were allocated to that node are reallocated to other nodes unless they are in the
FAILED
orCLOSING
state. (Prior to 6.2.4 there was another bug that meant we'd reallocate such jobs to other nodes and set them back toOPENED
.)However, the logic that limits how many jobs are
OPENING
on a particular node considers all jobs allocated to the node that have stale persistent tasks to beOPENING
. This means that if you restart a node that has 2 or moreFAILED
orCLOSING
jobs then it cannot open any more jobs after being restarted (because we only allow twoOPENING
jobs per node at any time).To prevent this problem,
FAILED
andCLOSING
jobs should not be consideredOPENING
if their persistent task status is stale./cc @jgowdyelastic
The text was updated successfully, but these errors were encountered: