Description
When a node is restarted ML job persistent tasks that were allocated to that node are reallocated to other nodes unless they are in the FAILED
or CLOSING
state. (Prior to 6.2.4 there was another bug that meant we'd reallocate such jobs to other nodes and set them back to OPENED
.)
However, the logic that limits how many jobs are OPENING
on a particular node considers all jobs allocated to the node that have stale persistent tasks to be OPENING
. This means that if you restart a node that has 2 or more FAILED
or CLOSING
jobs then it cannot open any more jobs after being restarted (because we only allow two OPENING
jobs per node at any time).
To prevent this problem, FAILED
and CLOSING
jobs should not be considered OPENING
if their persistent task status is stale.
/cc @jgowdyelastic