[ML] After a node restart FAILED and CLOSING jobs count as OPENING

When a node is restarted ML job persistent tasks that were allocated to that node are reallocated to other nodes _unless_ they are in the `FAILED` or `CLOSING` state.  (Prior to 6.2.4 there was another bug that meant we'd reallocate such jobs to other nodes and set them back to `OPENED`.)

However, the logic that limits how many jobs are `OPENING` on a particular node considers all jobs allocated to the node that have stale persistent tasks to be `OPENING`.  This means that if you restart a node that has 2 or more `FAILED` or `CLOSING` jobs then it cannot open any more jobs after being restarted (because we only allow two `OPENING` jobs per node at any time).

To prevent this problem, `FAILED` and `CLOSING` jobs should not be considered `OPENING` if their persistent task status is stale.

/cc @jgowdyelastic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] After a node restart FAILED and CLOSING jobs count as OPENING #31794

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ML] After a node restart FAILED and CLOSING jobs count as OPENING #31794

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions