Skip to content

[ML] After a node restart FAILED and CLOSING jobs count as OPENING #31794

Closed
@droberts195

Description

@droberts195

When a node is restarted ML job persistent tasks that were allocated to that node are reallocated to other nodes unless they are in the FAILED or CLOSING state. (Prior to 6.2.4 there was another bug that meant we'd reallocate such jobs to other nodes and set them back to OPENED.)

However, the logic that limits how many jobs are OPENING on a particular node considers all jobs allocated to the node that have stale persistent tasks to be OPENING. This means that if you restart a node that has 2 or more FAILED or CLOSING jobs then it cannot open any more jobs after being restarted (because we only allow two OPENING jobs per node at any time).

To prevent this problem, FAILED and CLOSING jobs should not be considered OPENING if their persistent task status is stale.

/cc @jgowdyelastic

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions