Implement a custom (pod) scheduler for AdaptDL #42

odp · 2020-10-06T18:41:22Z

Current AdaptDL controller is a mediator between what the allocator wants and what the k8s default scheduler has or can do. It tries to reconcile the jobs states so that both the parties agree on it in a dynamic environment. It gets the proposed allocations from the allocator and tries to apply it to the cluster using the default scheduler. It has to handle (pod) failures in doing so. Some of which are spurious others can be fatal. Spurious failures include, controller trying to schedule pod from job A when there is already a pod for job B on the node trying to exit (The jobs can move from state A to state B in arbitrary order based on allocator's decision and how controller decides to perform the reconciliation). In this case controller has to ignore this failure and retry. Other failures like real pod failure can be fatal. Currently the controller completely relies on the default scheduler to give it hints (in terms of pod states) about what has actually happened when it created/deleted a pod. This makes it hard for controller to (confidently) know which failure is ephemeral and which can't be ignored.

Currently there is also reliance on allocator to "fix" the allocations based on newly available information about the cluster resources. For example, if allocator decides to give node X to job A and a non-AdaptDL pod takes the node before controller has a chance to use it, it results in a spurious failure that is indirectly handled by the allocator by proposing new allocation in next cycle which avoids node X. But job A does not fail. This mechanism doesn't work with non-preemptible jobs where usually we won't allow a reallocation. The controller has to be smart enough to understand that the pod failure in this case qualifies for a reallocation because the non-preemptible job hasn't started execution yet. Currently it cannot do that.

What we want from the custom scheduler is to be in control of states of all the pods for all jobs so that we know exactly what each (pod) failure means and can easily handle interference by default k8s scheduler or any other scheduler contending for the same resources. The custom scheduler will be responsible for spawning and destroying pods for a job. Handle interference by external schedulers better, by identifying it clearly. Better handle preemptible and non-preemptible job failures. This simplifies the controller's job to just create and delete pods as per allocator's requests.

odp added the enhancement New feature or request label Oct 6, 2020

odp mentioned this issue Oct 6, 2020

Support distributed non-preemptible jobs #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a custom (pod) scheduler for AdaptDL #42

Implement a custom (pod) scheduler for AdaptDL #42

odp commented Oct 6, 2020

Implement a custom (pod) scheduler for AdaptDL #42

Implement a custom (pod) scheduler for AdaptDL #42

Comments

odp commented Oct 6, 2020