Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a custom (pod) scheduler for AdaptDL #42

Open
odp opened this issue Oct 6, 2020 · 0 comments
Open

Implement a custom (pod) scheduler for AdaptDL #42

odp opened this issue Oct 6, 2020 · 0 comments
Labels
enhancement New feature or request

Comments

@odp
Copy link
Collaborator

odp commented Oct 6, 2020

Current AdaptDL controller is a mediator between what the allocator wants and what the k8s default scheduler has or can do. It tries to reconcile the jobs states so that both the parties agree on it in a dynamic environment. It gets the proposed allocations from the allocator and tries to apply it to the cluster using the default scheduler. It has to handle (pod) failures in doing so. Some of which are spurious others can be fatal. Spurious failures include, controller trying to schedule pod from job A when there is already a pod for job B on the node trying to exit (The jobs can move from state A to state B in arbitrary order based on allocator's decision and how controller decides to perform the reconciliation). In this case controller has to ignore this failure and retry. Other failures like real pod failure can be fatal. Currently the controller completely relies on the default scheduler to give it hints (in terms of pod states) about what has actually happened when it created/deleted a pod. This makes it hard for controller to (confidently) know which failure is ephemeral and which can't be ignored.

Currently there is also reliance on allocator to "fix" the allocations based on newly available information about the cluster resources. For example, if allocator decides to give node X to job A and a non-AdaptDL pod takes the node before controller has a chance to use it, it results in a spurious failure that is indirectly handled by the allocator by proposing new allocation in next cycle which avoids node X. But job A does not fail. This mechanism doesn't work with non-preemptible jobs where usually we won't allow a reallocation. The controller has to be smart enough to understand that the pod failure in this case qualifies for a reallocation because the non-preemptible job hasn't started execution yet. Currently it cannot do that.

What we want from the custom scheduler is to be in control of states of all the pods for all jobs so that we know exactly what each (pod) failure means and can easily handle interference by default k8s scheduler or any other scheduler contending for the same resources. The custom scheduler will be responsible for spawning and destroying pods for a job. Handle interference by external schedulers better, by identifying it clearly. Better handle preemptible and non-preemptible job failures. This simplifies the controller's job to just create and delete pods as per allocator's requests.

@odp odp added the enhancement New feature or request label Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant