You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current AdaptDL controller is a mediator between what the allocator wants and what the k8s default scheduler has or can do. It tries to reconcile the jobs states so that both the parties agree on it in a dynamic environment. It gets the proposed allocations from the allocator and tries to apply it to the cluster using the default scheduler. It has to handle (pod) failures in doing so. Some of which are spurious others can be fatal. Spurious failures include, controller trying to schedule pod from job A when there is already a pod for job B on the node trying to exit (The jobs can move from state A to state B in arbitrary order based on allocator's decision and how controller decides to perform the reconciliation). In this case controller has to ignore this failure and retry. Other failures like real pod failure can be fatal. Currently the controller completely relies on the default scheduler to give it hints (in terms of pod states) about what has actually happened when it created/deleted a pod. This makes it hard for controller to (confidently) know which failure is ephemeral and which can't be ignored.
Currently there is also reliance on allocator to "fix" the allocations based on newly available information about the cluster resources. For example, if allocator decides to give node X to job A and a non-AdaptDL pod takes the node before controller has a chance to use it, it results in a spurious failure that is indirectly handled by the allocator by proposing new allocation in next cycle which avoids node X. But job A does not fail. This mechanism doesn't work with non-preemptible jobs where usually we won't allow a reallocation. The controller has to be smart enough to understand that the pod failure in this case qualifies for a reallocation because the non-preemptible job hasn't started execution yet. Currently it cannot do that.
What we want from the custom scheduler is to be in control of states of all the pods for all jobs so that we know exactly what each (pod) failure means and can easily handle interference by default k8s scheduler or any other scheduler contending for the same resources. The custom scheduler will be responsible for spawning and destroying pods for a job. Handle interference by external schedulers better, by identifying it clearly. Better handle preemptible and non-preemptible job failures. This simplifies the controller's job to just create and delete pods as per allocator's requests.
The text was updated successfully, but these errors were encountered:
Current AdaptDL controller is a mediator between what the allocator wants and what the k8s default scheduler has or can do. It tries to reconcile the jobs states so that both the parties agree on it in a dynamic environment. It gets the proposed allocations from the allocator and tries to apply it to the cluster using the default scheduler. It has to handle (pod) failures in doing so. Some of which are spurious others can be fatal. Spurious failures include, controller trying to schedule pod from job A when there is already a pod for job B on the node trying to exit (The jobs can move from state A to state B in arbitrary order based on allocator's decision and how controller decides to perform the reconciliation). In this case controller has to ignore this failure and retry. Other failures like real pod failure can be fatal. Currently the controller completely relies on the default scheduler to give it hints (in terms of pod states) about what has actually happened when it created/deleted a pod. This makes it hard for controller to (confidently) know which failure is ephemeral and which can't be ignored.
Currently there is also reliance on allocator to "fix" the allocations based on newly available information about the cluster resources. For example, if allocator decides to give node X to job A and a non-AdaptDL pod takes the node before controller has a chance to use it, it results in a spurious failure that is indirectly handled by the allocator by proposing new allocation in next cycle which avoids node X. But job A does not fail. This mechanism doesn't work with non-preemptible jobs where usually we won't allow a reallocation. The controller has to be smart enough to understand that the pod failure in this case qualifies for a reallocation because the non-preemptible job hasn't started execution yet. Currently it cannot do that.
What we want from the custom scheduler is to be in control of states of all the pods for all jobs so that we know exactly what each (pod) failure means and can easily handle interference by default k8s scheduler or any other scheduler contending for the same resources. The custom scheduler will be responsible for spawning and destroying pods for a job. Handle interference by external schedulers better, by identifying it clearly. Better handle preemptible and non-preemptible job failures. This simplifies the controller's job to just create and delete pods as per allocator's requests.
The text was updated successfully, but these errors were encountered: