-
Notifications
You must be signed in to change notification settings - Fork 73
Support the case user down scale replicas #58
Conversation
User may change replicas in a manifest update and it would be better to reconcile these changes even though not all frameworks support elastic training. Remove resources which index is out of range in reconcile logic. Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
One thing I am not that sure if we want to delete that pod in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM currently. I think we can open an issue to keep track of the discussion about the problem. |
Btwn, which all frameworks support this in an elastic manner? |
@johnugeorge https://github.com/pytorch/elastic allows to change number of workers dynamically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks @Jeffwan! Feel free to open separate issues to track the remaining work. |
* Set torch cuda versions * Yaml indentation * Yaml indentation Co-authored-by: Paul Angerer <dabauxi@users.noreply.github.com>
Resolve issue #59
User may change replicas in a manifest update and it would be better to reconcile these changes even though not all frameworks support elastic training.
Remove resources which index is out of range in reconcile logic.
Current problem is
GetPodSlices
swallow these cases andReconcilePod
andReconcileService
can not get the information.GetPodSlices
logic has been changed a little bit. Instead of creating slice with sizereplicas
. We send full snapshot back to caller.calculatePodSliceSize
returns math.Max(replica. maxIndex + 1)For example, let's assume we have pods with replica-index 0, 1, 2
If replica is 4, return a slice with size 4. [[0],[1],[2],[]], a pod with replica-index 3 will be created. (we have this support in the code because we always use replica as size of slice)
If replica is 1, return a slice with size 3. [[0],[1],[2]], pod with replica-index 1 and 2 are out of range and will be deleted. (this PR mainly focus on this part)