Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Support the case user down scale replicas #58

Merged
merged 2 commits into from
Mar 26, 2020

Conversation

Jeffwan
Copy link
Member

@Jeffwan Jeffwan commented Mar 25, 2020

Resolve issue #59

User may change replicas in a manifest update and it would be better to reconcile these changes even though not all frameworks support elastic training.

Remove resources which index is out of range in reconcile logic.

Current problem is GetPodSlices swallow these cases and ReconcilePod and ReconcileService can not get the information.

GetPodSlices logic has been changed a little bit. Instead of creating slice with size replicas. We send full snapshot back to caller. calculatePodSliceSize returns math.Max(replica. maxIndex + 1)

For example, let's assume we have pods with replica-index 0, 1, 2
If replica is 4, return a slice with size 4. [[0],[1],[2],[]], a pod with replica-index 3 will be created. (we have this support in the code because we always use replica as size of slice)

If replica is 1, return a slice with size 3. [[0],[1],[2]], pod with replica-index 1 and 2 are out of range and will be deleted. (this PR mainly focus on this part)

User may change replicas in a manifest update and it would be better to reconcile these changes even though not all frameworks support elastic training.

Remove resources which index is out of range in reconcile logic.

Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
@kubeflow-bot
Copy link

This change is Reviewable

@Jeffwan
Copy link
Member Author

Jeffwan commented Mar 26, 2020

One thing I am not that sure if we want to delete that pod in GetSlicesPods directly or pass to caller to handle this. To me,

  1. GetSlicesPods should only build the slice rather than other logic to kubernetes resources.

  2. If we delete resource directly in GetSlicesPods, that will minimize the change.

@Jeffwan
Copy link
Member Author

Jeffwan commented Mar 26, 2020

/cc @gaocegege @richardsliu

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gaocegege
Copy link
Member

One thing I am not that sure if we want to delete that pod in GetSlicesPods directly or pass to caller to handle this. To me,

1. `GetSlicesPods` should only build the slice rather than other logic to kubernetes resources.

2. If we delete resource directly in `GetSlicesPods`, that will minimize the change.

LGTM currently. I think we can open an issue to keep track of the discussion about the problem.

@johnugeorge
Copy link
Member

Btwn, which all frameworks support this in an elastic manner?

@Jeffwan
Copy link
Member Author

Jeffwan commented Mar 26, 2020

@johnugeorge https://github.com/pytorch/elastic allows to change number of workers dynamically.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@terrytangyuan
Copy link
Member

Thanks @Jeffwan! Feel free to open separate issues to track the remaining work.

@k8s-ci-robot k8s-ci-robot merged commit 7e0d50e into kubeflow:master Mar 26, 2020
@Jeffwan Jeffwan deleted the optional_port branch March 26, 2020 21:47
georgkaleido pushed a commit to georgkaleido/common that referenced this pull request Jun 9, 2022
* Set torch cuda versions

* Yaml indentation

* Yaml indentation

Co-authored-by: Paul Angerer <dabauxi@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants