-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pytorch): Support elastic training #1453
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
df2c1ed
to
cdfd1be
Compare
Pull Request Test Coverage Report for Build 1505952988
💛 - Coveralls |
2bfbb3d
to
46eeff9
Compare
46eeff9 implemented the feature. Overall coverage increased (+7.1%) to 15.252%, PyTorch related test coverage is increased from 0% to 80% PTAL |
Signed-off-by: Ce Gao <ce.gao@outlook.com>
Signed-off-by: Ce Gao <ce.gao@outlook.com>
Signed-off-by: Ce Gao <ce.gao@outlook.com>
Signed-off-by: Ce Gao <ce.gao@outlook.com>
46eeff9
to
b963cbe
Compare
Signed-off-by: Ce Gao <ce.gao@outlook.com>
Signed-off-by: Ce Gao <ce.gao@outlook.com>
Signed-off-by: Ce Gao <ce.gao@outlook.com>
3763152
to
00a5c89
Compare
Signed-off-by: Ce Gao <ce.gao@outlook.com>
The PR is ready to review. PTAL /assign @kubeflow/wg-training-leads |
/assign @zw0610 |
I suppose it's ready to review, right? |
�Yeah it is ready to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @zw0610
Signed-off-by: Ce Gao <ce.gao@outlook.com>
Close #1483 |
--- | ||
|
||
apiVersion: v1 | ||
kind: Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- with only 1 pod for etcd instead of 3?
- is this etcd instance deployed per (elastic) pytorch job or all elastic pytorch job will shall the etcd instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can share one. And u do not need one if you are using c10d
@@ -6916,6 +6916,53 @@ spec: | |||
description: The number of pods which reached phase Failed. | |||
format: int32 | |||
type: integer | |||
labelSelector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I do understand this change will definitely be introduced after we update the kubeflow/common apis definition, may be we should let other apis/controller to accept this changes in another pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If other operators does not support scale subresource, then it is not needed. Do you mean we should support the subresource for other operators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the update to kubeflow/common, every time we run make generates
in this repo, it shall always add labelSelector
to other APIs. I would suggest to ignore such changes from APIs except PyTorchJob
in this pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard to ignore it since the code generator reads the API definition and generate the corresponding code automatically. And the field labelSelector
is a pointer thus it is optional. It does not affect existing CRDs
a69a7a6
to
2fc6c57
Compare
Signed-off-by: Ce Gao <ce.gao@outlook.com>
2fc6c57
to
2940450
Compare
The comments are addressed, PTAL @zw0610 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege, zw0610 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
PoC for kubeflow/community#522
This PR:
This PR does not: