-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python SDK for Kubeflow Training Operator #1420
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Hi @alembiewski. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks for the contribution. I will take some time to review it today. With this one, we don't need this PR anymore https://github.com/kubeflow/tf-operator/pull/1389/files |
@@ -0,0 +1,533 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this example
PYTORCH_LOGLEVEL = os.environ.get('PYTORCHJOB_LOGLEVEL', 'INFO').upper() | ||
|
||
# PyTorchJob Label Names | ||
PYTORCHJOB_CONTROLLER_LABEL = 'controller-name' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other framework's constants are not populated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is added by ourselves for e2e test cases? I notice mxnet and xgboost are missing. If we don't have enough time to write test case for other frameworks, it's totally fine. Let's leave a TODO there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also used in client methods:
https://github.com/mesosphere/tf-operator/blob/75734f854a5b35715f068327d69de6185396b5d0/sdk/python/kubeflow/training/api/tf_job_client.py#L122-L125
I will add a comment, sure. The reason why I didn't add similar constants for mxnet
and xgboost
is that there are no clients for these frameworks currently implemented. Adding support for two more frameworks to the SDK is out of scope for this PR and should be addressed separately IMO.
@alexlatchford Really appreciate your work! Please check above comments |
/cc @kubeflow/wg-training-leads |
/ok-to-test |
@Jeffwan,
It seems like the job has been completed, but it wasn't able to find a pod to fetch the logs.
|
@Jeffwan, I updated the tooling for SDK generation and address the comments. Could you please take a look at the changes once again? |
Let me have a check on the failure. If you can run it successfully in your local env. it could be a flaky one. /retest |
The PR looks good to me. Please double check it. /cc @andreyvelich @kubeflow/wg-training-leads |
@Jeffwan, all tests are green now - I improved attribute checks in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for updating this @alembiewski!
I left few comments.
docs/development/developer_guide.md
Outdated
|
||
To generate Python SDK for the operator, run: | ||
``` | ||
.hack/python-sdk/gen-sdk.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.hack/python-sdk/gen-sdk.sh | |
./hack/python-sdk/gen-sdk.sh |
"from kubeflow.training import V1PyTorchJob\n", | ||
"from kubeflow.training import V1PyTorchJobSpec\n", | ||
"from kubeflow.training import V1RunPolicy\n", | ||
"from kubeflow.training.api.py_torch_job_client import PyTorchJobClient" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alembiewski @Jeffwan Do we want to add import of this client to __init__.py
, similar to this Katib __init__.py
?
Then, users can just run from kubeflow.training import PyTorchJobClient
or ``from kubeflow.training import TFJobClient` ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong options on this. WDYT @alembiewski ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable, updated post-gen.py
script to add imports automatically
@@ -46,14 +46,36 @@ Class | Method | Description | |||
[TFJobClient](docs/TFJobClient.md) | [is_job_succeeded](docs/TFJobClient.md#is_job_succeeded) | Check if the TFJob status is Succeeded | | |||
[TFJobClient](docs/TFJobClient.md) | [get_pod_names](docs/TFJobClient.md#get_pod_names) | Get pod names of TFJob | | |||
[TFJobClient](docs/TFJobClient.md) | [get_logs](docs/TFJobClient.md#get_logs) | Get training logs of the TFJob | | |||
[PyTorchJobClient](docs/PyTorchJobClient.md) | [create](docs/PyTorchJobClient.md#create) | Create PyTorchJob| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the Client docs were deleted by generator. Should we modify our script to not delete PyTorchJobClient.md
and docs/TFJobClient.md
during SDK generator run ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that, added client docs back
:return: True or False | ||
""" | ||
pytorchjob_status = self.get_job_status(name, namespace=namespace) | ||
return pytorchjob_status.lower() == "succeeded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this status to constants ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tbl(tfjob_name, status, update_time) | ||
|
||
if name == tfjob_name: | ||
if status == 'Succeeded' or status == 'Failed': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment about status.
@@ -0,0 +1,60 @@ | |||
# Copyright 2020 The Kubeflow Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2020 The Kubeflow Authors. | |
# Copyright 2021 The Kubeflow Authors. |
description="TFJob Python SDK", | ||
long_description="TFJob Python SDK", | ||
description="Training Operator Python SDK", | ||
long_description="Training Operator Python SDK", | ||
packages=setuptools.find_packages( | ||
include=("kubeflow*")), | ||
package_data={}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we drop support for Python < 3 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@Jeffwan, @andreyvelich, thanks for the review! I addressed all comments and suggestions, PTAL |
Thank you for updating this @alembiewski! /cc @kubeflow/wg-training-leads |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jeffwan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Resolves #1380.
Depends on #1389.
Drafts the initial proposal for Python SDK for Kubeflow Training Operator by combining the existing SDK for
TFJob
andPytorchJob
into a single SDK and using updated model classes produced by OpenAPI generator.Changes summary:
PyTorchJobClient
has been copied from thekubeflow/pytorch-operator
repohack/python-sdk/post_gen.py
to fix imports in the generated model test cases, which can be extended for other post-generation modificationsutils
andconstants
modulesPytorchJob
notebook example has been copied tosdk/python/examples
, example notebooks have been updated to reflect the changes in API and package namesNote for the reviewers
The following files are autogenerated and could be skipped during the review:
Observations & Questions
K8sIoApimachineryPkgApisMetaV1ObjectMeta
,K8sIoApimachineryPkgApisMetaV1ListMeta
etc. Is this something that could be fixed by updating the generator configuration?