Add Python SDK for Kubeflow Training Operator #1420

alembiewski · 2021-09-22T23:08:21Z

Resolves #1380.
Depends on #1389.

Drafts the initial proposal for Python SDK for Kubeflow Training Operator by combining the existing SDK for TFJob and PytorchJob into a single SDK and using updated model classes produced by OpenAPI generator.

Changes summary:

Python SDK has been generated by using updated tooling from Update scripts to generate sdk for all frameworks #1389.
PyTorchJobClient has been copied from the kubeflow/pytorch-operator repo
Introduces a new tool hack/python-sdk/post_gen.py to fix imports in the generated model test cases, which can be extended for other post-generation modifications
Reduce code duplication by merging utils and constants modules
Add model tests (auto-generated) and update e2e tests
Add support for the latest Python Kubernetes client
PytorchJob notebook example has been copied to sdk/python/examples, example notebooks have been updated to reflect the changes in API and package names
Update docs

Note for the reviewers

The following files are autogenerated and could be skipped during the review:

https://github.com/mesosphere/tf-operator/tree/update-sdk/sdk/python/docs
Files in the test root: https://github.com/mesosphere/tf-operator/tree/update-sdk/sdk/python/test (model imports have been updated to make tests work, code is here)
https://github.com/mesosphere/tf-operator/tree/update-sdk/sdk/python/kubeflow/training/models

Observations & Questions

External model attributes are not generated properly (e.g. from Kubernetes Python client): K8sIoApimachineryPkgApisMetaV1ObjectMeta, K8sIoApimachineryPkgApisMetaV1ListMeta etc. Is this something that could be fixed by updating the generator configuration?
How e2e tests and unit tests are executed for the SDK?

review-notebook-app · 2021-09-22T23:08:25Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

aws-kf-ci-bot · 2021-09-22T23:08:34Z

Hi @alembiewski. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jeffwan · 2021-09-23T02:06:19Z

Thanks for the contribution. I will take some time to review it today. With this one, we don't need this PR anymore https://github.com/kubeflow/tf-operator/pull/1389/files

alembiewski · 2021-09-23T05:59:33Z

@Jeffwan, this PR doesn't include tooling updates for the OpenApi generator - I used #1389 to generate Python SDK, so I think the tooling updates should be merged so we have scripts updated and will be able to regenerate the SDK in case of API changes.

sdk/python/README.md

sdk/python/docs/V1TFJobList.md

Jeffwan · 2021-09-23T07:18:27Z

sdk/python/examples/kubeflow-pytorchjob-sdk.ipynb

@@ -0,0 +1,533 @@
+{


Thanks for adding this example

sdk/python/kubeflow/training/__init__.py

Jeffwan · 2021-09-23T07:25:02Z

sdk/python/kubeflow/training/constants/constants.py

+PYTORCH_LOGLEVEL = os.environ.get('PYTORCHJOB_LOGLEVEL', 'INFO').upper()
+
+# PyTorchJob Label Names
+PYTORCHJOB_CONTROLLER_LABEL = 'controller-name'


Other framework's constants are not populated?

This file is added by ourselves for e2e test cases? I notice mxnet and xgboost are missing. If we don't have enough time to write test case for other frameworks, it's totally fine. Let's leave a TODO there?

It's also used in client methods:
https://github.com/mesosphere/tf-operator/blob/75734f854a5b35715f068327d69de6185396b5d0/sdk/python/kubeflow/training/api/tf_job_client.py#L122-L125
I will add a comment, sure. The reason why I didn't add similar constants for mxnet and xgboost is that there are no clients for these frameworks currently implemented. Adding support for two more frameworks to the SDK is out of scope for this PR and should be addressed separately IMO.

hack/python-sdk/post_gen.py

Jeffwan · 2021-09-23T07:33:33Z

@alexlatchford Really appreciate your work! Please check above comments

Jeffwan · 2021-09-23T07:33:55Z

/cc @kubeflow/wg-training-leads

Jeffwan · 2021-09-29T01:05:48Z

/ok-to-test

alembiewski · 2021-09-29T15:05:02Z

@Jeffwan, test.e2e.test_e2e_pytorchjob: test_sdk_e2e test is failing with the following error, could you help to figure out why? Not sure how can I troubleshoot that on the test cluster.

>           raise RuntimeError("Not found Pods of the PyTorchJob {} "
                               "in namespace {}".format(name, namespace))
E           RuntimeError: Not found Pods of the PyTorchJob pytorchjob-mnist-ci-test in namespace default

sdk/python/kubeflow/training/api/py_torch_job_client.py:384: RuntimeError

It seems like the job has been completed, but it wasn't able to find a pod to fetch the logs.
I also tried to run the test on my cluster, all pass:

(.venv) ➜  tf-operator git:(update-sdk) ✗ pytest sdk/python/test                                                                                                                            
======================================================================================================== test session starts =========================================================================================================
platform darwin -- Python 3.8.9, pytest-4.6.11, py-1.10.0, pluggy-0.13.1
Using --randomly-seed=1632920749
rootdir: /Users/.../go/src/github.com/mesosphere/tf-operator/sdk/python
plugins: cov-2.12.1, randomly-1.2.3
collected 20 items

sdk/python/test/test_v1_run_policy.py .                                                                                                                                                                                        [  5%]
sdk/python/test/e2e/test_e2e_tfjob.py .                                                                                                                                                                                        [ 10%]
sdk/python/test/test_v1_xg_boost_job.py .                                                                                                                                                                                      [ 15%]
sdk/python/test/test_v1_replica_status.py .                                                                                                                                                                                    [ 20%]
sdk/python/test/test_v1_py_torch_job.py .                                                                                                                                                                                      [ 25%]
sdk/python/test/test_v1_replica_spec.py .                                                                                                                                                                                      [ 30%]
sdk/python/test/test_v1_py_torch_job_list.py .                                                                                                                                                                                 [ 35%]
sdk/python/test/test_v1_scheduling_policy.py .                                                                                                                                                                                 [ 40%]
sdk/python/test/test_v1_tf_job_list.py .                                                                                                                                                                                       [ 45%]
sdk/python/test/test_v1_py_torch_job_spec.py .                                                                                                                                                                                 [ 50%]
sdk/python/test/test_v1_xg_boost_job_list.py .                                                                                                                                                                                 [ 55%]
sdk/python/test/e2e/test_e2e_pytorchjob.py .                                                                                                                                                                                   [ 60%]
sdk/python/test/test_v1_tf_job.py .                                                                                                                                                                                            [ 65%]
sdk/python/test/test_v1_xg_boost_job_spec.py .                                                                                                                                                                                 [ 70%]
sdk/python/test/test_v1_mx_job_spec.py .                                                                                                                                                                                       [ 75%]
sdk/python/test/test_v1_job_condition.py .                                                                                                                                                                                     [ 80%]
sdk/python/test/test_v1_tf_job_spec.py .                                                                                                                                                                                       [ 85%]
sdk/python/test/test_v1_mx_job.py .                                                                                                                                                                                            [ 90%]
sdk/python/test/test_v1_mx_job_list.py .                                                                                                                                                                                       [ 95%]
sdk/python/test/test_v1_job_status.py .                                                                                                                                                                                        [100%]

========================================================================================================== warnings summary ==========================================================================================================
sdk/python/.venv/lib/python3.8/site-packages/table_logger/table_logger.py:80
  /Users/.../.../tf-operator/sdk/python/.venv/lib/python3.8/site-packages/table_logger/table_logger.py:80: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    np.float: fmt.FloatFormatter,

sdk/python/.venv/lib/python3.8/site-packages/table_logger/table_logger.py:86
   /Users/.../go/src/github.com/mesosphere/tf-operator/sdk/python/.venv/lib/python3.8/site-packages/table_logger/table_logger.py:86: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    np.int: fmt.IntegerFormatter,

-- Docs: https://docs.pytest.org/en/latest/warnings.html
============================================================================================== 20 passed, 2 warnings in 308.06 seconds ===============================================================================================

alembiewski · 2021-09-29T15:46:01Z

@Jeffwan, I updated the tooling for SDK generation and address the comments. Could you please take a look at the changes once again?

Jeffwan · 2021-09-29T17:04:11Z

@Jeffwan, test.e2e.test_e2e_pytorchjob: test_sdk_e2e test is failing with the following error, could you help to figure out why? Not sure how can I troubleshoot that on the test cluster.
>           raise RuntimeError("Not found Pods of the PyTorchJob {} "
                               "in namespace {}".format(name, namespace))
E           RuntimeError: Not found Pods of the PyTorchJob pytorchjob-mnist-ci-test in namespace default

sdk/python/kubeflow/training/api/py_torch_job_client.py:384: RuntimeError
It seems like the job has been completed, but it wasn't able to find a pod to fetch the logs. I also tried to run the test on my cluster, all pass:

Let me have a check on the failure. If you can run it successfully in your local env. it could be a flaky one.

/retest

Jeffwan · 2021-09-30T01:59:32Z

sdk test case pass and clean up pod policy is a flaky test

/test kubeflow-tf-operator-presubmit

docs/development/developer_guide.md

hack/python-sdk/swagger.json

Jeffwan · 2021-09-30T02:47:27Z

The PR looks good to me. Please double check it.

/cc @andreyvelich @kubeflow/wg-training-leads

alembiewski · 2021-09-30T08:38:57Z

@Jeffwan, all tests are green now - I improved attribute checks in k8s_util.py, hoping this will make the test less flaky.
After this PR is merged, we should probably think about publishing it to PyPI, maybe as a part of the 1.3.0 release?

andreyvelich

Thanks a lot for updating this @alembiewski!
I left few comments.

andreyvelich · 2021-09-30T12:46:47Z

docs/development/developer_guide.md

+
+To generate Python SDK for the operator, run:
+```
+.hack/python-sdk/gen-sdk.sh


Suggested change

.hack/python-sdk/gen-sdk.sh

./hack/python-sdk/gen-sdk.sh

andreyvelich · 2021-09-30T12:55:51Z

sdk/python/examples/kubeflow-pytorchjob-sdk.ipynb

+    "from kubeflow.training import V1PyTorchJob\n",
+    "from kubeflow.training import V1PyTorchJobSpec\n",
+    "from kubeflow.training import V1RunPolicy\n",
+    "from kubeflow.training.api.py_torch_job_client import PyTorchJobClient"


@alembiewski @Jeffwan Do we want to add import of this client to __init__.py, similar to this Katib __init__.py ?

Then, users can just run from kubeflow.training import PyTorchJobClient or ``from kubeflow.training import TFJobClient` ?

I don't have strong options on this. WDYT @alembiewski ?

Sounds reasonable, updated post-gen.py script to add imports automatically

andreyvelich · 2021-09-30T13:00:44Z

sdk/python/README.md

@@ -46,14 +46,36 @@ Class | Method | Description
 [TFJobClient](docs/TFJobClient.md) | [is_job_succeeded](docs/TFJobClient.md#is_job_succeeded) | Check if the TFJob status is Succeeded |
 [TFJobClient](docs/TFJobClient.md) | [get_pod_names](docs/TFJobClient.md#get_pod_names) | Get pod names of TFJob |
 [TFJobClient](docs/TFJobClient.md) | [get_logs](docs/TFJobClient.md#get_logs) | Get training logs of the TFJob |
+[PyTorchJobClient](docs/PyTorchJobClient.md) | [create](docs/PyTorchJobClient.md#create) | Create PyTorchJob|


I believe the Client docs were deleted by generator. Should we modify our script to not delete PyTorchJobClient.md and docs/TFJobClient.md during SDK generator run ?

Thanks for catching that, added client docs back

sdk/python/kubeflow/training/api/py_torch_job_client.py

andreyvelich · 2021-09-30T13:04:02Z

sdk/python/kubeflow/training/api/py_torch_job_client.py

+        :return: True or False
+        """
+        pytorchjob_status = self.get_job_status(name, namespace=namespace)
+        return pytorchjob_status.lower() == "succeeded"


Move this status to constants ?

andreyvelich · 2021-09-30T13:06:23Z

sdk/python/kubeflow/training/api/tf_job_watch.py

+            tbl(tfjob_name, status, update_time)
+
+            if name == tfjob_name:
+                if status == 'Succeeded' or status == 'Failed':


Similar comment about status.

sdk/python/kubeflow/training/api/tf_job_client.py

andreyvelich · 2021-09-30T13:07:36Z

sdk/python/kubeflow/training/api/py_torch_job_watch.py

@@ -0,0 +1,60 @@
+# Copyright 2020 The Kubeflow Authors.


Suggested change

# Copyright 2020 The Kubeflow Authors.

# Copyright 2021 The Kubeflow Authors.

sdk/python/kubeflow/training/utils/utils.py

andreyvelich · 2021-09-30T13:11:12Z

sdk/python/setup.py

-  description="TFJob Python SDK",
-  long_description="TFJob Python SDK",
+  description="Training Operator Python SDK",
+  long_description="Training Operator Python SDK",
  packages=setuptools.find_packages(
    include=("kubeflow*")),
  package_data={},


Should we drop support for Python < 3 ?

alembiewski · 2021-10-02T20:36:21Z

@Jeffwan, @andreyvelich, thanks for the review! I addressed all comments and suggestions, PTAL

andreyvelich · 2021-10-03T20:21:54Z

Thank you for updating this @alembiewski!
/lgtm
I think we should also publish this SDK to PyPi once we are ready.

/cc @kubeflow/wg-training-leads

Jeffwan · 2021-10-03T22:46:04Z

/approve

google-oss-robot · 2021-10-03T22:46:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Jeffwan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alembiewski added 2 commits September 22, 2021 22:49

Add Python SDK for Kubeflow Training Operator

75734f8

Update example notebooks

e80031f

aws-kf-ci-bot added the needs-ok-to-test label Sep 22, 2021

google-oss-robot added the size/XXL label Sep 22, 2021

google-oss-robot requested review from jinchihe and terrytangyuan September 22, 2021 23:08

Jeffwan reviewed Sep 23, 2021

View reviewed changes

sdk/python/README.md Show resolved Hide resolved

Jeffwan reviewed Sep 23, 2021

View reviewed changes

google-oss-robot requested a review from a team September 23, 2021 07:33

alembiewski changed the title ~~Add Python SDK for Kubeflow Training Operator~~ [WIP] Add Python SDK for Kubeflow Training Operator Sep 23, 2021

google-oss-robot added the do-not-merge/work-in-progress label Sep 23, 2021

Jeffwan mentioned this pull request Sep 25, 2021

Update scripts to generate sdk for all frameworks #1389

Merged

Merge branch 'kubeflow:master' into update-sdk

0fb3a10

aws-kf-ci-bot added ok-to-test and removed needs-ok-to-test labels Sep 29, 2021

alembiewski added 3 commits September 29, 2021 15:21

Update SDK generation tooling and docs

556db10

Re-generate SDK

83c1453

Allow to specify container name in 'get_logs' methods

ac8ca57

alembiewski changed the title ~~[WIP] Add Python SDK for Kubeflow Training Operator~~ Add Python SDK for Kubeflow Training Operator Sep 29, 2021

google-oss-robot removed the do-not-merge/work-in-progress label Sep 29, 2021

Jeffwan mentioned this pull request Sep 29, 2021

Cut official release of 1.3.0 #1425

Closed

Generalize job labels

70866c5

Jeffwan reviewed Sep 30, 2021

View reviewed changes

docs/development/developer_guide.md Show resolved Hide resolved

Jeffwan reviewed Sep 30, 2021

View reviewed changes

hack/python-sdk/swagger.json Show resolved Hide resolved

google-oss-robot requested review from andreyvelich and a team September 30, 2021 02:47

Check if attribute exists

876a3e0

andreyvelich reviewed Sep 30, 2021

View reviewed changes

Address code review comments

cc4a716

google-oss-robot requested a review from a team October 3, 2021 20:21

google-oss-robot assigned andreyvelich Oct 3, 2021

google-oss-robot added the lgtm label Oct 3, 2021

google-oss-robot added the approved label Oct 3, 2021

Jeffwan approved these changes Oct 3, 2021

View reviewed changes

google-oss-robot merged commit 6523d8d into kubeflow:master Oct 3, 2021

alembiewski deleted the update-sdk branch October 4, 2021 08:10

alembiewski mentioned this pull request Mar 2, 2022

Update swagger.json schema for TFJobSpec to include RunPolicy #1278

Closed

alembiewski mentioned this pull request Jun 16, 2022

Add alembiewski to the members list kubeflow/internal-acls#553

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Python SDK for Kubeflow Training Operator #1420

Add Python SDK for Kubeflow Training Operator #1420

alembiewski commented Sep 22, 2021 •

edited

Loading

review-notebook-app bot commented Sep 22, 2021

aws-kf-ci-bot commented Sep 22, 2021

Jeffwan commented Sep 23, 2021

alembiewski commented Sep 23, 2021 •

edited

Loading

Jeffwan Sep 23, 2021

Jeffwan Sep 23, 2021

Jeffwan Sep 23, 2021

alembiewski Sep 23, 2021

Jeffwan commented Sep 23, 2021

Jeffwan commented Sep 23, 2021

Jeffwan commented Sep 29, 2021

alembiewski commented Sep 29, 2021 •

edited

Loading

alembiewski commented Sep 29, 2021

Jeffwan commented Sep 29, 2021

Jeffwan commented Sep 30, 2021

Jeffwan commented Sep 30, 2021 •

edited

Loading

alembiewski commented Sep 30, 2021

andreyvelich left a comment

andreyvelich Sep 30, 2021

andreyvelich Sep 30, 2021

Jeffwan Sep 30, 2021

alembiewski Oct 2, 2021

andreyvelich Sep 30, 2021

alembiewski Oct 2, 2021

andreyvelich Sep 30, 2021

alembiewski Oct 2, 2021

andreyvelich Sep 30, 2021

andreyvelich Sep 30, 2021

andreyvelich Sep 30, 2021

alembiewski Oct 2, 2021

alembiewski commented Oct 2, 2021

andreyvelich commented Oct 3, 2021

Jeffwan commented Oct 3, 2021

google-oss-robot commented Oct 3, 2021

	# Copyright 2020 The Kubeflow Authors.
	# Copyright 2021 The Kubeflow Authors.

Add Python SDK for Kubeflow Training Operator #1420

Add Python SDK for Kubeflow Training Operator #1420

Conversation

alembiewski commented Sep 22, 2021 • edited Loading

Changes summary:

Note for the reviewers

Observations & Questions

review-notebook-app bot commented Sep 22, 2021

aws-kf-ci-bot commented Sep 22, 2021

Jeffwan commented Sep 23, 2021

alembiewski commented Sep 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan commented Sep 23, 2021

Jeffwan commented Sep 23, 2021

Jeffwan commented Sep 29, 2021

alembiewski commented Sep 29, 2021 • edited Loading

alembiewski commented Sep 29, 2021

Jeffwan commented Sep 29, 2021

Jeffwan commented Sep 30, 2021

Jeffwan commented Sep 30, 2021 • edited Loading

alembiewski commented Sep 30, 2021

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alembiewski commented Oct 2, 2021

andreyvelich commented Oct 3, 2021

Jeffwan commented Oct 3, 2021

google-oss-robot commented Oct 3, 2021

alembiewski commented Sep 22, 2021 •

edited

Loading

alembiewski commented Sep 23, 2021 •

edited

Loading

alembiewski commented Sep 29, 2021 •

edited

Loading

Jeffwan commented Sep 30, 2021 •

edited

Loading