Releases · kubeflow/pytorch-operator

This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

25 Mar 00:23

Jeffwan

v0.7.0

2aae331

v0.7.0 release Latest

Latest

Features

support cleanPodPolicy is Running, same as tf operator (#288, @jiaqianjing)
Migrate pytorch-operator to go modules (#272, @Jeffwan)
feat(init_container): Add init container image CLI argument (#265, @gaocegege)
SDK supports getting PyTorchJob training process or logs (#252, @jinchihe)
Add watch function for PyTorchJob python Client API (#246, @jinchihe)
Add more APIs for SDK (#240, @jinchihe)
feat(deletePodsAndServices):only delete master service (#233, @leileiwan)
Generate Kubeflow PyTorchJob SDK (#227, @jinchihe)
feat: Replace common with kubeflow/common (#225, @gaocegege)
feat: Use golanglint (#226, @gaocegege)
Removing v1beta2 support (#222, @johnugeorge)
feat: Support running although it is uesless (#194, @gaocegege)
use init container for worker pod to wait master pod ready (#187, @zlcnju)
fix: Fix the comments (#193, @gaocegege)
Minor fix to add CoreV1 to scheme (#184, @johnugeorge)
update release script; fix post submit (#189, @johnugeorge)

Bug fixed

Fix Unit Tests (#293, @andreyvelich)
Fix minor OpenShift issues - resource requests, Dockerfile (#276, @vpavlin)
Fix the link to run_e2e_workflow.py script (#266, @terrytangyuan)
fix: Add resource limits for init container (#253, @gaocegege)
fix the reconcile flow (#242, @ChanYiLin)
fix(*) rm work service in controller_test.go (#235, @leileiwan)
fix(job_test) test case should not include worker service (#231, @leileiwan)

Chores

Change mnist example to use FashionMNIST (#327, @Jeffwan)
pytorch-operator: Consolidate manifests (#323, @yanniszark)
Temporarily disable mnist test case (#326, @Jeffwan)
PyTorch Operator: Move manifests development upstream (#320, @yanniszark)
Migrate to new test-infra (#316, @PatrickXYS)
update pytorch-operator deployment manifests file (#295, @myonlyzzy)
Add @andreyvelich to approvers (#309, @andreyvelich)
Reuse Common Scripts for Creating / Deleting EKS clusters (#308, @PatrickXYS)
Add Jeffwan@ to OWNERS (#306, @Jeffwan)
Move PyTorch Operator e2e tests to AWS Prow (#305, @Jeffwan)
Update openapi-gen to not rely on vendor (#274, @Jeffwan)
Update README.md (#290, @pingsutw)
Update CRD link (#289, @pingsutw)
Adds notes and example annotation for pytorch job (#285, @shawnzhu)
chore: Update OWNERS (#286, @gaocegege)
fix Dockerfile-mpi download miniconda.sh (#277, @jiaqianjing)
Update swagger-codegen-cli URL (#280, @jinchihe)
pin kubenertes client version to work around a bug (#262, @jinchihe)
Added The Pytorch GPU Docker under the appropriate folder (#255, @MATRIX4284)
Copy third party vendor source code to Docker image (#251, @johnugeorge)
Add third party license info (#250, @johnugeorge)
ConvertPyTorchJobToUnstructured uses function ToUnstructured to convert PyTorchJob to Unstructured instead of json (#241, @leileiwan)
replace gopkg.in/yaml.v2 with github.com/kubernetes-sigs/yaml repo (#238, @xrmzju)
Update tf operator branch dep (#223, @johnugeorge)
Avoiding unnecessary status update (#220, @johnugeorge)
Removing unnecessary rbac permissions (#221, @johnugeorge)
add mnist example dockerfile for ppc64le (#218, @zheddie)
Fix nslookup cannot work well in initContainerTemplate (#216, @hougangliu)
Minor change in log (#213, @johnugeorge)
Delete v1beta2 code (#212, @johnugeorge)
Add qps and burst options (#210, @ohmystack)
Set pytorchjob defaults in test utils (#208, @ohmystack)
Update codegen and verify in CI (#207, @ohmystack)
Update manifest to v0.6.0 (#200, @hougangliu)
Common label changes with K8s upgrade to 1.12.3 (#204, @johnugeorge)
Use multi-build to build pytorch-operator image (#198, @hmtai)
add total suffix in counter metrics (#201, @yeya24)
add kubeconfig flag (#192, @yeya24)
Remove unnecessary services for worker (#191, @hougangliu)

Assets 2

28 Jun 21:01

kunmingg

v1.0.0-rc.0

6aa39a4

v1.0.0-rc.0 Pre-release

Pre-release

pytorch-operator pre graduate

Assets 2

13 Aug 17:45

johnugeorge

v0.6.0

6aa39a4

v0.6.0 release

Merged pull requests:

set annotation automatically when EnableGangScheduling is set to true #181 (zlcnju)
fix wrong api version when delete pytorchjob #179 (wackxu)
Moving crd to manifests #178 (johnugeorge)
Adds developer guide and sample CRD for v1 #177 (krishnadurai)
Update image base to UBI8 GA #176 (johnugeorge)
PyTorch Operator Prometheus Metrics #175 (krishnadurai)
Set start timestamp #170 (johnugeorge)
Skip condition update when succeeded #173 (johnugeorge)
Sync PodGroup fix #172 (johnugeorge)
Check pending status for pastBackoffLimitOnFailure #171 (johnugeorge)
Making ResyncPeriod configurable #169 (johnugeorge)
add uuid to id for leader election #168 (fisherxu)
Polish documentation for PyTorch V1 #167 (richardsliu)
Remove v1beta1 code #166 (johnugeorge)
Adding tests for operator v1 api #165 (johnugeorge)
Adding examples for v1 api #164 (johnugeorge)
Implementation of Pytorch operator v1 API #162 (johnugeorge)
Revise API version to v1beta2 #159 (krishnadurai)
set CompletionTime first when pytorchjob exceeds limit #158 (wackxu)
Prune owners file #157 (johnugeorge)
Set tf operator version to v0.5.0 #156 (johnugeorge)
Minor format changes #155 (johnugeorge)
Adding cleaner base image for operator builds #153 (johnugeorge)

Assets 2

06 Jun 14:26

johnugeorge

v0.5.1

396fb2f

v0.5.1 release Pre-release

Pre-release

Sync PodGroup fix (#172)

Assets 2

29 Mar 17:01

johnugeorge

v0.5.0

e8d4d04

v0.5.0 release

Closed issues:

Ensuring CRD requires cluster-level authority #144
Label naming style inconsistent #140
Pytorch operator v1beta2 API #134
Support gang-scheduling by kube-batch #129
Pytorch workers keep crashing if master is not up yet. #125
Support cross compile for image build. #42
Deprecate v1alpha2 API #135
Distribution across multi-gpu nodes #128
Upgrade examples to Pytorch 1.0 #123
Double gradient reduction in examples? #122

Merged pull requests:

Implement ActiveDeadlineSeconds and BackoffLimit #151 (johnugeorge)
Use podGroup instead of PDB in v1beta2 #150 (johnugeorge)
Use kube-batch as scheduler by default when gang-scheduling is enabled #149 (johnugeorge)
Remove usage of crd client #148 (johnugeorge)
Update tests to have single operator deployment for v1beta1 and v1beta2 API #147 (johnugeorge)
Renaming labels to consistent format #146 (johnugeorge)
Workers are created only when the master is in running phase #145 (johnugeorge)
Adding tests for v1beta2 #143 (johnugeorge)
Change cluster version to 1.11 #142 (andreyvelich)
Update OWNERS #141 (andreyvelich)
Adding status subresource #139 (johnugeorge)
Adding v1beta2 API implementation #138 (johnugeorge)
Upgrading k8s to 1.11 #137 (johnugeorge)
Removing v1alpha2 API #136 (johnugeorge)
Adding detailed events/messages to PyTorch Jobs #133 (johnugeorge)
Skip status reinit when job is succeeded #132 (johnugeorge)
Travis build fix #131 (johnugeorge)
Rework example and e2e test script #126 (TimZaman)
Change Distributed Data Parallel example #124 (andreyvelich)

Assets 2

14 Feb 06:08

johnugeorge

v0.5.0-rc.1

da7798e

v0.5.0-rc.1 release Pre-release

Pre-release

Adding v1beta2 API implementation (#138)

* Adding v1beta2 API implementation

* Build v1beta2

Assets 2

08 Jan 07:04

johnugeorge

v0.4.0

306edb5

v0.4.0 release

Closed issues:

Delete v1alpha1 API and controller from the repository #105
Create v1beta1 Pytorch operator docker image #104
Implement v1beta1 controller #96
MPI backend mnist gpu example error: "No space left on device" #91
pytorch-operator should ensure that CRD exists #87
Refactor E2E tests #86
[discussion] Refactor pytorch operator APIs #84

Merged pull requests:

Updated gcloud build related code. #121 (ltomes)
Adding Operator deployment manifests #119 (johnugeorge)
Adding distributed mnist example with summaries #118 (johnugeorge)
Add master role label for PyTorchJob #116 (johnugeorge)
Minor fixes #115 (johnugeorge)
GetCondition func fix #114 (johnugeorge)
Update k8s cluster version to 1.10 #113 (johnugeorge)
Pytorch Katib example #112 (johnugeorge)
Updating pkg version to 1.10.1 #111 (johnugeorge)
Delete v1alpha1 API and Controller #110 (johnugeorge)
Delete v1alpha1 tests #109 (johnugeorge)
gopkg: Use version instead of branch #107 (gaocegege)
Keeping default project as kubeflow-ci #106 (johnugeorge)
Adding v1beta1 binary to the docker image #103 (johnugeorge)
Fix registry override for image builds #102 (johnugeorge)
Adding e2e tests for v1beta1 #101 (johnugeorge)
Release image to kubeflow images public #100 (johnugeorge)
Ensure that PyTorch CRD exists #99 (johnugeorge)
Adding examples for v1beta1 API #98 (johnugeorge)
Adding controller for v1beta1 api #97 (johnugeorge)
Updating common JobController vendor #95 (johnugeorge)
Validation test for V1beta1 apis #94 (johnugeorge)
Pytorch operator v1beta1 APIs #93 (johnugeorge)
Fix MPI backend mnist gpu example error: "No space left on device" #92 (jwwandy)
vendor: Update to 1.10 #90 (gaocegege)
Add richardsliu to OWNERS #85 (richardsliu)

Assets 2

10 Dec 04:34

johnugeorge

v0.4.0-rc.1

97b3b99

v0.4.0-rc.1 release Pre-release

Pre-release

Updating pkg version to 1.10.1 (#111)

* Updating k8s to 1.10.1

* Update to 1.10.1

Assets 2

09 Oct 05:20

johnugeorge

v0.3.0

8ea9b43

v0.3.0 Release

v0.3.0 release of PyTorch operator

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features

Bug fixed

Chores

Releases: kubeflow/pytorch-operator

v0.7.0 release

Features

Bug fixed

Chores

v1.0.0-rc.0

v0.6.0 release

v0.5.1 release

v0.5.0 release

v0.5.0-rc.1 release

v0.4.0 release

v0.4.0-rc.1 release

v0.3.0 Release