This repository has been archived by the owner on Sep 19, 2022. It is now read-only.
Releases: kubeflow/pytorch-operator
Releases · kubeflow/pytorch-operator
v0.7.0 release
Features
- support cleanPodPolicy is Running, same as tf operator (#288, @jiaqianjing)
- Migrate pytorch-operator to go modules (#272, @Jeffwan)
- feat(init_container): Add init container image CLI argument (#265, @gaocegege)
- SDK supports getting PyTorchJob training process or logs (#252, @jinchihe)
- Add watch function for PyTorchJob python Client API (#246, @jinchihe)
- Add more APIs for SDK (#240, @jinchihe)
- feat(deletePodsAndServices):only delete master service (#233, @leileiwan)
- Generate Kubeflow PyTorchJob SDK (#227, @jinchihe)
- feat: Replace common with kubeflow/common (#225, @gaocegege)
- feat: Use golanglint (#226, @gaocegege)
- Removing v1beta2 support (#222, @johnugeorge)
- feat: Support running although it is uesless (#194, @gaocegege)
- use init container for worker pod to wait master pod ready (#187, @zlcnju)
- fix: Fix the comments (#193, @gaocegege)
- Minor fix to add CoreV1 to scheme (#184, @johnugeorge)
- update release script; fix post submit (#189, @johnugeorge)
Bug fixed
- Fix Unit Tests (#293, @andreyvelich)
- Fix minor OpenShift issues - resource requests, Dockerfile (#276, @vpavlin)
- Fix the link to run_e2e_workflow.py script (#266, @terrytangyuan)
- fix: Add resource limits for init container (#253, @gaocegege)
- fix the reconcile flow (#242, @ChanYiLin)
- fix(*) rm work service in controller_test.go (#235, @leileiwan)
- fix(job_test) test case should not include worker service (#231, @leileiwan)
Chores
- Change mnist example to use FashionMNIST (#327, @Jeffwan)
- pytorch-operator: Consolidate manifests (#323, @yanniszark)
- Temporarily disable mnist test case (#326, @Jeffwan)
- PyTorch Operator: Move manifests development upstream (#320, @yanniszark)
- Migrate to new test-infra (#316, @PatrickXYS)
- update pytorch-operator deployment manifests file (#295, @myonlyzzy)
- Add @andreyvelich to approvers (#309, @andreyvelich)
- Reuse Common Scripts for Creating / Deleting EKS clusters (#308, @PatrickXYS)
- Add Jeffwan@ to OWNERS (#306, @Jeffwan)
- Move PyTorch Operator e2e tests to AWS Prow (#305, @Jeffwan)
- Update openapi-gen to not rely on vendor (#274, @Jeffwan)
- Update README.md (#290, @pingsutw)
- Update CRD link (#289, @pingsutw)
- Adds notes and example annotation for pytorch job (#285, @shawnzhu)
- chore: Update OWNERS (#286, @gaocegege)
- fix Dockerfile-mpi download miniconda.sh (#277, @jiaqianjing)
- Update swagger-codegen-cli URL (#280, @jinchihe)
- pin kubenertes client version to work around a bug (#262, @jinchihe)
- Added The Pytorch GPU Docker under the appropriate folder (#255, @MATRIX4284)
- Copy third party vendor source code to Docker image (#251, @johnugeorge)
- Add third party license info (#250, @johnugeorge)
- ConvertPyTorchJobToUnstructured uses function ToUnstructured to convert PyTorchJob to Unstructured instead of json (#241, @leileiwan)
- replace gopkg.in/yaml.v2 with github.com/kubernetes-sigs/yaml repo (#238, @xrmzju)
- Update tf operator branch dep (#223, @johnugeorge)
- Avoiding unnecessary status update (#220, @johnugeorge)
- Removing unnecessary rbac permissions (#221, @johnugeorge)
- add mnist example dockerfile for ppc64le (#218, @zheddie)
- Fix nslookup cannot work well in initContainerTemplate (#216, @hougangliu)
- Minor change in log (#213, @johnugeorge)
- Delete v1beta2 code (#212, @johnugeorge)
- Add qps and burst options (#210, @ohmystack)
- Set pytorchjob defaults in test utils (#208, @ohmystack)
- Update codegen and verify in CI (#207, @ohmystack)
- Update manifest to v0.6.0 (#200, @hougangliu)
- Common label changes with K8s upgrade to 1.12.3 (#204, @johnugeorge)
- Use multi-build to build pytorch-operator image (#198, @hmtai)
- add total suffix in counter metrics (#201, @yeya24)
- add kubeconfig flag (#192, @yeya24)
- Remove unnecessary services for worker (#191, @hougangliu)
v1.0.0-rc.0
pytorch-operator pre graduate
v0.6.0 release
Merged pull requests:
- set annotation automatically when EnableGangScheduling is set to true #181 (zlcnju)
- fix wrong api version when delete pytorchjob #179 (wackxu)
- Moving crd to manifests #178 (johnugeorge)
- Adds developer guide and sample CRD for v1 #177 (krishnadurai)
- Update image base to UBI8 GA #176 (johnugeorge)
- PyTorch Operator Prometheus Metrics #175 (krishnadurai)
- Set start timestamp #170 (johnugeorge)
- Skip condition update when succeeded #173 (johnugeorge)
- Sync PodGroup fix #172 (johnugeorge)
- Check pending status for pastBackoffLimitOnFailure #171 (johnugeorge)
- Making ResyncPeriod configurable #169 (johnugeorge)
- add uuid to id for leader election #168 (fisherxu)
- Polish documentation for PyTorch V1 #167 (richardsliu)
- Remove v1beta1 code #166 (johnugeorge)
- Adding tests for operator v1 api #165 (johnugeorge)
- Adding examples for v1 api #164 (johnugeorge)
- Implementation of Pytorch operator v1 API #162 (johnugeorge)
- Revise API version to v1beta2 #159 (krishnadurai)
- set CompletionTime first when pytorchjob exceeds limit #158 (wackxu)
- Prune owners file #157 (johnugeorge)
- Set tf operator version to v0.5.0 #156 (johnugeorge)
- Minor format changes #155 (johnugeorge)
- Adding cleaner base image for operator builds #153 (johnugeorge)
v0.5.1 release
Sync PodGroup fix (#172)
v0.5.0 release
Closed issues:
- Ensuring CRD requires cluster-level authority #144
- Label naming style inconsistent #140
- Pytorch operator v1beta2 API #134
- Support gang-scheduling by kube-batch #129
- Pytorch workers keep crashing if master is not up yet. #125
- Support cross compile for image build. #42
- Deprecate v1alpha2 API #135
- Distribution across multi-gpu nodes #128
- Upgrade examples to Pytorch 1.0 #123
- Double gradient reduction in examples? #122
Merged pull requests:
- Implement ActiveDeadlineSeconds and BackoffLimit #151 (johnugeorge)
- Use podGroup instead of PDB in v1beta2 #150 (johnugeorge)
- Use kube-batch as scheduler by default when gang-scheduling is enabled #149 (johnugeorge)
- Remove usage of crd client #148 (johnugeorge)
- Update tests to have single operator deployment for v1beta1 and v1beta2 API #147 (johnugeorge)
- Renaming labels to consistent format #146 (johnugeorge)
- Workers are created only when the master is in running phase #145 (johnugeorge)
- Adding tests for v1beta2 #143 (johnugeorge)
- Change cluster version to 1.11 #142 (andreyvelich)
- Update OWNERS #141 (andreyvelich)
- Adding status subresource #139 (johnugeorge)
- Adding v1beta2 API implementation #138 (johnugeorge)
- Upgrading k8s to 1.11 #137 (johnugeorge)
- Removing v1alpha2 API #136 (johnugeorge)
- Adding detailed events/messages to PyTorch Jobs #133 (johnugeorge)
- Skip status reinit when job is succeeded #132 (johnugeorge)
- Travis build fix #131 (johnugeorge)
- Rework example and e2e test script #126 (TimZaman)
- Change Distributed Data Parallel example #124 (andreyvelich)
v0.5.0-rc.1 release
Adding v1beta2 API implementation (#138) * Adding v1beta2 API implementation * Build v1beta2
v0.4.0 release
Closed issues:
- Delete v1alpha1 API and controller from the repository #105
- Create v1beta1 Pytorch operator docker image #104
- Implement v1beta1 controller #96
- MPI backend mnist gpu example error: "No space left on device" #91
- pytorch-operator should ensure that CRD exists #87
- Refactor E2E tests #86
- [discussion] Refactor pytorch operator APIs #84
Merged pull requests:
- Updated gcloud build related code. #121 (ltomes)
- Adding Operator deployment manifests #119 (johnugeorge)
- Adding distributed mnist example with summaries #118 (johnugeorge)
- Add master role label for PyTorchJob #116 (johnugeorge)
- Minor fixes #115 (johnugeorge)
- GetCondition func fix #114 (johnugeorge)
- Update k8s cluster version to 1.10 #113 (johnugeorge)
- Pytorch Katib example #112 (johnugeorge)
- Updating pkg version to 1.10.1 #111 (johnugeorge)
- Delete v1alpha1 API and Controller #110 (johnugeorge)
- Delete v1alpha1 tests #109 (johnugeorge)
- gopkg: Use version instead of branch #107 (gaocegege)
- Keeping default project as kubeflow-ci #106 (johnugeorge)
- Adding v1beta1 binary to the docker image #103 (johnugeorge)
- Fix registry override for image builds #102 (johnugeorge)
- Adding e2e tests for v1beta1 #101 (johnugeorge)
- Release image to kubeflow images public #100 (johnugeorge)
- Ensure that PyTorch CRD exists #99 (johnugeorge)
- Adding examples for v1beta1 API #98 (johnugeorge)
- Adding controller for v1beta1 api #97 (johnugeorge)
- Updating common JobController vendor #95 (johnugeorge)
- Validation test for V1beta1 apis #94 (johnugeorge)
- Pytorch operator v1beta1 APIs #93 (johnugeorge)
- Fix MPI backend mnist gpu example error: "No space left on device" #92 (jwwandy)
- vendor: Update to 1.10 #90 (gaocegege)
- Add richardsliu to OWNERS #85 (richardsliu)
v0.4.0-rc.1 release
Updating pkg version to 1.10.1 (#111) * Updating k8s to 1.10.1 * Update to 1.10.1
v0.3.0 Release
v0.3.0 release of PyTorch operator