Releases · kubeflow/training-operator

23 May 20:46

v0.5.2

4a46eba

v0.5.2

Remove deprecated field (#1007)

Assets 2

19 May 02:55

richardsliu

v0.5.1

63de5cb

v0.5.1

fix repeat delete service and pod (#998)

Assets 2

30 Mar 19:28

richardsliu

v0.5.0

aa322c7

v0.5.0

add ActiveDeadlineSeconds and BackoffLimit features (#963)

* add ActiveDeadlineSeconds and BackoffLimit features

* fix goimport and unassign variable in test

* fix test and delete the package added by dep ensure

* fix goimports

* fix ActiveDeadlineSeconds unit test

* add logger for test

* add logger for test

* add logger for test

* add BackoffForOnFailure test

* fix test

* fix test

* fix unit test

Assets 2

14 Feb 05:31

richardsliu

v0.4.0

c284947

v0.4.0

Upgrading k8s to 1.11.2 (#942)

Assets 2

28 Nov 22:27

richardsliu

v0.4.0-rc.1

bb0115f

v0.4.0-rc.1 Pre-release

Pre-release

Upgrade to 1.10.1 (#874)

Assets 2

20 Nov 22:42

richardsliu

v0.4.0-rc.0

89e6f66

v0.4.0-rc.0 Pre-release

Pre-release

Initial version of 0.4.0

TFJob v1beta1 API

Assets 2

07 Oct 22:21

jlewi

v0.3.0

fac8eff

v0.3.0 Release

The v0.3.0 release of the TFJob operator.

Assets 2

21 Jun 22:21

kunmingg

v0.2.0-rc1

38b886a

v0.2.0-rc1

tf-operator release v0.2.0, part of Kubeflow release v0.2.0.

Features and improvements:

[v1alpha2] Set event for tfjob when spec is not valid #620
[enhancement] Fix the gofmt support #586
[go] Use dep instead of glide to reduce the size of vendor #556
[v1alpha2] Enhance the logic about sync #547
[v1alpha2] Use structured log #537
[log] investigate zap #534
[v1alpha2] Try to not to always claim pods #533
[v1alpha2] Suppport customized port #532
[v1alpha2] start using kubeconfig #522
v1alpha2 integration #521
TFJob operator surface queue metrics #503
[api] Remove pending pods from active pods #484
[enhancement] Set StartTime for TFJob status #475
[Feature] Support "eval" worker in tf-operator #444
Add appropriate logging fields to the tf-operator log messages #424
[enhancement] Refactor docs #379
Deprecate TfPort and set default port for users #327
[enhancement] Add e2e test cases for recorder #317
Make the TfJob controller more event driven #314
Potential data race, maybe #302
Don't leave pods running just to get logs #128
Add hyperparameter tuning? #112
Use headless services for Training jobs #40
More validation of TfJob #25

Fixed bugs:

[v1alpha2] RealServiceControl does not set owner reference #616
TfJob operator stops working on invalid spec #561
[v1alpha2]tfjob restartPolicy for Never #555
[v1alpha2] Potential bugs when there is one worker succeeded #538
[v1alpha2][test] Avoid potential data race problem #530
Phase is wrong unexpected TfJob phase: Done #110

Closed issues:

[v1alpha2] Make restart policy a pointer #692
[v1alpha2] Need conditions Succeeded and Failed indicating when job is done #673
[v1alpha2] add pod label with job name (without namespace) #672
[v1alpha2] Pods not deleted when job finishes #671
[v1alpha2] conditions not updated #668
[v1alpha2] Move control interface to separate pakckage #665
[v1alpha2] Move test util to separate package #664
Speedup E2E test by running build and setup cluster in parallel #659
In TFjob, when the workers Completed, i want the ps Completed too, how can i do? #657
[v1alpha2] service names are prefixed with namespace #654
[v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653
dep ensure give warning on k8s.io/apiserver #647
[v1alpha2] pod names don't include random salt #644
[v1alpha2]Unable to create pod #641
GPU tests failing; ks env doesn't exist #640
TFJob not marked as success when master exits but not workers #634
v1alpha2 - pod names don't include replica type #633
tensorflow on kubernetes how to pass in worker_host and ps_host to container if I use tf-operator #630
tf_job_client blocks forever #606
[v1alpha2] Need to add the v1alpha2 binaries to our Docker image #600
[v1alpha2] Need ksonnet package #599
Support deploying v1alpha2 and v1alpha1 controllers simultaneously #598
[v1alpha2] Remove controller_utils.go #591
[v1alpha2] Add CI test #589
[question] dist_mnist example failed to run #588
can not set labels #580
v1alpha2 should use headless services #574
TFJob operator should pass through annotations to the pod #573
[test] Test failed because of ImagePullBackOff #567
Servable not found for request: Latest(mnist) #552
[v1alpha2] The state of distributed model training. #544
[test] copy labels and anotations to pod from tfjob #543
Unable to deploy the example TfJob in the user guide #535
[v1alpha2] Do not set default to always for restartpolicy #524
E2E test steps should exit with non zero exit code if test fails #514
[v1alpha2] Sync commits with v1alpha1 #490
Use OpenAPI validation for CRDs in k8s 1.9 #437
default install of kubeflow no longer install tf-job-dashboard #435
Use DAG functionality of Argo in our E2E tests #422
Post submits are failing with Argo #370
tf-job-operator pod hangs and doesn't restart if it can't delete one of the TfJob pods #366
Refactor TFJobStatus in CRD API #333
Deprecate the TfImage field #330
[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283
Does TfJob controller need to do master election? #263
Setup Prow PR Dashboard #255
API: some comments about API changes from PR #215 review #249
e2e test for the case that the chief is not master #235
Use conditions instead of phase #223
Submitted tfjobs cease to start running under unknown conditions #203
Tutorials #195
Copy chart to kubernetes/charts #93
Create a web page to list releases #70
tensorflow 1.4 and estimator support #61
Set a default value for restartPolicy #55

Merged pull requests:

*: Add cleanpod policy for v1alpha2 #691 (gaocegege)
status: Fail the TFJob if PS is failed #690 (gaocegege)
Use tf_job_name not tf_job_key as the label name. #689 (jlewi)
pkg: Delete pods an...

Assets 2

30 Mar 03:31

jlewi

v0.1.0

a7511ff

Initial release of the TFJob operator

gcr.io/kubeflow-images-staging/tf_operator@sha256:1a3d1a2ee90f0108fff3e29023228fc686afbfa311752e8b3bf71859d488b435

v0.1.0 (2018-03-29)

Closed issues:

[v1alpha2] Implement condition update #502
E2E tests timing out; job appears to remain in running state even though job is done. #500
[v1alpha2] TF_CONFIG should be configurable by user #499
[test] All log is 404 in argo #496
Presubmit shows succeeded, but some test actually failed. #479
Waiting pods start too long #461
[test] Add unit test for pkg/controller #455
Create a suitable OWNERS file in /dashboard #443
Tide is misconfigured for this repository. #433
CI failed to setup the cluster #420
[docs] Add dashboard readme #411
Make coverall results advisory and not report as failure #406
Presubmits failing due to lint #404
[enhancement] Fix go vet errors which not caught by the compilers #395
User facing website for Kubeflow that details how to choose a stack #371
[discussion] How to set clusterspec #369
[enhancement] Rename the cmd/tf_operator to cmd/tf-operator #363
Local releaser fails due to version_tag #360
Helm test failure not reported to gubernator #355
[discussion] Whether to create CRD in helm charts #353
Should resourcelock be in the same namespace as controller? #352
Helm test tf-job does not pass validation #351
Move tensorflow/k8s to kubeflow/tf-operator #350
Get rid of TensorBoard replica #347
Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs #346
Deprecate the ENV MY_POD_NAMESPACE and MY_POD_NAME #341
[feature] Does tfJob support setting different label/envVar for each worker(replicas >1)? #340
[Discussion] Time to start tagging releases for the TF operator? #339
[discussion] Should group name be tensorflow.org or kubeflow.io or kubeflow.org? #337
dashboard silient error during calling non-existent tfjob #335
in dashboard, silent error when nonexistent namespace is specified #334
Deprecate the IsDefaultPS field #329
[Convention] Replace Tf with TF in CRD #328
Standardise labels for issues and PRs #326
Manage Pods directly instead of using Job controllers #325
TfJobs dashboard not showing jobs #324
TfJobs dashboard doesn't work with K8s API server proxy or envoy proxy #323
Recreating a failed/successful job with same name doesn't work #322
Releaser incorrectly tags images as "dirty" #321
Reenable the releaser #320
E2E tests are not isolated #318
Need to mark prow job as failed if any tests fail #315
Remove outdated branch wbuchwalter-patch-1 #311
E2E test delete and recreate job with same name #310
TrainingJob.reconcile not called periodically #309
rename master to chief #306
Assign resource quota for TensorBoard #304
Jobs evicted for lack of memory, potentially add resource field to tf-job prototype #301
[Discussion] Operators vs. controller pattern #300
[bug] Add a default pod template for PS #297
Bunch of pylint error messages #294
Fix Head #293
Operator deployment fails post-v20180108-190394d #292
Promote last known good release #290
[bug] metadata.ownerReferences.apiVersion is not set #288
fail to run example job. invalid job spec: tfReplicaSpec.TfPort can''t be nil #284
[bug] Build log 404 in https://prow.k8s.io/?repo=tensorflow%2Fk8s #282
[feature] Seperate the CRD and controller #281
Gaps in test coverage #280
Regression in flag name: controller-config-file #279
[bug] glog before flag.Parse() #275
build new code to new image and find some problem #274
Fix the releaser so we can build new images #270
deploy.py gives gcloud api error '... Version "1.8.1-gke.1" is invalid.' #268
Pods terminated without waiting #267
Attach appropriate header (copyright) to go files #266
suppose i've install the tfjob in my k8s cluster #265
what's the folder pkg for? #264
Build failing because of lint issues #256
what's the main change between version 0.2 and version 0.3? #247
SetupCluster failures unexpected keyword argument 'client_configuration' #242
GPU test marked as succeeded but airflow step is failing #240
Use Kubeflow & ksonnet to install TfJob #239
tf_smoke.py distributed computing doesn't work on minikube #238
example-job can not work in private k8s cluster #233
Test failures aren't properly reported in Gubernator #229
[CRD] Request for input and output dirs in TFJobSpec #224
TfJob should be marked as failed if setup fails #218
panic: runtime error: invalid memory address or nil pointer dereference can not run in k8s 1.8.5 #212
Rethink the TFJob CRD #209
ksonnet configs for deploying the TfJob CRD & Controller #208
Make default TfImage configurable by users #207
refactor the TfJob to use Informer and Controller #206
Use Argo workflow engine for CI/CD or releases #205
Potential issue with Tensorboard / value of simple best-practices example with tboard #202
Investigate using buildah to...

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.0 (2018-03-29)

Releases: kubeflow/training-operator

v0.5.2

v0.5.1

v0.5.0

v0.4.0

v0.4.0-rc.1

v0.4.0-rc.0

v0.3.0 Release

v0.2.0-rc1

Initial release of the TFJob operator

v0.1.0 (2018-03-29)