-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate controller implementation to kubeflow/common fashion #1171
Conversation
/assign @Jeffwan @gaocegege |
BTW, since I remove the |
Travis tests have failedHey @ChanYiLin, 2nd Buildhack/verify-codegen.sh
golangci-lint run ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/*.go,pkg/util/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
TravisBuddy Request Identifier: c9ef6950-abcc-11ea-a7dd-93c9ffbfb007 |
Travis tests have failedHey @ChanYiLin, 2nd Buildhack/verify-codegen.sh
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/*.go,pkg/util/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
TravisBuddy Request Identifier: cc3dd3e0-abd1-11ea-a7dd-93c9ffbfb007 |
If they are using older version of tf-operator code, it should be fine. They will be upgraded and switching to use kubeflow/common as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change. Upgrade to k8s 1.16 and kubeflow/common 0.3.1 help reduce the size of PR and makes us concentrate on the migration changes. This is great.
One thing we need to leave an eye is not to bring regression issues. I will help on some tests on this.
@@ -42,5 +42,5 @@ ${GOPATH}/bin/defaulter-gen --input-dirs github.com/kubeflow/tf-operator/pkg/ap | |||
cd - > /dev/null | |||
|
|||
echo "Generating OpenAPI specification for tensorflow/v1" | |||
${GOPATH}/bin/openapi-gen --input-dirs github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1,k8s.io/api/core/v1,k8s.io/apimachinery/pkg/apis/meta/v1,k8s.io/apimachinery/pkg/api/resource,k8s.io/apimachinery/pkg/runtime,k8s.io/apimachinery/pkg/util/intstr,k8s.io/apimachinery/pkg/version --output-package github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1 --go-header-file hack/boilerplate/boilerplate.go.txt "$@" | |||
${GOPATH}/bin/openapi-gen --report-filename=hack/violation_exception.list --input-dirs github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1,k8s.io/api/core/v1,k8s.io/apimachinery/pkg/apis/meta/v1,k8s.io/apimachinery/pkg/api/resource,k8s.io/apimachinery/pkg/runtime,k8s.io/apimachinery/pkg/util/intstr,k8s.io/apimachinery/pkg/version --output-package github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1 --go-header-file hack/boilerplate/boilerplate.go.txt "$@" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
em. Seems we assume openapi-gen
is in the path. This will break for new users. I will create an issue to track to improve this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure!
@@ -0,0 +1,42 @@ | |||
API rule violation: names_match,k8s.io/api/core/v1,AzureDiskVolumeSource,DataDiskURI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great.
CleanPodPolicy: &cleanPodPolicy, | ||
TFReplicaSpecs: map[TFReplicaType]*common.ReplicaSpec{ | ||
TFReplicaTypeWorker: &common.ReplicaSpec{ | ||
SuccessPolicy: &defaultSuccessPolicy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if success policy is part of run policy. This is something we can consolidate together in future releases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, SuccessPolicy and FailurePolicy(we have opened an issue for it) are for distributed tensorflow only, I am not sure the definition of Success and Failure are the same for other framework.
@@ -27,8 +27,8 @@ import ( | |||
|
|||
const ( | |||
// labels for pods and servers. | |||
tfReplicaTypeLabel = "tf-replica-type" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For test, I think this is fine. We should be cautious for public facing labels. I think some users may have logics depends on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since we now use reconcilePod
from common library, the labels are defined in common library
https://github.com/kubeflow/common/blob/master/pkg/apis/common/v1/constants.go#L5
and added to the pods here
https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/pod.go#L436
As you can see, it is hard coded now.
That means, no matter in tf-operator or mxnet-operator, the labels in the pod will always be apiv1.ReplicaIndexLabel = "replica-index"
and not the one defined in each operator.
I think we should modify the common library to use the GetReplicaTypeLabelKey
func (tc *TFController) GetReplicaTypeLabelKey() string {
return tfReplicaTypeLabel
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created separate issue to track this kubeflow/common#98
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have an interface there, and leave replica-type
as default implementation in kubeflow/common
, different operators can still override it if there's a need.
This seems like a migration issue for tf only. Technically, we only use this label (and others) to filter pods with prefilter type. even they have same value, it's not a problem and won't affect filtering logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it affect Katib? /cc @andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in Katib we use labels to search for master pod.
@Jeffwan @terrytangyuan In the new common
implementation, it still applies job-role: master
for master pod ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for job-role label, it is still job-role:master
But for replica type label, it has changed and removed the tf- prefix
I will solve the issue in common library so we can overwrite the replica type label
and then change this back in tf-operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ChanYiLin! I believe it should work for us.
@@ -1,177 +0,0 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. We retire pod/service control package here.
pkg/controller.v1/tensorflow/pod.go
Outdated
|
||
gangSchedulingPodGroupAnnotation = "scheduling.k8s.io/group-name" | ||
//gangSchedulerName = "volcano" | ||
//gangSchedulingPodGroupAnnotation = "scheduling.k8s.io/group-name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These variables are not being used now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, these are originally used in reconcilePod
in tf-operator
|
||
// if master pod is present, select the master pod | ||
// if master is not present, first worker pod is selected as the master. | ||
if ContainChieforMasterSpec(tfjob) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is TF specific. Did you change to use isMasterRole
https://github.com/kubeflow/common/blob/f77924701cb5d8a2d6b7f49f7afaec3c6271782a/pkg/apis/common/v1/interface.go#L57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have implemented it in controller.go L409
pkg/controller.v1/tensorflow/pod.go
Outdated
if spec.Template.Spec.SchedulerName != "" && spec.Template.Spec.SchedulerName != tc.Config.GangSchedulerName { | ||
return true | ||
// IsWorker0Completed return true if pod of worker0 succeeded and exited with 0 | ||
func (tc *TFController) IsWorker0Completed(tfjob *tfv1.TFJob, replicas map[commonv1.ReplicaType]*commonv1.ReplicaSpec) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not find reference for this methods. Can you point me to the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use it and PodRestart
in UpdateJobStatus
.
Since in th-operator, we sometimes update status based on specific pod status (Chief, Master and worker0). But with using the framework from common library, we cannot use updateStatusSingle
(which has been removed) with flag restart
, worker0completed
to update status.
In common library, we now updateStatus after reconcile all pods. So I have to implement these two function to iterate all pods again for specific scenario.
(That is why I want to know how does mxnet-operator deal with scheduler succeeded in v1 with using common library)
Travis tests have failedHey @ChanYiLin, 2nd Buildhack/verify-codegen.sh
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/*.go,pkg/util/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
TravisBuddy Request Identifier: b43032a0-ae6c-11ea-b5ce-3bd953fd8c80 |
@ChanYiLin Seems failure is from lint. Can you try to lint the code locally? |
Sure, but it is weird that the lint error came from the file that is not modified in the PR |
@ChanYiLin Maybe one of the dependencies' version got changed along the line. Try and see if you can reproduce the issue locally. |
Travis tests have failedHey @ChanYiLin, 2nd Buildhack/verify-codegen.sh
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/*.go,pkg/util/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
TravisBuddy Request Identifier: dd79c960-b0a5-11ea-b9a3-b9754cbad7e4 |
Travis tests have failedHey @ChanYiLin, 2nd Buildhack/verify-codegen.sh
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/*.go,pkg/util/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
TravisBuddy Request Identifier: 42c85470-b0ac-11ea-b9a3-b9754cbad7e4 |
/test kubeflow-tf-operator-presubmit |
1 similar comment
/test kubeflow-tf-operator-presubmit |
Travis tests have failedHey @ChanYiLin, 2nd Buildhack/verify-codegen.sh
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/*.go,pkg/util/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
TravisBuddy Request Identifier: 85d70290-b11d-11ea-b9a3-b9754cbad7e4 |
The |
Hey @ChanYiLin, TravisCI finished with status TravisBuddy Request Identifier: 7ca64aa0-dada-11ea-bee2-571e9d389dd6 |
/retest |
@jinchihe there is still another issue in setup-kubeflow and the ci also failed. Can you help to verify the issue? Thanks |
/retest |
1 similar comment
/retest |
/test kubeflow-tf-operator-presubmit |
/retest |
/test kubeflow-tf-operator-presubmit |
1 similar comment
/test kubeflow-tf-operator-presubmit |
@gaocegege @Jeffwan This PR finally passed all the tests :) |
Signed-off-by: ChanYiLin <j5111261112@gmail.com>
let me squash the commit. |
Thanks for your contribution! 🎉 👍 |
/lgtm |
@ChanYiLin I don't have permission in this repo. I think you can self approve it now. Great work! It looks good to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Appreciate all the efforts.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChanYiLin Thank you for this work!
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
I am going to merge this PR. Thanks for your work and reviews from other contributors.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
-> "cla/google Expected — Waiting for status to be reported" There is some problem with the network traffic and CLA Assistant took some time to respond. @ChanYiLin Can you reply |
/test kubeflow-tf-operator-presubmit |
/retest |
Major changes include
reconcileJob
from common libraryreconcilePods
from common libraryreconcileServices
from common libraryupdateStatusSingle()
with new functionupdateJobStatus()
pkg/common/jobController
,pkg/control
,pkg/util
TestCleanupTFJob
since we cannot replace deleteTFJobHandler anymore with using common library