Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared implementation of operator code #773

Merged
merged 8 commits into from
Aug 10, 2018
Merged

Shared implementation of operator code #773

merged 8 commits into from
Aug 10, 2018

Conversation

johnugeorge
Copy link
Member

@johnugeorge johnugeorge commented Aug 8, 2018

This is a continuation of TF refactoring PR #767

  1. Moved common initialization to the base controller
  2. Moved common pod and service level functions to the base

This change is Reviewable

@coveralls
Copy link

coveralls commented Aug 8, 2018

Coverage Status

Coverage decreased (-0.1%) to 58.026% when pulling b5ced61 on johnugeorge:refactor into d2509aa on kubeflow:master.

@johnugeorge
Copy link
Member Author

/cc @jlewi /cc @gaocegege

@k8s-ci-robot k8s-ci-robot requested a review from jlewi August 8, 2018 09:04
@k8s-ci-robot
Copy link

@johnugeorge: GitHub didn't allow me to request PR reviews from the following users: /cc.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @jlewi /cc @gaocegege

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -115,6 +133,75 @@ type JobController struct {
Recorder record.EventRecorder
}

func NewJobController(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name may conflicts users since Kubernetes has a job controller

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about TrainingJobController or BaseController

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is package scoped, is it a problem? I initially thought of putting just "Controller" as the package name. But I felt that it will be confusing as we already have a "Controller" package for v1alpha1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new contributors will be confused, maybe. While it is not a big problem, just a nit.

@@ -75,7 +67,7 @@ var (
// TFJobController is the type for TFJob Controller, which manages
// the lifecycle of TFJobs.
type TFJobController struct {
jobcontroller.JobController
*jobcontroller.JobController
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why using pointer here, could you please explain more about it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No modification. Hence not necessary. I will update it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@gaocegege
Copy link
Member

Generally LGTM 😄

@gaocegege
Copy link
Member

/lgtm
/hold

Waiting for @jlewi 's review

Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after nits. Thanks!

// AdoptFunc used byControlRefManager to get the latest object if UID matches
AdoptFunc(job metav1.Object) func() (metav1.Object, error)
// Returns the Controller name
GetControllerName() string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer ControllerName(), GetXxx() is Java style :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same. But I left it because many interfaces in kubernetes haven't followed it.
https://github.com/kubernetes/apimachinery/blob/master/pkg/apis/meta/v1/meta.go#L33

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScorpioCPH Added the changes

// orphan.
// for _, tfjob := range tc.getPodJobs(pod) {
// tc.enqueueTFJob(tfjob)
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these commented out code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove them

@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 8, 2018
@gaocegege
Copy link
Member

/lgtm

@johnugeorge
Copy link
Member Author

/retest

1 similar comment
@johnugeorge
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 8, 2018
@johnugeorge
Copy link
Member Author

cluster creation is timing out
/retest

@ankushagarwal ankushagarwal removed their request for review August 8, 2018 18:08
@johnugeorge
Copy link
Member Author

@jlewi Is this a transient issue?

@johnugeorge
Copy link
Member Author

/retest

@gaocegege
Copy link
Member

gaocegege commented Aug 9, 2018

Error log:

INFO:googleapiclient.discovery:URL being requested: POST https://container.googleapis.com/v1/projects/kubeflow-ci/zones/us-east1-d/clusters?alt=json
INFO:root:Creating cluster; project=kubeflow-ci, zone=us-east1-d, name=zrator-presubmit-v2-773-b5ced61-926-5207
ERROR:root:Exception occured creating cluster: <HttpError 500 when requesting https://container.googleapis.com/v1/projects/kubeflow-ci/zones/us-east
1-d/clusters?alt=json returned "Internal error encountered.">, status: 500
Traceback (most recent call last):
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v2-773-b5ced61-926-5207/src/kubeflow/testing/py/kubeflow/testing/util.py", line 168, in
create_cluster
   response = request.execute()
 File "/usr/local/lib/python2.7/dist-packages/oauth2client/_helpers.py", line 133, in positional_wrapper
   return wrapped(*args, **kwargs)
 File "/usr/local/lib/python2.7/dist-packages/googleapiclient/http.py", line 844, in execute
   raise HttpError(resp, content, uri=self.uri)
HttpError: <HttpError 500 when requesting https://container.googleapis.com/v1/projects/kubeflow-ci/zones/us-east1-d/clusters?alt=json returned "Inte
rnal error encountered.">
Traceback (most recent call last):
 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
   "__main__", fname, loader, pkg_name)
 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
   exec code in run_globals
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v2-773-b5ced61-926-5207/src/kubeflow/tf-operator/py/deploy.py", line 360, in <module>
   main()
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v2-773-b5ced61-926-5207/src/kubeflow/tf-operator/py/deploy.py", line 356, in main
   args.func(args)
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v2-773-b5ced61-926-5207/src/kubeflow/tf-operator/py/deploy.py", line 139, in setup_clust
er
   util.create_cluster(gke, project, zone, cluster_request)
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v2-773-b5ced61-926-5207/src/kubeflow/testing/py/kubeflow/testing/util.py", line 168, in
create_cluster
   response = request.execute()
 File "/usr/local/lib/python2.7/dist-packages/oauth2client/_helpers.py", line 133, in positional_wrapper
   return wrapped(*args, **kwargs)
 File "/usr/local/lib/python2.7/dist-packages/googleapiclient/http.py", line 844, in execute
   raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 500 when requesting https://container.googleapis.com/v1/projects/kubeflow-ci/zones/us-east1-d/clusters?
alt=json returned "Internal error encountered.">

@gaocegege
Copy link
Member

/retest

@gaocegege
Copy link
Member

gaocegege commented Aug 9, 2018

/cc @jlewi
The test cannot set up the cluster, and I think the problem is from the cluster. Could you please help us to locate it?

@jlewi
Copy link
Contributor

jlewi commented Aug 9, 2018

/assign @ankushagarwal Can you review please?

@johnugeorge
Copy link
Member Author

/retest

@gaocegege
Copy link
Member

It's cool
/lgtm
/approve

@gaocegege
Copy link
Member

@jlewi Do you want to review the PR? If not I think we can merge it.

@jlewi
Copy link
Contributor

jlewi commented Aug 10, 2018

/lgtm
/approve
/hold cancel

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants