Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller: Refactor controller_pod #548

Merged
merged 9 commits into from
Apr 25, 2018
Merged

controller: Refactor controller_pod #548

merged 9 commits into from
Apr 25, 2018

Conversation

gaocegege
Copy link
Member

@gaocegege gaocegege commented Apr 20, 2018

Signed-off-by: Ce Gao gaoce@caicloud.io


This change is Reviewable

@coveralls
Copy link

coveralls commented Apr 20, 2018

Coverage Status

Coverage decreased (-1.0%) to 49.4% when pulling d83e7fb on gaocegege:slice into 60d25c4 on kubeflow:master.

@gaocegege gaocegege changed the title WIP: Refactor controller_pod controller: Refactor controller_pod Apr 24, 2018
@gaocegege
Copy link
Member Author

/assign @ScorpioCPH @jlewi

@ankushagarwal ankushagarwal removed their request for review April 24, 2018 03:50
Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I like this function getPodSlices() :)

}

func getPodSlices(pods []*v1.Pod, replicas int, logger *log.Entry) [][]*v1.Pod {
podSlices := make([][]*v1.Pod, 0)
logger.Infof("%v", pods)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log maybe too verbose.

@@ -212,9 +198,9 @@ func getPodSlices(pods []*v1.Pod, replicas int, logger *log.Entry) [][]*v1.Pod {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we break here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I rebased the master and can not see it anymore, could you have another review and point it out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has already been addressed.

}

func getPodSlices(pods []*v1.Pod, replicas int, logger *log.Entry) [][]*v1.Pod {
podSlices := make([][]*v1.Pod, 0)
logger.Infof("%v", pods)
podSlices := make([][]*v1.Pod, replicas)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about use a map for podSlices, like this:

podSlices := make(map[int64][]*v1.Pod, replicas)

}

func getPodSlices(pods []*v1.Pod, replicas int, logger *log.Entry) [][]*v1.Pod {
podSlices := make([][]*v1.Pod, 0)
logger.Infof("%v", pods)
podSlices := make([][]*v1.Pod, replicas)
for _, pod := range pods {
if _, ok := pod.Labels[tfReplicaIndexLabel]; !ok {
logger.Warning("The pod do not have the index label.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

// TODO(CPH): Need to delete pods.
loggerForTFJob(tfjob).Infof("need to delete pod but it is not implemented yet")
// We already have one, and check if it is succeede or something else.
// pod := podSlice[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this line used for?

Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
@gaocegege
Copy link
Member Author

Addressed.

Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
@jlewi
Copy link
Contributor

jlewi commented Apr 24, 2018

nit; it would be great if you could add a more descriptive PR description. If its not making substantive changes I'm happy to leave it to you @ScorpioCPH

@jlewi
Copy link
Contributor

jlewi commented Apr 24, 2018

Review status: 0 of 3 files reviewed at latest revision, 5 unresolved discussions.


pkg/controller.v2/controller_pod.go, line 189 at r2 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

This log maybe too verbose.

IME we should log sufficient information to debug what the job did so we can troubleshoot. If we use the json log entry with appropriate keys; hopefully we can filter logs pretty easily.


pkg/controller.v2/controller_pod.go, line 46 at r6 (raw file):

	pods = filterPodsForTFReplicaType(pods, rt)
	succeeded, failed := getPodStatus(pods)
	runningPods := filterRunningPods(pods)

Should you compute succeeded and running pods in the loop below because you want to take into account the fact that a replica could have more than 1 pod.

Why not use the for loop below to compute a map

replicaIndex -> status
and then pass that to update status?


pkg/controller.v2/controller_pod.go, line 53 at r6 (raw file):

	for index, podSlice := range podSlices {
		if len(podSlice) > 1 {
			loggerForTFJob(tfjob).Warning("We have to many pods for the worker %d", index)

More than 1 pod might not be an error. If there is 1 running pod and the rest are not running that might be fine. Might be worth noting that in the TODO as part of fixing this later.


pkg/controller.v2/controller_pod.go, line 57 at r6 (raw file):

		}
		if len(podSlice) == 0 {
			loggerForTFJob(tfjob).Infof("need to create new pod: %s-%d", rt, index)

Would be good to log this message with keys corresponding to replica type and index. This way we can easily filter log messages to see what we did/didn't do for a particular replica.


pkg/controller.v2/controller_pod.go, line 138 at r6 (raw file):

}

func getPodSlices(pods []*v1.Pod, replicas int, logger *log.Entry) [][]*v1.Pod {

Please add a comment for getPodSlices; even thought its a private function it well help future readers of the code.


Comments from Reviewable

@gaocegege
Copy link
Member Author

pkg/controller.v2/controller_pod.go, line 138 at r6 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Please add a comment for getPodSlices; even thought its a private function it well help future readers of the code.

OK, SGTM


Comments from Reviewable

@gaocegege
Copy link
Member Author

pkg/controller.v2/controller_pod.go, line 189 at r2 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

IME we should log sufficient information to debug what the job did so we can troubleshoot. If we use the json log entry with appropriate keys; hopefully we can filter logs pretty easily.

I added such statement for debug and removed now. I think the log in this function is enough to help us locate the errors. WDYT


Comments from Reviewable

@gaocegege
Copy link
Member Author

pkg/controller.v2/controller_pod.go, line 190 at r2 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

How about use a map for podSlices, like this:

podSlices := make(map[int64][]*v1.Pod, replicas)

I considered it once but I think the index is in a range then I think slice is easier to understand.


Comments from Reviewable

@gaocegege
Copy link
Member Author

:lgtm:


Comments from Reviewable

@gaocegege
Copy link
Member Author

pkg/controller.v2/controller_pod.go, line 53 at r6 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

More than 1 pod might not be an error. If there is 1 running pod and the rest are not running that might be fine. Might be worth noting that in the TODO as part of fixing this later.

I think it is an error since we expect that only one pod for one replica index is created.


Comments from Reviewable

@gaocegege
Copy link
Member Author

pkg/controller.v2/controller_pod.go, line 46 at r6 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Should you compute succeeded and running pods in the loop below because you want to take into account the fact that a replica could have more than 1 pod.

Why not use the for loop below to compute a map

replicaIndex -> status
and then pass that to update status?

Good idea, I will update it.


Comments from Reviewable

@gaocegege
Copy link
Member Author

pkg/controller.v2/controller_pod.go, line 57 at r6 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Would be good to log this message with keys corresponding to replica type and index. This way we can easily filter log messages to see what we did/didn't do for a particular replica.

OK, will do it.


Comments from Reviewable

@gaocegege
Copy link
Member Author

pkg/controller.v2/controller_pod.go, line 101 at r2 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

What this line used for?

Removed.


Comments from Reviewable

Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor nits.

@@ -0,0 +1,66 @@
package controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add copyright here :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for that. I will add it soon.

}
}

// All workers are running, set StartTime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
@gaocegege
Copy link
Member Author

PTAL @ScorpioCPH

@yph152
Copy link
Contributor

yph152 commented Apr 25, 2018

 else {
 +			// We already have one, and check the status.
 +			pod := podSlice[0]
 +			updateTFJobReplicaStatuses(tfjob, rtype, pod)
  		}		  		}
...
func updateTFJobReplicaStatuses(tfjob *tfv1alpha2.TFJob, rtype tfv1alpha2.TFReplicaType, pod *v1.Pod) {
 +	switch pod.Status.Phase {
 +	case v1.PodRunning, v1.PodPending:
 +		tfjob.Status.TFReplicaStatuses[rtype].Active++
 +	case v1.PodSucceeded:
 +		tfjob.Status.TFReplicaStatuses[rtype].Succeeded++
 +	case v1.PodFailed:
 +		tfjob.Status.TFReplicaStatuses[rtype].Failed++
 +	}
 +}

@gaocegege If the state of pod continues to change, it will always be restarted, and the state of
running --> failed --> running --> failed..., These values are going to get bigger all the time.

@yph152
Copy link
Contributor

yph152 commented Apr 25, 2018

thanks,I understand!

@ScorpioCPH
Copy link
Member

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ScorpioCPH

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 07fd2d6 into kubeflow:master Apr 25, 2018
@gaocegege gaocegege deleted the slice branch April 25, 2018 07:50
yph152 pushed a commit to yph152/tf-operator that referenced this pull request Jun 18, 2018
* controller_pod: refactor

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Add test case

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* WIP

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller_pod: Separate creation and status update

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Add continue

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Remove old code

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Remove log

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Update

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller_status: Add copyright header

Signed-off-by: Ce Gao <gaoce@caicloud.io>
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this pull request Jul 9, 2018
* controller_pod: refactor

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Add test case

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* WIP

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller_pod: Separate creation and status update

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Add continue

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Remove old code

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Remove log

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Update

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller_status: Add copyright header

Signed-off-by: Ce Gao <gaoce@caicloud.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants