Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller: Separate ps and worker pods #481

Merged
merged 3 commits into from
Mar 21, 2018
Merged

controller: Separate ps and worker pods #481

merged 3 commits into from
Mar 21, 2018

Conversation

gaocegege
Copy link
Member

@gaocegege gaocegege commented Mar 20, 2018

/assign @ScorpioCPH

We do not separate the pods of PS and workers in the controller, then it has bugs when the replicas are succeeded.

If the PR LGTM, I will file another PR for service.

Signed-off-by: Ce Gao gaoce@caicloud.io


This change is Reviewable

Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
@coveralls
Copy link

coveralls commented Mar 20, 2018

Coverage Status

Coverage increased (+0.2%) to 59.071% when pulling b17913b on gaocegege:succeeded into cb5b994 on kubeflow:v1alpha2.

Signed-off-by: Ce Gao <gaoce@caicloud.io>
Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much!

@@ -135,7 +136,7 @@ func (tc *TFJobController) reconcilePods(
}

// Update the active status since we have created -diff pods during the loop.
tfjob.Status.TFReplicaStatuses[rtype].Active = int32(len(activePods) - diff)
tfjob.Status.TFReplicaStatuses[rtype].Active = expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if expected is negative?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will never happen sine succeeded pods always < replicas, WDYT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about use activePods directly to make the code more readability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activePods is outdated since we have already created or deleted some pods during the loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But creation maybe failed so we should only trust the active pods we get in this loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the creation is failed, the functions returns errors and the control flow can not reach here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean creation API is return ok, but Pod are not running successfully because some scheduling or resource limit error, in another word Pod are pending not active.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our definition pending pods is also active pods 🤔

func IsPodActive(p *v1.Pod) bool {
	return v1.PodSucceeded != p.Status.Phase &&
		v1.PodFailed != p.Status.Phase &&
		p.DeletionTimestamp == nil
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a new issue #484

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

@ankushagarwal ankushagarwal removed their request for review March 20, 2018 15:05
Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ScorpioCPH

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 280f77c into kubeflow:v1alpha2 Mar 21, 2018
@gaocegege gaocegege deleted the succeeded branch March 21, 2018 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants