Create Pod instead of Job #344

ScorpioCPH · 2018-01-24T12:50:06Z

This PR is a part of #325:

rename jobName() to genName()
create Pod instead of Job

TODOs (in another PR):

use controller.PodControlInterface and CreatePodsWithControllerRef to create Pod
Listen Pod CRUD and update TFJob status which descried in Make the TfJob controller more event driven #314

@jlewi @gaocegege

This change is

gaocegege · 2018-01-24T13:32:59Z

Do we keep the trainer and just use pod instead of job in this PR?

jlewi · 2018-01-29T13:15:50Z

Can you explain how you are going to deal with pod terminations? For example suppose, a pod gets terminated because the machine its running on becomes unavailable. How will this situation be handled?

gaocegege · 2018-01-29T13:19:32Z

@jlewi

The pod event callback handler should be notified by the apiserver and we could handle the failure of pod, I think.

gaocegege · 2018-02-08T08:37:48Z

Hi, @ScorpioCPH

Any progress here?

ScorpioCPH · 2018-02-09T03:51:40Z

This is dependent on this discussion #333.

jlewi · 2018-02-10T00:17:14Z

Why is this blocked on #333 ?

The discussion in #333 is to tackle TFJobStatus as part of the next version of the TFJob API. That will probably take some time.

Can we do the migration to direct management of pods using the current API?

ScorpioCPH · 2018-02-10T03:33:31Z

If we just want to replace Job with Pod, I think this PR is enough now.

Maybe we can add event driven logic in another PR.

jlewi · 2018-02-10T05:48:16Z

When the PR is ready please remove WIP from the title.

coveralls · 2018-02-10T06:23:31Z

Coverage decreased (-0.4%) to 31.333% when pulling be20406 on ScorpioCPH:replace-job-with-pod into bde716e on tensorflow:master.

gaocegege

Some comments, and one question:

Do we need to update https://github.com/tensorflow/k8s/blob/master/examples/gke/TF%20on%20GKE.ipynb?

/cc @jlewi

gaocegege · 2018-02-10T07:26:44Z

pkg/trainer/replicas.go

 			}
 		} else {
-			s.recorder.Eventf(s.Job.job, v1.EventTypeNormal, SuccessfulCreateReason, "Created job: %v", createdJob.Name)
+			s.recorder.Eventf(s.Job.job, v1.EventTypeNormal, SuccessfulCreateReason, "Created Pod: %v", createdPod.Name)
 		}
 	}
 	return nil


https://github.com/tensorflow/k8s/pull/344/files#diff-d2bc8c1807fa25d2b911d2d781f48a07R236

err = s.ClientSet.BatchV1().Jobs(s.Job.job.ObjectMeta.Namespace).DeleteCollection(&meta_v1.DeleteOptions{}, options)

I think we should update here, too.

Yes, thanks, done.

gaocegege · 2018-02-10T07:29:19Z

pkg/trainer/replicas.go

@@ -359,7 +341,7 @@ func replicaStatusFromPodList(l v1.PodList, name tfv1alpha1.ContainerName) tfv1a
 }

 func (s *TFReplicaSet) GetSingleReplicaStatus(index int32) tfv1alpha1.ReplicaState {
-	j, err := s.ClientSet.BatchV1().Jobs(s.Job.job.ObjectMeta.Namespace).Get(s.jobName(index), meta_v1.GetOptions{})
+	j, err := s.ClientSet.BatchV1().Jobs(s.Job.job.ObjectMeta.Namespace).Get(s.genName(index), meta_v1.GetOptions{})


I am wondering if we should replace BatchV1().Jobs() to CoreV1().Pods()

jlewi · 2018-02-10T20:26:58Z

We can probably just delete
https://github.com/tensorflow/k8s/blob/master/examples/gke/TF%20on%20GKE.ipynb?

If its broken.

jlewi · 2018-02-12T14:15:20Z

What are the failure/retry semantics?

Before we relied on the Job controller to create new pods as necessary. What happens now?

It looks like genNames always creates a POD with the same name for replica index i. But what happens if a pod is terminated (e.g. a VM restarts)? We would either need to recreate the pod with a new name or delete the current pod before recreating it.

jlewi · 2018-02-12T18:33:10Z

/test all

jlewi · 2018-02-12T22:08:52Z

/test all

gaocegege · 2018-02-13T01:16:46Z

@jlewi We are celebrating Chinese New Year those days, and apologize that maybe we can not reply promptly.

What are the failure/retry semantics?

I think it should be handled by the pod restartPolicy.

jlewi · 2018-02-13T14:03:43Z

Restart policy only applies to the containers within the pod docs.

So I don't think it helps in the case where the pod itself is terminated; e.g. because the VM becomes unhealthy.

gaocegege · 2018-02-13T14:07:30Z

@jlewi

Sorry for the misunderstanding. I think we have to handle it in the operator. We need to register the callback for pod and if pod status is changed we should do some work similar to job controller in kubernetes maybe.

ScorpioCPH · 2018-02-14T01:42:16Z

But what happens if a pod is terminated (e.g. a VM restarts)?
failure/retry semantics

I think this is another topic event-driven, we should register some EventHandlerFuncs to watch the Pods' status changed to do some action.

jlewi · 2018-02-14T04:46:02Z

Right now because we create Job controllers, we are resilient to unwanted pod terminations. If a pod is terminated (e.g. because a VM goes away) then the job controller will create a new pod to replace it.

So don't we need to provide the same level of reliability? Otherwise it would be a regression?

Why can't we handle this in the reconcile function?
https://github.com/kubeflow/tf-operator/blob/master/pkg/trainer/training.go#L343

Reconcile should be called periodically; so in the reconcile function why can't we check that all the requisite pods are running and if not create new ones?

This seems like a good idea even if we have event handlers. If we rely on event handlers and an event gets dropped the job would be stuck. But as long as we have logic in Renoncile to get a job back to a healthy state then we can always recover.

jlewi · 2018-02-14T04:47:36Z

See #309 for more info about how the informer periodically sends Update events that we can use to trigger reconcile.

ScorpioCPH · 2018-02-22T02:44:36Z

@jlewi In fact, we call Renoncile() functions in the event handler currently, code is here.

I can't see any different between Reconcile() and EventHandler.

jlewi · 2018-02-22T05:19:34Z

I can't see any different between Reconcile() and EventHandler.

I'm not following.

The lines of code you [linked to(]https://github.com/kubeflow/tf-operator/blob/master/pkg/controller/controller.go#L123-L125) show that the informer will generate an Update event periodically which will cause Reconcile to be called periodically.

So if the Reconcile creates any missing pods then the code should be resilient to failures.

But your Reconcile function always uses the same name for every pod it creates; i.e. the function is TFReplicaSet.genName is a deterministic function of replica type, job type, index, runtime id.

So master-0 would always correspond to some pod name "master-abcd-0".

So suppose we create pod "master-abcd-0" and then this pod is terminated because the VM restarts.
At this point the Reconcile function needs to create a new pod otherwise the job will never make progress. But the Reconcile function right now just calls create using the same name "master-abcd-0" which will return an already exists error.

This worked before when we were creating Job Controllers because the job controller would automatically create a new pod if a pod terminated and wasn't successful.

Now that we are creating the pods ourselves we need to do that.

ScorpioCPH · 2018-03-04T12:07:39Z

@jlewi Addressed your comments (creating pods with a random UUID & reuse SyncPods/SyncServices), PTAL :)

jlewi · 2018-03-04T22:32:11Z

Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

pkg/trainer/replicas.go, line 420 at r4 (raw file):

		// Label to get all pods of this TFReplicaType + index
		labels := s.Labels()

Why not just call GetSingleReplicaStatus and then decide based on the result whether or not to create a new pod?

Comments from Reviewable

jlewi · 2018-03-04T22:35:02Z

Review status: 0 of 3 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

pkg/trainer/replicas.go, line 510 at r4 (raw file):

func (s *TFReplicaSet) genPodName(index int32) string {
	// Generate a new pod name with random string
	return s.genName(index) + util.RandString(8)

Nit please add a "-" between the name and the random string. Also 5 digits of randomness is what K8s does and that's probably sufficient.

Comments from Reviewable

jlewi · 2018-03-04T22:36:15Z

A couple minor comments. Biggest issue is that travis is still failing. It would be nice to reuse GetSingleReplicaStatus if it makes sense but I'm fine with the current code if you think its better.

So fixing travis is the only blocking change at this point.

k8s-ci-robot · 2018-03-05T04:02:12Z

@ScorpioCPH: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
tf-k8s-presubmit	`6e1c2b6`	link	`/test tf-k8s-presubmit`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ScorpioCPH · 2018-03-05T04:14:13Z

/retest

ScorpioCPH · 2018-03-05T04:35:45Z

@jlewi Thanks, i will rethink GetSingleReplicaStatus in v1alpha2 :)

And i think CI is ok now.

jlewi · 2018-03-05T13:25:05Z

Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions.

pkg/trainer/replicas.go, line 420 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Why not just call GetSingleReplicaStatus and then decide based on the result whether or not to create a new pod?

Per discussion elsehwere will reconsider this as part of v1alpha2.

Comments from Reviewable

jlewi · 2018-03-05T13:26:31Z

Reviewed 1 of 2 files at r5.
Review status: 1 of 3 files reviewed at latest revision, 2 unresolved discussions.

Comments from Reviewable

jlewi · 2018-03-05T13:26:56Z

/lgtm
/approve

k8s-ci-robot · 2018-03-05T13:27:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jlewi · 2018-03-05T13:27:18Z

@ScorpioCPH Thank you.

gaocegege · 2018-03-05T13:35:07Z

Reviewed 1 of 2 files at r3, 2 of 2 files at r5.
Review status: all files reviewed at latest revision, 2 unresolved discussions.

Comments from Reviewable

gaocegege · 2018-03-05T13:37:47Z

Could we merge it now? @jlewi

jlewi · 2018-03-05T17:53:47Z

@gaocegege Yeah. I was hoping auto merge would work but I think it's being blocked by the reviewable failed status check. I'll merge it manually.

This PR is a part of kubeflow#325: rename jobName() to genName() create Pod instead of Job TODOs (in another PR): use controller.PodControlInterface and CreatePodsWithControllerRef to create Pod Listen Pod CRUD and update TFJob status which descried in kubeflow#314

* A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Fix kubeflow#500

* A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Don't create any resources if the DeletionTimestamp is set. Creating resources at this point would end blocking deletion of the object because the controller would create resources while we are trying to delete them. * Use logrus in controller.go, trainer.go, and replicas.go to log with fields providing information about the job and repliac. This makes it easy to filter logs for a particular job. * Use logrus to log the name of the job in a field.

* Fix bug with jobs not being marked as completed. * A bug was introduced with getting the replica status in #344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Don't create any resources if the DeletionTimestamp is set. Creating resources at this point would end blocking deletion of the object because the controller would create resources while we are trying to delete them. * Use logrus in controller.go, trainer.go, and replicas.go to log with fields providing information about the job and repliac. This makes it easy to filter logs for a particular job. * Use logrus to log the name of the job in a field. * Checking the deletiontime stamp doesn't appear to be sufficient. Use the Phase to determine whether we should create resources. * Run gofmt. * * Reset the rate limiter after every successful sync. * Otherwise the ratelimiter will end up delaying processing subsequent events which isn't what we want. * Run goimports to fix lint issues. * * Reconcile needs to update the TFJob stored in TrainingJob. This ensures TrainingJob has an up to date representation of the job. * Otherwise changes made to the spec won't be available to TrainingJob. For example, if the job is deleted by the user, the deletion timestamp will be set. But if we don't update the TFJob stored in TrainingJob this change won't be propogated. * * TrainingJob.update should log the value of the job not the pointer. * Add more comments to the code.

* Fix bug with jobs not being marked as completed. * A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Don't create any resources if the DeletionTimestamp is set. Creating resources at this point would end blocking deletion of the object because the controller would create resources while we are trying to delete them. * Use logrus in controller.go, trainer.go, and replicas.go to log with fields providing information about the job and repliac. This makes it easy to filter logs for a particular job. * Use logrus to log the name of the job in a field. * Checking the deletiontime stamp doesn't appear to be sufficient. Use the Phase to determine whether we should create resources. * Run gofmt. * * Reset the rate limiter after every successful sync. * Otherwise the ratelimiter will end up delaying processing subsequent events which isn't what we want. * Run goimports to fix lint issues. * * Reconcile needs to update the TFJob stored in TrainingJob. This ensures TrainingJob has an up to date representation of the job. * Otherwise changes made to the spec won't be available to TrainingJob. For example, if the job is deleted by the user, the deletion timestamp will be set. But if we don't update the TFJob stored in TrainingJob this change won't be propogated. * * TrainingJob.update should log the value of the job not the pointer. * Add more comments to the code.

ScorpioCPH force-pushed the replace-job-with-pod branch from 37d348d to 4033787 Compare February 10, 2018 03:36

jlewi requested a review from gaocegege February 10, 2018 05:47

ScorpioCPH force-pushed the replace-job-with-pod branch from 4033787 to be20406 Compare February 10, 2018 06:15

ScorpioCPH changed the title ~~[WIP] Create Pod instead of Job~~ Create Pod instead of Job Feb 10, 2018

gaocegege reviewed Feb 10, 2018

View reviewed changes

ScorpioCPH force-pushed the replace-job-with-pod branch from be20406 to 6e1c2b6 Compare February 10, 2018 08:06

gaocegege added kind/feature area/operator labels Feb 10, 2018

mitake mentioned this pull request Feb 22, 2018

RFC: Add a new field SchedulerName to the controller config file format #398

Closed

ScorpioCPH force-pushed the replace-job-with-pod branch from 5e342a1 to 979b20c Compare March 4, 2018 12:06

ScorpioCPH force-pushed the replace-job-with-pod branch 2 times, most recently from 61309e8 to 3e60600 Compare March 5, 2018 02:35

Create pod instead of job

2a19325

ScorpioCPH force-pushed the replace-job-with-pod branch from 3e60600 to 2a19325 Compare March 5, 2018 04:01

k8s-ci-robot added the lgtm label Mar 5, 2018

k8s-ci-robot added the approved label Mar 5, 2018

jlewi merged commit 6706903 into kubeflow:master Mar 5, 2018

jlewi mentioned this pull request Mar 7, 2018

Manage Pods directly instead of using Job controllers #325

Closed

jlewi mentioned this pull request Mar 23, 2018

Fix bug with jobs not being marked as completed. #501

Merged

Create Pod instead of Job #344

Create Pod instead of Job #344

Conversation

ScorpioCPH commented Jan 24, 2018 • edited by jlewi Loading

gaocegege commented Jan 24, 2018 • edited Loading

jlewi commented Jan 29, 2018

gaocegege commented Jan 29, 2018

gaocegege commented Feb 8, 2018

ScorpioCPH commented Feb 9, 2018

jlewi commented Feb 10, 2018

ScorpioCPH commented Feb 10, 2018

jlewi commented Feb 10, 2018

coveralls commented Feb 10, 2018 • edited Loading

gaocegege left a comment

Choose a reason for hiding this comment

gaocegege Feb 10, 2018

Choose a reason for hiding this comment

ScorpioCPH Feb 10, 2018

Choose a reason for hiding this comment

gaocegege Feb 10, 2018

Choose a reason for hiding this comment

jlewi commented Feb 10, 2018

jlewi commented Feb 12, 2018

jlewi commented Feb 12, 2018

jlewi commented Feb 12, 2018

gaocegege commented Feb 13, 2018

jlewi commented Feb 13, 2018

gaocegege commented Feb 13, 2018

ScorpioCPH commented Feb 14, 2018 • edited Loading

jlewi commented Feb 14, 2018

jlewi commented Feb 14, 2018

ScorpioCPH commented Feb 22, 2018

jlewi commented Feb 22, 2018

ScorpioCPH commented Mar 4, 2018

jlewi commented Mar 4, 2018

jlewi commented Mar 4, 2018

jlewi commented Mar 4, 2018

k8s-ci-robot commented Mar 5, 2018 • edited Loading

ScorpioCPH commented Mar 5, 2018

ScorpioCPH commented Mar 5, 2018

jlewi commented Mar 5, 2018

jlewi commented Mar 5, 2018

jlewi commented Mar 5, 2018

k8s-ci-robot commented Mar 5, 2018

jlewi commented Mar 5, 2018

gaocegege commented Mar 5, 2018

gaocegege commented Mar 5, 2018

jlewi commented Mar 5, 2018

ScorpioCPH commented Jan 24, 2018 •

edited by jlewi

Loading

gaocegege commented Jan 24, 2018 •

edited

Loading

coveralls commented Feb 10, 2018 •

edited

Loading

ScorpioCPH commented Feb 14, 2018 •

edited

Loading

k8s-ci-robot commented Mar 5, 2018 •

edited

Loading