refactor the TfJob to use Informer and Controller #206

wackxu · 2017-12-08T03:29:40Z

Kubernetes provides many useful tools that we can use to develop our project, such as sharedInformer, clientSet, lister and so on. Using these make our project more readable and good performance. and we also need versioned the API and it is easy for us to implement backwards compatible and make the code more specification.
we can refactor the project similar with example https://github.com/kubernetes/sample-controller.
I have done something work. How do you think about it? @jlewi

jlewi · 2017-12-08T18:27:18Z

In general, I'd much prefer to reuse existing well maintained libraries then custom code. So I'm all for replacing custom code with a more established library.

Are their specific changes you are thinking of making?

gaocegege · 2017-12-09T03:49:48Z

Hi we(caicloud) are implementing the controller for kubeflow which is similar to kubernetes/sample-controller. And we are glad to file a proposal for the new design soon.

wackxu · 2017-12-09T12:05:13Z

@gaocegege I am so sorry that I have almost done it and I will commit a PR ASAP

wackxu · 2017-12-09T12:39:08Z

@jlewi We can make the change step by step,The first phases main include follows:
1、refactor the spec package to the kubernetes standard API, make the CRD definition versioned and it is easy for us to maintenance and make the code backwards compatible.
2、import code-generator to help us generate sharedInformer, lister, clientSet and so on.Then we will use these to refactor the controller.
3、Use the sharedInformer to help us list and watch CRDs, It can cache the CRD data, so it is more Efficient and It can reduce the code line number and more readable.
4、replace the client to use the generate client and so on.

ddysher · 2017-12-09T13:08:26Z

@wackxu nice work!

Are you targeting at refactoring the code base only, or do you also plan to change the API?

jlewi · 2017-12-09T19:41:47Z

That's awesome. Are you planning on submitting a single PR or splitting it up into multiple PRs?

…

On Dec 9, 2017 7:08 AM, "Deyuan Deng" ***@***.***> wrote: @wackxu <https://github.com/wackxu> nice work! Are you targeting at refactoring the code base only, or do you also plan to change the API? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#206 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAvcA0foys2IdSOYNPkxd8K8i0rUHvSrks5s-oZLgaJpZM4Q6iaY> .

gaocegege · 2017-12-10T09:07:38Z

@wackxu It doesn't matter, it is awesome 😄

fix #206 1.refactor the spec package to the kubernetes standard API, all detention of our CRD is in pkg pkg/apis/tensorflow and also the validation、 set Default for the CRD and some help func. 2.for use K8s code gen, wo import the project https://github.com/kubernetes/code-generator, and some shell script in the folder hack. You can see how to use it in https://blog.openshift.com/kubernetes-deep-dive-code-generation-customresources/ 3.All the file in pkg/client are all auto generated and we should not modify it manually. 4.I do not plan to change the logical to use the sharedInformer and clientset. This PR is big enough, I think I can do it in a follow PR. 5.SharedInformer is a helpful tool with cache that we can use it to list and watch our resources. More details you can see http://borismattijssen.github.io/articles/kubernetes-informers-controllers-reflectors-stores

jlewi · 2018-01-03T05:44:12Z

Reopening this because it was autoclosed but I think there are PRs still pending related to this.

gaocegege · 2018-01-03T09:22:08Z

#234 refactors the logic but there are some other things should be done to migrate to the new controller design.

We implemented a simple controller https://github.com/caicloud/kubeflow-controller and glad to help the upstream to do the rest work.

jlewi · 2018-01-03T13:33:31Z

@gaocegege Merging the caicloud work seems like a great idea; I think the caiacloud controller has many advantages over the current implementation. A couple questions/comments

https://github.com/caicloud/kubeflow-controller seems to distinguish between a local job and non local job. What's the difference?
How does the caicloud controller determine when a job is done?
Could you explain the proposed design? Here are some questions, but I think it might better if you could write up an explanation that could eventually be included in our docs as opposed to answering the questions directly.
- It looks like its event driven. Can you explain this? For example is the controller handling events for the sub resources (e.g. pods and services) and not just TfJob events?
- It looks like events are created by comparing current status to expectations. Can you explain how this works in terms of deciding whether a pod needs to be created?
- How do you deal with the fact that that workers and pods are stateful?
Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see Consider how we manage replicas (stateful sets, managing pods directly) #45).
Looks like caicloud controller is publishing Kubernetes events which is definitely something we should do

wackxu · 2018-01-04T01:48:22Z

@jlewi @gaocegege caicloud kubeflow-controller is similar with kubernetes job controller, It is a better implementation. I think we can do it in the next PR and I am very glad to help review it. The main change is so big and I think we should first merge #234, #234 is mainly change the structure of controller and cmd pkg to keep style with kubernetes controller. The next PR we can fork on the reconcile logic like caicloud kubeflow-controller .

ghost · 2018-01-04T03:14:29Z

@jlewi @wackxu

Thanks for your review, I will write a design doc ASAP and file an issue to discuss the differences between caicloud controller and the upstream controller. And I agree with @wackxu that we should focus on #234 and keep moving.

https://github.com/caicloud/kubeflow-controller seems to distinguish between a local job and non-local job. What's the difference?

There are some changes in caicloud TFJob CRD, and @DjangoPeng is summarising them and he will file an issue to discuss it.

How does the caicloud controller determine when a job is done?

Now it is simple, the controller checks the statuses of the workers, if all workers are succeeded, we think the TFJob is succeeded.

Could you explain the proposed design? Here are some questions, but I think it might better if you could write up an explanation that could eventually be included in our docs as opposed to answering the questions directly.

Yeah, I will write a design doc and post in this issue.

Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see #45).

We have an internal discussion, too in https://github.com/caicloud/kubeflow-controller/issues/71. Now we have not a conclusion and welcome comments :-)

Personally, I think if most of the use cases could be handled by job, then job is better since we do not have to manage all the lifecycle of the pods, which is implemented in job controller.

Looks like caicloud controller is publishing Kubernetes events which is definitely something we should do

Yeah, and I think @zjj2wry are implementing the feature for tensorflow/k8s: #260.

Now we have limited support in caicloud controller and the recorder records the creation of pods and services.

For example, events when we create 4 workers and 2 PS:

Events:
  Type    Reason            Age                From                 Message
  ----    ------            ----               ----                 -------
  Normal  SuccessfulCreate  33s                kubeflow-controller  Created service: dist-training-job-worker-0-d88wg
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-worker-1-hwl26
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-worker-2-jx7g7
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-worker-3-gdb66
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-ps-0-nlfdp
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-ps-1-jw7c4
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created pod: dist-training-job-2j9br
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created pod: dist-training-job-pm85l
  Normal  SuccessfulCreate  31s                kubeflow-controller  Created pod: dist-training-job-5rgvh
  Normal  SuccessfulCreate  29s (x3 over 31s)  kubeflow-controller  (combined from similar events): Created pod: dist-training-job-n48d7

And we could support more events if we need.

DjangoPeng · 2018-01-04T03:35:02Z

@jlewi Thanks for your reply. I'd like to give a supplement.

There are some changes in caicloud TFJob CRD, and @DjangoPeng is summarising them and he will file an issue to discuss it.

In caicloud Clever platform, we support not only TensorFlow jobs but also batch job, service and others. Considering the open source work, we'll eventually combine TFJob CRD of ourselves and upstream. It's necessary to file an issue to discuss the details, and I'll do it after my company work.

How does the caicloud controller determine when a job is done?

The design and lifecycle of local job is determinate and easy-to-understand, because local job is similar with single batch job. But distributed job is more complicated and the status is up to user. So, we have to discuss the strategy and logic.

Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see #45).

We're discussing the selection. Personally speaking, I prefer Pod. Ref: https://github.com/caicloud/kubeflow-controller/issues/71#issuecomment-355056365

gaocegege · 2018-01-05T06:30:37Z

https://github.com/gaocegege/kubeflow-controller/blob/a54021d250c6e532f3a6e32be32e37e1249c0780/docs/design_doc.md#implementation

I added some explanation about the implementation, and as wackxu@ said it is simliar to job controller in Kubernetes.

Before #234 merged, I will fix some issues in our own implementation. And I will file a PR to contribute the implementation to the upstream after #234 if you think it is generally acceptable :-)

jlewi · 2018-01-05T15:12:09Z

In caicloud Clever platform, we support not only TensorFlow jobs but also batch job, service and others. Considering the open source work, we'll eventually combine TFJob CRD of ourselves and upstream. It's necessary to file an issue to discuss the details, and I'll do it after my company work.

Are you planning on using a single CRD or having different CRDs for different types of jobs?

@DjangoPeng so is your plan to upstream to tensorflow/k8s any changes/improvements needed by Caicloud and then just use the TfJob controller in tensorflow/k8s? Do you forsee a need to maintain a downstream fork in caicloud/kubeflow-controller long term?

gaocegege · 2018-01-10T06:29:19Z

@jlewi

Sorry for the late reply.

We hope we could use the upstream controller eventually. And we will discuss about the changes and try to merge the changes in CRD into the upstream, or we will rethink about the CRD spec design of ours if we can not come to an agreement.

As README in caicloud/kubeflow-controller said, it is just temporary. And we hope the controller could help to accelerate the development of tensorflow/k8s.

jlewi · 2018-01-10T11:29:26Z

Sounds great thanks.

jlewi · 2018-01-25T13:44:11Z

Is there more work to be done to use Informer and Controller?

gaocegege · 2018-01-25T14:01:13Z

I think the work is done, and other things are in other issues, e.g. #314

gaocegege · 2018-02-05T15:16:33Z

I am closing the issue since we have already finished it. Thanks for your awesome contribution again @wackxu

jlewi changed the title ~~refactor the k8s project~~ refactor the TfJob to use K8s libraries Dec 12, 2017

wackxu mentioned this issue Dec 12, 2017

Refactor the TfJob to use K8s libraries #215

Merged

jlewi changed the title ~~refactor the TfJob to use K8s libraries~~ refactor the TfJob to use Informer and Controller Dec 12, 2017

jlewi assigned wackxu Dec 12, 2017

jlewi mentioned this issue Dec 14, 2017

Use conditions instead of phase #223

Closed

jlewi closed this as completed in #215 Dec 29, 2017

jlewi reopened this Jan 3, 2018

gaocegege mentioned this issue Jan 11, 2018

[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283

Closed

6 tasks

This was referenced Jan 12, 2018

[Discussion] Operators vs. controller pattern #300

Closed

TrainingJob.reconcile not called periodically #309

Closed

jlewi added this to the Kubecon Europe milestone Jan 25, 2018

gaocegege added kind/feature area/operator labels Feb 1, 2018

gaocegege closed this as completed Feb 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor the TfJob to use Informer and Controller #206

refactor the TfJob to use Informer and Controller #206

wackxu commented Dec 8, 2017

jlewi commented Dec 8, 2017

gaocegege commented Dec 9, 2017

wackxu commented Dec 9, 2017

wackxu commented Dec 9, 2017

ddysher commented Dec 9, 2017

jlewi commented Dec 9, 2017 via email

gaocegege commented Dec 10, 2017

jlewi commented Jan 3, 2018

gaocegege commented Jan 3, 2018 •

edited

Loading

jlewi commented Jan 3, 2018

wackxu commented Jan 4, 2018

ghost commented Jan 4, 2018 •

edited by ghost

Loading

DjangoPeng commented Jan 4, 2018

gaocegege commented Jan 5, 2018 •

edited

Loading

jlewi commented Jan 5, 2018 •

edited

Loading

gaocegege commented Jan 10, 2018 •

edited

Loading

jlewi commented Jan 10, 2018

jlewi commented Jan 25, 2018

gaocegege commented Jan 25, 2018

gaocegege commented Feb 5, 2018

refactor the TfJob to use Informer and Controller #206

refactor the TfJob to use Informer and Controller #206

Comments

wackxu commented Dec 8, 2017

jlewi commented Dec 8, 2017

gaocegege commented Dec 9, 2017

wackxu commented Dec 9, 2017

wackxu commented Dec 9, 2017

ddysher commented Dec 9, 2017

jlewi commented Dec 9, 2017 via email

gaocegege commented Dec 10, 2017

jlewi commented Jan 3, 2018

gaocegege commented Jan 3, 2018 • edited Loading

jlewi commented Jan 3, 2018

wackxu commented Jan 4, 2018

ghost commented Jan 4, 2018 • edited by ghost Loading

DjangoPeng commented Jan 4, 2018

gaocegege commented Jan 5, 2018 • edited Loading

jlewi commented Jan 5, 2018 • edited Loading

gaocegege commented Jan 10, 2018 • edited Loading

jlewi commented Jan 10, 2018

jlewi commented Jan 25, 2018

gaocegege commented Jan 25, 2018

gaocegege commented Feb 5, 2018

gaocegege commented Jan 3, 2018 •

edited

Loading

ghost commented Jan 4, 2018 •

edited by ghost

Loading

gaocegege commented Jan 5, 2018 •

edited

Loading

jlewi commented Jan 5, 2018 •

edited

Loading

gaocegege commented Jan 10, 2018 •

edited

Loading