Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] MXNet operator support #136

Closed
gaocegege opened this issue May 22, 2018 · 22 comments
Closed

[Request] MXNet operator support #136

gaocegege opened this issue May 22, 2018 · 22 comments

Comments

@gaocegege
Copy link
Member

gaocegege commented May 22, 2018

Now we support TensorFlow and pytorch well. MXNet is another popular ML framework and I think we should implement the operator for it to attract more DL practitioners.

We need to create a proposal like https://github.com/kubeflow/community/blob/master/proposals/pytorch-operator-proposal.md and create the repository for the operator.

References

/cc @brucechin

@suleisl2000
Copy link
Contributor

TuSimple have just written a draft version of MXNet operator. We'd like to contribute it to Kubeflow community as part of ecosystem and continuously improve it in the future. In addition to better deep learning lifecycle management, we plan to integrate the operator with kube-arbitrator for better job scheduling. Any process I have to follow up to make it happen? @jlewi

BTW, TuSimple is one of the major contributors in MXNet community.

/cc @k82cn @jzp1025 @jjmtraveller

@jlewi
Copy link
Contributor

jlewi commented Aug 13, 2018

@suleisl2000 that's a fantastic offer.

Would you and TuSimple be willing to continue supporting the operator and making the changes (see below) to better integrate it into Kubeflow?

Assuming we go forward integrating this into Kubeflow there's a couple things we should do

  1. Add a ksonnet package see here
  2. Add an mxnet guide to our website [here](https://github.com/kubeflow/website/tree/master/content/docs/guides/components
  3. Figure out where the code should live
  4. Figure out what changes are needed to provide an API consistent with other operators see v1alpha2 pytorch API should try to be consistent with TFJob pytorch-operator#49

@suleisl2000 @johnugeorge is currently working on refactoring the PyTorchJob and TFJob operators so that they can share implementation; see for example kubeflow/training-operator#773. The current thinking is that we should have a different CRD for each type of job, but the underlying implementation should be shared.

As part of this there is ongoing discussion about whether we should use separate repos or move all the
code into a single operator.

@gaocegege @johnugeorge what do you think? Should we put the code in its own repository or should we add it to tf-operator to make it easier to start refactoring to use a shared implementation?

@johnugeorge
Copy link
Member

@suleisl2000 Great to see this effort. Tf-operator is currently refactored to share the common code for future operators. Operators will have their own CRDs(at least for now) while sharing all common code. Currently, we have decided to use tf-operator as the central repo to hold all operators. We have planned to rename the repo in the future. I will raise initial PR for Pytorch operator(v1alpha2) in a week. You can use that as a reference.

@jlewi
Since we already have a shared implementation now, I think we should use it. Else efforts will be duplicated. we can plan v1alpha2 version of https://github.com/TuSimple/mxnet-operator to use the shared implementation in tf-operator repo.

@suleisl2000
Copy link
Contributor

@jlewi @johnugeorge Thanks for your reply. We are glad to integrate it into Kubeflow and we will take actions as following:

  1. Investigate v1alpha2 version to use shared implementation mentioned by @johnugeorge
  2. Finish TODO items listed by @jlewi

/cc @jzp1025 @jjmtraveller

@jlewi
Copy link
Contributor

jlewi commented Aug 16, 2018

SGTM. I think if we do items 1 (ksonnet package) and 2 (docs) people can start using it and giving feedback. As long as we mark it "experimental" to indicate that is in flux and subject to change I don't see a strong need to block on resolution of i/how best to integrate into the shared implementation.

@suleisl2000
Copy link
Contributor

@jlewi That's great. We are cleaning up the code, and will create a PR once it done.

@suleisl2000
Copy link
Contributor

I have created PRs for ksonnect package and docs. However, I am not sure how to create PR for the operator, would you mind giving some guidance? @jlewi @johnugeorge

suleisl2000 added a commit to suleisl2000/kubeflow that referenced this issue Aug 22, 2018
suleisl2000 added a commit to suleisl2000/kubeflow that referenced this issue Aug 22, 2018
@jlewi
Copy link
Contributor

jlewi commented Aug 22, 2018

What's the question about the operator; is it just about where to put the code? Should ew create a new repository for this?

@suleisl2000
Copy link
Contributor

suleisl2000 commented Aug 22, 2018

Yes, that's my question. Currently, we have the repository under https://github.com/TuSimple/mxnet-operator. I think it is better to move the code into https://github.com/kubeflow/mxnet-operator just like other operators. If it is the case, who have the permission to create the repository for it? @jlewi

@johnugeorge
Copy link
Member

johnugeorge commented Aug 22, 2018 via email

suleisl2000 added a commit to suleisl2000/kubeflow that referenced this issue Aug 23, 2018
@gaocegege
Copy link
Member Author

I think we could create a new repo. Repo transfer requires the ownership in Kubeflow and TuSimple org

@jlewi
Copy link
Contributor

jlewi commented Aug 23, 2018

I think a new repo is better; just so there is a PR indicating CLA's been signed and code is contributed.

jlewi added a commit to kubeflow/mxnet-operator that referenced this issue Aug 23, 2018
Create a new repository for the mxnet operator and add @suleisl2000  as an owner since he will be working on it.
Related to kubeflow/community#136
@jlewi
Copy link
Contributor

jlewi commented Aug 23, 2018

@suleisl2000 I have created the repo mxnet-operator.
I also sent you an invite to the org please accept.

You'll need to do the following

  1. Follow the instructions here to setup Prow
  2. Add other reviewers/approvers to the OWNERs file as necessary (they will need to request invites to the Kubeflow org for that to actually work)
  3. Please submit a PR adding yourself to members.yaml
    https://github.com/kubeflow/community/blob/master/members.yaml
    so that we know how to reach you
    • All the repo approvers should also be listed there; everyone should submit a PR for themselves so its clear we have their permission to include them in the repo.
  4. You should be able to submit and approve PRs to the repo

Lets leave this issue until the above items are completed and the repo is fully setup.

@idibidiart
Copy link

Really eager to try MXNet Operator in Kubeflow. Is it ready in master? @suleisl2000

@suleisl2000
Copy link
Contributor

There is a little bit merge effort to be done. I will try to finish it early next week. Thank you for your attention. @idibidiart

@suleisl2000
Copy link
Contributor

suleisl2000 commented Aug 27, 2018

@jlewi Please help invite @jzp1025 into the org as mxnet-operator reviewer. BTW, looks like I don't have access to "Add the ci-bots team to the repository with write access" indicated in item 1.

k8s-ci-robot pushed a commit to kubeflow/website that referenced this issue Aug 27, 2018
k8s-ci-robot pushed a commit to kubeflow/mxnet-operator that referenced this issue Aug 29, 2018
* add prow setup config (kubeflow/community#136)

* copy all files from tusimple/mxnet-operator to here

* change to mxnet-operator
@jlewi
Copy link
Contributor

jlewi commented Sep 3, 2018

@suleisl2000 I already added the ci-bots; sorry for not making that clear.

sent @jzp1025 a GitHub invite.

@gaocegege
Copy link
Member Author

Could we close the issue? I think the mxnet operator repository has been set up.

@idibidiart
Copy link

Hi,

Has it been merged with master? Or is it a separate repo?

Also, is there a Readme for using MXNet with Kubeflow?

@gaocegege
Copy link
Member Author

gaocegege commented Sep 4, 2018

@idibidiart
Copy link

Fantastic. Thank you to all who helped with this.

yutongz pushed a commit to yutongz/k8s-test-infra that referenced this issue Sep 12, 2018
yutongz pushed a commit to yutongz/k8s-test-infra that referenced this issue Sep 12, 2018
@gaocegege
Copy link
Member Author

I think we could close the issue since we have the repo for mxnet operator.

michelle192837 pushed a commit to michelle192837/testgrid that referenced this issue Aug 27, 2019
woop pushed a commit to woop/community that referenced this issue Nov 16, 2020
* Add richardsliu to OWNERS in kubeflow/website

* Test website versioning

* Revert "Test website versioning"

This reverts commit 67ea8da00360bf05aebdb667f608c36774ed822f.

* Testing website versioning

* Add richardsliu to OWNERS in kubeflow/website

Test website versioning

Revert "Test website versioning"

This reverts commit 67ea8da00360bf05aebdb667f608c36774ed822f.

Testing website versioning

* Fix css

* Fixing merge errors

* Fix css

* Fix css

* Fix css

* Fix css

* Change master label to latest

* Parameterize links in docs to point to the right version

* Fix shortcode

* Fix shortcode

* Fix shortcodes

* Fix more links

* Fix some more links

* Modify style changes in sass instead of css

* Edit README.md

* Rename latest to master
saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants