-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Request] Repository for Arena #164
Comments
/cc @gaocegege @kkasravi |
Personally, I love the tool. I haven't tried it but I looked through the code. I think it improves the usability from CLI level. We have some discussions about Kubeflow CLI. arena is what I want to get from the Kubeflow CLI. Thus I think we could accept the contribution from @cheyang and make it a core project in Kubeflow community. I have a communication with @cheyang and he will continue to contribute to the project. What I am worried about is the copyright. @cheyang Will you just move the repo to Kubeflow org or you could transfer the code copyright and ownership to Kubeflow community? And will your company allow the transfer? |
Thanks for the response of @gaocegege . And we are in processing of approve for copyright. Looks good till now. When it's approved, we can transfer it to KubeFlow community. |
Then LGTM. |
This is really cool and has important functionality like tooling to help fetch logs. I took a look at the repo and I have a couple high level questions
Regarding #1, the overview says
But the Roadmap is much broader; for example it mentions multi-tenancy and training history management. How do these functions overlap with the idea of a CLI? Why wouldn't we tackle these problems in the broader context of Kubeflow rather than in the context of a CLI? Regarding 2. How are we going to align Arena with Kubeflow? One of the core principles of Kubeflow is that we don't introduce new patterns and tools for things already covered by Kubernetes. We also don't want to create Kubeflow specific solutions to problems that should be solved by Kubernetes. Do the Arena contributors agree with this principle? Here are some examples
Related discussions: Configurable Dev Tooling #54 Discussions of multi-tenancy and ACLs #124 |
Thanks very much for sharing your ideas and a lot of helpful information with us! As part of the KubeFlow, the goal of Arena:
The Princlple of Arena:
The first principle is the key. We are trying to solve the customer’s requirements and issues. Here is our answer to the questions:
For the specific question: Managing nodes e.g arena nodes seems like core Kubernetes functionality handled by kubectl; do we need to build a custom container for this?
One of the problems Arena is trying to solve is making it easy to go from source to job?
|
This is a great goal. If this is the goal why emphasize a CLI based approach? The request I hear most from datascientists is
Would do you think about moving Arena in one of the directions listed above? My conjecture is that we can deliver a more data scientist friendly experience by focusing on notebooks; e.g. by making Arena
I think if we build a good python library that could provide a native notebook experience and then be used as the basis for a CLI. |
It’s based on our experiences of the customer engagement . We have many AI customers which are from Internet companies, Research, banks and etc. We have found they are familiar and prefer CLI in linux terminal rather than notebook. I think it may be the habit of data scientists of our customers. We tried to promote jupyterhub/notebook to our customers, but they preferred to CLI solutions through linux terminal. Because it's what they are doing today. That’s why we delivered Arena. I also see Caicloud has the same feedback from their customers.
we don’t have such plan now because most of our customers is fine with the CLI solution, but if there are more customers asking for the UI, we will provide the UI for them. |
We've gotten data scientist feedback that they also like CLI's with the ability to customize their CLI using python. We should, if possible in this discussion, qualify areas where a UI may be preferred vs iterative development such as training a model. In the latter we've been told that data scientists have wanted to automate aspects of their workflows. We've also gotten feedback from some data scientists that working in a terminal vs a notebook in a browser is preferred due to the higher latencies of typing in a browser. |
BTW we spent some time on an earlier effort that was also based on spf13/cobra but decoupled the command from its execution which was done with serverless functions that could be written in python. For this we used kubeless. We spent some time making the command set extensible, so you could add, remove commands. Looking at your codebase it looks like many similar ideas are implemented using the kubernetes clientset - which kubeless utilizes under the covers. @jlewi is @kunmingg planning on extending gcp-click-to-deploy so other commands would be sent to bootstrapper for execution? I know that there is an active effort to unify kfctl.sh and gcp-click-to-deploy but wasn't sure if this extended beyond the deployment of components into areas that arena has focused on. |
I see a lot of value in CLIs. The question I have is how will Arena evolve compared to generic CLI's in K8s e.g.
Lets take an example
It looks like this is doing two things
Isn't this the sort of workflow that tools like draft and skaffold are targeting? It looks like the chart is trying to turn the entire YAML spec into a set of parameters. Why not just publish a helm chart and use helm as the CLI? Why wrap helm in another, custom CLI? Rather than create really complex templates as in the TFJob chart why not just create a set of example templates/charts that cover different use cases and encourage people to create templates specific to their needs? Creating generic templates is really hard. I think we'll just end up adding more and more parameters to cover more use cases and eventually rewriting the API. Looking at the helm charts it seems like that's what's happening. So instead of telling people to look at the APIs for our CRDs(TFJobs, MPIJob, e.tc...) to figure out how to set something (e.g. an environment variable) they need to look at the chart and reverse engineer the Jinja. How is that better than just pointing folks at the Container Spec in the K8s docs? @kkasravi @kunmingg isn't working on extending bootstrapper to perform other commands. Ref: |
It would be good to discuss this at one of the community meetings. Unfortunately, I'm not sure I will be able to attend tomorrow's meeting and two weeks (next meeting for Asia Pacific Time zone) I will be on vacation. |
Thanks, no urgent. We can discuss when you are back. Have a good vacation. |
In fact, we have already provided some helm charts for TensorFlow and Horovod in https://github.com/helm/charts/tree/master/stable. And we tried to help the customers to use chart to cover the model development, training and serving. Notebook Our customers used them, but they thought they are too complicated for them. If There are too many choices and options to the customers, they feel confused and don’t like to know too many details. And they dislike to use both helm and kubectl, they want to use a single CLI to handle their daily work. According to our experience, the data scientists only cares three questions:
And our wrapper is trying to answer the questions above and avoid exposing details. Arena is the CLI facade with machine learning domain knowledges, it does not only submits the training job, also manages the lifecycle of the job, it can get the status of the job, check the logs directly. The ordinary users can use arena directly without understanding charts. It's easy for them to get started. If the advanced users need to add more features, they can modify the chart directly. The similar solution of us is floyCLI from floydhub and https://polyaxon.com/ |
@kkasravi @wbuchwalter @gaocegege thoughts? One question I have is when would we suggest users to use lower level tools (e.g. directly write YAML files and use kubectl) vs using Arena? |
What do folks think about just starting to incorporate Arena and seeing where it leads? |
@jlewi +1 I think using the clientset API within a golang program needs to be explored and the reasons are similar to why bootstrapper uses a rest API from golang. I would suggest we look for ways to make the API extensible so that new or different methods can be bound within the spf13/cobra command set. One area I had prototyped was dynamically loading .so's
but adopting something similar to kubectl plugin architecture may be more extensible. |
I think for newcomers or entry-level data scientists/ML engineers, we should provide a CLI/simplified API to help them run their jobs easily. Because there are some users do not understand the concepts of Kubernetes and do not know how to use kubectl to create resources on Kubernetes, and they do not want to learn. And that's why I suggested building a unified API layer here: https://docs.google.com/document/d/1RkNL6XY7rR4eaW1TuM-loMuX9Dm5pFi5wpaFnnrH5LM/edit?usp=sharing As for the advanced users, we should keep the kubernetes way. In this way, they could have some low-level configuration for their training jobs. |
@jlewi good question of when we suggest users to use kubectl or arena. |
I went ahead and created the arena repository. I've created an initial OWNERS file with @cheyang so he can approve changes including adding additional approvers and reviewers. I created a new repo rather than transferring the existing repo, because I'd like a record of the CLA being signed as part of code submission. We can use: To move issues if desired. |
@wsxiaozhang my question is more about when we tell users to switch from submitting jobs via arena/CLI to writing YAML files. For example, are there modifications (adding volumes, setting resource requests, environment variables) which arena will explicitly not support? We had originally tried (using ksonnet prototype parameters) to make it easy for users to customize TFJob and TFServing just by setting parameters. In practice, we found that this led to very complex prototypes that were hard to understand. As a result, we've been moving more in the direction of treating "prototypes" as examples that people copy and then modify. As an example of the complexity you can look at the TFServing prototype We wanted to make it easy for people to load their model from different object stores (e.g. GCS or S3). Each of these requires setting different environment variables and volume mounts; some of which might need to be customized by the user. This leads to an every growing number of parameters the user can set (e.g The complexity will increase as we try to support more ways of running Kubernetes. For example, at least in the past Azure and GCP used different names for GPU resources. Looking at the arena command above; it already has 10 parameters. At what point is it more convenient and better for reproducibility to start checking in YAML files containing the parameters for each run? |
I think a good model for a CLI to submit jobs is
I don't think of If we find the CLI moving in the direction of defining a substantially different Job API then the underlying operators we should pause and think about the path forward. |
@jlewi Is it possible to transfer the existing repo? Because we'd love to keep existing PR, Forks and stars. We can require new PRs with CLA signed. Thanks. |
Transfer is complete: |
Move repository from https://github.com/AliyunContainerService/arena to KubeFlow community.
/assign @jlewi
The text was updated successfully, but these errors were encountered: