-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify scope of tensorflow/k8s #150
Comments
a list of questions in my mind that can be better answered if the scope is clearly defined:
|
I hope this repo will be part of a broader effort to create a loosely coupled stack of components covering the full ML lifecycle on K8s. @aronchick has suggested calling this broader effort KubeFlow. This will be just one repo in that effort. The focus of this repo will be on TensorFlow on K8s. TensorFlow is itself a growing ecosystem of components aimed at the full life cycle of ML. So this repo will likely be focused on closing any gaps in making those components run well on K8s. In most cases, I expect there will be more than one reasonable location to host particular code. As an example, if we wanted to add tooling to make TensorFlow Serving easy to spin up on K8s that could live in this repo or in tensorflow/serving
General data pipelining isn't in scope. However, tooling relevant to TensorFlow could live in this repo. For example, an Airflow operator to launch TfJobs could live in this repo (or Airflow contrib).
I suspect the UI will ultimately live in its own repo. Its not clear which component (or components) will provide the functionality to compare across experiments. One option is for TensorBoard to add this functionality; tensorflow/tensorboard#92
The hope is to use K8s as an abstraction layer so any storage system that works with K8s can work with these components. The TfJob CRD doesn't make any assumption about the underlying storage system because it uses K8s storage layer (volumes) to hide the details. Relying on K8s breaks down when K8s doesn't have an appropriate abstraction and you have to expose the details of the underlying cluster. An example is logging (#128). Right now K8s doesn't provide a logging API that fetches logs from durable storage (e.g. StackDriver). As a result, when using TfJob I don't think there's a cloud agnostic way to fetch logs after the job finishes. This isn't just an issue for TensorFlow. Any system running batch jobs (e.g. Spark, Airflow, etc...) has this problem. My hope is that K8s will evolve APIs to solve this.
I'm open to client libraries in other languages if people think they are useful and are willing to contribute. Perhaps we can reuse K8s client generation code so we can auto-generate them. /cc @aronchick @foxish @vishh |
@jimexist did that help? Happy to drop into mail and explain our overall efforts if you'd like |
The UI part is a little bit obscure. The upstream ticket seems quite stalled and the last two comments are unreplied. |
Also I think that an overview on this Google Gradient co-financied effort could be useful to outline the perimeter of this project. |
@bhack Algorithmia isn't currently involved although it would be great to work with them if they are interested.
Its my understanding that @wbuchwalter is working on a minimal UI just to make TensorFlow on K8s more accessible to folks who feel more comfortable with a UI. I think the long term direction for the UI is unclear. Feel free to chime in on the issue with your oppinions or on slack kubeflow.slack.com |
@jlewi To make TensorFlow tasks well running on Kubernetes. I think we have to implement some other tools, such as monitor(real-time processing and monitoring the status of all TensorFlow tasks). An ease-to-use command line tool is sort of necessary. |
@DjangoPeng I agree with you. Do you want to open up issues for those items? |
Sure thing. Maybe next Monday or Tuesday, I'll open an issue to explain and clarify our proposal. At the same time, I'll give a developing schedule including feature and due date. |
We should update README.md to clarify the scope of tensorflow/k8s.
The text was updated successfully, but these errors were encountered: