-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding support for running TFX w/Beam in the components #2958
Comments
Actually you have the option to author the pipeline in TFX and run on KubeflowDagRunner, see You can specify Dataflow runner by specifying beam args, see example here https://github.com/tensorflow/tfx/blob/848063c2c84e60f368994edf9ace01d017120dbc/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_gcp.py#L100 |
/assign @numerology |
So that’s only if I make a TFX pipeline, if I make a kubeflow pipeline w/TFX components it doesn’t let me specify the runner as far as I can see. |
Jiaxiao, this match the email we discussed before. |
Actually in KubeflowDagRunner all TFX components are converted to KFP containerOp under the hood. I'm hesitating in supporting a codepath like KFP-TFX-KFP-Argo. This looks unnecessarily redundant. |
So let’s say I have a mixed pipeline, like needing to do some non-TFX data munging first, how do I do my TFX work on beam as part of that? Or is that just not something in scope? (Also if we don’t want folks to use the KFP TFX components should we delete them? I know KF has a lot of dead ends it might be good to clean up some of them if this is one of those cases). |
Well, we had Dataflow components, but someone cleaned them up
KFP uses Argo and runs on Kubernetes. Beam is a different orchestrator. Can you please tell us about scenarios that require using Beam on Kubernetes? When using TFX framework (not components), you can choose a runner - either Beam or Kubeflow+Argo+Kubernetes. Of course, KFP allows you to launch code on Dataflow. But that would be opaque task - a single block that executes some beam pipeline. Same with any other pair of orchestrators. |
So my scenario is I fetch data from the world outside of TFX and store it on GCS as part 1 of my pipeline. Then in part 2 I want to train a tensorflow model, ideally with TFT for data transformations. Like if you don’t imagine TFX being used inside of KF pipelines that’s fine; I’m just a little confused. |
Just following up: do we not expect folks to use TFX as part of a Kubeflow pipeline? I want to make sure I capture this correctly in the book I'm working on. |
See also kubeflow/kubeflow#1583 My understanding is that KFP will be the recommended way to run TFX pipelines especially on K8s. To achieve that my understanding is that at some point the TFX implementation on KFP will support scaling out the processing(e.g. TF.Transform) using Beam. I believe using Dataflow is already supported and we have at least one issue (kubeflow/kubeflow#1583) related to supporting other runners like Flink. @rmgogogo @jessiezcc @Bobgy Can anyone from pipelines provide some guidance on terms of the roadmap for TFX and KFP? Could we get the relevant information added to |
/cc @katsiapis |
Good idea. |
@Ark-kun |
Thanks for the pointer. It looks like it is very easy to add support for |
Hi @holdenk, I recommend using the official TFX components on KFP instead [1] [2]. Those components are updated, maintained and have been tested with Dataflow. This is the supported path, we'll be happy to help you with any issues here. Thanks! [1] https://www.tensorflow.org/tfx/tutorials/tfx/cloud-ai-platform-pipelines |
+1. |
Ok but if I have non TFX tasks that doesn’t work |
That's true. Longer-term solution: |
That’s great news! Thank you for working on these :) |
@holdenk |
Looking at the current TFX components, it seems like they only run in local mode. I think it would be useful to consider adding support for running on top of Beam w/Dataflow (and later Flink & Spark).
The text was updated successfully, but these errors were encountered: