-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add How To Guide for Dataflow #13461
Conversation
43646ae
to
d45fb37
Compare
Ways to run a data pipeline | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
There are multiple options to execute a Dataflow pipeline on Airflow. If looking to execute the pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to develop it, because it is very problematic for users.
I would suggest such a text, but it can probably be improved and expanded.
There are several ways to run a Dataflow pipeline depending on your environment, source files:
- **Non-templated pipeline**: Developer can run the pipeline as a local process on the worker if you have a '*.jar' file for Java or a '* .py` file for Python. This also means that the necessary system dependencies must be installed on the worker. For Java, worker must have the JRE Runtime installed. For Python, the Python interpreter. The runtime versions must be compatible with the pipeline versions. This is the fastest way to start a pipeline, but because of its frequent problems with system dependencies, it often causes problems.
- **Templated pipeline**: The programmer can make the pipeline independent of the environment by preparing a template that will then be run on a machine managed by Google. This way, changes to the environment won't affect your pipeline. There are two types of the templates:
- **Classic templates**. Developers run the pipeline and create a template. The Apache Beam SDK stages files in Cloud Storage, creates a template file (similar to job request), and saves the template file in Cloud Storage.
- **Flex Templates**. Developers package the pipeline into a Docker image and then use the `gcloud` command-line tool to build and save the Flex Template spec file in Cloud Storage.
- **SQL pipeline**: Developer can write pipeline as SQL statement and then execute it in Dataflow.
It is a good idea to test your pipeline using the non-templated pipeline, and then run the pipeline in production using the templates.
For details on the differences between the pipeline types, see `Dataflow templates <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>__` in the Google Cloud documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the expanded version.
"This is the fastest way to start a pipeline, but because of its frequent problems with system dependencies, it often causes problems. " is very strongly worded. We could maybe add more explanation on how to manage dependency problems instead. Otherwise this will prevent most users from trying this option. And I think the airflow operator has been developed significantly overtime and allows managing dependencies. So I am guessing this is more of a documentation problem than a problem with the operator itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very often these problems are not easy to solve because one common Docker image is used for many environments. For example, in Cloud Composer, you cannot install any system dependencies. The only thing you can do is install the new libraries via pip. I agree that this is a fairly strong-worded sentence and we can think about improving it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I slightly rephrased it from : "it often causes problems." to " it may cause problems."
batch asynchronously (fire and forget), batch blocking (wait until completion), or streaming (run indefinitely). | ||
In Airflow it is best practice to use asynchronous batch pipelines or streams and use sensors to listen for expected job state. | ||
|
||
By default :class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreateJavaJobOperator` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what is the behavior of the DataflowStartFlexTemplateOperator? It seems to me that these sections need to be generalized a little in order to describe the general assumptions and only then describe the specific cases. I suspect that a well-written section will allow us to delete subsections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored it and added description how it works with templates operators
docs/apache-airflow-providers-google/operators/cloud/dataflow.rst
Outdated
Show resolved
Hide resolved
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Dataflow has multiple options of executing pipelines. It can be done in the following modes: | ||
batch asynchronously (fire and forget), batch blocking (wait until completion), or streaming (run indefinitely). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I wouldn't consider streaming as another execution model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is based on the Dataflow documentation:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content looks good to me.
I think this review would benefit from a review with someone with tech writing background. Does airflow community has a reviewer to help with consistent documentation? Alternatively, we could ask @rosetn to review the changes to dataflow.rst
if she can.
docs/apache-airflow-providers-google/operators/cloud/dataflow.rst
Outdated
Show resolved
Hide resolved
Ways to run a data pipeline | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
There are multiple options to execute a Dataflow pipeline on Airflow. If looking to execute the pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the expanded version.
"This is the fastest way to start a pipeline, but because of its frequent problems with system dependencies, it often causes problems. " is very strongly worded. We could maybe add more explanation on how to manage dependency problems instead. Otherwise this will prevent most users from trying this option. And I think the airflow operator has been developed significantly overtime and allows managing dependencies. So I am guessing this is more of a documentation problem than a problem with the operator itself.
The community doesn't have any technical writer. We rely only on the contributions of other people and they are mostly developers. |
d45fb37
to
c5899cf
Compare
The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason. |
|
||
There are several ways to run a Dataflow pipeline depending on your environment, source files: | ||
|
||
- **Non-templated pipeline**: Developer can run the pipeline as a local process on the worker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify, worker is an "airflow worker" not a "dataflow worker" in this case right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I changed to "airflow worker"
- **Classic templates**. Developers run the pipeline and create a template. The Apache Beam SDK stages | ||
files in Cloud Storage, creates a template file (similar to job request), | ||
and saves the template file in Cloud Storage. See: :ref:`howto/operator:DataflowTemplatedJobStartOperator` | ||
- **Flex Templates**. Developers package the pipeline into a Docker image and then use the ``gcloud`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this not require "gcloud" as a depedency pre-installed in the airflow worker nodes? (similar to the JRE or python requirements above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that only SQL Operator requires Google Cloud CLI to be installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mik-laj is right. Only DataflowStartSqlJobOperator
requires gcloud
.
I added warning in DataflowStartSqlJobOperator
section and operator itself about required gcloud
SDK
5eed586
to
59251a3
Compare
The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason. |
59251a3
to
9beba59
Compare
9beba59
to
d3d9d9b
Compare
Retaken from #8809
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.