Add How To Guide for Dataflow #13461

TobKed · 2021-01-04T14:50:31Z

Retaken from #8809

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

TobKed · 2021-01-07T07:18:24Z

cc @tanjinP @mik-laj @aaltay

mik-laj · 2021-01-07T14:46:28Z

docs/apache-airflow-providers-google/operators/cloud/dataflow.rst

+Ways to run a data pipeline
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are multiple options to execute a Dataflow pipeline on Airflow. If looking to execute the pipeline


I would like to develop it, because it is very problematic for users.

I would suggest such a text, but it can probably be improved and expanded.

There are several ways to run a Dataflow pipeline depending on your environment, source files: - **Non-templated pipeline**: Developer can run the pipeline as a local process on the worker if you have a '*.jar' file for Java or a '* .py` file for Python. This also means that the necessary system dependencies must be installed on the worker. For Java, worker must have the JRE Runtime installed. For Python, the Python interpreter. The runtime versions must be compatible with the pipeline versions. This is the fastest way to start a pipeline, but because of its frequent problems with system dependencies, it often causes problems. - **Templated pipeline**: The programmer can make the pipeline independent of the environment by preparing a template that will then be run on a machine managed by Google. This way, changes to the environment won't affect your pipeline. There are two types of the templates: - **Classic templates**. Developers run the pipeline and create a template. The Apache Beam SDK stages files in Cloud Storage, creates a template file (similar to job request), and saves the template file in Cloud Storage. - **Flex Templates**. Developers package the pipeline into a Docker image and then use the `gcloud` command-line tool to build and save the Flex Template spec file in Cloud Storage. - **SQL pipeline**: Developer can write pipeline as SQL statement and then execute it in Dataflow. It is a good idea to test your pipeline using the non-templated pipeline, and then run the pipeline in production using the templates. For details on the differences between the pipeline types, see `Dataflow templates <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>__` in the Google Cloud documentation.

I like the expanded version.

"This is the fastest way to start a pipeline, but because of its frequent problems with system dependencies, it often causes problems. " is very strongly worded. We could maybe add more explanation on how to manage dependency problems instead. Otherwise this will prevent most users from trying this option. And I think the airflow operator has been developed significantly overtime and allows managing dependencies. So I am guessing this is more of a documentation problem than a problem with the operator itself.

Very often these problems are not easy to solve because one common Docker image is used for many environments. For example, in Cloud Composer, you cannot install any system dependencies. The only thing you can do is install the new libraries via pip. I agree that this is a fairly strong-worded sentence and we can think about improving it

I slightly rephrased it from : "it often causes problems." to " it may cause problems."

mik-laj · 2021-01-07T20:48:07Z

docs/apache-airflow-providers-google/operators/cloud/dataflow.rst

+batch asynchronously (fire and forget), batch blocking (wait until completion), or streaming (run indefinitely).
+In Airflow it is best practice to use asynchronous batch pipelines or streams and use sensors to listen for expected job state.
+
+By default :class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreateJavaJobOperator`


And what is the behavior of the DataflowStartFlexTemplateOperator? It seems to me that these sections need to be generalized a little in order to describe the general assumptions and only then describe the specific cases. I suspect that a well-written section will allow us to delete subsections.

I refactored it and added description how it works with templates operators

docs/apache-airflow-providers-google/operators/cloud/dataflow.rst

mik-laj · 2021-01-07T20:50:58Z

docs/apache-airflow-providers-google/operators/cloud/dataflow.rst

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Dataflow has multiple options of executing pipelines. It can be done in the following modes:
+batch asynchronously (fire and forget), batch blocking (wait until completion), or streaming (run indefinitely).


Personally, I wouldn't consider streaming as another execution model.

It is based on the Dataflow documentation:

https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#configuring-pipelineoptions-for-execution-on-the-cloud-dataflow-service

aaltay

The content looks good to me.

I think this review would benefit from a review with someone with tech writing background. Does airflow community has a reviewer to help with consistent documentation? Alternatively, we could ask @rosetn to review the changes to dataflow.rst if she can.

docs/apache-airflow-providers-google/operators/cloud/dataflow.rst

aaltay · 2021-01-07T23:57:40Z

docs/apache-airflow-providers-google/operators/cloud/dataflow.rst

+Ways to run a data pipeline
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are multiple options to execute a Dataflow pipeline on Airflow. If looking to execute the pipeline


I like the expanded version.

"This is the fastest way to start a pipeline, but because of its frequent problems with system dependencies, it often causes problems. " is very strongly worded. We could maybe add more explanation on how to manage dependency problems instead. Otherwise this will prevent most users from trying this option. And I think the airflow operator has been developed significantly overtime and allows managing dependencies. So I am guessing this is more of a documentation problem than a problem with the operator itself.

mik-laj · 2021-01-08T00:21:26Z

Does airflow community has a reviewer to help with consistent documentation?

The community doesn't have any technical writer. We rely only on the contributions of other people and they are mostly developers.

TobKed · 2021-01-11T13:32:59Z

I made some changes. PTAL @mik-laj @aaltay

github-actions · 2021-01-11T22:42:32Z

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason.

aaltay · 2021-01-12T01:29:14Z

docs/apache-airflow-providers-google/operators/cloud/dataflow.rst

+
+There are several ways to run a Dataflow pipeline depending on your environment, source files:
+
+- **Non-templated pipeline**: Developer can run the pipeline as a local process on the worker


to clarify, worker is an "airflow worker" not a "dataflow worker" in this case right?

You are right. I changed to "airflow worker"

aaltay · 2021-01-12T01:30:15Z

docs/apache-airflow-providers-google/operators/cloud/dataflow.rst

+  - **Classic templates**. Developers run the pipeline and create a template. The Apache Beam SDK stages
+    files in Cloud Storage, creates a template file (similar to job request),
+    and saves the template file in Cloud Storage. See: :ref:`howto/operator:DataflowTemplatedJobStartOperator`
+  - **Flex Templates**. Developers package the pipeline into a Docker image and then use the ``gcloud``


Would this not require "gcloud" as a depedency pre-installed in the airflow worker nodes? (similar to the JRE or python requirements above)

It seems to me that only SQL Operator requires Google Cloud CLI to be installed.

@mik-laj is right. Only DataflowStartSqlJobOperator requires gcloud.

I added warning in DataflowStartSqlJobOperator section and operator itself about required gcloud SDK

github-actions · 2021-01-12T14:32:48Z

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason.

boring-cyborg bot added kind:documentation provider:google Google (including GCP) related issues labels Jan 4, 2021

TobKed force-pushed the howto-for-dataflow branch from 43646ae to d45fb37 Compare January 5, 2021 13:14

mik-laj reviewed Jan 7, 2021

View reviewed changes

docs/apache-airflow-providers-google/operators/cloud/dataflow.rst Outdated Show resolved Hide resolved

mik-laj reviewed Jan 7, 2021

View reviewed changes

aaltay reviewed Jan 8, 2021

View reviewed changes

TobKed force-pushed the howto-for-dataflow branch from d45fb37 to c5899cf Compare January 11, 2021 13:28

aaltay approved these changes Jan 12, 2021

View reviewed changes

TobKed force-pushed the howto-for-dataflow branch from 5eed586 to 59251a3 Compare January 12, 2021 11:51

TobKed force-pushed the howto-for-dataflow branch from 59251a3 to 9beba59 Compare January 13, 2021 09:01

Tobiasz Kędzierski added 3 commits January 19, 2021 08:37

Add How To Guide for Dataflow

a7513a0

fixup! Add How To Guide for Dataflow

7c7aa7b

fixup! fixup! Add How To Guide for Dataflow

d3d9d9b

TobKed force-pushed the howto-for-dataflow branch from 9beba59 to d3d9d9b Compare January 19, 2021 07:38

TobKed requested a review from mik-laj January 19, 2021 17:50

mik-laj approved these changes Jan 21, 2021

View reviewed changes

mik-laj merged commit 70bf307 into apache:master Jan 21, 2021

mik-laj deleted the howto-for-dataflow branch January 21, 2021 10:41

eladkal mentioned this pull request Apr 18, 2021

Create guide for Dataflow operators #8202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add How To Guide for Dataflow #13461

Add How To Guide for Dataflow #13461

TobKed commented Jan 4, 2021 •

edited

Loading

TobKed commented Jan 7, 2021

mik-laj Jan 7, 2021 •

edited

Loading

aaltay Jan 7, 2021

mik-laj Jan 8, 2021

TobKed Jan 12, 2021

mik-laj Jan 7, 2021

TobKed Jan 11, 2021 •

edited

Loading

mik-laj Jan 7, 2021

TobKed Jan 11, 2021

aaltay left a comment

aaltay Jan 7, 2021

mik-laj commented Jan 8, 2021

TobKed commented Jan 11, 2021

github-actions bot commented Jan 11, 2021

aaltay Jan 12, 2021

TobKed Jan 13, 2021

aaltay Jan 12, 2021

mik-laj Jan 12, 2021

TobKed Jan 12, 2021

github-actions bot commented Jan 12, 2021


		There are several ways to run a Dataflow pipeline depending on your environment, source files:

		- Non-templated pipeline: Developer can run the pipeline as a local process on the worker

Add How To Guide for Dataflow #13461

Add How To Guide for Dataflow #13461

Conversation

TobKed commented Jan 4, 2021 • edited Loading

TobKed commented Jan 7, 2021

mik-laj Jan 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TobKed Jan 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaltay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mik-laj commented Jan 8, 2021

TobKed commented Jan 11, 2021

github-actions bot commented Jan 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 12, 2021

TobKed commented Jan 4, 2021 •

edited

Loading

mik-laj Jan 7, 2021 •

edited

Loading

TobKed Jan 11, 2021 •

edited

Loading