Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add How To Guide for Dataflow #13461

Merged
merged 3 commits into from
Jan 21, 2021
Merged

Conversation

TobKed
Copy link
Contributor

@TobKed TobKed commented Jan 4, 2021

Retaken from #8809


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added kind:documentation provider:google Google (including GCP) related issues labels Jan 4, 2021
@TobKed TobKed force-pushed the howto-for-dataflow branch from 43646ae to d45fb37 Compare January 5, 2021 13:14
@TobKed
Copy link
Contributor Author

TobKed commented Jan 7, 2021

cc @tanjinP @mik-laj @aaltay

Ways to run a data pipeline
^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are multiple options to execute a Dataflow pipeline on Airflow. If looking to execute the pipeline
Copy link
Member

@mik-laj mik-laj Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to develop it, because it is very problematic for users.

I would suggest such a text, but it can probably be improved and expanded.

There are several ways to run a Dataflow pipeline depending on your environment, source files:
- **Non-templated pipeline**: Developer can run the pipeline as a local process on the worker if you have a '*.jar' file for Java or a '* .py` file for Python. This also means that the necessary system dependencies must be installed on the worker.  For Java, worker must have the JRE Runtime installed. For Python, the Python interpreter. The runtime versions must be compatible with the pipeline versions. This is the fastest way to start a pipeline, but because of its frequent problems with system dependencies, it often causes problems. 
- **Templated pipeline**: The programmer can make the pipeline independent of the environment by preparing a template that will then be run on a machine managed by Google. This way, changes to the environment won't affect your pipeline. There are two types of the templates:
     - **Classic templates**. Developers run the pipeline and create a template. The Apache Beam SDK stages files in Cloud Storage, creates a template file (similar to job request), and saves the template file in Cloud Storage.
     - **Flex Templates**. Developers package the pipeline into a Docker image and then use the `gcloud` command-line tool to build and save the Flex Template spec file in Cloud Storage. 
- **SQL pipeline**: Developer can write pipeline as SQL statement and then execute it in Dataflow.

It is a good idea to test your pipeline using the non-templated pipeline, and then run the pipeline in production using the templates.

For details on the differences between the pipeline types, see `Dataflow templates <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>__` in the Google Cloud documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the expanded version.

"This is the fastest way to start a pipeline, but because of its frequent problems with system dependencies, it often causes problems. " is very strongly worded. We could maybe add more explanation on how to manage dependency problems instead. Otherwise this will prevent most users from trying this option. And I think the airflow operator has been developed significantly overtime and allows managing dependencies. So I am guessing this is more of a documentation problem than a problem with the operator itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very often these problems are not easy to solve because one common Docker image is used for many environments. For example, in Cloud Composer, you cannot install any system dependencies. The only thing you can do is install the new libraries via pip. I agree that this is a fairly strong-worded sentence and we can think about improving it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly rephrased it from : "it often causes problems." to " it may cause problems."

batch asynchronously (fire and forget), batch blocking (wait until completion), or streaming (run indefinitely).
In Airflow it is best practice to use asynchronous batch pipelines or streams and use sensors to listen for expected job state.

By default :class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreateJavaJobOperator`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what is the behavior of the DataflowStartFlexTemplateOperator? It seems to me that these sections need to be generalized a little in order to describe the general assumptions and only then describe the specific cases. I suspect that a well-written section will allow us to delete subsections.

Copy link
Contributor Author

@TobKed TobKed Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored it and added description how it works with templates operators

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Dataflow has multiple options of executing pipelines. It can be done in the following modes:
batch asynchronously (fire and forget), batch blocking (wait until completion), or streaming (run indefinitely).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I wouldn't consider streaming as another execution model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@aaltay aaltay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content looks good to me.

I think this review would benefit from a review with someone with tech writing background. Does airflow community has a reviewer to help with consistent documentation? Alternatively, we could ask @rosetn to review the changes to dataflow.rst if she can.

Ways to run a data pipeline
^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are multiple options to execute a Dataflow pipeline on Airflow. If looking to execute the pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the expanded version.

"This is the fastest way to start a pipeline, but because of its frequent problems with system dependencies, it often causes problems. " is very strongly worded. We could maybe add more explanation on how to manage dependency problems instead. Otherwise this will prevent most users from trying this option. And I think the airflow operator has been developed significantly overtime and allows managing dependencies. So I am guessing this is more of a documentation problem than a problem with the operator itself.

@mik-laj
Copy link
Member

mik-laj commented Jan 8, 2021

Does airflow community has a reviewer to help with consistent documentation?

The community doesn't have any technical writer. We rely only on the contributions of other people and they are mostly developers.

@TobKed TobKed force-pushed the howto-for-dataflow branch from d45fb37 to c5899cf Compare January 11, 2021 13:28
@TobKed
Copy link
Contributor Author

TobKed commented Jan 11, 2021

I made some changes. PTAL @mik-laj @aaltay

@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason.


There are several ways to run a Dataflow pipeline depending on your environment, source files:

- **Non-templated pipeline**: Developer can run the pipeline as a local process on the worker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, worker is an "airflow worker" not a "dataflow worker" in this case right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I changed to "airflow worker"

- **Classic templates**. Developers run the pipeline and create a template. The Apache Beam SDK stages
files in Cloud Storage, creates a template file (similar to job request),
and saves the template file in Cloud Storage. See: :ref:`howto/operator:DataflowTemplatedJobStartOperator`
- **Flex Templates**. Developers package the pipeline into a Docker image and then use the ``gcloud``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this not require "gcloud" as a depedency pre-installed in the airflow worker nodes? (similar to the JRE or python requirements above)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that only SQL Operator requires Google Cloud CLI to be installed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mik-laj is right. Only DataflowStartSqlJobOperator requires gcloud.

I added warning in DataflowStartSqlJobOperator section and operator itself about required gcloud SDK

@TobKed TobKed force-pushed the howto-for-dataflow branch from 5eed586 to 59251a3 Compare January 12, 2021 11:51
@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason.

@TobKed TobKed force-pushed the howto-for-dataflow branch from 59251a3 to 9beba59 Compare January 13, 2021 09:01
@TobKed TobKed force-pushed the howto-for-dataflow branch from 9beba59 to d3d9d9b Compare January 19, 2021 07:38
@TobKed TobKed requested a review from mik-laj January 19, 2021 17:50
@mik-laj mik-laj merged commit 70bf307 into apache:master Jan 21, 2021
@mik-laj mik-laj deleted the howto-for-dataflow branch January 21, 2021 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:documentation provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants