Google Cloud Dataflow Template Pipelines

These Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.

Google is providing this collection of pre-implemented Dataflow templates as a reference and to provide easy customization for developers wanting to extend their functionality.

Note on Default Branch

As of November 18, 2021, our default branch is now named "main". This does not affect forks. If you would like your fork and its local clone to reflect these changes you can follow GitHub's branch renaming guide.

Building

Maven commands should be run on the unified-templates.xml aggregator POM. An example would be:

mvn clean install -f unified-templates.xml -pl v2/pubsub-binary-to-bigquery -am

Template Pipelines

* Supports user-defined functions (UDFs).

For documentation on each template's usage and parameters, please see the official docs.

Getting Started

Requirements

Java 11
Maven 3

Building the Project

Build the entire project using the maven compile command.

mvn clean compile

Building/Testing from IntelliJ

IntelliJ, by default, will often skip necessary Maven goals, leading to build failures. You can fix these in the Maven view by going to Module_Name > Plugins > Plugin_Name where Module_Name and Plugin_Name are the names of the respective module and plugin with the rule. From there, right-click the rule and select "Execute Before Build".

The list of known rules that require this are:

common > Plugins > protobuf > protobuf:compile
common > Plugins > protobuf > protobuf:test-compile

Formatting Code

From either the root directory or v2/ directory, run:

mvn spotless:apply

This will format the code and add a license header. To verify that the code is formatted correctly, run:

mvn spotless:check

The directory to run the commands from is based on whether the changes are under v2/ or not.

Creating a Template File

Dataflow templates can be created using a maven command which builds the project and stages the template file on Google Cloud Storage. Any parameters passed at template build time will not be able to be overwritten at execution time.

mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.<template-class> \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=<project-id> \
--stagingLocation=gs://<bucket-name>/staging \
--tempLocation=gs://<bucket-name>/temp \
--templateLocation=gs://<bucket-name>/templates/<template-name>.json \
--runner=DataflowRunner"

Executing a Template File

Once the template is staged on Google Cloud Storage, it can then be executed using the gcloud CLI tool. The runtime parameters required by the template can be passed in the parameters field via comma-separated list of paramName=Value.

gcloud dataflow jobs run <job-name> \
--gcs-location=<template-location> \
--zone=<zone> \
--parameters <parameters>

Using UDFs

User-defined functions (UDFs) allow you to customize a template's functionality by providing a short JavaScript function without having to maintain the entire codebase. This is useful in situations which you'd like to rename fields, filter values, or even transform data formats before output to the destination. All UDFs are executed by providing the payload of the element as a string to the JavaScript function. You can then use JavaScript's in-built JSON parser or other system functions to transform the data prior to the pipeline's output. The return statement of a UDF specifies the payload to pass forward in the pipeline. This should always return a string value. If no value is returned or the function returns undefined, the incoming record will be filtered from the output.

UDF Function Specification

Template	UDF Input Type	Input Description	UDF Output Type	Output Description
Datastore Bulk Delete	String	A JSON string of the entity	String	A JSON string of the entity to delete; filter entities by returning undefined
Datastore to Pub/Sub	String	A JSON string of the entity	String	The payload to publish to Pub/Sub
Datastore to GCS Text	String	A JSON string of the entity	String	A single-line within the output file
GCS Text to BigQuery	String	A single-line within the input file	String	A JSON string which matches the destination table's schema
Pub/Sub to BigQuery	String	A string representation of the incoming payload	String	A JSON string which matches the destination table's schema
Pub/Sub to Datastore	String	A string representation of the incoming payload	String	A JSON string of the entity to write to Datastore
Pub/Sub to Splunk	String	A string representation of the incoming payload	String	The event data to be sent to Splunk HEC events endpoint. Must be a string or a stringified JSON object

UDF Examples

Adding fields

/**
 * A transform which adds a field to the incoming data.
 * @param {string} inJson
 * @return {string} outJson
 */
function transform(inJson) {
  var obj = JSON.parse(inJson);
  obj.dataFeed = "Real-time Transactions";
  obj.dataSource = "POS";
  return JSON.stringify(obj);
}

Filtering records

/**
 * A transform function which only accepts 42 as the answer to life.
 * @param {string} inJson
 * @return {string} outJson
 */
function transform(inJson) {
  var obj = JSON.parse(inJson);
  // only output objects which have an answer to life of 42.
  if (obj.hasOwnProperty('answerToLife') && obj.answerToLife === 42) {
    return JSON.stringify(obj);
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 984 Commits
.github		.github
cicd		cicd
it		it
src		src
syndeo-template		syndeo-template
tutorials		tutorials
v2		v2
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
JAVA_LICENSE_HEADER		JAVA_LICENSE_HEADER
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
cloudbuild.yaml		cloudbuild.yaml
pom.xml		pom.xml
unified-templates.xml		unified-templates.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Cloud Dataflow Template Pipelines

Note on Default Branch

Building

Template Pipelines

Getting Started

Requirements

Building the Project

Building/Testing from IntelliJ

Formatting Code

Creating a Template File

Executing a Template File

Using UDFs

UDF Function Specification

UDF Examples

Adding fields

Filtering records

About

Releases

Packages

Languages

License

kamibayashi/DataflowTemplates

Folders and files

Latest commit

History

Repository files navigation

Google Cloud Dataflow Template Pipelines

Note on Default Branch

Building

Template Pipelines

Getting Started

Requirements

Building the Project

Building/Testing from IntelliJ

Formatting Code

Creating a Template File

Executing a Template File

Using UDFs

UDF Function Specification

UDF Examples

Adding fields

Filtering records

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages