From 3df711f15c948a9b13a3ac2df3da49b916a3c5d0 Mon Sep 17 00:00:00 2001 From: Anthony Miyaguchi Date: Fri, 26 Apr 2019 16:23:10 -0700 Subject: [PATCH] Create documentation on testing BigQuery from ingestion-beam. --- docs/architecture/diagram.svg | 12 ++--- docs/diagrams/workflow.mmd | 27 ++++++++++ docs/diagrams/workflow.svg | 4 ++ docs/ingestion_testing_workflow.md | 85 ++++++++++++++++++++++++++++++ ingestion-beam/README.md | 3 ++ 5 files changed, 125 insertions(+), 6 deletions(-) create mode 100644 docs/diagrams/workflow.mmd create mode 100644 docs/diagrams/workflow.svg create mode 100644 docs/ingestion_testing_workflow.md diff --git a/docs/architecture/diagram.svg b/docs/architecture/diagram.svg index 4006fd861d..1a48990bbb 100644 --- a/docs/architecture/diagram.svg +++ b/docs/architecture/diagram.svg @@ -1,8 +1,8 @@ -
Colors
Dataflow jobs are green
Kubernetes services are magenta
Producers are orange
PubSub topics are cyan
Google Cloud services are purple
Producers
Ingestion Edge
Raw Topics
Landfill Sink
Cloud Storage
Decoder
Cloud Memorystore
Decoded Topics
BigQuery Sink
BigQuery
Dataset Sink
Cloud Storage
Republisher
Per DocType Topics
Monitoring Sample Topics
\ No newline at end of file diff --git a/docs/diagrams/workflow.mmd b/docs/diagrams/workflow.mmd new file mode 100644 index 0000000000..1bdb522498 --- /dev/null +++ b/docs/diagrams/workflow.mmd @@ -0,0 +1,27 @@ +graph TD + +subgraph dataops/sandbox/my-project + dataflow + bigquery + pubsub + subscription + + pubsub --> |gcloud pubsub subscriptions create| subscription + subscription --> dataflow + dataflow --> bigquery +end + +subgraph mozilla-pipeline-schemas + mps[repository archive] +end + +subgraph ingestion-beam + src[src/] + schemas[schemas.tar.gz] + bq-schemas[bq-schemas/] + + src --> |mvn compile exec:java| dataflow + mps --> |download-schemas| schemas + schemas --> |generate-bq-schemas| bq-schemas + bq-schemas --> |update-bq-table| bigquery +end diff --git a/docs/diagrams/workflow.svg b/docs/diagrams/workflow.svg new file mode 100644 index 0000000000..eb17819d96 --- /dev/null +++ b/docs/diagrams/workflow.svg @@ -0,0 +1,4 @@ +
ingestion-beam
mozilla-pipeline-schemas
dataops/sandbox/my-project
gcloud pubsub subscriptions create
mvn compile exec:java
download-schemas
generate-bq-schemas
update-bq-table
src/
schemas.tar.gz
bq-schemas/
dataflow
repository archive
bigquery
pubsub
subscription
\ No newline at end of file diff --git a/docs/ingestion_testing_workflow.md b/docs/ingestion_testing_workflow.md new file mode 100644 index 0000000000..e96be09342 --- /dev/null +++ b/docs/ingestion_testing_workflow.md @@ -0,0 +1,85 @@ + + + + +- [Ingestion Testing Workflow](#ingestion-testing-workflow) + - [Setting up the GCS project](#setting-up-the-gcs-project) + - [Bootstrapping schemas from `mozilla-pipeline-schemas`](#bootstrapping-schemas-from-mozilla-pipeline-schemas) + - [Building the project](#building-the-project) + + + +# Ingestion Testing Workflow + +The ingestion-beam handles dataflow of documents from the edge into various +sinks. You may be interested in standing up a small testing instance to validate +the integration of the various components. + +![diagrams/workflow.mmd](diagrams/workflow.svg) +__Figure__: _An overview of the various components necessary to query BigQuery +against data from a pubsub subscription._ + +## Setting up the GCS project + +Read through [`whd/gcp-quickstart`](https://github.com/whd/gcp-quickstart) for details +about the sandboxing environment that is provided by data operations. + +* Install the [Google Cloud SDK](https://cloud.google.com/sdk/) +* Navigate to the [Google Cloud Console](https://cloud.google.com/sdk/) +* Create a new project under `firefox.gcp.mozilla.com/dataops/sandbox` + - `gcloud config set project ` +* Create a PubSub subscription (see `gcp-quickstart/pubsub.sh`) +* Create a GCS bucket + - `gsutil mb gs://` +* Enable the [DataFlow API](https://console.cloud.google.com/marketplace/details/google/dataflow.googleapis.com) +* Create a service account and store the key locally + + +## Bootstrapping schemas from `mozilla-pipeline-schemas` + +* Download the latest schemas from `mozilla-pipeline-schemas` using `bin/download-schemas`. + - This script may also inject testing resources into the resulting archive. + - A `schemas.tar.gz` will appear at the project root. +* Generate BigQuery schemas using `bin/generate-bq-schemas`. + - Schemas will be written to `bq-schemas/`. + ``` + bq-schemas/ + ├── activity-stream.impression-stats.1.bigquery.json + ├── coverage.coverage.1.bigquery.json + ├── edge-validator.error-report.1.bigquery.json + ├── eng-workflow.bmobugs.1.bigquery.json + .... + ``` +* Update the BigQuery table in the current project using `bin/update-bq-table`. + - This may take several minutes. Read the script for usage information. +* Verify that tables have been updated by viewing the BigQuery console. + + +## Building the project + +Follow the instructions of the project README. Here is a quick-reference for a running a job from a set of files in GCS. + +```bash +export GOOGLE_APPLICATION_CREDENTIALS=keys.json +PROJECT=$(gcloud config get-value project) +BUCKET="gs://$PROJECT" + +path="$BUCKET/data/*.ndjson" +mvn compile exec:java -Dexec.args="\ + --runner=Dataflow \ + --project=$PROJECT \ + --autoscalingAlgorithm=NONE \ + --workerMachineType=n1-standard-1 \ + --numWorkers=1 \ + --gcpTempLocation=$BUCKET/tmp \ + --inputFileFormat=json \ + --inputType=file \ + --input=$path\ + --outputType=bigquery \ + --output=$PROJECT:\${document_namespace}.\${document_type}_v\${document_version} \ + --bqWriteMethod=file_loads \ + --tempLocation=$BUCKET/temp/bq-loads \ + --errorOutputType=file \ + --errorOutput=$BUCKET/error/ \ +" +``` diff --git a/ingestion-beam/README.md b/ingestion-beam/README.md index 2cf153b48c..8e6720bdd0 100644 --- a/ingestion-beam/README.md +++ b/ingestion-beam/README.md @@ -562,6 +562,9 @@ use the `bin/mvn` executable to run maven in docker: ./bin/mvn clean test ``` +To run the project in a sandbox against production data, see this document on +![configuring an integration testing workflow](docs/ingestion_testing_workflow.md). + # License This Source Code Form is subject to the terms of the Mozilla Public