Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation on testing BigQuery from ingestion-beam #543

Merged
merged 4 commits into from
Apr 29, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .mermaid
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"themeCSS": ".label foreignObject { overflow: visible; }"
}
3 changes: 3 additions & 0 deletions .spelling
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ failsafe
featureful
filesystem
GCP
GCS
GeoIP
GKE
GroupByKey
Expand Down Expand Up @@ -58,11 +59,13 @@ protobuf
PubSub
PubsubMessage
Q4
readme
Redis
Republisher
runtime
S3
schemas
SDK
sharding
SQLite
stderr
Expand Down
1 change: 1 addition & 0 deletions bin/update-diagrams
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,6 @@ for f in $(find . -name "*.mmd"); do
--volume $PWD:/root/project \
--workdir /root/project \
$IMAGE \
-c .mermaid \
-i ${f} -o ${f/.mmd/.svg}
done
13 changes: 7 additions & 6 deletions docs/architecture/diagram.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 27 additions & 0 deletions docs/diagrams/workflow.mmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
graph TD

subgraph dataops/sandbox/my-project
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to decrease the font sizes? This is rendering strangely for me where most of the text looks clipped inside the boxes:

Screen Shot 2019-04-29 at 8 58 51 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this will fix it -- mermaid-js/mermaid#790 (comment)

The online editor does the right thing. I think the mermaid cli tool and the online tool are out of sync.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks better for me now too.

dataflow
bigquery
pubsub
subscription

pubsub --> |gcloud pubsub subscriptions| subscription
subscription --> dataflow
dataflow --> bigquery
end

subgraph mozilla-pipeline-schemas
mps[repository archive]
end

subgraph ingestion-beam
src[src/]
schemas[schemas.tar.gz]
bq-schemas[bq-schemas/]

src --> |mvn compile| dataflow
mps --> |download-schemas| schemas
schemas --> |generate-bq-schemas| bq-schemas
bq-schemas --> |update-bq-table| bigquery
end
5 changes: 5 additions & 0 deletions docs/diagrams/workflow.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
85 changes: 85 additions & 0 deletions docs/ingestion_testing_workflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->


- [Ingestion Testing Workflow](#ingestion-testing-workflow)
- [Setting up the GCS project](#setting-up-the-gcs-project)
- [Bootstrapping schemas from `mozilla-pipeline-schemas`](#bootstrapping-schemas-from-mozilla-pipeline-schemas)
- [Building the project](#building-the-project)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

# Ingestion Testing Workflow

The ingestion-beam handles data flow of documents from the edge into various
sinks. You may be interested in standing up a small testing instance to validate
the integration of the various components.

![diagrams/workflow.mmd](diagrams/workflow.svg)
__Figure__: _An overview of the various components necessary to query BigQuery
against data from a PubSub subscription._

## Setting up the GCS project

Read through [`whd/gcp-quickstart`](https://github.com/whd/gcp-quickstart) for details
about the sandbox environment that is provided by data operations.

* Install the [Google Cloud SDK](https://cloud.google.com/sdk/)
* Navigate to the [Google Cloud Console](https://cloud.google.com/sdk/)
* Create a new project under `firefox.gcp.mozilla.com/dataops/sandbox`
- `gcloud config set project <PROJECT>`
* Create a PubSub subscription (see `gcp-quickstart/pubsub.sh`)
* Create a GCS bucket
- `gsutil mb gs://<PROJECT>`
* Enable the [Dataflow API](https://console.cloud.google.com/marketplace/details/google/dataflow.googleapis.com)
* Create a service account and store the key locally


## Bootstrapping schemas from `mozilla-pipeline-schemas`

* Download the latest schemas from `mozilla-pipeline-schemas` using `bin/download-schemas`.
- This script may also inject testing resources into the resulting archive.
- A `schemas.tar.gz` will appear at the project root.
* Generate BigQuery schemas using `bin/generate-bq-schemas`.
- Schemas will be written to `bq-schemas/`.
```
bq-schemas/
├── activity-stream.impression-stats.1.bigquery.json
├── coverage.coverage.1.bigquery.json
├── edge-validator.error-report.1.bigquery.json
├── eng-workflow.bmobugs.1.bigquery.json
....
```
* Update the BigQuery table in the current project using `bin/update-bq-table`.
- This may take several minutes. Read the script for usage information.
* Verify that tables have been updated by viewing the BigQuery console.


## Building the project

Follow the instructions of the project readme. Here is a quick-reference for a running a job from a set of files in GCS.

```bash
export GOOGLE_APPLICATION_CREDENTIALS=keys.json
PROJECT=$(gcloud config get-value project)
BUCKET="gs://$PROJECT"

path="$BUCKET/data/*.ndjson"
mvn compile exec:java -Dexec.args="\
--runner=Dataflow \
--project=$PROJECT \
--autoscalingAlgorithm=NONE \
--workerMachineType=n1-standard-1 \
--numWorkers=1 \
--gcpTempLocation=$BUCKET/tmp \
--inputFileFormat=json \
--inputType=file \
--input=$path\
--outputType=bigquery \
--output=$PROJECT:\${document_namespace}.\${document_type}_v\${document_version} \
--bqWriteMethod=file_loads \
--tempLocation=$BUCKET/temp/bq-loads \
--errorOutputType=file \
--errorOutput=$BUCKET/error/ \
"
```
3 changes: 3 additions & 0 deletions ingestion-beam/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -562,6 +562,9 @@ use the `bin/mvn` executable to run maven in docker:
./bin/mvn clean test
```

To run the project in a sandbox against production data, see this document on
![configuring an integration testing workflow](../docs/ingestion_testing_workflow.md).

# License

This Source Code Form is subject to the terms of the Mozilla Public
Expand Down