Operationalized Data Lake/Warehouse - Ability to provide a highly available and highly scalable Operational Data Store (ODS) while keeping an Enterprise Data Warehouse (EDW) up to date in near real time, in order to obtain broader insight by cross-referencing against other sources of data in the enterprise.
SA Maintainer: Sig Narváez,
SA Contributor: Paresh Saraf,
SA Contributor: Munish Kapoor
SA Contributor: Chris Grabosky
Time to setup: 2+ hrs
Time to execute: 20 mins
Estimated GCP cost: ~$50
Estimated Atlas cost: <$5
This proof shows how MongoDB Atlas on Google Cloud can act as an Operational Data Layer and effectively co-exist with Google BigQuery acting as Enterprise Data Warehouse.
The proof uses a sample movie dataset sample_mflix loaded into MongoDB Atlas. Google Cloud Dataflow is then used to transform the MongoDB collection into relational tables using the MongoDBIO library from Apache Beam. Finally, BigQueryIO library is used to write the transformed data into Google BigQuery. We will then join this data with a public dataset provided by Google BigQuery to show the power of consolidation across disparate data sources.
*** NOTE ***: The readme for this proof (this file) needs an overhaul given pull request ODS-EDW complete refactor #47. For now, please follow demo_steps.sh and demo_steps_ml.sh.
- A MongoDB Atlas account
- A Google Cloud Platform account
- Log-on to your Atlas account
- If you do not have an account, click on Sign Up. Sign Up with Google or enter details.
- Click on the Database Access link on the left column. In the new page, choose to Add New User using the green button on the left. In the modal dialog, give the user the name
appUser
and passwordappUser123
. We will use some built in roles so click Add Default Privileges and in the Default Privileges section add the roles readWriteAnyDatabase and clusterMonitor then press the green Add User button to create the user.
- Return to the clusters page by clicking Clusters on the left navigation column.
- Press the giant Build a Cluster button and then choose to create a single region cluster.
- Deploy a 3 node replica-set in the Google Cloud region of your choice (here us-west2) with default settings, no backup, MongoDB version 4.0, and give it a name of your choosing (here we will use MDB-Dataflow). The size must be M10 or larger. Here we used an M30. Click the Create Cluster button and wait for the cluster to complete deployment.
If this is a free account, you will be required to add your credit card to create a cluster other than M0. To add your credit card details - select Billing on the left Navigation Menu and Add a Payment Method.
- Whitelist the IPs. For the demo we will allow access from 0.0.0.0/0. This is not recommended for a production setup, where the recommendation will be to use VPC Peering and private IPs.
- Copy the Connection String and replace the password with the password of the appUser created earlier. This will be used in the Execution later.
MongoDB Atlas provides a sample dataset that you can use to quickly get started and begin experimenting with various tools like CRUD operations in Atlas and visualizations in Charts. To load the dataset:
- Navigate to your Clusters view.
- In the left navigation pane in Atlas, click Clusters.
- Open the Load Sample Dataset dialog.
- Locate the cluster where you want to load sample data.
- Click the Ellipses (…) button for your cluster.
- Click Load Sample Dataset.
- In the dialog, click Load Sample Dataset
- The dialog closes and Atlas begins loading your sample dataset into your cluster.
- View your sample data.
- To view your sample data by click your cluster’s Collections button. You should see the following databases in your cluster:
S. No. | Datbase Name |
---|---|
1. | sample_airbnb |
2. | sample_geospatial |
3. | sample_mflix |
4. | sample_supplies |
5. | sample_training |
6. | sample_weatherdata |
We will be using the sample_mflix
data for this exercise.
-
Log-on to your personal GCP account.
-
In the Cloud Console, on the project selector page, create a new Cloud project named MongoDBAtlasODS-EDW.
-
Make sure that billing is enabled for your Google Cloud project
-
Enable the following Google Cloud services via the API by opening this url in your browser. You will need to select the MongoDBAtlasODS-EDW project.
https://console.cloud.google.com/flows/enableapi?apiid=dataflow,compute_component,logging,storage_component,storage_api,bigquery,pubsub,datastore.googleapis.com,cloudresourcemanager.googleapis.com
- Create a VM Instance using GCP Compute Engine
- Use the following values for creating the VM instance. Ensure that you deploy in the same Google Cloud region as your MongoDB Atlas cluster (us-west2 in this example).
- Name:
client-vm
- Region: us-west2
- Zone: a
- Machine configuration: leave as General purpose and choose n1-standard-1
- Boot disk: Change to Ubuntu 18.04 LTS (you can choose something else but all instructions will be based on this)
- Identity and API access: Allow full access to all Cloud APIs
- Leave the other values as default.
- Then click Create
- Name:
- Connect to the newly created VM using ssh
- Prepare the prerequisites for the machine by either running the commands in the following table one-by-one, or execute the
prepMachine.sh
script located in this directory
Purpose | Command |
---|---|
Confirm version is Ubuntu 18.04 LTS | lsb_release -a |
Update apt cache | sudo apt update |
Install git | sudo apt install -y git |
Install Java | sudo apt install -y default-jre |
Check Java version | java --version |
Make sure only one Java is installed | sudo update-alternatives --config java |
Set JAVA_HOME environment variable |
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 |
Make sure correct java is in path |
export PATH=$PATH:$JAVA_HOME/bin |
Check Java version to make sure all is good | java --version |
Install Apache Maven | sudo apt install -y maven |
Make sure Maven is in path | mvn -v |
IF THE OUTPUT OF THE LSB_RELEASE
COMMAND IS NOT THAT OF UBUNTU 18.04, YOU HAVE DEPLOYED THE WRONG VM FOR THESE INSTRUCTIONS. PLEASE DESTROY THE VM AND START OVER WITH THE CORRECT OPERATING SYSTEM
ON THE export JAVA_HOME
COMMAND, THE PATH LISTED SHOULD BE THE PATH RETURNED BY THE sudo update-alternatives --config java
COMMAND, OMMITTING THE /bin/java
. AN EXAMPLE HAS BEEN PROVIDED
-
Clone the git repo using the git clone command. NOTE: Cloning private GitHub repos from the command line require personal access tokens. eg -
git clone https://YOUR_USER:YOUR_PERSONAL_ACCESS_TOKEN@github.com/10gen/atlas-google-pov.git
-
Create a Cloud Storage bucket:
-
In the Cloud Console, go to the Cloud Storage Browser page then Create Bucket
-
Complete the form as follows:
- Name: anything you choose but do not include sensitive information in the bucket name, because the bucket namespace is global and publicly visible
- Location type: change to
Region
and in the drop down choose the same regions you deployed your compute and Atlas in (hereus-west2
) - Storage class: Standard
- Access control: fine-grained
-
Then click the blue Create button
-
Configure BigQuery environment
-
Search for bigquery in search bar at the top and select first result.
-
Create an empty table movies (inside the mflix dataset) as shown below. Accept remaining defaults.
Make sure you are inside proof folder. Execute below script :
cd ~/atlas-google-pov/proofs/02-ODS-AND-EDW/
mvn compile exec:java \
-Dexec.mainClass=dataflowdemonew.MongoDataFlowBigQuery \
-Dexec.args="--project=<your GCP project id> \
--mongouri=<mongodb atlas cluster connection string> \
--gcpTempLocation=<temp location for dataflow (within the cloud storage bucket created before)> \
--tempLocation=<temp location for bigquery (within the cloud storage bucket created before)> \
--runner=DataflowRunner \
--jobName=<job name> \
--region=<region name>" \
-Pdataflow-runner
Example:
mvn compile exec:java \
-Dexec.mainClass=dataflowdemonew.MongoDataFlowBigQuery \
-Dexec.args="--project=mongodbatlasods-edw-268604 \
--mongouri=mongodb+srv://appUser:appUser123@mdb-dataflow-htq0y.gcp.mongodb.net/sample_mflix \
--gcpTempLocation=gs://mkbucket001/tmp/ \
--tempLocation=gs://mkbucket001/tmp/ \
--runner=DataflowRunner \
--jobName=dataflow-intro \
--region=us-west1" \
-Pdataflow-runner
HINT: If you obtain a runtime error, check which DataFlow Regional Endpoint is closest and available to your VM and MongoDB Atlas Cluster.
After a few minutes, navigate to GCP Dataflow (Select it from left tab in GCP console). You can see list of jobs in progress and completed:
Click on the relevant job. You should be able to see job details as below:
Navigate to BigQuery from GCP console. You should be able to see the rows populated in the empty tables created before.
An Operational Data Lake/Date Warehouse will bring in data from various different System of Records (SORs). Users execute queries across these data sources. Let's join two distinct datasets, the mflix.movies
data, which has the details on movie listings brought over from MongoDB Atlas and Google BigQuery's public dataset on film locations in San Francisco, where movies have been shot.
Lets query this data to see list of movies ordered by the number of shooting locations in San Francisco.
Execute the following query. You may see that the movie Blue Jasmine has been shot in over 50 locations in San Francisco, followed by Time After Time and San Andreas.
WITH movieLocations
AS (select b.title title,
count(*) numLoc
from `bigquery-public-data.san_francisco_film_locations.film_locations` b
group by b.title),
movies
AS (SELECT movie_id,
title,
plot
FROM mflix.movies)
SELECT movie_id,
m.title,
numLoc,
plot
FROM movies m
INNER JOIN
movieLocations
ON m.title = movieLocations.title
order by 3 desc;
You should see an output like this:
- Terminate the MongoDB Atlas Cluster
- Delete BigQuery Dataset
- Delete Compute VM
- Delete Storage
- Shutdown the project, if it isn't used for other purposes.