Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shell scripts to automate much of the ETL process. #180

Merged
merged 4 commits into from
Sep 13, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 94 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,16 +150,102 @@ The table below demonstrates, at a high level, the information that is being c
1.6. Repeat step 1.3 for all Oracle databases that you want to assess.

## Step 2 - Importing the data collected into Google BigQuery for analysis
Much of the data import and report generation has been automated. Follow section 2.1 to use the automated process. Section 2.2 provides instructions for the manual process if that is your preference. Both processes assume you have rights to create datasets in a Big Query project and access to Data Studio.

2.1. Setup Environment variables (From Google Cloud Shell ONLY).
Make note of the project name and the data set name and location. The data set will be created if it does not exist.

2.1 Automated load process

These instructions are written for running in a Cloud Shell environment.

2.1.1 Clone the Optimus Prime codebase to a working directory.

Create a working directory for the code base, then clone the repository from Github.

Ex:
```
mkdir -p ~/code/op
cd ~/code/op
git clone https://github.com/GoogleCloudPlatform/oracle-database-assessment
```

2.1.2 Create a data directory and upload files from the client

Create a directory to hold the output files for processing, then upload the files to that location and uncompress.

Ex:
```
mkdir ~/data
<upload files to data>
cd data
<uncompress files>
```

2.1.3 Configure automation

The automated process is configured via the file <workingdirectory>/oracle-database-assessment/db_addessment/0_configure_op_env.sh. Edit this file and enter values for these variables:

```
# This is the name of the project into which you want to load data
export PROJECTNAME=yourProjectNameHere

# This is the name of the data set into which you want to load.
# The dataset will be created if it does not exist.
# If the datset already exists, it will have this data appoended.
# Use only alphanumeric characters, - (dash) or _ (underscore)
# This name must be filesystem and html compatible
export DSNAME=yourDatasetNameHere

# This is the location in which the dataset should be created.
export DSLOC=yourRegionNameHere

# This is the full path into which the customer's files have been extracted.
export OP_LOG_DIR=/full/Path/To/LogFiles

# This is the name of the report you want to create in DataStudio upon load completion.
# Use only alphanumeric characters or embed HTML encoding.
export REPORTNAME="OptimusPrime%20Dashboard%20${DSNAME}"
```

2.1.4 Execute the load scripts

The load scripts expect to be run from the <workingdirectory>/oracle-database-assessment/db_addessment directory. Change to this directory and run the following commands in numeric order. Check output of each for errors before continuing to the next.

```
. ./0_configure_op_env.sh
. ./1_activate_op.sh
. ./2_load_op.sh
. ./3_run_op_etl.sh
. ./4_gen_op_report_url.sh
```

The function of each script is as follows.
```
0_configure_op_env.sh - Defines environment variables that are used in the other scripts.
1_activate_op.sh - Installs necessary Python support modules and activates the Python virtual environment for Optimus Prime.
2_load_op.sh - Loads the client data files into the base Optimus Prime tables in the requested data set.
3_run_op_etl.sh - Installs and runs Big Query procedures that create additional views and tables to support the Optimus Prime dashboard.
4_gen_op_report_url.sh - Generates the URL to view the newly loaded data using a report template.
```

2.1.5 View the data in Optimus Prime Dashboard report

Click the link displayed by script 4_gen_op_report_url.sh to view the report. Note that this link does not persist the report.
To save the report for future use, click the '"Edit and Share"' button, then '"Acknowledge and Save"', then '"Add to Report"'. It will then show up in Data Studio in '"Reports owned by me"' and can be shared with others.

Skip to step 3 to perform additional analysis for anything not contained in the dashboard report.

2.2 Manual load process

2.2.1. Setup Environment variables (From Google Cloud Shell ONLY).

```
gcloud auth list

gcloud config set project <project id>
```

2.2 Export Environment variables. (Step 1.2 has working directory created)
2.2.2 Export Environment variables. (Step 1.2 has working directory created)

```
export OP_WORKDING_DIR=<<path for working directory>
Expand All @@ -169,39 +255,39 @@ mkdir $OP_OUTPUT_DIR/log
export OP_LOG_DIR=$OP_OUTPUT_DIR/log
```

2.3 Create working directory (Skip if you have followed step 1.2 on same server)
2.2.3 Create working directory (Skip if you have followed step 1.2 on same server)

```
mkdir $OP_WORKDING_DIR
```

2.4 Clone Github repository (Skip if you have followed step 1.2 on same server)
2.2.4 Clone Github repository (Skip if you have followed step 1.2 on same server)

```
cd <work-directory>
git clone https://github.com/GoogleCloudPlatform/oracle-database-assessment
```

2.5 Create assessment output directory
2.2.5 Create assessment output directory

```
mkdir -p /<work-directory>/oracle-database-assessment-output
cd /<work-directory>/oracle-database-assessment-output
```

2.6 Move zip files to assessment output directory and unzip
2.2.6 Move zip files to assessment output directory and unzip

```
mv <<file file>> /<work-directory>/oracle-database-assessment-output
unzip <<zip files>>
```

2.7. [Create a service account and download the key](https://cloud.google.com/iam/docs/creating-managing-service-accounts#before-you-begin ) .
2.2.7. [Create a service account and download the key](https://cloud.google.com/iam/docs/creating-managing-service-accounts#before-you-begin ) .

* Set GOOGLE_APPLICATION_CREDENTIALS to point to the downloaded key. Make sure the service account has BigQuery Admin privelege.
* NOTE: This step can be skipped if using [Cloud Shell](https://ssh.cloud.google.com/cloudshell/)

2.8. Create a python virtual environment to install dependencies and execute the `optimusprime.py` script
2.2.8. Create a python virtual environment to install dependencies and execute the `optimusprime.py` script

```
python3 -m venv $OP_WORKDING_DIR/op-venv
Expand Down
40 changes: 40 additions & 0 deletions db_assessment/0_configure_op_env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# This file configures the environemt for loading data files from the client
# Edit this file and set the project name, data set name, and data set location to
# where you want the data loaded.
# Ensure you have proper access to the project and rights to create a data set.

# This is the name of the project into which you want to load data
export PROJECTNAME=yourProjectNameHere
cofin marked this conversation as resolved.
Show resolved Hide resolved

# This is the name of the data set into which you want to load.
# The dataset will be created if it does not exist.
# If the datset already exists, it will have this data appoended.
# Use only alphanumeric characters, - (dash) or _ (underscore)
# This name must be filesystem and html compatible
export DSNAME=yourDatasetNameHere

# This is the location in which the dataset should be created.
export DSLOC=yourRegionNameHere

# This is the full path into which the customer's files have been extracted.
export OP_LOG_DIR=fullPathToLogFiles

# This is the name of the report you want to create in DataStudio upon load completion.
# Use only alphanumeric characters or embed HTML encoding.
export REPORTNAME="OptimusPrime%20Dashboard%20${DSNAME}"

# This is the column separator used in the customer's files. Older versions of
# the extract will use semicolon, newer versions will use pipe.
export COLSEP='|'


export OP_WORKING_DIR=$(pwd)

cofin marked this conversation as resolved.
Show resolved Hide resolved
echo
echo Environment set to load from ${OP_LOG_DIR} into ${PROJECTNAME}.${DSNAME}

if [[ -s ${OP_LOG_DIR}/errors*.log ]]
then
echo Errors found in data to be loaded. Please review before continuing.
cat ${OP_LOG_DIR}/errors*.log
fi
8 changes: 8 additions & 0 deletions db_assessment/1_activate_op.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
THISDIR=$(pwd)
python3 -m venv ${OP_WORKING_DIR}/../op-venv
source ${OP_WORKING_DIR}/../op-venv/bin/activate
cd ${OP_WORKING_DIR}/..

pip3 install pip --upgrade
pip3 install .
cd ${THISDIR}
13 changes: 13 additions & 0 deletions db_assessment/2_load_op.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
THISD=$(pwd)
bq mk -d --data_location=${DSLOC} ${DSNAME}
cd ${OP_WORKING_DIR}/..
for COLID in $(ls -1 ${OP_LOG_DIR}/opdb*| rev | cut -d '.' -f 2 | rev | sort | uniq)
do
python3 ./db_assessment/optimusprime.py -sep "${COLSEP}" -dataset ${DSNAME} -fileslocation ${OP_LOG_DIR} -projectname ${PROJECTNAME} -collectionid ${COLID} | tee ${THISD}/opload-${DSNAME}-${COLID}.log
cofin marked this conversation as resolved.
Show resolved Hide resolved
done
echo
echo Logs of this upload are available at:
echo
ls -l ${THISD}/opload-${DSNAME}-*.log
echo
cd ${THISD}
4 changes: 4 additions & 0 deletions db_assessment/3_run_op_etl.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
sed "s/projectID.dataset/${PROJECTNAME}.${DSNAME}/g" op_etl_template.sql > op_etl_${DSNAME}.sql
bq query --use_legacy_sql=false <op_etl_${DSNAME}.sql | tee op_etl_${DSNAME}.log
echo
echo A log of this process is available at op_etl_${DSNAME}.log
56 changes: 56 additions & 0 deletions db_assessment/4_gen_op_report_url.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# ReportID is taken from the DataStudio template upon which the new report will be created.
REPORTID=ed2d87f1-e037-4e65-8ef0-4439a3e62aa3
cofin marked this conversation as resolved.
Show resolved Hide resolved

# REPORTNAME and DSNAME are set in another script.
# The URL template is formatted for editability and readability.
# Line feeds, carriage returns and spaces will be filtered out when generated.
# Any new data sources added to the template will need to be modified here.
URL_TEMPLATE="https://datastudio.google.com/reporting/create?c.
reportId=${REPORTID}
&r.reportName=${REPORTNAME}
&ds.ds106.connector=bigQuery
&ds.ds106.datasourceName=T_DS_Database_Metrics
&ds.ds106.projectId=optimusprime-migrations
&ds.ds106.type=TABLE
&ds.ds106.datasetId=${DSNAME}
&ds.ds106.tableId=T_DS_Database_Metrics
&ds.ds96.connector=bigQuery
&ds.ds96.datasourceName=T_DS_BMS_sizing
&ds.ds96.projectId=optimusprime-migrations
&ds.ds96.type=TABLE
&ds.ds96.datasetId=${DSNAME}
&ds.ds96.tableId=T_DS_BMS_sizing
&ds.ds103.connector=bigQuery
&ds.ds103.datasourceName=V_DS_BMS_BOM
&ds.ds103.projectId=optimusprime-migrations
&ds.ds103.type=TABLE
&ds.ds103.datasetId=${DSNAME}
&ds.ds103.tableId=V_DS_BMS_BOM
&ds.ds169.connector=bigQuery
&ds.ds169.datasourceName=V_DS_HostDetails
&ds.ds169.projectId=optimusprime-migrations
&ds.ds169.type=TABLE
&ds.ds169.datasetId=${DSNAME}
&ds.ds169.tableId=V_DS_HostDetails
&ds.ds68.connector=bigQuery
&ds.ds68.datasourceName=V_DS_dbfeatures
&ds.ds68.projectId=optimusprime-migrations
&ds.ds68.type=TABLE
&ds.ds68.datasetId=${DSNAME}
&ds.ds68.tableId=V_DS_dbfeatures
&ds.ds12.connector=bigQuery
&ds.ds12.datasourceName=V_DS_dbsummary
&ds.ds12.projectId=optimusprime-migrations
&ds.ds12.type=TABLE
&ds.ds12.datasetId=${DSNAME}
&ds.ds12.tableId=V_DS_dbsummary"

echo
echo The Optimus Prime dashboard report \"${REPORTNAME}\" is available at the link below
echo
echo ${URL_TEMPLATE} | sed 's/\r//g;s/\n//g;s/ //g'
echo
echo Click the link to view the report.
echo To create a persistent copy of this report:
echo Click the '"Edit and Share"' button, then '"Acknowledge and Save"', then '"Add to Report"'.
echo It will then show up in Data Studio in '"Reports owned by me"' and can be shared with others.
Loading