Big data and visualization

Hands-on lab step-by-step

November 2019

Information in this document, including URL and other Internet Web site references, is subject to change without notice. Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, e-mail address, logo, person, place or event is intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

The names of manufacturers, products, or URLs are provided for informational purposes only and Microsoft makes no representations and warranties, either expressed, implied, or statutory, regarding these manufacturers or the use of the products with any Microsoft technologies. The inclusion of a manufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Links may be provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is not responsible for the contents of any linked site or any link contained in a linked site, or any changes or updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission received from any linked site. Microsoft is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement of Microsoft of the site or the products contained therein.

Microsoft and the trademarks listed at https://www.microsoft.com/en-us/legal/intellectualproperty/Trademarks/Usage/General.aspx are trademarks of the Microsoft group of companies. All other trademarks are property of their respective owners.

Contents

Big data and visualization hands-on lab step-by-step

Big data and visualization hands-on lab step-by-step

Abstract and learning objectives

This hands-on lab is designed to provide exposure to many of Microsoft's transformative line of business applications built using Microsoft big data and advanced analytics.

By the end of the lab, you will be able to show an end-to-end solution, leveraging many of these technologies, but not necessarily doing work in every component possible.

Solution architecture

Below is a diagram of the solution architecture you will build in this lab. Please study this carefully so you understand the whole of the solution as you are working on the various components.

Requirements

Microsoft Azure subscription must be pay-as-you-go or MSDN.
- Trial subscriptions will not work.
Follow all the steps provided in Before the Hands-on Lab.

Exercise 1: Retrieve Azure Storage account information and Subscription Id

Task 1: Retrieve Azure Storage account information and Subscription Id

You will need to have the Azure Storage account name and access key when you create your Azure Databricks cluster during the lab. You will also need to create storage containers in which you will store your data files.

From the side menu in the Azure portal, choose Resource groups, then enter your resource group name into the filter box, and select it from the list.
Next, select your lab Azure Storage account from the list.
On the Overview blade, locate and copy your Azure Subscription Id and save to a text editor such as Notepad for later.
Select Access keys (1) from the menu. Copy the storage account name (2) and the key1 key (3) and copy the values to a text editor such as Notepad for later.

Exercise 2: Create folders within DataLake Gen 2

In the previous lab, you created a container called Analytics, we will add folders to store the raw data from Shopify and Magento.

You can use Azure Storage Explorer, navigate to the specific container, Analytics and create folders as follows:

a. raw/system/magento/

b. raw/system/shopify/

c. raw/system/postgress/

Exercise 2: Setup Azure Data Factory

Duration: 20 minutes

In this exercise, you will create a baseline environment for Azure Data Factory development for further operationalization of data movement and processing. You will create a Data Factory service, create a Linked Service and set up copy from Magento and Shopify Data as the Source

Task 1: Configure Azure Data Factory :Create copy pipeline using the Copy Data Wizard

Launch a new browser window, and navigate to the Azure portal (https://portal.azure.com). Once prompted, log in with your Microsoft Azure credentials. If prompted, choose whether your account is an organization account or a Microsoft account. This will be based on which account was used to provision your Azure subscription that is being used for this lab.
From the side menu in the Azure portal, choose Resource groups, then enter your resource group name into the filter box, and select it from the list.
Next, select your Azure Data Factory service from the list.
On the Data Factory blade, select Author & Monitor under Actions.
A new page will open in another tab or new window. Within the Azure Data Factory site, select Copy Data on the menu.
Select the copy activity, Enter the task name as shopify_copy and select Run Once. Click Next. Click on +Create New Connection
In the Linked Service blade that appears select REST.
Fill in the details on the New Linked Service, you can refer to this link for details on each field: https://docs.microsoft.com/en-us/azure/data-factory/connector-rest

9.Enter below details and Click Create

a.	Give  name as ShopifyRest

b.	Select Integration Run time as AutoResolveIntegrationRuntime

c.	Base URL - <Base Shopify URL> similar to - https://<shop>.myshopify.com/admin/api/2020-01/

d.	Authentication Type as Anonymous

After specificying the above test the connection and click create the next screen required to enter the REST dataset properties.

Specify REST dataset properties

a. Enter Relative URL as orders.json?limit=5&created_at_min=2020-02-13T00:00:00&created_at_max=2020-02-14T00:00:00&financial_status=any&status=any

b. Request Method – GET

c. Click New under Additional Headers

i. Name as Content-Type , Value as application/json

ii. Name as X-Shopify-Access-Token , Value as

d. Click Next
On the Destination screen, select + Create new connection.
Select Azure Data Lake Gen 2 within the New Linked Service blade, then select Continue.
On the New Linked Service (Azure Data Lake Gen2) account screen, enter the following and then select Create.
- Name: DataLakeStorageOutput
- Connect via integration runtime: Select your Integration Runtime.
- Authentication method: Select Account key.
- Account selection method: From Azure subscription
- Storage account name: Select the data lake gen 2 storage account you provisioned in the before-the-lab section.
Choose the output file or folder , Browse -> analytics/raw/system/shopify and filename as orders.json
On the Destination data store page, select Next.
File Format Settings, File Format as JSON, File pattern as Set of Objects
Summary, Click Next

Publish and monitor the ADF Pipeline

Click on Next and publish the pipeline.

Monitor the pipeline

Set up Magento as the Source

Repeat the above process for Magento with

a.	Base URL as https://www.<shop>.com.sg/rest/

b.	Relative URL as /V1/orders?searchCriteria[currentPage]=1&searchCriteria[pageSize]=500&searchCriteria[sortOrders][0][field]=entity_id&searchCriteria[sortOrders][0][direction]=ASC&searchCriteria[filter_groups][0][filters][0][field]=updated_at&searchCriteria[filter_groups][0][filters][0][value]=2020-01-31 00:00:00&searchCriteria[filter_groups][0][filters][0][condition_type]=gteq&searchCriteria[filter_groups][0][filters][1][field]=updated_at&searchCriteria[filter_groups][0][filters][1][value]=2020-02-06 00:00:00&searchCriteria[filter_groups][0][filters][1][condition_type]=lteq
c.	Request Method – GET

d.	Click New under Additional Headers
  i.	Name as Content-Type  , Value as application/json
  ii.	Name as Authorization, Value as Bearer <APIToken>
e.	In Choose Output File or Folder -> select analytics/raw/system/magento/

Parametrize the ADF Path

Parameterizing the URL and the output dataset location, the usecase being getting the pipeline to run every hour with current - one hour time range and get orders from Magento and Shopify.

The workshop covers two ways to achieve the same.

Parametrizing the dates as part of the query string.
Parametrizing the dates as part of running the pipeline.

Go to the Pipeline that got created. Click on Author icon to view the Pipelines

2.Click on the Pipeline that starts with CopyPipeline_

Click on Copy data Activity on the right Pane and go to Source tab. Click on Open next to Source dataset

It will open source dataset. Go to Parameters tab and add two new parameters , this will get the dynamic dates.

updatedFromtimestamp = @substring(subtractFromTime(pipeline().TriggerTime,1,'hour'), 0, 19) updatedTotimestamp = @substring(pipeline().TriggerTime, 0, 19)

Go to Connections tab, in Relative URL click on Add dynamic content as follows, this constructs the Query String using previous dynamic dates.

@concat('orders.json? created_at_min=',dataset().updatedFromtimestamp,'&created_at_max=',dataset().updatedTotimestamp,'&financial_status=any&status=any')

Alternate option is to pass the From and To Parameters part of the pipeline. Click on CopyPipeline tab and there will be two parameters on the Source tab

Provide the value as updatedFromtimestamp = @substring(subtractFromTime(pipeline().TriggerTime,1,'hour'), 0, 19) updatedTotimestamp = @substring(pipeline().TriggerTime, 0, 19)

Click on Validate All to save the changes
Next Go to Sink tab and Click on Open next to Sink dataset
Go to Parameters tab and add following parameters

date = @substring(pipeline().TriggerTime, 0, 10) timestamp = @substring(pipeline().TriggerTime, 0, 19)

Go to Connections tab, give the directory (first box) as container name (analytics) and click on Add Dynamic content below for directory in File Path and add the following:

@concat('raw/system/shopify/',dataset().type_of_data,'/',dataset().date)

11.Similarly Add Dynamic Content for File name as @concat(dataset().type_of_data,'_',dataset().timestamp,'.json')

Click on CopyPipeline tab and there will be three parameters on the Sink tab

type_of_data = orders date = @substring(pipeline().TriggerTime, 0, 10) timestamp = @substring(pipeline().TriggerTime, 0, 19)

This is basically to create the folder and file structure as below dynamically

Click on Validate All to save the changes. Click on Publish All to Publish the changes and run the pipeline (Add Trigger -> Trigger Now)
Go to Monitor tab to view the Pipeline execution

Scheduling the Pipeline

From the Pipeline, Click on Add Trigger -> New/Add Trigger

From Choose Trigger -> Click on +New. a. Give the Name for Trigger as Hourly_Trigger b. Type as Schedule c. Start Date as needed d. Recurrence Every 1 Hour and Click on OK.

Go to Monitor tab to view the Pipeline execution

Exercise 4: Retrieve lab environment information and create Databricks cluster

Duration: 10 minutes

In this exercise, you will retrieve your Azure Storage account name and access key and your Azure Subscription Id and record the values to use later within the lab. You will also create a new Azure Databricks cluster.

Task 2: Create an Azure Databricks cluster

You have provisioned an Azure Databricks workspace, and now you need to create a new cluster within the workspace. Part of the cluster configuration includes setting up an account access key to your Azure Storage account, using the Spark Config within the new cluster form. This will allow your cluster to access the lab files.

From the side menu in the Azure portal, select Resource groups, then enter your resource group name into the filter box, and select it from the list.
Next, select your Azure Databricks service from the list.
In the Overview pane of the Azure Databricks service, select Launch Workspace.

Azure Databricks will automatically log you in using Azure Active Directory Single Sign On.
Select Clusters (1) from the menu, then select Create Cluster (2).
On the Create New Cluster form, provide the following:
- Cluster Name: lab
- Cluster Type: Standard
- Databricks Runtime Version: Runtime: 6.2 0r 6.3 (Note: the runtime version may have LTS after the version. This is also a valid selection.)
- Python Version: 3
- Enable Autoscaling: Uncheck this option.
- Auto Termination: Check the box and enter 120
- Worker Type: Standard_F4s
- Driver Type: Same as worker
- Workers: 1
  
  Note :Spark Config, for this workshop we are setting it as part of the Databricks notebooks, this step can be skipped.
Select Create Cluster.

Exercise 2: Load Sample Data and Databricks Notebooks

In this exercise, you will load data from Data Lake Gen 2, perform transformations and write back to Data Lake Gen 2.

Open your Azure Databricks workspace. Before continuing to the next step, verify that your new cluster is running. Do this by navigating to Clusters on the left-hand menu and ensuring that the state of your cluster is Running.

Task 1: Open Azure Databricks and complete lab notebooks

Download the following file:
- Magento.html
  
  Magento Notebook: Performs the following
  1. Read the Magento JSON data ingested into Raw Zone by ADF
  2. Convert the semi structured JSON data into Structured data
  3. Extract the required fields
  4. Extract the Shipped Date, Delivered Date, Refund Date and Captured Date depending on the Comments column
  5. If initial load, load the data directly into Snowflake
  6. In case of incremental load, perform Merge operation into Snowflake to perform de-duplication
- Shopify.html
  
  Shopify Notebook: Performs the following
  1. Read the Shopify JSON data ingested into Raw Zone by ADF
  2. Convert the semi structured JSON data into Structured data
  3. Extract the required fields
  4. Extract the corresponding refund quantity and amount for items from refunds array if any
  5. If initial load, load the data directly into Snowflake
  6. In case of incremental load, perform Merge operation into Snowflake to perform de-duplication
Within Azure Databricks, select Workspace on the menu, then Users, select your user, then select the down arrow on the top of your user workspace. Select Import.
Within the Import Notebooks dialog, select Import from: file, then drag-and-drop the file or browse to upload it.
Before you begin, make sure you attach your cluster to the notebooks, using the dropdown. You will need to do this for each notebook you open.
Click on the Shopify Notebook to open it on the Databricks Cluster.

i.Read from ADLS In Read from ADLS, replace containername, accountname, folder in Cell5 and replace it with actual name

ii.Also FILL IN THE ACCOUNT KEY in spark.conf.set
To add snowflake python connector, from Workspace pane, click on Down arrow and Import
Click on Click here

Click on PyPI and enter snowflake-connector-python. Click on Create and select Install on all Clusters

Snowflake Processing Provide the Snowflake details by replacing xxxxx with actual value

Run cell one by one or Run All to finish all steps.
Click on the Magento Notebook to open it on the Databricks Cluster.

i.Read from ADLS In Read from ADLS, replace containername, accountname, folder in Cell5 and replace it with actual name

ii.Also FILL IN THE ACCOUNT KEY in spark.conf.set

iii.Snowflake Processing Provide the Snowflake details by replacing xxxxx with actual value

Run cell one by one or Run All to finish all steps.

Exercise 3: Operationalize Azure Databricks and Data Factory (Optional Task)

Duration: 20 minutes

In this exercise, you will extend the Data Factory to operationalize the Databricks Notebooks.

Task 1: Create Azure Databricks Linked Service

Return to, or reopen, the Author & Monitor page for your Azure Data Factory in a web browser, navigate to the Author view, and select the pipeline.
Once there, expand Databricks under Activities.
Drag the Notebook activity onto the design surface to the side of the Copy activity.
Select the Notebook activity on the design surface to display tabs containing its properties and settings at the bottom of the screen. On the General tab, enter "BatchScore" into the Name field.
Select the Azure Databricks tab, and select + New next to the Databricks Linked service drop down. Here, you will configure a new linked service which will serve as the connection to your Databricks cluster.
On the New Linked Service dialog, enter the following:
- Name: enter a name, such as AzureDatabricks.
- Connect via integration runtime: Leave set to Default.
- Account selection method: Select From Azure subscription.
- Choose your Azure Subscription.
- Pick your Databricks workspace to populate the Domain automatically.
- Select cluster: choose Existing interactive cluster.
Leave the form open and open your Azure Databricks workspace in another browser tab. You will retrieve the Access token and cluster id here.
In Azure Databricks, select the Account icon in the top corner of the window, then select User Settings.
Select Generate New Token under the Access Tokens tab. Enter ADF access for the comment and leave the lifetime at 90 days. Select Generate.
Copy the generated token and paste it into a text editor such as Notepad for a later step.
Switch back to your Azure Data Factory screen and paste the generated token into the Access token field within the form. After a moment, select your cluster underneath Choose from existing clusters. Select Create.
Switch back to Azure Databricks. Select Workspace in the menu. Select the Exercise 5 folder then open notebook 01 Deploy for Batch Scoring. Examine the content but don't run any of the cells yet. You need to replace STORAGE-ACCOUNT-NAME with the name of the blob storage account you copied in Exercise 1.
Switch back to your Azure Data Factory screen. Select the Settings tab, then browse to your Exercise 5/01 Deploy for Batch Score notebook into the Notebook path field.
The final step is to connect the Copy activities with the Notebook activity. Select the small green box on the side of the copy activity, and drag the arrow onto the Notebook activity on the design surface. What this means is that the copy activity has to complete processing and generate its files in your storage account before the Notebook activity runs, ensuring the files required by the BatchScore notebook are in place at the time of execution. Select Publish All after making the connection.

Task 2: Trigger workflow

Switch back to Azure Data Factory. Select your pipeline if it is not already opened.
Select Trigger, then Trigger Now located above the pipeline design surface.
Enter 3/1/2017 into the windowStart parameter, then select Finish.
Select Monitor in the menu. You will be able to see your pipeline activity in progress as well as the status of past runs.

Exercise 4: Populating SnowFlake using the Databricks connector for SnowFlake

Task 1:

The Magento and Shopify notebooks have SnowFlake connectors, which are used to connect to snowflakes to populate the transformed data on to SnowFlake.

Once the data is inserted the same can be verified by logging on to the SnowFlake portal.

Login to Snowflake https://.azure.snowflakecomputing.com/console
Enter User name and password
Click on WORKSHOP database to see WORKSHOP_CHANNEL_SALES table

Exercise 5: Visualizing in Power BI Desktop using SnowFlake

Duration: 20 minutes

In this exercise, you will create visualizations in Power BI Desktop.

Task 1: Open the desktop version of Power BI, select the Get Data Option as illustrated below:

Task 2: Select SnowFlake as the datasource:

Task 3: Enter the SnowFlake connection details:

Task 4: Enter the user name and password:

Task 5: Select the WORKSHOP_CHANNEL_SALES to load the data:

Task 6: Charts created using CHANNEL_SALES data in Snowflake:

Exercise 6: Visualizing in Power BI Desktop using Azure Databricks - Optional

Task 1: Obtain the JDBC connection string to your Azure Databricks cluster - Optional

Before you begin, you must first obtain the JDBC connection string to your Azure Databricks cluster.

In Azure Databricks, go to Clusters and select your cluster.
On the cluster edit page, scroll down to the bottom of the page, expand Advanced Options, then select the JDBC/ODBC tab.
On the JDBC/ODBC tab, copy and save the first JDBC URL.
- Construct the JDBC server address that you will use when you set up your Spark cluster connection in Power BI Desktop.
- Take the JDBC URL that you copied and saved in step 3 and do the following:
- Replace jdbc:spark with https.
- Remove everything in the path between the port number and sql, retaining the components indicated by the boxes in the image below. Also remove ;AuthMech=3;UID=token;PWD=<personal-access-token> from the end of the string.
- In our example, the server address would be:
https://westus2.azuredatabricks.net:443/sql/protocolv1/o/2035768554230150/0603-160328-rage709 or https://westus2.azuredatabricks.net:443/sql/protocolv1/o/2035768554230150/lab (if you choose the aliased version)

Task 2: Connect to Azure Databricks using Power BI Desktop [Optional]

If you did not already do so during the before the hands-on lab setup, download Power BI Desktop from https://powerbi.microsoft.com/en-us/desktop/.
When Power BI Desktop starts, you will need to enter your personal information, or Sign in if you already have an account.
Select Get data on the screen that is displayed next.
Select Other from the side, and select Spark from the list of available data sources.
Select Connect.
On the next screen, you will be prompted for your Spark cluster information.
Paste the JDBC connection string you constructed a few steps ago into the Server field.
Select the HTTP protocol.
Select DirectQuery for the Data Connectivity mode, and select OK. This option will offload query tasks to the Azure Databricks Spark cluster, providing near-real time querying.
Enter your credentials on the next screen as follows:
- User name: token
- Password: Remember that ADF Access token we generated and asked you to paste in Notepad, that is the password.
Select Connect.
In the Navigator dialog, check the box next to flight_delays_summary, and select Load.
It will take several minutes for the data to load into the Power BI Desktop client.

Task 3: Create Power BI report

Once the data finishes loading, you will see the fields appear on the far side of the Power BI Desktop client window.

After the hands-on lab

Duration: 10 minutes

In this exercise, attendees will deprovision any Azure resources that were created in support of the lab.

Task 1: Delete resource group

Using the Azure portal, navigate to the Resource group you used throughout this hands-on lab by selecting Resource groups in the menu.
Search for the name of your research group and select it from the list.
Select Delete in the command bar and confirm the deletion by re-typing the Resource group name and selecting Delete.

You should follow all steps provided after attending the Hands-on lab.

Files

HOL step-by-step - Big data and visualization.md

Latest commit

History

HOL step-by-step - Big data and visualization.md

File metadata and controls

Big data and visualization hands-on lab step-by-step

Abstract and learning objectives

Solution architecture

Requirements

Exercise 1: Retrieve Azure Storage account information and Subscription Id

Task 1: Retrieve Azure Storage account information and Subscription Id

Exercise 2: Create folders within DataLake Gen 2

Exercise 2: Setup Azure Data Factory

Task 1: Configure Azure Data Factory :Create copy pipeline using the Copy Data Wizard

Publish and monitor the ADF Pipeline

Monitor the pipeline

Set up Magento as the Source

Parametrize the ADF Path

Scheduling the Pipeline

Exercise 4: Retrieve lab environment information and create Databricks cluster

Task 2: Create an Azure Databricks cluster

Exercise 2: Load Sample Data and Databricks Notebooks

Task 1: Open Azure Databricks and complete lab notebooks

Exercise 3: Operationalize Azure Databricks and Data Factory (Optional Task)

Task 1: Create Azure Databricks Linked Service

Task 2: Trigger workflow

Exercise 4: Populating SnowFlake using the Databricks connector for SnowFlake

Task 1:

Exercise 5: Visualizing in Power BI Desktop using SnowFlake

Task 1: Open the desktop version of Power BI, select the Get Data Option as illustrated below:

Task 2: Select SnowFlake as the datasource:

Task 3: Enter the SnowFlake connection details:

Task 4: Enter the user name and password:

Task 5: Select the WORKSHOP_CHANNEL_SALES to load the data:

Task 6: Charts created using CHANNEL_SALES data in Snowflake:

Exercise 6: Visualizing in Power BI Desktop using Azure Databricks - Optional

Task 1: Obtain the JDBC connection string to your Azure Databricks cluster - Optional

Task 2: Connect to Azure Databricks using Power BI Desktop [Optional]

Task 3: Create Power BI report

After the hands-on lab

Task 1: Delete resource group