Skip to content

Latest commit

 

History

History
467 lines (276 loc) · 31.9 KB

01 Exercise 1 - Building a Machine Learning Model.md

File metadata and controls

467 lines (276 loc) · 31.9 KB

Exercise 1: Building a Machine Learning Model

Duration: 90 mins

Synopsis: In this exercise, attendees will implement a classification experiment. They will load the training data from their local machine into a dataset. Then they will explore the data to identify the primary components they should use for prediction, and use two different algorithms for predicting the classification. They will evaluate the performance of both and algorithms choose the algorithm that performs best. The model selected will be exposed as a web service that is integrated with the sample web app.

This exercise has 9 tasks:

Task 1: Connect to the Lab VM

  1. From the left side of the Azure portal, click on All resources.

  2. In the Filter items... box, type in lab.

  3. Select your lab VM. Keep in mind the name of the virtual machine will begin with the "app name" you provided when setting up this workshop environment (in the prerequisite deployment).

  4. At the top of the blade for your VM, click on Connect.

    Screenshot

  5. Download and open the RDP file.

  6. When the Remote Desktop Connection screen appears, check the Don't ask me again... box and click on Connect button.

    Screenshot

  7. Log in with the following credentials:

    • User name: cortana
    • Password: Password.1!!

Task 2: Navigate to Machine Learning Studio

  1. In a browser, go to https://studio.azureml.net and log in using the same account you used in the Azure portal to deploy the prerequisites for this workshop.
  2. Once you are signed in, ensure the workspace that was created as part of the prerequisites is selected from the top bar.

Task 3: Upload the Sample Datasets

  1. Before you begin creating a machine learning experiment, there are three datasets you need to load.

  2. Download the three CSV sample datasets from here: http://aka.ms/awtdata and save AdventureWorksTravelDatasets.zip to your Desktop.

    • Note: You will need to unblock the zip file before extracting its files. Do this by right clicking on it, selecting Properties, and then unblocking the file in the resulting dialog.
  3. Extract the ZIP and verify you have the following files:

    • FlightDelaysWithAirportCodes.csv
    • FlightWeatherWithAirportCodes.csv
    • AirportCodeLocationClean.csv
  4. Click + NEW at the bottom, point to Dataset , and select From Local File.

    Screenshot

  5. In the dialog that appears, click Choose File and browse to the FlightDelaysWithAirportCodes.csv file and click OK.

  6. Change the name of the dataset to FlightDelaysWithAirportCodes.

  7. Click on the check mark on the bottom right corner of the screen.

    Screenshot

  8. Repeat the previous step for the FlightWeatherWithAirportCodes.csv and AirportCodeLocationLookupClean.csv setting the name for the dataset in a similar fashion.

Task 4: Start a New Experiment

  1. Click + NEW in the command bar.

  2. In the options that appear, click Blank Experiment.

    Screenshot

  3. Give your new experiment a name, such as AdventureWorks Travel by editing the label near the top of the design surface.

    Screenshot

  4. In the toolbar on the left, in the Search experiment items box, type the name of the dataset you created with flight delay data (FlightDelaysWithAirportCodes). You should see a component for it listed under Saved Datasets -> My Datasets.

    Screenshot

  5. Click and drag on the FlightDelaysWithAirportCodes to add it to the design surface.

    Screenshot

  6. Next, you will explore each of the datasets to understand what kind of cleanup (aka data munging) will be necessary.

  7. Hover over the output port of the FlightDelaysWithAirportCodes dataset.

    Screenshot

  8. Right click on the port and select Visualize.

    Screenshot

  9. A new dialog will appear showing a maximum of 100 rows by 100 columns sample of the dataset. You can see at the top that the dataset has a total of 2,719,418 rows (also referred to as examples in Machine Learning literature) and has 20 columns (also referred to as features).

    Screenshot

  10. Because all 20 columns are displayed, you can scroll the grid horizontally. Scroll until you see the DepDel15 column and click it to view statistics about the column. The DepDel15 column displays a 1 when the flight was delayed at least 15 minutes and 0 if there was no such delay. In the model you will construct, you will try to predict the value of this column for future data.

    Screenshot

  11. Notice in the Statistics panel that a value of 27444 appears for Missing Values. This means that 27,444 rows do not have a value in this column. Since this value is very important to our model, we will eliminate any rows that do not have a value for this column.

  12. To eliminate these problem rows, close the dialog and go back to the design surface. From the toolbar, search for Clean Missing Data.

    Screenshot

  13. Drag this module on to the design surface beneath your FlightDelaysWithAirportCodes dataset. Click the small circle at the bottom of the FlightDelaysWithAirportCodes dataset, drag and release when your mouse is over the circle found in the top center of the Clean Missing Data module. These circles are referred to as ports, and by taking this action you have connected the output port of the dataset with the input port of the Clean Missing Data module, which means the data from the dataset will flow along this path.

    Screenshot

  14. Click Save on the command bar at the bottom to save your in-progress experiment.

    Screenshot

  15. Click Run in the command bar at the bottom to run the experiment.

    Screenshot

  16. When the experiment is finished running, you will see a finished message in the top right corner of the design surface, and green check marks over all modules that ran.

    Screenshot

  17. You should run your experiment whenever you need to update the metadata describing what data is flowing through the modules, so that newly added modules can be aware of the shape of your data (most module have dialogs that can suggest columns, but before they can make suggestions you need to have run your experiment).

  18. Click the Clean Missing Data module to select it. The property panel on the right will display the settings appropriate to the selected module.

  19. In this case, we want to remove rows that have no value for the DepDel15 column. Begin by clicking Launch Column Selector.

    Screenshot

  20. Ensuring With Rules is selected on the left side of the dialog, under the Begin With section, select No Columns. In the row of controls that appears, change the second drop down to Column Names. Then in the text box that appears begin to type DepDel15 and select that item from type-ahead list.

    Screenshot

  21. Click the checkmark to apply the settings. You have now indicated to the Clean Missing Data module that DepDel15 is the only column it should act on.

    Screenshot

  22. In the Properties panel for Clean Missing Data, click the Cleaning mode drop down and select Remove entire row. Now your Clean Missing Data module is fully configured to remove any rows that are missing values for DepDel15.

    Screenshot

  23. To verify the result, run your experiment again. After it is finished, click the leftmost output port of the Clean Missing Data module and select Visualize.

  24. In the dialog that appears, scroll over to DepDel15 and click the column. In the statistics you should see that Missing Values reads 0.

    Screenshot

  25. Our model will approximate departure times to the nearest hour, but departure time is captured as an integer. For example, 8:37 am is captured as 837. Therefore, we will need to process the CRSDepTime column and round it down to the nearest hour.

  26. To perform this rounding two steps are required. First, you will need to divide the value by 100 (so that 837 becomes 8.37). Second, you will round this value down to the nearest hour (so that 8.37 becomes 8).

  27. Begin by adding an Apply Math Operation module beneath the Clean Missing Data module and connect the leftmost output port of the Clean Missing Data module to the input port of the Apply Math Operation.

    Screenshot

  28. In the properties of the Apply Math Operation, set the Category to Operations , Basic operation to Divide , Operation argument type to Constant , Constant operation argument to 100, Selected columns to CRSDepTime (see screenshot below), and Output mode to Append.

    Screenshot

    Screenshot

  29. Run the experiment to update the metadata.

  30. This module will add a new column to the dataset called Divide(CRSDeptTime_$100), but we want to rename it to CRSDepHour. To do so, add an Edit Metadata module and connect its input port to the output port of Apply Math Operation.

    Screenshot

  31. For the properties of the Edit Metadata, set the Selected Columns to Divide(CRSDeptTime_$100) and New column names to CRSDepHour.

    Screenshot

    Screenshot

  32. Run the experiment to update the metadata.

  33. Add another Apply Math Operation module to round the time down to the nearest hour and connect it to the Edit Metadata module.

  34. Set the Category to Rounding, Selected columns to CRSDepHour (see screenshot for how to select), and Output mode to Inplace.

    Screenshot

    Screenshot

  35. Run the experiment to update the metadata.

  36. We do not need all of the columns present in the FlightDelaysWithAirportCodes dataset. To pare down the columns we can use multiple options, but in this case we chose to use an Execute R Script module that selects out only the columns of interest using R code.

  37. Add an Execute R Script module beneath the last Apply Math Operation, and connect the output of the Apply Math Operation to the first input port (leftmost) of the Execute R Script.

    Screenshot

  38. In the Properties panel for Execute R Script, click the "double window" icon to maximize the script editor.

    Screenshot

  39. Replace the default script with the following and click the checkmark to save it.

    ds.flights <- maml.mapInputPort(1)
    # Trim the columns to only those we will use for the predictive model
    ds.flights = ds.flights[, c("OriginAirportCode","OriginLatitude", "OriginLongitude", "Month", "DayofMonth", "CRSDepHour", "DayOfWeek", "Carrier", "DestAirportCode", "DestLatitude", "DestLongitude", "DepDel15")]
    maml.mapOutputPort("ds.flights");
  40. Run the experiment to update the metadata (this may take a minute or two to complete).

  41. Right click on the leftmost output port of your Execute R Script module and select Visualize.

  42. Verify that the dataset only contains the 12 columns referenced in the R script.

    Screenshot

  43. At this point the Flight Delay Data is prepared, and we turn to preparing the historical weather data.

Task 5: Prepare the Weather Data

  1. To the right of the FlightDelaysWithAirportCodes dataset, add the FlightWeatherWithAirportCodes dataset.

    Screenshot

  2. Right click the output port of the FlightWeatherWithAirportCodes dataset and select Visualize.

    Screenshot

  3. Observe that this data set has 406,516 rows and 29 columns. For this model, we are going to focus on predicting delays using WindSpeed (in MPH), SeaLevelPressure (in inches of Hg), and HourlyPrecip (in inches). We will focus on preparing the data for those features.

  4. In the dialog, click the WindSpeed column and review the statistics. Observe that the Feature Type was inferred as String and that there are 32 Missing Values. Below that, examine the histogram to see that, even though the type was inferred as string, the values are all actually numbers (e.g. the x-axis values are 0, 6, 5, 7, 3, 8, 9, 10, 11, 13). We will need to ensure that we remove any missing values and convert WindSpeed to its proper type as a numeric feature.

    Screenshot

  5. Next, click the SeaLevelPressure column. Observe that the Feature Type was inferred as String and there are 0 Missing Values. Scroll down to the histogram, and observe that many of the features of a numeric value (e.g., 29.96, 30.01, etc.), but there are many features with the string value of M for Missing. We will need to replace this value of M with a suitable numeric value so that we can convert this feature to be a numeric feature.

    Screenshot

  6. Finally, examine the HourlyPrecip feature. Observe that it too was inferred to have a Feature Type of String and is missing values for 374,503 rows. Looking at the histogram, observe that besides the numeric values, there is a value T for Trace amount of rain). We will need to replace the T with a suitable numeric value and covert this feature to a numeric feature.

    Screenshot

  7. Let us begin by cleaning up the missing values for both WindSpeed and HourlyPrecip.

  8. Below the FlightWeatherWithAirportCode dataset, drop a Clean Missing Data module and connect the output of the dataset to the input of module.

    Screenshot

  9. Run the experiment to update the metadata available to the Clean Missing Data module.

  10. In the Properties panel for Clean Missing Data, set the Selected columns to HourlyPrecip and WindSpeed (see screenshot for help selecting if needed), set the Cleaning mode to Custom substitution value and set the Replacement value to 0.0.

    Screenshot

    Screenshot

  11. Next, add an Execute R Script module below the Clean Missing Data module and connect the first output port of the former to the first input port of the latter.

    Screenshot

  12. In the Properties panel for the Execute R Script, click the "double window" icon to open the script editor.

  13. Paste in the following script and click the checkmark. This script replaces the HourlyPrecip values having T with 0.05, WindSpeed values of M with 0.0, and the SeaLevelPressure values of M with the global average pressure of 29.92. It also narrows the dataset to just the few feature columns we want to use with our model.

    ds.weather <- maml.mapInputPort(1)
    
    # Round weather time up to the next hour since
    # that's the hour for which we want to use flight data
    ds.weather$Hour = ceiling(ds.weather$Time / 100)
    
    # Replace any WindSpeed values of "M" with 0.005 and make the feature numeric
    speed.num = ds.weather$WindSpeed
    speed.num[speed.num == "M"] = 0.005
    speed.num = as.numeric(speed.num)
    ds.weather$WindSpeed = speed.num 
    
    # Replace any SeaLevelPressure values of "M" with 29.92 (the average pressure) and make the feature numeric
    pressure.num = ds.weather$SeaLevelPressure
    pressure.num[pressure.num == "M"] = 29.92
    pressure.num = as.numeric(pressure.num)
    ds.weather$SeaLevelPressure = pressure.num 
    
    # Adjust the HourlyPrecip variable (convert "T" (trace) to 0.005)
    rain = ds.weather$HourlyPrecip
    rain[rain %in% c("T")] = "0.005"
    ds.weather$HourlyPrecip = as.numeric(rain)
    
    # Pare down the variables in the Weather dataset
    ds.weather = ds.weather[, c("AirportCode", "Month", "Day", "Hour", "WindSpeed", "SeaLevelPressure", "HourlyPrecip")]
    
    maml.mapOutputPort("ds.weather");
  14. Run the experiment. Currently it should appear as follows:

    Screenshot

  15. Click the first output port of the Execute R Script module and select Visualize.

  16. In the statistics, verify that WindSpeed, SeaLevelPressure, and HourlyPrecip are now all Numeric Feature types and that they have no missing values.

Task 6: Join the Flight and Weather Datasets

  1. With both datasets ready, we want to join them together so that we can associate historical flight delay with the weather data at departure time.

  2. Drag the Join Data module on to the design surface, beneath and centered between both Execute R Script modules. Connect the leftmost output port of the left Execute R module to leftmost input port of the Join Data module, and the leftmost output port of the right Execute R Script module to the rightmost input port of the Join Data module.

    Screenshot

  3. In the Properties panel of the Join Data module, relate the rows of data between the two sets L (the flight delays) and R (the weather). Set the Join key columns for L to include OriginAirportCode, Month, DayofMonth, and CRSDepHour.

    Screenshot

  4. Set the Join key columns for R to include AirportCode, Month, Day, and Hour.

    Screenshot

  5. Leave the Join Type at inner join and uncheck Keep right key columns in joined table (so that we do not include the redundant values of AirportCode, Month, Day, and Hour).

    Screenshot

  6. Next, add an Edit Metadata module and connect its input port to the output port of the Join Data module. We will use this module to convert the fields that were unbounded String feature types, to the enumeration like Categorical feature. On the Properties panel, set the Selected columns to DayOfWeek, Carrier, DestAirportCode, and OriginAirportCode. Set the Categorical drop down to Make categorical.

    Screenshot

    Screenshot

  7. Run the experiment to update the metadata.

  8. Add a Select Columns in Dataset module, connect the output of the previous Edit Metadata to the input of the Select Columns in Dataset module, then set the selected columns to exclude (see the screenshot for how to do this): OriginLatitude , OriginLongitude , DestLatitude , and DestLongitude.

    Screenshot

  9. Save your experiment.

Task 7: Train the Model

AdventureWorks Travel wants to build a model to predict if a departing flight will have a 15 minute or greater delay. In the historical data they have provided, the indicator for such a delay is found within DepDelay15 (where a value of 1 means delay, 0 means no delay). To create model that predicts such a binary outcome, we can choose from the various Two-Class modules that Azure ML offers. For our purposes, we begin with a Two-Class Logistic Regression. This type of classification module needs to be first trained on sample data that includes the features important to making a prediction and must also include the actual historical outcome for those features.

The typical pattern is split the historical data so a portion is shown to the model for training purposes, and another portion is reserved to test just how well the trained model performs against examples it has not seen before.

  1. Drag a Split Data module beneath Select Columns in Dataset and connect them.

    Screenshot

  2. On the Properties panel for the Split Data module, set the Fraction of rows in the first output dataset to 0.7 (so 70% of the historical data will flow to output port 1). Set the Random seed to 7634.

    Screenshot

  3. Next, add a Train Model module and connect it to leftmost output of the Split Data module.

    Screenshot

  4. On the Properties panel for the Train Model module, set the Selected columns to DepDel15.

    Screenshot

  5. Drag a Two-Class Logistic Regression module above and to the left of the Train Model module and connect the output to the leftmost input of the Train Model module.

    Screenshot

  6. Below the Train Model drop a Score Model module. Connect the output of the Train Model module to the leftmost input port of the Score Model and connect the rightmost output of the Split Data module to the rightmost input of the Score Model.

    Screenshot

  7. Run the experiment.

  8. When the experiment is finished running (which may take a few minutes), right click on the output port of the Score Model module and select Visualize to see the results of its predictions. You should have a total of 13 columns.

    Screenshot

  9. If you scroll to the right so that you can see the last two columns, observe there is a Scored Labels column and a Scored Probabilities column. The former is the prediction (1 for predicting delay, 0 for predicting no delay) and the latter is the probability of the prediction. In the following screenshot, for example, the last row shows a delay predication with a 53.1% probability.

    Screenshot

  10. While this view enables you to see the prediction results for the first 100 rows, if you want to get more detailed statistics across the prediction results to evaluate your models performance you can use the Evaluate Model module.

  11. Drag an Evaluate Model module on to the design surface beneath the Score Model module. Connect the output of the Score Model module to the leftmost input of the Evaluate Model module.

    Screenshot

  12. Run the experiment.

  13. When the experiment is finished running, right-click the output of the Evaluate Model module and select Visualize. In this dialog box, you are presented with various ways to understand how your model is performing in the aggregate. While we will not cover how to interpret these results in detail, we can examine the ROC chart that tells us that at least our model (the blue curve) is performing better than random (the light gray straight line going from 0,0 to 1,1). A good start for our first model!

    Screenshot

Task 8: Operationalize the Experiment

  1. Now that we have a functioning model, let's package it up into a predictive experiment that can be called as web service.

  2. In the command bar at the bottom, click Set Up Web Service and then select Predictive Web Service. If you see that the Set Up Web Service option is grayed out, then you may need to run the experiment again by click on the RUN button.

    Screenshot

  3. A copy of your training experiment is created that contains the trained model wrapped between web service input (the web service action you invoke with parameters) and web service output modules (how the result of scoring the parameters are returned).

  4. We will make some adjustments to the web service input and output modules to control the parameters we require and the results we return.

  5. When packaging the Predictive Web Service, Azure ML added two Apply Transformation modules which are not needed. Delete both of the Apply Transformation Modules.

    Screenshot

  6. The Apply Transformation modules were added to support the Clean Missing Data modules. We will not be using these steps in our flow, so delete both Clean Missing Data Modules.

    Screenshot

  7. Reconnect the FlightDelaysWithAirportCodes to the input to Apply Math Operation module that is directly beneath it.

    Screenshot

  8. Reconnect the FlightWeatherWithAirportCodes module to the leftmost input port of the Execute R Script module beneath it.

    Screenshot

  9. Now move the Web service input down so it is to the right of the Join Data module. Connect the output of the Web service input to input of the Edit Metadata module.

    Screenshot

  10. Right click the line connecting the Join Data module and the Edit Metadata module and select Delete.

    Screenshot

  11. In between the Join Data and the Edit Metadata modules, drop a Select Columns in Dataset module and connect Join Data to this new Select Columns in Dataset module. In the Properties panel for the Select Columns in Dataset module, set the Select columns to all columns, exclude (notice this is an exclude operation) columns DepDel15, OriginLatitude, OriginLongitude, DestLatitude, and DestLongitude. This configuration will update the web service metadata so that these columns do not appear as required input parameters for the web service.

    Screenshot

  12. Connect the Select Columns in Dataset output to Edit Metadata.

    Screenshot

  13. Select the Select Columns in Dataset module that comes after the Edit Metadata module and delete it.

  14. Connect the output of the Edit Metadata module directly to the right input of the Score Model module.

    Screenshot

  15. As we removed the latitude and longitude columns from the dataset to remove them as input to the web service, we have to add them back in before we return the result so that the results can be easily visualized on a map.

  16. To add these fields back, begin by deleting the line between the Score Model and Web service output.

  17. Drag the AirportCodeLocationLookupClean dataset on to the design surface, positioning it below the Score Model module.

    Screenshot

  18. Add a Join Data module. Connect the output of the Score Model module to the leftmost input of the Join Data module and the output of the dataset to the rightmost input of the Join Data module.

    Screenshot

  19. In the Properties panel for the Join Data module, for the Join key columns for L propery, set the selected columns to OriginAirportCode. For the Join key columns for R property, set the selected columns to AIRPORT. Uncheck Keep right key columns in joined table.

    Screenshot

  20. Add a Select Columns in Dataset module beneath the Join Data module. Connect the Join Data output to the input of the Select Columns in Dataset module.

    Screenshot

  21. In the Properties panel, set the Selected columns to exclude (not include) the columns: AIRPORT_ID and DISPLAY_AIRPORT_NAME.

    Screenshot

    Screenshot

  22. Add a Edit Metadata module. Connect the output of the Select Columns in Dataset module to the input of the Edit Metadata module.

    Screenshot

  23. In the Properties panel for the Edit Metadata module, set the Selected columns to LATITUDE and LONGITUDE. In the New column names enter: OriginLatitude , OriginLongitude.

    Screenshot

  24. Connect the output of the Edit Metadata module to the input of the Web service output module.

    Screenshot

  25. Run the experiment. This should take 5-7 minutes.

Task 9: Deploy Web Service and Note API Information

  1. When the experiment is finished running, click Deploy Web Service [New]. This will launch the web service deployment wizard.

  2. You can leave the default name, select Create new... for Price Plan and then provide a Plan Name value. Finally, under Monthly Plan Options select Standard DevTest.

    Screenshot

  3. Scroll down and click the Deploy button. After deployment is completed, you will be taken to the web services Quick Start page for your new web service.

    Screenshot

  4. From the Quick Start page, click the Use Web Service link.

  5. Click the Copy button for the Primary key, open a copy of Notepad, and paste the value in the editor.

  6. Click the Copy button for the Request-Response link. The URL will look something like the following:

  7. The first GUID after workspaces is your Workspace ID. The second GUID after services is your Service ID.

  8. Copy each of these values into Notepad as well. Make sure you note which GUID is which because you will need these in a later step.

  9. Finally, copy the Batch Requests URL to Notepad as well, but make sure to remove the '?' character and everything after it. You should be left with a URL that looks something like the following. Again, make sure to label this as your batch service in your Notepad instance.

    Screenshot

Next Exercise: [Exercise 2 - Setup Azure Data Factory](02 Exercise 2 - Setup Azure Data Factory.md)