Problem Statement : To build a classification methodology to predict the type of forest cover based on the given training data.
Data Ineìgestion: The client will send data in multiple sets of files in batches at a given location. Data will contain different indicators to classify them between given types of forest cover.
Data description: Name / Data Type / Measurement / Description
Elevation / quantitative /meters / Elevation in meters
Aspect / quantitative / azimuth / Aspect in degrees azimuth
Slope / quantitative / degrees / Slope in degrees
Horizontal_Distance_To_Hydrology / quantitative / meters / Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology / quantitative / meters / Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways / quantitative / meters / Horz Dist to nearest roadway
Horizontal_Distance_To_Fire_Points / quantitative / meters / Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns) / qualitative / 0 (absence) or 1 (presence) / Wilderness area designation
Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1 (presence) / Soil Type designation
Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation.
Step 1) Data Validation In this step, we perform different sets of validation on the given set of training files.
-
Name Validation- We validate the name of the files based on the given name in the schema file. We have created a regex pattern as per the name given in the schema file to use for validation. After validating the pattern in the name, we check for the length of date in the file name as well as the length of time in the file name. If all the values are as per requirement, we move such files to "Good_Data_Folder" else we move such files to "Bad_Data_Folder."
-
Number of Columns - We validate the number of columns present in the files, and if it doesn't match with the value given in the schema file, then the file is moved to "Bad_Data_Folder."
-
Name of Columns - The name of the columns is validated and should be the same as given in the schema file. If not, then the file is moved to "Bad_Data_Folder".
-
The datatype of columns - The datatype of columns is given in the schema file. This is validated when we insert the files into Database. If the datatype is wrong, then the file is moved to "Bad_Data_Folder".
-
Null values in columns - If any of the columns in a file have all the values as NULL or missing, we discard such a file and move it to "Bad_Data_Folder".
Step 2) Data Insertion in Database
- Database Creation and connection - Create a database with the given name passed. If the database is already created, open the connection to the database.
- Table creation in the database - Table with name - "Good_Data", is created in the database for inserting the files in the "Good_Data_Folder" based on given column names and datatype in the schema file. If the table is already present, then the new table is not created and new files are inserted in the already present table as we want training to be done on new as well as old training files.
- Insertion of files in the table - All the files in the "Good_Data_Folder" are inserted in the above-created table. If any file has invalid data type in any of the columns, the file is not loaded in the table and is moved to "Bad_Data_Folder".
Step 3) Model Training
- Data Export from Db - The data in a stored database is exported as a CSV file to be used for model training.
- Data Preprocessing
a) Check for null values in the columns. If present, impute the null values using the KNN imputer. b) Encode the categorical values in the class column. c) scale the numerical values in the given dataset after we split them for test and train. - Clustering - KMeans algorithm is used to create clusters in the preprocessed data. The optimum number of clusters is selected by plotting the elbow plot, and for the dynamic selection of the number of clusters, we are using "KneeLocator" function. The idea behind clustering is to implement different algorithms To train data in different clusters. The Kmeans model is trained over preprocessed data and the model is saved for further use in prediction.
- Model Selection - After clusters are created, we find the best model for each cluster. We are using two algorithms, "Random Forest" and "XGBoost". For each cluster, both the algorithms are passed with the best parameters derived from GridSearch. We calculate the AUC scores for both models and select the model with the best score. Similarly, the model is selected for each cluster. All the models for every cluster are saved for use in prediction.
Step 4) Prediction Data Description
Client will send the data in multiple set of files in batches at a given location.Apart from prediction files, we also require a "schema" file from client which contains all the relevant information about the training files such as:Name of the files, Length of Date value in FileName, Length of Time value in FileName, Number of Columns, Name of the Columns and their datatype.
Step 5) Data Validation
In this step, we perform different sets of validation on the given set of training files.
- Name Validation- We validate the name of the files on the basis of given Name in the schema file. We have created a regex pattern as per the name given in schema file, to use for validation. After validating the pattern in the name, we check for length of date in the file name as well as length of time in the file name. If all the values are as per requirement, we move such files to "Good_Data_Folder" else we move such files to "Bad_Data_Folder".
- Number of Columns - We validate the number of columns present in the files, if it doesn't match with the value given in the schema file then the file is moved to "Bad_Data_Folder".
- Name of Columns - The name of the columns is validated and should be same as given in the schema file. If not, then the file is moved to "Bad_Data_Folder".
- Datatype of columns - The datatype of columns is given in the schema file. This is validated when we insert the files into Database. If dataype is wrong then the file is moved to "Bad_Data_Folder".
- Null values in columns - If any of the columns in a file has all the values as NULL or missing, we discard such file and move it to "Bad_Data_Folder".
Step 6) Data Insertion in Database
- Database Creation and connection - Create database with the given name passed. If the database is already created, open the connection to the database.
- Table creation in the database - Table with name - "Good_Data", is created in the database for inserting the files in the "Good_Data_Folder" on the basis of given column names and datatype in the schema file. If table is already present then new table is not created, and new files are inserted the already present table as we want training to be done on new as well old training files.
- Insertion of files in the table - All the files in the "Good_Data_Folder" are inserted in the above-created table. If any file has invalid data type in any of the columns, the file is not loaded in the table and is moved to "Bad_Data_Folder".
Step 7) Prediction
- Data Export from Db - The data in a stored database is exported as a CSV file to be used for model training.
- Data Preprocessing
a) Check for null values in the columns. If present, impute the null values using the KNN imputer. b) Encode the categorical values in the class column. c) scale the numerical values in the given dataset after we split them for test and train. - Prediction - Based on the cluster number, the respective model is loaded and is used to predict the data for that cluster.
- Once the prediction is made for all the clusters, the predictions are saved in a CSV file with the names of forest cover was was given in the training set before encoding at a given location and the location is returned to the client.
Step 8) Deployment
We will be deploying the model to the Pivotal Cloud Foundry platform. This is a workflow diagram for the prediction of using the trained model.
requirements.txt file consists of all the packages that you need to deploy the app in the cloud.
main.py is the entry point of our application, where the flask server starts. Here we will be decoding a base64 to an image, and then we will be making predictions. • Go to https://cloud.google.com/ and create an account if already haven’t created one. Then go to the console of your account. • Go to IAM and admin(highlighted) and click manage resources.
• Click CREATE PROJECT to create a new project for deployment. • Once the project gets created, select App Engine and select Dashboard.
• Go to https://dl.google.com/dl/cloudsdk/channels/rapid/GoogleCloudSDKInstaller.exe to download the google cloud SDK to your machine. • Click Start Tutorial on the screen and select Python app and click start.
• Check whether the correct project name is displayed and then click next.
• Create a file ‘app.yaml’ and put ‘runtime: python37’ in that file. • Create a ‘requirements.txt’ file by opening the command prompt/anaconda prompt, navigate to the project folder and enter the command ‘pip freeze > requirements.txt’. It is recommended to use separate environments for different projects. • Your python application file should be called ‘main.py’. It is a GCP specific requirement. • Open gcloud shell , navigate to the project folder and enter the command gcloud init to initialise the gcloud context. • It asks you to select from the list of available projects.
• Once the project name is selected, enter the command gcloud app deploy app.yaml --project . • After executing the above command, GCP will ask you to enter the region for your application. Choose the appropriate one.
• GCP will ask for the services to be deployed. Enter ‘y’ to deploy the services. • And then it will give you the link for your app.