Skip to content

Latest commit

 

History

History
48 lines (34 loc) · 2.26 KB

README.md

File metadata and controls

48 lines (34 loc) · 2.26 KB

Hello!

Below you can find a outline of how to reproduce my 4th place solution for the Microsoft Malware Prediction Challenge competition.

CONTENTS

Directories and Files:
data/ - The directory will store the raw competition data, along with another folder for the processed data
data/clean/ - The processed data used for the final submission
logs/ - The training logs, and feature importances of the models
models/ - The 5 LightGBM models that were used to produce the submission
submissions/ - The final submission.csv file
predict.py - Used to make model predictions, uses data from data/clean/ and the models in models/ , stores predictions in submissions/
prepare_data.py - Processes the raw data and saves it in data/clean
SETTINGS.json - Paths to all directories and file locations references in the code . train.py - Trains the models, uses data from data/clean/, saves training logs and feature importances in logs/, saves model in models/

Hardware: (The following specs were used to create the original solution)

  1. AWS c4.8xlarge (36 vCPUs, 60 GB memory)
  2. Ubuntu 16.04 LTS (100 GB boot disk)

Software (python packages are detailed separately in requirements.txt)

Python 3.7.1

Data setup

The following code will download the raw train and test files from the competition. Assumes Kaggle API is installed.

cd data
kaggle competitions download microsoft-malware-prediction -f test.csv
kaggle competitions download microsoft-malware-prediction -f train.csv

Process the data

The following code will process the raw competition data stored in the data/ directory, and save the processed data in the data/clean/ directory. (NOTE: Running this code will overwrite the files data/clean/test-clean.pkl and data/clean/train.pkl.)
python prepare_data.py

Model training

The following code will retrain the models. It will use data in the data/clean/train-clean.pkl file. This will take over 9 hours to train (NOTE: This will overwrite all files in logs/ and models/)
python train.py

Model prediction:

The following code will use the data/clean/test-clean.pkl data to produce a new submission file. This will take about 15 minutes. (NOTE: this will overwrite the submissions/submission.csv file.)
python predict.py