Hello!
Below you can find a outline of how to reproduce my 4th place solution for the Microsoft Malware Prediction Challenge competition.
Directories and Files:
data/ - The directory will store the raw competition data, along with another folder for the processed data
data/clean/ - The processed data used for the final submission
logs/ - The training logs, and feature importances of the models
models/ - The 5 LightGBM models that were used to produce the submission
submissions/ - The final submission.csv file
predict.py - Used to make model predictions, uses data from data/clean/ and the models in models/ , stores predictions in submissions/
prepare_data.py - Processes the raw data and saves it in data/clean
SETTINGS.json - Paths to all directories and file locations references in the code .
train.py - Trains the models, uses data from data/clean/, saves training logs and feature importances in logs/, saves model in models/
- AWS c4.8xlarge (36 vCPUs, 60 GB memory)
- Ubuntu 16.04 LTS (100 GB boot disk)
Python 3.7.1
The following code will download the raw train and test files from the competition. Assumes Kaggle API is installed.
cd data
kaggle competitions download microsoft-malware-prediction -f test.csv
kaggle competitions download microsoft-malware-prediction -f train.csv
The following code will process the raw competition data stored in the data/ directory, and save the processed data in the data/clean/ directory. (NOTE: Running this code will overwrite the files data/clean/test-clean.pkl and data/clean/train.pkl.)
python prepare_data.py
The following code will retrain the models. It will use data in the data/clean/train-clean.pkl file. This will take over 9 hours to train (NOTE: This will overwrite all files in logs/ and models/)
python train.py
The following code will use the data/clean/test-clean.pkl data to produce a new submission file. This will take about 15 minutes. (NOTE: this will overwrite the submissions/submission.csv file.)
python predict.py