Skip to content

4th Place Submission to Kaggle's Microsoft Malware Prediction Challenge

Notifications You must be signed in to change notification settings

24mlight/Microsoft-Malware-Prediction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hello!

Below you can find a outline of how to reproduce my 4th place solution for the Microsoft Malware Prediction Challenge competition.

CONTENTS

Directories and Files:
data/ - The directory will store the raw competition data, along with another folder for the processed data
data/clean/ - The processed data used for the final submission
logs/ - The training logs, and feature importances of the models
models/ - The 5 LightGBM models that were used to produce the submission
submissions/ - The final submission.csv file
predict.py - Used to make model predictions, uses data from data/clean/ and the models in models/ , stores predictions in submissions/
prepare_data.py - Processes the raw data and saves it in data/clean
SETTINGS.json - Paths to all directories and file locations references in the code . train.py - Trains the models, uses data from data/clean/, saves training logs and feature importances in logs/, saves model in models/

Hardware: (The following specs were used to create the original solution)

  1. AWS c4.8xlarge (36 vCPUs, 60 GB memory)
  2. Ubuntu 16.04 LTS (100 GB boot disk)

Software (python packages are detailed separately in requirements.txt)

Python 3.7.1

Data setup

The following code will download the raw train and test files from the competition. Assumes Kaggle API is installed.

cd data
kaggle competitions download microsoft-malware-prediction -f test.csv
kaggle competitions download microsoft-malware-prediction -f train.csv

Process the data

The following code will process the raw competition data stored in the data/ directory, and save the processed data in the data/clean/ directory. (NOTE: Running this code will overwrite the files data/clean/test-clean.pkl and data/clean/train.pkl.)
python prepare_data.py

Model training

The following code will retrain the models. It will use data in the data/clean/train-clean.pkl file. This will take over 9 hours to train (NOTE: This will overwrite all files in logs/ and models/)
python train.py

Model prediction:

The following code will use the data/clean/test-clean.pkl data to produce a new submission file. This will take about 15 minutes. (NOTE: this will overwrite the submissions/submission.csv file.)
python predict.py

About

4th Place Submission to Kaggle's Microsoft Malware Prediction Challenge

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%