- deal with missing values and categorical variables (use one-hot encoding)
- check for correlated variables
- analyse distributions of attributes
- visualise the data using two or three principal components
For each model:
- standarize the dataset
- use the same cross-validation scheme
- inspect feature importance
- evaluate model with logloss, ROC-AUC and F1 classification metrics
- dataset
- paper
- Python Machine Learning by Sebastian Raschka
- interesting talks:
- Machine Learning 101 by Kyle Kastner (+ GitHub repo)
- Classification using Pandas and Scikit-Learn by Skipper Seabold (+ GitHub repo)
- Machine Learning with Scikit-Learn by Jake VanderPlas (+ GitHub repo)
- Neural Nets for Newbies by Melanie Warrick (+ GitHub repo)
- dimensionality reduction
- cross-validation
- supervised vs unsupervised learning
- regression vs classification
- confidence intervals and p-values