In the remaining weeks, we will do some machine learning exercises with Python. The main package we will be using is scikit-learn. This is the de facto standard machine learning framework for python. Because there are so many excellent tutorials on scikit-learn, I wanted to avoid reinventing the wheel from scratch and decided to use some of the existing ones. I will be emphasizing important points through experience.
We will start with the scikit-learn basic tutorial.
Scikit-learn is famous for its excellent documentation which is also a great resource for machine learning in general.
This simple tutorial walks us through the organization of scikit-learn and introduces the generic methods fit()
and predict()
implemented for various classifiers.
We will then go to a realistic example written up by an experienced data scientist. This time we will walk through the process of obtaining, cleaning up and normalizing less-then-perfect data and comparing various classifiers with stratified cross validation.
A copy of the churn dataset used in the write up is included in this repo for your convenience.
During these two weeks I will also go over some basic concepts such as :
- Cross validation
- ROC and area under ROC
- Confusion matrix
- Some other concepts regarding evaluation of classifiers
Finally, in the 3rd week, you will train your own classifiers using scikit-learn and the p53 Mutants Data Set and I will be walking around answering your questions.