Programming projects for the Coursera Machine Learning course taught by Stanford Prof. Andrew Ng, completed in octave/matlab. I have also implemented some of these projects in python using the scikit-learn library for my own research and development purposes.
- Implemented a Linear Regression model to make predictions in several scenarios:
- with one variable to predict profits for a food truck
- with multiple variables to predict housing prices in Portland, Oregon.
- Both were implemented in
octave/matlab
for the course work and inPython using sklearn
- Files completed:
- in ex1/
warmUpExercise.m
,plotData.m
,gradientDescent.m
,computeCost.m
- ex1_sklearn/
food_truck_profits.py
,housing_prices.py
- in ex1/
- To model student acceptance rates at a university (linear model/decision boundary)
- To model whether microchips from a fabrication plant pass quality assurance (non-linear model/decision boundary) (note: Not completed in sklearn, only in octave/matlab.)
- files completed:
- in ex2/:
plotData.m
,sigmoid.m
,costFunction.m
,predict.m
,costFunctionReg.m
- in ex2_sklearn/:
university_admissions.py
- in ex2/:
- Implementations: One-vs-all logistic regression and neural networks to identify sample of hand-written digit examples.
- One-vs-all logistic regression implemented in octave and python with sklearn.
- Neural network feedforward propagation was performed on a model that already had the network parameters trained for us.
- Training set: 5000 examples of handwritten digits, which is a subset of the MNIST handwritten digit dataset (http://yann.lecun.com/exdb/mnist/).
- Files completed:
- in ex3/:
lrCostFunction.m
,oneVsAll.m
,predictOneVsAll.m
,predict.m
- in ex3_sklearn/:
lrDigitRecog.py
- TO DO:
nnDigitRecog.py
(figure out which python framework to do neural network classification. Probably thePylearn2
library).
- in ex3/:
- Dataset and learned hidden parameters (
Theta1
andTheta2
) are same as in Project 3 - In this project was implemented
- regularized cost function
- backpropagation algorithm to compute gradient of nn cost function
- A number of additional features were implemented as well, including randomizing the initial weights and checking the gradient
- Files completed (octave/matlab only):
- ex4/:
sigmoidGradient.m
,randInitializeWeights.m
,nnCostFunction.m
(bulk of work here)
- TO DO: (To extend this assignment and implement in python) Perform neural network training and predictions, using
Pylearn2
or similar.
- Regularized linear regression to predict amount of water flowing into/out of a dam using change in water level in a reservoir.
- Diagnostics of debugging learning algorithms and effects of bias vs. variance.
- Dataset division into training set, cross validation set, and test set is introduced and implemented.
- Learning curves are generated
- Files completed:
- ex5/:
linearRegCostFunction.m
,learningCurve.m
,polyFeatures.m
,validationCurve.m
- Data sets with both linear and non-linear decision boundaries
- Also used cross validation to choose regularization param for gaussian (rbf) kernel on sample data set
- Trained an SVM on a subset of the SpamAssasin Public Corpus (http://spamassassin.apache.org/publiccorpus/) to create a high acuracy spam filter
- Uses a bag of words model to perform feature extraction, and SVM for classification.
- Files completed:
- ex6/:
gaussianKernel.m
,dataset3Params.m
,processEmail.m
,emailFeatures.m
- ex6_sklearn/:
svmDecBoundary.py
- ex6/:
- Used K-means algorithm to compress a 24-bit color representation of an image with a 16 color representation.
- Each pixel was clustered into one of 16 groups, and the RGB color replaced with the closest of these 16 colors.
- Compression factor: ~ 6x.
- Used PCA to reduce dimensions in facial images from 1024 (32x32 pixels) to 100. This is useful primarily for training time consuming algorithms (like nn) on a reduced dataset which may work just as well.
- Files completed:
- ex7/:
pca.m
,projectData.m
,recoverData.m
,computeCentroids.m
,findClosestCentroids.m
,kMeansInitCentroids.m
- ex7/:
- Anomaly detection to detect failing servers on a network
- Estimated normal operations given data of throughput, latency by fitting a gaussian distribution
- Threshold selected from F1 score on CV set.
- Collaborative filtering to build recommender system for movies
- Used a subset of the MovieLens 100k Dataset
- computed cost function for CF algorithm
- Files completed:
- ex8/:
estimateGaussian.m
,selectThreshold.m
,cofiCostFunc.m
- ex8/: