6/14/2016 to 8/18/2016
Instructor: Hamed Hasheminia
Tuesdays | Thursdays |
---|---|
6/14: Data Science - Introduction Part I | 6/16 Data Science - Introduction Part II |
6/21: Linear Regression Lines Part I | 6/23: Linear Regression Lines Part II |
6/28: Model Selection | 6/30: Missing Data and Imputation |
7/5: K-Nearest Neighbors | 7/7: Logistic Regression Part I |
7/12: Logistic Regression Part II | 7/14: In Class Project |
7/19: Tree-Based Models Part I | 7/21: Tree-Based Models Part II |
7/26: Natural Language Processing | 7/28: Time Series Models |
8/2: Principal Component Analysis | 8/4: Data Visualization |
8/9: Naive Bayes | 8/11: Course Review |
8/16: Final Project Presentations I | 8/18: Final Project Presentations II |
##Lecture 1 Summary (Data Science - Introduction Part I)
- Data Science - meaning
- Continuous, Discrete and Qualitative Data
- Supervised vs Unsupervised Learning
- Classification vs Regression
- Time series vs cross-sectional data
- Numpy
- Pandas
Resources
- Lecture 1 - Introduction - Slides
- Intro Numpy - Code
- Intro Numpy - Code - Solutions
- Intro Pandas - Code
- InClass Practice Code - Pandas
- InClass Practice Code - Solutions
Set up GitHub - Self-study guide
- Lecture 0 - GitHub - Slides
- excellent videos to set-up github. Students who have not used GitHub before must watch these videos.
- A hands-on introduction to Git and GitHub, and how to make them work together! More Git resources for beginners here
Pre-work for second lecture
- Review all lecture notes including Lecture Slides, Numpy notebook, and Pandas notebook
- Finish self-study Github guidlines listed above
- Finish Inclass Practice Code
- Review final project requirements. You can find timelines for final project at slide 11 of Lecture 1 PowerPoint Slides
Additional Resources
- Official Pandas Tutorials. Wes & Company's selection of tutorials and lectures
- Julia Evans Pandas Cookbook. Great resource with examples from weather, bikes and 311 calls
- Learn Pandas Tutorials. A great series of Pandas tutorials from Dave Rojas
- Research Computing Python Data PYNBs. A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas
- Measures of central tendency (Mean, Median, Mode, Quartiles, Percentiles)
- Measures of Variability (IQR, Standard Deviation, Variance)
- Skewness Coefficient
- Boxplots, Histograms, Scatterplots
- Central Limit Theorem
- Class/Dummy Variables
- Walkthrough describing and visualizing data in Pandas
Resources
- Lecture 2 - Slides
- Basic Statistics - Part 2 - Lab Codes
- Basic Statistics - Part 2 - Practice Code
- Basic Statistics - Part 2 - Practice Code - Solutions
HW 1 is Assigned
- Please read and follow instructions from readme
- This homework is due on June 23rd, 2016 at 6:30PM
Additional Resources
- Here you can find valuable resources for matplotlib
- A good Video on Centeral Limit Theorem
- Linear Regression lines
- Single Variable and Multi-Variable Regression Lines
- Capture non-linearity using Linear Regression lines.
- Interpretting regression coefficients
- Dealing with dummy variables in regression lines
- intro on sklearn and searborn library
Resources
- Lecture 3 - Slides
- Linear Regression - Part I - Lab Codes
- Linear Regression - Part I - Practice Code
- Linear Regression - Part I - Practice Solutions
Additional Resources
- My videos on regression lines. Video 1, Video 2
- This is an excellent book. In Lecture 3 and Lecture 4, we are going to cover Chapter 3 of this textbook.
- Seaborn
- Weighted Least Square Method (WLS)
- Good resource for heteroskedasticity
- Here Contours are elegantly introduced.)
- Hypothesis test - test of significance on regression coefficients
- p-values
- Capture non-linearity using Linear Regression lines.
- R-squared
- Interaction Effects
Resources
- Lecture 4 - Slides
- Linear Regression - Part II - Lab Codes
- Linear Regression - Part II - Practice Code
- Linear Regression - Part II - Practice Solution
Additional Resources
- My videos on regression lines. Video 1, Video 2
- This is an excellent book. In Lecture 3 and Lecture 4, we covered Chapter 3 of this textbook.
- statmodels.formula.api
HW 2 is Assigned
- Please read and follow instructions from readme
- Here you can find iPython notebook of your 2nd assignment.
- This homework is due on June 30th, 2016 at 6:30PM
- Bias-Variance Trade off
- Validation (Test vs Train set)
- Cross-Validation
- Ridge and Lasso Regression
- (Optional) Backward Selection, Forward Selection, All Subset Selection. (If you want to use these methods you need to use R)
Resources
- Lecture 5 - Slides
- Model Selection - Lab Codes
- Model Selection - Practice Code
- Model Selection - Practice Solutions
- HW 1 - Key
Additional Resources
- Preprocessing Library
- Cross-Validation Library
- This is an excellent book. You can find theory of Cross-Validation in Chapter 5. You can also learn about Lasso and Ridge regression in Chapter 6 of the mentioend textbook.
- Here you can find my video on Cross-Validation
- Here you can find my video on Ridge and Lasso Regression
- Here you can find my video on Best subset selection.
- Types of missing data (MCAR, MAR, NMAR)
- Single imputation and their limitations
- Imuptation using regression lines and error
- Hot deck imputation
- multiple imputation
Resources
- Lecture 6 - Slides
- Missing Data and Imputation - Lab Codes
- Missing Data and Imputation - Practice Code and HW 3
- Missing Data and Imputation - Solution Code
Additional Resources
- Great Video by Dr. Elizabeth A. Stuart from John Hopkins University
Announcements
- HW 3 is assigned (Due at 6:30PM - July 7th)
- Please read this before starting your assignment.
- HW3 starter code can be found here
- Classification Problems
- Misclassifciation Error
- KNN algorithm for Classification
- Cross-Validation for KNN Algorithm
- Limitations of KNN Algorithm
- KNN algorithm for Regression
Resources
- Lecture 7 - Slides
- K-Nearest Neighbors - Lab Codes
- K-Nearest Neighbors - Practice Code
- K-Nearest Neighbors - Practice Solution
Announcements
- HW 2 Solutions are posted.
- Logistic Regression - Intro
- Odds vs Probability
- Using Logistic Regression to Make predictions
- How one interprets coefficients of a Logistic Regression model
- Strength and weaknesses of Logistic Regression Model
Resources
- Lecture 8 - Slides
- Logistic Regression Part I - Lab Codes
- Logistic Regression Part I - Practice Code
- Logistic Regression Part I - Practice Solutions
Additional Resources
- Logistic Regression video
HW 3 Solutions Posted
- Unbalanced observations and Logistic Regression
- FP/FN/TP/TN/FPR/TPR
- The effect of changing Threshhold
- ROC Curves
- Area Under Curve
- How to compare classifciation algorithms
Resources
- Lecture 9 - Slides
- Logistic Regression Part II - Lab Codes
- Logistic Regression Part II - Practice Code
- Logistic Regression Part II - Practice - Solutions
- Breast Cancer Project
- Breast Cancer - Group Notebook
- Energy Efficiency Project
- Energy Efficiency - Group Notebook
- Income Prediction Project
- Wine Quality Project
- Wine Quality - Group Notebook
- Decision Tree for Regression
- Greedy Approach
- Decision Tree for Classification
- Gini Index and Entropy index
- Limitation of Simple Decision Trees
Resources
Additional Resources
- Tree-Based Models - Video 1
- Tree-Based Models - Video 2
- If you are among the ones who hate dealing with dummy variables, enjoy working with this dummify function
- Bagging
- Random Forest
- Boosting
- Tuning parameters for boosting and Random Forest
Resources
Additional Resources
Announcement
- HW 4 is assigned and is due on July 28th 2016 at 6:30PM.
- Please read ReadMe file before working on your project.
- Definition of Natural Language Processing
- NLP applications
- Basic NLP practice
- Stop words, bag-of-words, TF-DIF
Resources
- Lecture 13 - Slides
- Natural Language Processing - Lab Codes
- Natural Language Processing - Practice Code
- Natural Language Processing - Practice Solutions
Additional Resources
- If you want to learn a lot more NLP, check out the excellent video lectures and slides from this Coursera course (which is no longer being offered).
- Natural Language Processing with Python is the most popular book for going in-depth with the Natural Language Toolkit (NLTK).
- A Smattering of NLP in Python provides a nice overview of NLTK, as does this notebook from DAT5.
- spaCy is a newer Python library for text processing that is focused on performance (unlike NLTK).
- If you want to get serious about NLP, Stanford CoreNLP is a suite of tools (written in Java) that is highly regarded.
- When working with a large text corpus in scikit-learn, HashingVectorizer is a useful alternative to CountVectorizer.
- Automatically Categorizing Yelp Businesses discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
- Modern Methods for Sentiment Analysis shows how "word vectors" can be used for more accurate sentiment analysis.
- Identifying Humorous Cartoon Captions is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
Pre-Work
- Principal Component Analysis
- Computation of PCAs
- Geometry of PCAs
- Proportion of Variance Explained
Resources
- Lecture 14 - Slides
- Principal Component Analysis - Lab Codes
- Principal Component Analysis - Practice Code
- Principal Component Analysis - Practice Solutions
Additional Resources
- This tutorial on Principal Components Analysis (PCA) includes good refreshers on covariance and linear algebra
- To go deeper on Singular Value Decomposition, read Kirk Baker's excellent tutorial.
- Chapter 10 of Statistical Learning with applications in R
- AutoRegressive Models
- Moving Averages
- ARMA
- ARIMA
Resources
Additional Resources
- This is a good resource for AR models
- Relatively easy to read book on time series.