Course materials for General Assembly's Data Science course in Washington, DC (3/18/15 - 6/3/15).
Instructors: Kevin Markham and Brandon Burroughs
Monday | Wednesday |
---|---|
3/18: Introduction and Python | |
3/23: Git and Command Line | 3/25: Exploratory Data Analysis |
3/30: Visualization and APIs | 4/1: Machine Learning and KNN |
4/6: Bias-Variance and Model Evaluation | 4/8: Kaggle Titanic |
4/13: Web Scraping, Tidy Data, Reproducibility | 4/15: Linear Regression |
4/20: Logistic Regression and Confusion Matrices | 4/22: ROC and Cross-Validation |
4/27: Project Presentation #1 | 4/29: Naive Bayes |
5/4: Natural Language Processing | 5/6: Kaggle Stack Overflow |
5/11: Decision Trees | 5/13: Ensembles |
5/18: Clustering and Regularization | 5/20: Advanced scikit-learn and Regex |
5/25: No Class | 5/27: Databases and SQL |
6/1: Course Review | 6/3: Project Presentation #2 |
- 3/30: Deadline for discussing your project idea(s) with an instructor
- 4/6: Project question and dataset (write-up)
- 4/27: Project presentation #1 (slides, code, visualizations)
- 5/18: First draft due (draft of project paper, code, visualizations)
- 5/25: Peer review due
- 6/3: Project presentation #2 (project paper, slides, code, visualizations, data, data dictionary)
- Course project requirements
- Public data sources
- Kaggle competitions
- Examples of student projects
- Peer review guidelines
- Office hours will take place every Saturday and Sunday.
- Homework will be assigned every Wednesday and due on Monday, and you'll receive feedback by Wednesday.
- Our primary tool for out-of-class communication will be a private chat room through Slack.
- Homework submission form (also for project submissions)
- Gist is an easy way to put your homework online
- Feedback submission form (at the end of every class)
- Install the Anaconda distribution of Python 2.7x.
- Install Git and create a GitHub account.
- Once you receive an email invitation from Slack, join our "DAT5 team" and add your photo.
- Choose a Python workshop to attend, depending upon your current skill level:
- Beginner: Saturday 3/7 10am-2pm or Thursday 3/12 6:30pm-9pm
- Intermediate: Saturday 3/14 10am-2pm
- Practice your Python using the resources below.
- Codecademy's Python course: Good beginner material, including tons of in-browser exercises.
- DataQuest: Similar interface to Codecademy, but focused on teaching Python in the context of data science.
- Google's Python Class: Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
- A Crash Course in Python for Scientists: Read through the Overview section for a quick introduction to Python.
- Python for Informatics: A very beginner-oriented book, with associated slides and videos.
- Code from our beginner and intermediate workshops: Useful for review and reference.
- Introduction to General Assembly
- Course overview (slides)
- Brief tour of Slack
- Checking the setup of your laptop
- Python lesson with airline safety data (code)
Homework:
- Python exercises with Chipotle order data (listed at bottom of code file) (solution)
- Work through GA's excellent introductory command line tutorial and then take this brief quiz.
- Read through the course project requirements and start thinking about your own project!
Optional:
- If we discovered any setup issues with your laptop, please resolve them before Monday.
- If you're not feeling comfortable in Python, keep practicing using the resources above!
Homework:
- Command line exercises with SMS Spam Data (listed at the bottom of Introduction to the Command Line) (solution)
- Note: This homework is not due until Monday. You might want to create a GitHub repo for your homework instead of using Gist!
Optional:
- Browse through some example student projects to stimulate your thinking and give you a sense of project scope.
Resources:
- This Command Line Primer goes a bit more into command line scripting.
- Read the first two chapters of Pro Git to gain a much deeper understanding of version control and basic Git commands.
- Watch Introduction to Git and GitHub (36 minutes) for a quick review of a lot of today's material.
- GitRef is an excellent reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
- The Markdown Cheatsheet covers standard Markdown and a bit of "GitHub Flavored Markdown."
- Pandas for data exploration, analysis, and visualization (code)
- Split-Apply-Combine pattern
- Simple examples of joins in Pandas
Homework:
- Pandas practice with Automobile MPG Data (listed at the bottom of Exploratory Analysis in Pandas) (solution)
- Talk to an instructor about your project
- Don't forget about the Command line exercises (listed at the bottom of Introduction to the Command Line)
Optional:
- To learn more Pandas, review this three-part tutorial, or review these two excellent (but extremely long) notebooks on Pandas: introduction and data wrangling.
- Read How Software in Half of NYC Cabs Generates $5.2 Million a Year in Extra Tips for an excellent example of exploratory data analysis.
Homework:
- Visualization practice with Automobile MPG Data (listed at the bottom of the visualization code) (solution)
- Note: This homework isn't due until Monday.
Optional:
- Watch Look at Your Data (18 minutes) for an excellent example of why visualization is useful for understanding your data.
Resources:
- For more on Pandas plotting, read this notebook or the visualization page from the official Pandas documentation.
- To learn how to customize your plots further, browse through this notebook on matplotlib or this similar notebook.
- To explore different types of visualizations and when to use them, Choosing a Good Chart and The Graphic Continuum are handy one-page references, or check out the R Graph Catalog.
- For a more in-depth introduction to visualization, browse through these PowerPoint slides from Columbia's Data Mining class.
- Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
- Iris dataset
- What does an iris look like?
- Data hosted by the UCI Machine Learning Repository
- "Human learning" exercise (solution)
- Introduction to data science (slides)
- Quora: What is data science?
- Data science Venn diagram
- Quora: What is the workflow of a data scientist?
- Example student project: MetroMetric
- Machine learning and KNN (slides)
- Introduction to scikit-learn (code)
- Documentation: user guide, module reference, class documentation
Homework:
- Complete your visualization homework assigned in class 4
- Reading assignment on the bias-variance tradeoff
- A write-up about your project question and dataset is due on Monday! (example one, example two)
Optional:
- For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
- For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
- For a fun (yet enlightening) look at the data science workflow, read What I do when I get a new data set as told through tweets.
- For a more in-depth introduction to data science, browse through these PowerPoint slides from Columbia's Data Mining class.
- For a more in-depth introduction to machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
- For a really nice comparison of supervised versus unsupervised learning, plus an introduction to reinforcement learning, watch this video (13 minutes) from Caltech's Learning From Data course.
Resources:
- Quora has a data science topic FAQ with lots of interesting Q&A.
- Keep up with local data-related events through the Data Community DC event calendar or weekly newsletter.
- Brief introduction to the IPython Notebook
- Exploring the bias-variance tradeoff (notebook)
- Discussion of the assigned reading on the bias-variance tradeoff
- Model evaluation procedures (notebook)
Resources:
- If you would like to learn the IPython Notebook, the official Notebook tutorials are useful.
- To get started with Seaborn for visualization, the official website has a series of tutorials and an example gallery.
- Hastie and Tibshirani have an excellent video (12 minutes, starting at 2:34) that covers training error versus testing error, the bias-variance tradeoff, and train/test split (which they call the "validation set approach").
- Caltech's Learning From Data course includes a fantastic video (15 minutes) that may help you to visualize bias and variance.
- Guest instructor: Josiah Davis
- Participate in Kaggle's Titanic competition
- Work in pairs, but the goal is for every person to make at least one submission by the end of the class period!
Homework:
- Option 1 is to do the Glass identification homework. This is a good option if you are still getting comfortable with what we have learned so far, and prefer a very structured assignment. (solution)
- Option 2 is to keep working on the Titanic competition, and see if you can make some additional progress! This is a good assignment if you are feeling comfortable with the material and want to learn a bit more on your own.
- In either case, please submit your code as usual, and include lots of code comments!
- Web scraping (slides and code)
- Tidy data:
- Introduction
- Example datasets: Bob Ross, NFL ticket prices, airline safety, Jets ticket prices, Chipotle orders
- Reproducibility:
Resources:
- This web scraping tutorial from Stanford provides an example of getting a list of items.
- If you want to learn more about tidy data, Hadley Wickham's paper has a lot of nice examples.
- If your co-workers tend to create spreadsheets that are unreadable by computers, perhaps they would benefit from reading this list of tips for releasing data in spreadsheets. (There are some additional suggestions in this answer from Cross Validated.)
- Here's Colbert on reproducibility (8 minutes).
- Linear regression (notebook)
- Simple linear regression
- Estimating and interpreting model coefficients
- Confidence intervals
- Hypothesis testing and p-values
- R-squared
- Multiple linear regression
- Feature selection
- Model evaluation metrics for regression
- Handling categorical predictors
Homework:
- If you're behind on homework, use this time to catch up.
- Keep working on your project... your first presentation is in less than two weeks!!
Resources:
- To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning, from which this lesson was adapted. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
- To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on simple linear regression and multiple linear regression.
- This introduction to linear regression is much more detailed and mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.
Homework:
- Video assignment on ROC Curves and Area Under the Curve
- Review the notebook from class 6 on model evaluation procedures
Resources:
- For more on logistic regression, watch the first three videos (30 minutes total) from Chapter 4 of An Introduction to Statistical Learning.
- UCLA's IDRE has a handy table to help you remember the relationship between probability, odds, and log-odds.
- Better Explained has a very friendly introduction (with lots of examples) to the intuition behind "e".
- Here are some useful lecture notes on interpreting logistic regression coefficients.
- Kevin wrote a simple guide to confusion matrix terminology that you can use as a reference guide.
- ROC curves and Area Under the Curve
- Discuss the video assignment
- Exercise: drawing an ROC curve
- Calculating AUC and plotting an ROC curve (notebook)
- Cross-validation (notebook)
- Discuss this article on Smart Autofill for Google Sheets
Homework:
- Your first project presentation is on Monday! Please submit a link to your project repository (with slides, code, and visualizations) before class using the homework submission form.
Optional:
- Titanic exercise (notebook)
Resources:
- scikit-learn has extensive documentation on model evaluation.
- For more on cross-validation, read section 5.1 of An Introduction to Statistical Learning (11 pages) or watch the related videos: K-fold and leave-one-out cross-validation (14 minutes), cross-validation the right and wrong ways (10 minutes).
- Project presentations!
Homework:
- Read these Introduction to Probability slides (from the OpenIntro Statistics textbook) and try the included quizzes. Pay specific attention to the following terms: probability, sample space, mutually exclusive, independent.
- Reading assignment on spam filtering.
- Conditional probability and Bayes' theorem
- Slides (adapted from Visualizing Bayes' theorem)
- Visualization of conditional probability
- Applying Bayes' theorem to iris classification (notebook)
- Naive Bayes classification
- Slides
- Example with spam email (notebook)
- Discuss the reading assignment on spam filtering
- Airport security example
- Classifying SMS messages (code)
Homework:
- Please download/install the following for the NLP class on Monday
- In Spyder,
import nltk
and runnltk.download('all')
. This downloads all of the necessary resources for the Natural Language Tool Kit. - We'll be using two new packages/modules for this class: textblob and lda. Please install them. Hint: In the Terminal (Mac) or Git Bash (Windows), run
pip install textblob
andpip install lda
.
- In Spyder,
Resources:
- For other intuitive introductions to Bayes' theorem, here are two good blog posts that use ducks and legos.
- For more on conditional probability, these slides may be useful.
- For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
- If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.
- If you're planning on using text features in your project, it's worth exploring the different types of Naive Bayes and the many options for CountVectorizer.
- Natural Language Processing (notebook)
- NLTK: tokenization, stemming, lemmatization, part of speech tagging, stopwords, Named Entity Recognition, LDA
- Alternative: TextBlob
Resources:
- Natural Language Processing with Python: free online book to go in-depth with NLTK
- NLP online course: no sessions are available, but video lectures and slides are still accessible
- Brief slides on the major task areas of NLP
- Detailed slides on a lot of NLP terminology
- A visual survey of text visualization techniques: for exploration and inspiration
- DC Natural Language Processing: active Meetup group
- Stanford CoreNLP: suite of tools if you want to get serious about NLP
- Getting started with regex: Python introductory lesson and reference guide, real-time regex tester, in-depth tutorials
- A good explanation of LDA
- Textblob documentation
- SpaCy: a new NLP package
- Overview of how Kaggle works (slides)
- Kaggle In-Class competition: Predict whether a Stack Overflow question will be closed (code)
Optional:
- Keep working on this competition! You can make up to 5 submissions per day, and the competition doesn't close until 6:30pm ET on Wednesday, May 27 (class 20).
Resources:
- For a great overview of the diversity of problems tackled by Kaggle competitions, watch Kaggle Transforms Data Science Into Competitive Sport (28 minutes) by Jeremy Howard (past president of Kaggle).
- Getting in Shape for the Sport of Data Science (74 minutes), also by Jeremy Howard, contains a lot of tips for competitive machine learning.
- Learning from the best is an excellent blog post covering top tips from Kaggle Masters on how to do well on Kaggle.
- Feature Engineering Without Domain Expertise (17 minutes), a talk by Kaggle Master Nick Kridler, provides some simple advice about how to iterate quickly and where to spend your time during a Kaggle competition.
- Kevin's project presentation video (16 minutes) gives a nice tour of the end-to-end machine learning process for a Kaggle competition. (Or, just check out the slides.)
- Decision trees (notebook)
Resources:
- scikit-learn documentation: Decision Trees
Installing Graphviz (optional):
- Mac:
- Windows:
- Download and install MSI file
- Add it to your Path: Go to Control Panel, System, Advanced System Settings, Environment Variables. Under system variables, edit "Path" to include the path to the "bin" folder, such as:
C:\Program Files (x86)\Graphviz2.38\bin
- Ensembles and random forests (notebook)
Homework:
- Your project draft is due on Monday! Please submit a link to your project repository (with paper, code, and visualizations) before class using the homework submission form.
- Your peers and your instructors will be giving you feedback on your project draft.
- Here's an example of a great final project paper from a past student.
- Make at least one new submission to our Kaggle competition! We suggest trying Random Forests or building your own ensemble of models. For assistance, you could use this framework code, or refer to the complete code from class 15. You can optionally submit your code to us if you want feedback.
Resources:
- scikit-learn documentation: Ensembles
- Quora: How do random forests work in layman's terms?
Homework:
- You will be assigned to review the project drafts of two of your peers. You have until next Monday to provide them with feedback, according to these guidelines.
Resources:
- Introduction to Data Mining has a thorough chapter on cluster analysis.
- The scikit-learn user guide has a nice section on clustering.
- Wikipedia article on determining the number of clusters.
- This K-means clustering visualization allows you to set different numbers of clusters for the iris data, and this other visualization allows you to see the effects of different initial positions for the centroids.
- Fun examples of clustering: A Statistical Analysis of the Work of Bob Ross (with data and Python code), How a Math Genius Hacked OkCupid to Find True Love, and characteristics of your zip code.
- An Introduction to Statistical Learning has useful videos on K-means clustering (17 minutes), ridge regression (13 minutes), and lasso regression (15 minutes).
- Caltech's Learning From Data course has a great video introducing regularization (8 minutes) that builds upon their video about the bias-variance tradeoff.
- Here is a longer example of feature scaling in scikit-learn, with additional discussion of the types of scaling you can use.
- Clever Methods of Overfitting is a classic post by John Langford.
- Advanced scikit-learn (code)
- Searching for optimal parameters: GridSearchCV
- Standardization of features: StandardScaler
- Chaining steps: Pipeline
- Regular expressions ("regex")
Optional:
- Use regular expressions to create a list of causes from the homicide data. Your list should look like this:
['shooting', 'shooting', 'blunt force', ...]
. If the cause is not listed for a particular homicide, include it in the list as'unknown'
.
Resources:
- scikit-learn has an incredibly active mailing list that is often much more useful than Stack Overflow for researching a particular function.
- The scikit-learn documentation includes a machine learning map that may help you to choose the "best" model for your task.
- In you want to build upon the regex material presented in today's class, Google's Python Class includes an excellent lesson (with an associated video).
- regex101 is an online tool for testing your regular expressions in real time.
- If you want to go really deep with regular expressions, RexEgg includes endless articles and tutorials.
- Exploring Expressions of Emotions in GitHub Commit Messages is a fun example of how regular expressions can be used for data analysis.
Homework:
- Read this classic paper, which may help you to connect many of the topics we have studied throughout the course: A Few Useful Things to Know about Machine Learning.
- Your final project is due next Wednesday!
- Please submit a link to your project repository before Wednesday's class using the homework submission form.
- Your presentation should start with a recap of the key information from the previous presentation, but you should spend most of your presentation discussing what has happened since then.
- Don't forget to practice your presentation and time yourself!
Resources:
- SQLZOO, Mode Analytics, and Code School all have online SQL tutorials that look promising.
- w3schools has a sample database that allows you to practice your SQL.
- 10 Easy Steps to a Complete Understanding of SQL is a good article for those who have some SQL experience and want to understand it at a deeper level.
- A Comparison Of Relational Database Management Systems gives the pros and cons of SQLite, MySQL, and PostgreSQL.
- If you want to go deeper into databases and SQL, Stanford has a well-respected series of 14 mini-courses.
Resources:
- Data science review: A summary of key concepts from the Data Science course.
- Comparing supervised learning algorithms: Kevin's table comparing the machine learning models we studied in the course.
- Choosing a Machine Learning Classifier: Edwin Chen's short and highly readable guide.
- Machine Learning Done Wrong and Common Pitfalls in Machine Learning: Thoughtful advice on common mistakes to avoid in machine learning.
- Practical machine learning tricks from the KDD 2011 best industry paper: More advanced advice than the resources above.
- An Empirical Comparison of Supervised Learning Algorithms: Research paper from 2006.
- Many more resources for continued learning!
- Presentations!
Class is over! What should I do now?
- Take a break!
- Go back through class notes/code/videos to make sure you feel comfortable with what we've learned.
- Take a look at the Resources for each class to get a deeper understanding of what we've learned. Start with the Resources from Class 21 and move to topics you are most interested in.
- You might not realize it, but you are at a point where you can continue learning on your own. You have all of the skills necessary to read papers, blogs, documentation, etc.
- GA Data Guild
- 8/24/2015
- 9/21/2015
- 10/19/2015
- 11/9/2015
- Follow data scientists on Twitter. This will help you stay up on the latest news/models/applications/tools.
- Participate in Data Community DC events. They sponsor meetups, workshops, etc, notably the Data Science DC Meetup. Sign up for their newsletter also!
- Read blogs to keep learning. I really like District Data Labs.
- Do Kaggle competitions! This is a good way to continue and hone your skillset. Plus, you'll learn a ton along the way.
And finally, don't forget about graduation!