This Cohort is meant as an expedited ramp up on the skills for Data Science. The intent is to help someone break into Data Science. The material covered is in a crash course fashion and is by no means comprehensive. The materials covered are a "best of".
Data Science has been evolving and shaping over the last decade and can be defined differently even within one company. Some may define it as AI and ML others as Data Engineering with Predictive Modeling. While others just grab-bag of Technical Project Management and Data Wrangling.
The content in this class is meant to give a bootcamp styled foundational knowledge and application to begin a journey into the field of Data Science.
The goal everyone should have in this class is to move the needle at least 1 or 2 points on the below self-evaluation. Teaching and mentoring is a proven way to grow or reinforce knowledge. Highly encourage study groups and/or volunteering to share on a subject of your interest related to course content or complimentary to it. Please reach out if you are interested in sharing or if you feel you are at a 4 or 5 volunteered presentations to fortify your own knowledge while sharing are welcome. Self Evaluation
- 0 - Do not know
- 1 - Aquiring knowledge ( studied or attempted in last 30 days)
- 2 - Can and have applied ~ 3x with use of reference material(notes, google, stackoverflow) < 80% of the time
- 3 - Can and have applied applied with little use of reference <20% of the time
- 4 - Can teach or mentor
- 5 - Design, Optimize, Code Review, Improve
Class times Tuesday PST: 4-7 Thursday PST:4-7
The 4 units will be accented with workstyled meetings.
-
12/21 Chipotle Obtain and Understand data update
-
1/16 Chiptle EDA with Python update & Final Project + Dataset proposal
-
2/1 Project EDA Brief, Linear regression Lunch and Learn(demo and knowledge share)
-
2/22 Proof of concept with tech team and executive sponsors.
-
Updating your team on what you understand about the data.
-
Proposal making a use case and seeking feedback from your team before moving forward.
-
Brief Write up with stats about progress with project, blockers, challenges and refined scope
-
Lunch and Learn informal setting to share what you know, hear something a new way, or actively learn as a teacher
-
Technical report This is the final brief with more detail on how it was built, where there are issues either with data or code, what is still needed, why the approach was taken, what other options were considered, Next steps. This will have stats about the data that was excluded and a profile of what was included. **Executive Presentation a 1 page or less document in common terms that tell a narrative that answers the business questions, allows for business acumene detailed questions, and offers next steps to the relevent audience. Answers the question, "So what?"
Office Hours
Slack
- Get a head start
Course outline may adjust depending on time. There is a lot of content to cover in a short period of time.
DATE | CLASS | DATE | CLASS |
---|---|---|---|
01 | Orientation and Review home, slides | 02 | Development Environment home, slides |
03 | Jupyter Numpy Pandas home, slides | 04 | Lab Presentations Catchup slides |
05 | Intro Exploratory Data Analysis(EDA) in Pandas | 06 | Statistics in Python |
07 | More EDA Data Visualization in Python | 08 | Experiments and Hypothesis Testing |
09 | Presentations | 10 | KNN/ Classification |
11 | Train-Test Split & Bias Variance | 12 | Linear Regression |
13 | Logistic Regression | 14 | Presentations |
15 | Working with Data APIs | 16 | Intro to Natural Language Processing(NLP) |
17 | Intro to Time Series | 18 | Flex subject and Class time |
19 | Flex day, review, catchup, workshop | 20 | PRESENTATION |
Download Anaconda with python 3.6 or 7, Pycharm or code editor for exercises
Name | Description |
---|---|
Learn Python the Hardway walk through | This is a great way to dig into deep basic syntax with a guide! |
Learn Python the Hardway | Remember that walkthrough video, try it without the video, gets a bit more real after about exercise 15. |
Codecademy | Repeat, repeat repeat, just another avenue to reinforce everything your learning |
Automate the Boring Stuff | Review and then a lot more |
Python Language Reference | Good as reference |
Python Standard Library | Library reference |
Python Tutorial Point | Good Navigation + additional links |
W3 Schools | good tutorial and reference |
LearningPython | Review of the above but then begins to progress into NumPy and Pandas |
Pandas
Name | Description |
---|---|
10 minutes to Pandas | Excellent starter into Pandas |
Pandas tutorial Data Frames in Python | Data Frames explained |
Pandas getting started | Fundamentals at a deeper level |
Data Munging in Python with Pandas | SQL of Python |
How to clean data with Pandas | Bottom of page how to clean data with Pandas |
Cast object to specified Pandas datatype | Good code examples |
Pandas Top 10 | Useful and hard to find features |
Essential Basics | Build fluency and understanding |
Summerizing, Aggregating, Grouping in Pandas | Nice write up on subject |
Missing data | Good for troubleshooting |
Official Pandas Tutorials | Wes & Company's selection of tutorials and lectures |
Julia Evans Pandas Cookbook | Great resource with examples from weather, bikes and 311 calls |
Learn Pandas Tutorials | A great series of Pandas tutorials from Dave Rojas |
Research Computing Python Data PYNBs | A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas |
more resources
- Review each concept and each line of code in these files of python code:
- Introduction to Python does a great job explaining Python essentials and includes example code.
- If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
- If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
- If you have more time, try missions 2 and 3 from DataQuest's Learning Python course. Resources:
- For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
- For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
- Quora has a data science topic FAQ with lots of interesting Q&A.
MORE DATA
-
Seattle Pronto Cycle Share data, Released Oct 2015:
-
Open data catalogs from various governments and NGOs:
- NYC Open Data
- DC Open Data Catalog / OpenDataDC
- DataLA
- data.gov (see also: Project Open Data Dashboard)
- data.gov.uk
- US Census Bureau
- World Bank Open Data
- Humanitarian Data Exchange
- Sunlight Foundation: government-focused data
- ProPublica Data Store
-
Datasets hosted by academic institutions:
- UC Irvine Machine Learning Repository: datasets specifically designed for machine learning
- Stanford Large Network Dataset Collection: graph data
- Inter-university Consortium for Political and Social Research
- Pittsburgh Science of Learning Center's DataShop
- Academic Torrents: distributed network for sharing large research datasets
- Dataverse Project: searchable archive of research data
-
Datasets hosted by private companies:
- Quandl: over 10 million financial, economic, and social datasets
- Amazon Web Services Public Data Sets
- Kaggle provides datasets with their challenges, but each competition has its own rules as to whether the data can be used outside of the scope of the competition.
-
Big lists of datasets:
- Awesome Public Datasets: Well-organized and frequently updated
- Rdatasets: collection of 700+ datasets originally distributed with R packages
- RDataMining.com
- KDnuggets
- inside-R
- 100+ Interesting Data Sets for Statistics
- 20 Free Big Data Sources
- Sebastian Raschka: datasets categorized by format and topic
-
APIs:
- Apigee: explore dozens of popular APIs
- Mashape: explore hundreds of APIs
- Python APIs: Python wrappers for many APIs
-
Other interesting datasets:
- FiveThirtyEight: data and code related to their articles
- The Upshot: data related to their articles
- Yelp Dataset Challenge: Yelp reviews, business attributes, users, and more from 10 cities
- Donors Choose: data related to their projects
- 200,000+ Jeopardy questions
- CrowdFlower: interesting datasets created or enhanced by their contributors
- UFO reports: geolocated and time-standardized UFO reports for close to a century
- Reddit Top 2.5 Million: all-time top 1,000 posts from each of the top 2,500 subreddits
-
Other resources:
- Datasets subreddit: ask for help finding a specific data set, or post your own
- Center for Data Innovation: blog posts about interesting, recently-released data sets.