Important links:
- Data Mining vs Machine Learning vs Artificial Intelligence vs Statistics
- What do data scientists get payed?
Name | Mike Izbicki (call me Mike) |
mizbicki@cmc.edu | |
Office | Adams 216 |
Office Hours | MW 3:45-4:00 or by appointment (see my schedule) |
Webpage | https://izbicki.me |
Research | Machine Learning (see izbicki.me/research.html for some past projects) |
Fun Facts | grew up in San Clemente, CA (1 hr south of Claremont) 7 years in the navy, worked on nuclear submarines and at NSA left Navy as a conscientious objector phd/postdoc at UC Riverside taught in DPRK |
General Information:
- This is the theory course for CMC's Data Science major
- Combines linear algebra, statistics, and computation
- Prepare you for industry or graduate school
Learning Objectives:
- Exposure to research-level data mining
- Understand the latest algorithms
- But algorithms get outdated fast, and data mining practitioners must be able to read math
- Major algorithms
- Eigen-methods for data mining
- Logistic regression
- Major concepts
- Bias/variance trade-off
- Regularization
- Major Theorems
- The VC Dimension theorem
- The SGD convergence theorem
- (maybe) The Johnson-Lindenstrauss Lemma
- (probably not) The Cramer-Rao bound and Fisher information
- Feature generation methods
- Text (English, non-English)
- Social media
- Kernels
- Ethical implications of data mining
- Apply data mining libraries (PyTorch, scikit-learn, GenSim, spaCy, etc.)
- Teaching you how to use these libraries is NOT the primary goal of the course
- Approximately 1/3 of the homeworks are programming related, but these assignments are designed to help you understand the math
Prerequisite knowledge:
- linear algebra
- eigenvectors
- statistics
- linear/logistic regression
- (no class listed as a prereq in the catalog because there are more than 20 stats classes offered)
- computation
- big-o analysis
- git
- use python libraries
- generating plots
Textbook:
All resources are freely available online
- Understanding Machine Learning: From Theory to Algorithms (freely available here)
- lots of research papers (5-10)
Grades:
Category | Percent |
---|---|
Homework | 80 |
Project | 20 |
This will be a hard class, but a low-stress class.
-
The material is intrinsically hard
- Very few people find linear algebra, statistics and computing to ALL be easy subjects
- There's a reason people who understand this material get paid $200k+ salaries at FAANG
-
The course is low-stress because you have full control over what your grade will be:
-
You will grade all homeworks yourself
- I will spot check your homeworks
- If you want detailed feedback, ask and I will provide it
- You should know when a proof/coding assignment is right/wrong
-
The project:
- To get an A, you must somehow advance the state of human knowledge
- May work individually or in a small team
- Options:
- Write an analysis of 2-3 research papers
- Perform an interesting experiment
- Publish your writeup online
- Your grade is determined based on how many people read/share your writeup
- This will be part of your "portfolio"
- No one cares about your grades
-
Late Work Policy:
You lose 20% on the assignment for each week late.
Collaboration Policy:
There are no restrictions on collaboration in this class, and collaboration is highly encouraged.
WARNING: All material in this class is cumulative. If you work "too closely" with another student on an assignment, you won't understand how to complete subsequent assignments, and you will quickly fall behind. You should view collaboration as a way to improve your understanding, not as a way to do less work.
You are ultimately responsible for ensuring you learn the material!
Week | Date | Topic |
---|---|---|
1 | Mon, Aug 24 | Course intro |
1 | Wed, Aug 26 | Computational Linear Algebra |
2 | Mon, Aug 31 | Pagerank |
2 | Wed, Sep 2 | Pagerank |
3 | Mon, Sep 7 | Statistical Learning Theory |
3 | Wed, Sep 9 | Statistical Learning Theory |
4 | Mon, Sep 14 | Statistical Learning Theory |
4 | Wed, Sep 16 | Statistical Learning Theory |
5 | Mon, Sep 21 | Logistic Regression |
5 | Wed, Sep 23 | Logistic Regression |
6 | Mon, Sep 28 | Kernels / neural networks / k-nearest neighbor / decision trees |
6 | Wed, Sep 30 | Kernels / neural networks / k-nearest neighbor / decision trees |
7 | Mon, Oct 5 | Stochastic gradient descent |
7 | Wed, Oct 7 | Stochastic gradient descent |
8 | Mon, Oct 12 | Regularization |
8 | Wed, Oct 14 | Regularization |
9 | Mon, Oct 19 | Hashing trick / random projections |
9 | Wed, Oct 21 | Hashing trick / random projections |
10 | Mon, Oct 26 | Word2Vec |
10 | Wed, Oct 28 | Word2Vec |
11 | Mon, Nov 2 | Word2Vec: FastText |
11 | Wed, Nov 4 | Word2Vec: translation |
12 | Mon, Nov 9 | Word2Vec: bias |
12 | Wed, Nov 11 | Word2Vec: history |
13 | Mon, Nov 16 | Other Applications |
13 | Wed, Nov 18 | Other Applications |
14 | Mon, Nov 23 | Other Applications |
I've tried to design the course to be as accessible as possible for people with disabilities. If you need any further accommodations, please ask.
I want you to succeed and I'll make every effort to ensure that you can.