CSE 158 Recommender Systems and Web Mining taken at UC San Diego during Fall Quarter 2023.
Note: DSC 148 Introduction to Data Mining is seen as an equivalent course for Data Science major requirement purpose. It is taught by Prof. Jingbo Shang during Winter Quarter each year.
To uphold academic integrity, please do not use any solutions of the projects as your own. MOSS will be able to tell and you will suffer from significant academic integrity violation consequences.
Like previous offerings, the course covers basic Machine Learning concepts and various approaches for Recommender Systems. Topics include:
- Regression (Least-Squares Regression, ML basics)
- Classification (Naïve Bayes Classifier, Logistic Regression, Support Vector Machines, Model Evaluation)
- Recommender Systems (Jaccard/Cosine/Pearson Similarity Functions, Collaborative Filtering, Latent Factor Models, One-Class Recommendation, Bayesian Personalized Ranking, Evaluation Metrics - Precision/Recall, AUC, Mean Reciprocal Rank, Cumulative Gain and NDCG, Feature-Based Recommendation)
- Text Mining (Sentiment Analysis, Bags-of-Words, TF-IDF, Stopwords, Stemming, Low-Dimensional Representations of Text)
- Content and Structure in Recommender Systems (Factorization Machines, Group and Socially Aware Recommendation, Online Advertising)
- Modeling Temporal and Sequence Data (Sliding windows and autoregression, Temporal Dynamics in Recommender Systems)
- Visual Recommendation (Complementary Item Recommendation, Fashion and Outfit Recommendation, Fit Prediction)
- Ethics and Fairness (Filter Bubbles and Recommendation Diversity, Calibration, Serendipity, and Other "Beyond Accuracy" Measures, Algorithm Fairness)
- Basic Logistic Regression and Classification Tasks on GoodReads Fantasy Reviews and Beer Reviews datasets.
- Score: 8.0/8.0
- Implemented Logistic Regression with One-Hot Encoding, Precision@k and BER, and Similarity-based Rating Predictions on Beer Reviews and Amazon Music Instruments datasets.
- Score: 8.0/8.0
- Implemented Similarity-based Recommendation and trained a regressor to predict game playing time on Steam dataset (head start of Assignment 1).
- Score: 8.0/8.0
- Implemented basic Bag-of-Words model to find the most common words. Then, use the word set for a logistic regression to predict the genre of data from Stem Category Data.
- Implemented TF-IDF and used the scores of 1000 most common words to train a logistic regressor to predict genre category.
- Implemented item2vec model on GoodReads Young Adult Reviews dataset to find similar books based on scores from Cosine similarity function.
- Score: 8.0/8.0
- A continuation of Homework 3 to optimize and fine-tuning parameters for models on Steam dataset.
- To predict if an user would play a game, given userID and gameID, I implemented the Bayesian Personalized Ranking Model, fine-tuned Adam optimizer learning rate and model's regularization constant, and performed an ensemble with popularity-based recommendation method.
- To predict an user's play time on a given game, I implemented the Latent Factor Model with bias terms only. Then, I fine-tuned regularization constant and performed an early stop (when validation MSE starts to increase) to avoid overfitting issues.
- I eventually achieved a reasonably good performance for both models, particularly for the play prediction task. My performance on the course Leaderboard is as follows:
Task | Private Leaderboard Rank | Public Leaderboard Rank |
---|---|---|
Play Prediction | 11/603 (Top 2% of class) | 20/603 |
Time Played Prediction | 44/603 (Top 8% of class) | 68/603 |
Note: If graduate students (both Master's and PhD) in CSE 258/MGTA 461 are included, my ranks are 34/1209 (Top 3%) for Play Prediction task and 154/1209 for Time Played Prediction task.
- This is an open-ended project. We were doing a sentiment analysis task on Google Local dataset using bag-of-words model.
- Our project repository is linked here.
Note: there are major errors in our report that are inconsistent with our model training statistics. We'll fix these errors at some point.
My project partners are Nate del Rosario, Chuong Nguyen, and Trevan Nguyen. The solution for this assignment is the collective efforts among us.
Special thanks to Prof. Julian McAuley for his dedication in teaching the course and answering questions on Piazza. Also, I appreciate TAs' efforts to hold office hours and answer Piazza posts.