Going off from Kaggle's recommendation system tutorial. This is an overview of an actual python recommendation system based on news sharing and user interactions
3 types of recommender techniques were built and tested:
- Content based filtering
- Collaborative filtering
- Hybrid approach
Basedline was created using popularity recommender but this doesn't have any personalization
The common recommender evaluation criteria includes:
- Top N metrics
- NDCG@N
- MAP@N
The tutorial covers the simple 80/20 split between train and test. However, to model what the recommender would produce in production, the timestamp based split should be used to model what the recommender would spit out on a particular date
- Build User profiles
- Recommend based on User/Item profiles
- Memory based: using past interaction activity, compute items that are similar based on users interacted or compute users that are similar based on items they have interacted
- Model based: (SVD, deep recommenders, reinforcement learnings)
- Top-N accuracy scores
- How exactly the content based filtering work (How user profiles are created, item profiles are created, how recommendation is created) . Look at the
- How does the TF-IDF technique work for information retrival. It is noted in the tutorial that it is for transforming unstructured data into vectorized form
- How does the Collaborative filtering work? Look in more details at what the implementation looks like.
- How exactly the hybrid approach combined the content based filtering and collaborative filtering together?
- Check out how the vector space model work and why it matters to the content based filters
- What's the scipy sparse matrix and vectorizer function? What do those do? In the code, this was used during the computation of the tfidf score based on the item details prior to the creation of item profile
- What does the item profile exactly look like? It is at the per item level where all the words within the item are computed with a TFIDF score.
- What does the TFIDFVectorizer function do? What does the output look like? Where is the corpus coming from for the IDF calculation?
- What does the item profile and user profile look like?