Two-layer hybrid recommender system for retail. Layer 1 uses an Implicit library for sparse data (KNN and ALS approaches). Level 2 uses a ranking model using the CatBoost (gradient boosting). This gave double the growth compared to the baseline. Evaluated by custom precision metric.
Stack:
- 1-st layer: Implicit, ItemItemRecommender, ALS, sklearn, pandas, numpy, matplotlib
- 2-nd layer: CatBoost, LightGBM
Data: from Retail X5 Hero Competition
Steps:
- Prepare data: prefiltering
- Matching model (initialize MainRecommender 1-st layer model as baseline)
- Evaluate Top@k Recall
- Ranking model (choose 2-nd layer model)
- Feature engineering for ranking
Please, open train.ipynb Jupiter notebook file and explore how to create Recommender system step-by-step.
Project has next few steps:
First is looking at datasets and prefiltering data
Learn first-layer model as baseline. In MainRecommender
class we have two base models from implicit lib - ItemItemRecommender
and AlternatingLeastSquares
:
ALS
used to find similar users, items and als recommendations. ItemItemRecommender
used to find own item recommendations among user's purchases.
For first-layer model we have taken Recall@k metric because it is show the proportion of correct answers from real purchases. With this approach we going to significantly cut dataset size for second-layer model.
Here we are evaluating different types of recommendations:
And are selecting optimal value of Recall:
In that step we are making new X_train dataset with target based on purchases:
Here we are choosing classifier from LightGBM
and CatBoost
, evaluate it by Precision@k at test data. In this step we have not impressive result.
Adding new features for ranking model based on user, item and paired user-item data.
Controling overfitting for CatBoost
and cutting extra estimators:
Ranking model gave us double the growth compared to the baseline..
As we see the best feature importance is paired user-item features: