Exploratory project for Introduction to RecSys and User Preferences class meant to analyze individual differences between varying content-based information provided by selected datasets. For each dataset, there exists a limited set of information that can be exploited/used for recommendation purposes. Project tries to evaluate Content-based recommender systems using subsets of available information from individual datasets and compare key offline evaluation metrics between used subsets of information.
-
MovieLens-100K (also ML-Latest for feature extraction and analysis)
- CB Information:
- Genres
- Tags
- Release Years
- CB Information:
-
Book Crossing Dataset
- Using pre-filtered version with only 3 event types
- This dataset consists of:
- 92,490 interactions from 3,431 users on 8,885 items.
- History: 3,423 users and 8,878 items (78,372 accesses)
- Purchase: 824 users and 3,077 items (5,089 interactions)
- Add to cart: 1,557 users and 4,447 items (9,029 interactions)
- Content: Implicit feedback by Event type (View, Add to cart, Purchase)
-
RetailRocket
- Using pre-filtered version
- 272,679 interactions (explicit / implicit) from 2,946 users on 17,384 books.
- Ratings: 1,295 users and 14,684 books (62,657 ratings applied)
- Ratings are between 1 - 10.
- CB Information: Simple demographic info for the users (age, gender, occupation, zip)
Out of all evaluations, the one with the most variable information and the one most analysed is the ML-100K. (More evaluations in respective dataset folders)
For this evaluation precomputed Item-to-Item similarities were used. To generate required data, run "movielens-cb.ipynb" notebook.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from pandas.errors import SettingWithCopyWarning
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=SettingWithCopyWarning)
df = pd.read_csv("ml-latest-small/ratings.csv", sep=",")
df
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1 | 4.0 | 964982703 |
1 | 1 | 3 | 4.0 | 964981247 |
2 | 1 | 6 | 4.0 | 964982224 |
3 | 1 | 47 | 5.0 | 964983815 |
4 | 1 | 50 | 5.0 | 964982931 |
... | ... | ... | ... | ... |
100831 | 610 | 166534 | 4.0 | 1493848402 |
100832 | 610 | 168248 | 5.0 | 1493850091 |
100833 | 610 | 168250 | 5.0 | 1494273047 |
100834 | 610 | 168252 | 5.0 | 1493846352 |
100835 | 610 | 170875 | 3.0 | 1493846415 |
100836 rows Ă— 4 columns
moviesDF = pd.read_csv("ml-latest-small/movies.csv", sep=",")
moviesDF.movieId = moviesDF.movieId.astype(int)
moviesDF.set_index("movieId", inplace=True)
# Extract years
moviesDF["year"] = moviesDF.title.str.extract(r'\(([0-9]+)\)')
moviesDF["year"] = moviesDF.year.astype("float")
# Clean title
moviesDF["title"] = moviesDF.title.str.replace(r'\(([0-9]+)\)$', "").str.strip()
# Clean genres
moviesDF["genres"] = moviesDF.genres.str.lower()
moviesDF
title | genres | year | |
---|---|---|---|
movieId | |||
1 | Toy Story | adventure|animation|children|comedy|fantasy | 1995.0 |
2 | Jumanji | adventure|children|fantasy | 1995.0 |
3 | Grumpier Old Men | comedy|romance | 1995.0 |
4 | Waiting to Exhale | comedy|drama|romance | 1995.0 |
5 | Father of the Bride Part II | comedy | 1995.0 |
... | ... | ... | ... |
193581 | Black Butler: Book of the Atlantic | action|animation|comedy|fantasy | 2017.0 |
193583 | No Game No Life: Zero | animation|comedy|fantasy | 2017.0 |
193585 | Flint | drama | 2017.0 |
193587 | Bungo Stray Dogs: Dead Apple | action|animation | 2018.0 |
193609 | Andrew Dice Clay: Dice Rules | comedy | 1991.0 |
9742 rows Ă— 3 columns
df["title"] = moviesDF.title.loc[df.movieId].values
df.drop(columns=["timestamp"], inplace=True)
Let's use ratings by people, that have more than 20 ratings total for evaluation.
ratingCounts = df.groupby("movieId")["userId"].count()
valid_items = ratingCounts.loc[ratingCounts >= 20].index.values
df = df.loc[df.movieId.isin(valid_items)]
df
userId | movieId | rating | title | |
---|---|---|---|---|
0 | 1 | 1 | 4.0 | Toy Story |
1 | 1 | 3 | 4.0 | Grumpier Old Men |
2 | 1 | 6 | 4.0 | Heat |
3 | 1 | 47 | 5.0 | Seven (a.k.a. Se7en) |
4 | 1 | 50 | 5.0 | Usual Suspects, The |
... | ... | ... | ... | ... |
100803 | 610 | 148626 | 4.0 | Big Short, The |
100808 | 610 | 152081 | 4.0 | Zootopia |
100829 | 610 | 164179 | 5.0 | Arrival |
100830 | 610 | 166528 | 4.0 | Rogue One: A Star Wars Story |
100834 | 610 | 168252 | 5.0 | Logan |
67898 rows Ă— 4 columns
df.rating.value_counts().plot(kind = "bar")
<AxesSubplot: >
map_users = {user: u_id for u_id, user in enumerate(df.userId.unique())}
map_items = {item: i_id for i_id, item in enumerate(df.movieId.unique())}
df["movieId"] = df["movieId"].map(map_items)
df["userId"] = df["userId"].map(map_users)
df.head()
userId | movieId | rating | title | |
---|---|---|---|---|
0 | 0 | 0 | 4.0 | Toy Story |
1 | 0 | 1 | 4.0 | Grumpier Old Men |
2 | 0 | 2 | 4.0 | Heat |
3 | 0 | 3 | 5.0 | Seven (a.k.a. Se7en) |
4 | 0 | 4 | 5.0 | Usual Suspects, The |
movieId_title = {}
for idx, row in df.iterrows():
movieId_title[row["movieId"]] = row["title"]
np.save("mappings/map_title.npy", movieId_title)
movieId_genres = pd.Series(moviesDF.genres, index = moviesDF.index)
np.save("mappings/map_genres.npy", movieId_genres)
movieId_tags = pd.Series(np.load("mappings/map_tags.npy", allow_pickle=True))
# Using randomized (0.33 * df.size) for testing and the rest (0.67 * df.size) for training
train, test = train_test_split(df, test_size = 0.33, random_state = 56)
train.to_csv("train.dat", index=False, header=False, sep="\t", encoding="CP1250", errors="ignore")
test.to_csv("test.dat", index=False, header=False, sep="\t", encoding="CP1250", errors="ignore")
unique_movie_ids = pd.concat([train, test]).movieId.unique()
unique_movie_ids.shape
(1297,)
# Preprocessing precomputed metadata_files (similarity files) to only include movies rated by some people (which we are observing)
def preprocess_metadata(metadata_file):
metadata = pd.read_csv(metadata_file, sep="\t", encoding="CP1250", header=None, names=["item1", "item2", "rel"])
filtered = metadata.loc[metadata["item1"].isin(unique_movie_ids) & metadata["item2"].isin(unique_movie_ids)]
filtered.to_csv(f"processed/{metadata_file}", index=False, header=False, sep="\t", encoding="CP1250", errors="ignore")
preprocess_metadata("title_sim.dat")
preprocess_metadata("tag_sim.dat")
preprocess_metadata("genre_sim.dat")
preprocess_metadata("title_tag_sim.dat")
preprocess_metadata("title_genre_sim.dat")
Using caserec RS library we evaluate CB RecSys using real ratings made by users in testing data. If the item recommended to the user is seen by the user in testing data, we think of it as relevant.
from caserec.recommenders.item_recommendation.content_based import ContentBased
# Top-N recommendations
RANK_LENGTH = 50
# Optional metrics used for evaluation: ["PREC", "RECALL", "NDCG", "MAP"]
METRICS = ["PREC", "RECALL", "NDCG", "MAP"]
# For which ranks we evaluate metrics
RANK_EVAL = [10, 20, 50]
Title similarity
title_model = ContentBased("train.dat", "test.dat", similarity_file="processed/title_sim.dat",
output_file="output/rank_title.dat", as_binary=False, rank_length=RANK_LENGTH)
title_model.compute(as_table=True, table_sep="\t", metrics=METRICS, n_ranks=RANK_EVAL)
[Case Recommender: Item Recommendation > Content Based Algorithm]
train data:: 610 users and 1297 items (45491 interactions) | sparsity:: 94.25%
test data:: 610 users and 1297 items (22407 interactions) | sparsity:: 97.17%
training_time:: 0.007982 sec
prediction_time:: 22.567967 sec
PREC@10 PREC@20 PREC@50 RECALL@10 RECALL@20 RECALL@50 NDCG@10 NDCG@20 NDCG@50 MAP@10 MAP@20 MAP@50
0.06623 0.085656 0.081607 0.023463 0.063036 0.128316 0.216975 0.258815 0.291939 0.132644 0.136731 0.133862
Tags similarity
tag_model = ContentBased("train.dat", "test.dat", similarity_file="processed/tag_sim.dat",
output_file="output/rank_tag.dat", as_binary=False, rank_length=RANK_LENGTH)
tag_model.compute(as_table=True, table_sep="\t", metrics=METRICS, n_ranks=RANK_EVAL)
[Case Recommender: Item Recommendation > Content Based Algorithm]
train data:: 610 users and 1297 items (45491 interactions) | sparsity:: 94.25%
test data:: 610 users and 1297 items (22407 interactions) | sparsity:: 97.17%
training_time:: 0.010037 sec
prediction_time:: 22.910093 sec
PREC@10 PREC@20 PREC@50 RECALL@10 RECALL@20 RECALL@50 NDCG@10 NDCG@20 NDCG@50 MAP@10 MAP@20 MAP@50
0.047869 0.047951 0.051934 0.013491 0.029481 0.094047 0.174812 0.213229 0.248201 0.117574 0.120592 0.100002
Genre similarity
genre_model = ContentBased("train.dat", "test.dat", similarity_file="processed/genre_sim.dat",
output_file="output/rank_genre.dat", as_binary=False, rank_length=RANK_LENGTH)
genre_model.compute(as_table=True, table_sep="\t", metrics=METRICS, n_ranks=RANK_EVAL)
[Case Recommender: Item Recommendation > Content Based Algorithm]
train data:: 610 users and 1297 items (45491 interactions) | sparsity:: 94.25%
test data:: 610 users and 1297 items (22407 interactions) | sparsity:: 97.17%
training_time:: 1.218894 sec
prediction_time:: 23.614394 sec
PREC@10 PREC@20 PREC@50 RECALL@10 RECALL@20 RECALL@50 NDCG@10 NDCG@20 NDCG@50 MAP@10 MAP@20 MAP@50
0.055738 0.051885 0.041213 0.015075 0.027308 0.05433 0.216125 0.228107 0.245012 0.149846 0.140758 0.1212
Combination of Title and Tag similarities
title_tag_model = ContentBased("train.dat", "test.dat", similarity_file="processed/title_tag_sim.dat",
output_file="output/rank_title_tag.dat", as_binary=False, rank_length=RANK_LENGTH)
title_tag_model.compute(as_table=True, table_sep="\t", metrics=METRICS, n_ranks=RANK_EVAL)
[Case Recommender: Item Recommendation > Content Based Algorithm]
train data:: 610 users and 1297 items (45491 interactions) | sparsity:: 94.25%
test data:: 610 users and 1297 items (22407 interactions) | sparsity:: 97.17%
training_time:: 0.010617 sec
prediction_time:: 22.795777 sec
PREC@10 PREC@20 PREC@50 RECALL@10 RECALL@20 RECALL@50 NDCG@10 NDCG@20 NDCG@50 MAP@10 MAP@20 MAP@50
0.047541 0.047295 0.050328 0.013271 0.028799 0.09214 0.174127 0.211382 0.246048 0.117886 0.11984 0.099411
Combination of Title and Genre similarities
title_genre_model = ContentBased("train.dat", "test.dat", similarity_file="processed/title_genre_sim.dat",
output_file="output/rank_title_genre.dat", as_binary=False, rank_length=RANK_LENGTH)
title_genre_model.compute(as_table=True, table_sep="\t", metrics=METRICS, n_ranks=RANK_EVAL)
[Case Recommender: Item Recommendation > Content Based Algorithm]
train data:: 610 users and 1297 items (45491 interactions) | sparsity:: 94.25%
test data:: 610 users and 1297 items (22407 interactions) | sparsity:: 97.17%
training_time:: 1.182623 sec
prediction_time:: 23.658360 sec
PREC@10 PREC@20 PREC@50 RECALL@10 RECALL@20 RECALL@50 NDCG@10 NDCG@20 NDCG@50 MAP@10 MAP@20 MAP@50
0.055738 0.051885 0.04118 0.015075 0.027308 0.054002 0.216125 0.228098 0.244707 0.149027 0.139932 0.12034
Some example of recommendation using title+genre variant:
ranking = pd.read_csv('output/rank_title_genre.dat', sep='\t', names=["userId", "movieId", "rating"])
ranking["title"] = ranking.movieId.map(movieId_title)
ranking["tags"] = ranking.movieId.map(movieId_tags)
ranking["genres"] = ranking.movieId.map(movieId_genres)
ranking.sort_values(by="rating", ascending=False).head(5)
userId | movieId | rating | title | tags | genres | |
---|---|---|---|---|---|---|
24900 | 498 | 917 | 0.585525 | Entrapment | children|drama | |
9696 | 193 | 412 | 0.582343 | Multiplicity | assassination | drama |
9684 | 193 | 334 | 0.582343 | Sense and Sensibility | bus | drama |
9670 | 193 | 254 | 0.582343 | Strictly Ballroom | assassin jean reno hit men action assassin ass... | drama |
9671 | 193 | 263 | 0.582343 | Mars Attacks! | drama |
# Top-10 recommendation for User 1
ranking.loc[ranking.userId == 1].head(10)
userId | movieId | rating | title | tags | genres | |
---|---|---|---|---|---|---|
50 | 1 | 4 | 0.399869 | Usual Suspects, The | pregnancy remake | comedy|drama|romance |
51 | 1 | 11 | 0.399869 | Clerks | comedy|drama|romance | |
52 | 1 | 52 | 0.399869 | Reservoir Dogs | writing | comedy|drama|romance |
53 | 1 | 58 | 0.399869 | Star Wars: Episode V - The Empire Strikes Back | comedy|drama|romance | |
54 | 1 | 94 | 0.399869 | Wild Things | comedy|drama|romance | |
55 | 1 | 195 | 0.399869 | Girl with the Dragon Tattoo, The | comedy|drama|romance | |
56 | 1 | 224 | 0.399869 | Heavenly Creatures | classic space action action sci fi epic great ... | comedy|drama|romance |
57 | 1 | 232 | 0.399869 | In the Name of the Father | comedy|drama|romance | |
58 | 1 | 281 | 0.399869 | Waking Ned Devine (a.k.a. Waking Ned) | comedy|drama|romance | |
59 | 1 | 351 | 0.399869 | Crimson Tide | comedy|drama|romance |
Based on the above 4 variants of using title, tag, genre, title+tag and title+genre for Content-based recommendation, these are the results:
# x = ["PREC@10","PREC@20","PREC@50","RECALL@10","RECALL@20","RECALL@50","NDCG@10","NDCG@20","NDCG@50","MAP@10","MAP@20","MAP@50"]
labels = ["PREC@20","RECALL@20","NDCG@20","MAP@20"]
y1 = [0.085656,0.063036,0.258815,0.136731]
y2 = [0.047951,0.029481,0.213229,0.120592]
y3 = [0.051885,0.027308,0.228107,0.140758]
x = np.arange(len(labels))
width = 0.25
fig, ax = plt.subplots()
ax = fig.add_axes([0,0,1,1])
rects1 = ax.bar(x - 0.25, y1, width, label="Title")
rects2 = ax.bar(x, y2, width, label="Tag")
rects3 = ax.bar(x + 0.25, y3, width, label="Genre")
ax.set_ylabel("Metric value")
ax.set_title("Metric evalution at TOP 20")
ax.set_xticks(x, labels)
ax.legend()
ax.bar_label(rects1, padding=3, fmt='%.2f')
ax.bar_label(rects2, padding=3, fmt='%.2f')
ax.bar_label(rects3, padding=3, fmt='%.2f')
plt.show()
The Title-based similarities are the best-performing, followed by Genre and then Tags.
The Tags in the ML 100k dataset are extremely sparsed, so for better evaluation the ML 27M dataset could be used.