GitHub - Kjosev/Recommendation-engine

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
docs		docs
blend.py		blend.py
collaborative.py		collaborative.py
create_train_test.py		create_train_test.py
helpers.py		helpers.py
helpers_MF.py		helpers_MF.py
readme.txt		readme.txt
run.py		run.py
run_ALS.py		run_ALS.py
run_AVG.py		run_AVG.py
run_CFI.py		run_CFI.py
run_CFU.py		run_CFU.py
run_GLBAVG.py		run_GLBAVG.py
run_MVIAVG.py		run_MVIAVG.py
run_SGD.py		run_SGD.py
run_USRAVG.py		run_USRAVG.py
train_ALS.py		train_ALS.py
train_SGD.py		train_SGD.py

Repository files navigation

###################################################################
############################# README #############################
###################################################################
################ PROJECT 2 - RECOMMENDER SYSTEM ################
###################################################################
########################## TEAM #NoChill ###########################
###################################################################

Explanation of abbreviations (see report for details):
ALS = alternating least squares
SGD = stochastic gradient descent
AVG = smart average
GLBAVG = global average
USRAVG = user average
MVIAVG = item average
CFI = item-based collaborative filtering
CFU = user-based collaborative filtering

1) To reproduce the result of the final submission, just run command ‘python run.py’.
It combines the submissions generated with ALS, SGD, AVG, GLBAVG, USRAVG,
MVIAVG, CFI, CFU; finds the best weights for blending them and then creates a submission from all.
Each individual submission can be generated by running the appropriate script 'run_[method].py'.
If submission does not exist 'run.py' automatically runs the script to generate it before combining.
In order to have a fast way to generate the final submission, we have
provided a precomputed submission for CFU, as it takes quite a long time to be computed without
the precomputed similarity matrix (that would have been too big to be submitted). For
CFI we use precomputed similarity matrix ("movieSim.obj"). For ALS and SGD we provide precomputed
user and item features. The code will still work without these files but it will take a long time
since it will have to retrain everything.

2) All the submissions needed for ‘run.py’ are created automatically by the code if not present in the
folder. To create them manually, the following scripts should be run:
- run_ALS.py (~ 10 min)
- run_SGD.py (~ 1 h)
- run_AVG.py (fast)
- run_GLBAVG.py (fast)
- run_MVIAVG.py (fast)
- run_USRAVG.py (fast)
- run_CFI.py (~ 2-3 min)
- run_CFU.py (~ 1-2 h) with no precomputed similarity

3) Since the final submission is generated by blending submissions from multiple methods,
we need to find the weights with which each method contributes. This is done by 'blend.py'.
It creates a joint dataset from the method outputs in 'data/methods' folder combined with
the true values. Then it runs least squares on that dataset and finds the parameters.
These parameters are saved in 'coefs.obj' and are used by 'run.py'. If they do not exist
'run.py' will run 'blend.py'.

4) The files in 'data/methods/' were generated with the following procedure:
- split the matrix into 5 couples of submatrices train and test, with 90% and 10%
of the data. The script to do this is create_train_test.py.
- for each of the 5 couples, run all the 8 methods on the train, generating
submissions for the corresponding test.

This takes a very long time for some of the methods,
so we have provided the files already generated.

Here is an guide through the files contained in this folder.
- ‘run.py’ creates the final submission. Needs the precomputed submissions with the
different methods and the precomputed weights.
- ‘train_ALS.py’ factorizes a matrix with Alternating Least Squares. Needs the original
data and creates the files ‘item_features_ALS.py’ and ‘user_features_ALS.py’.
- ‘train_SGD.py’ factorizes a matrix with Stochastic Gradient Descent. Needs the
original data and creates the files ‘item_features_SGD.py’ and ‘user_features_SGD.py’.
- ‘run_ALS.py’ creates predictions for the whole matrix, using its factorization.
- ‘run_SGD.py’ creates predictions for the whole matrix, using its factorization.
- ‘run_AVG.py’ creates predictions for the whole matrix, with a modified average
method (see report).
- ‘run_GLBAVG’ creates predictions for the whole matrix, with the global average.
- ‘run_USRAVG’ creates predictions for the whole matrix, with the user average.
- ‘run_MVIAVG’ creates predictions for the whole matrix, with the item average.
- ‘run_CFI.py’ creates predictions for the whole matrix, with item-based collaborative filtering.
- ‘run_CFU.py’ creates predictions for the whole matrix, with user-based collaborative filtering.
- ‘blend.py’ computes the weights for the blending with the precomputed values in
the folder methods.
- ‘create_train_test.py’ creates the 5 couples train-test to run the very expensive training to generate
the files in methods folder.
- ‘collaborative.py’ contains the function used to compute similarity matrices.
- ‘helpers.py’ contains various functions used in the scripts concerning averages, collaborative
filtering and blending.
- ‘helpers_MF.py’ contains various functions used for matrix factorization.
- folder ‘data’ contains precomputed files used in the previously explained algorithms, as well as
the ‘submissions’ folder (which will be filled with all the submissions’) and the ‘methods’ folder. These
data will be contained in a subfolder of data named ‘train_test’.
with the very-long-to-compute matrices used to find the best weights for the different methods.