The approach chosen was to make a usual ML problem out of the challenge task. Independent models are built for each tarck in the second half of the user session. Thus, such training does not directly optimise the competition metrics, but aims to make good predictions for each individual track.
The main issue was the size of the dataset, that is far too large to fit into memory on a local machine.
For preprocessing dask
was used to allow pandas
-like processing but perform partitioned operations.
For modelling, no good solution was found within limited timeframe.
There are several frameworks that allow online learning, like:
- vowpal wabbit,
- sklearn has several models, that implement
partial_fit()
method, - dask-ml has an
Incremental
wrapper around the sklearn models withpartial_fit
to run on adask.Array
, - XGBoost allows for processing with external memory,
- H2O has checkpointing mechanism, that allows to continue training any H20 model on new data. This is slightly different from the other options, as it allows to adjust the model to changing patterns in the data, instead of finding a pattern in a larger volume of data.
However, all of those have downsides:
- XGBoost and vowpal wabbit read directly files in specific format, which would mean dumping of processed features on disk with additional large disk space needed;
- most of models with
partial_fit
are quite simple, which affects both dask-ml and sklearn.
Due to lack of time non of those options was tried. Therefore modelling was restricted to in-memory training with a usual sklearn API in python. This restricts the amount of data that one can use. On the private laptop with the full set of features I was not able to go beyond 6 files.
git clone https://github.com/mlisovyi/spotify_skip_prediction
cd spotify_skip_prediction
pip install -r requirements.txt
Please download the dataset from https://www.crowdai.org/challenges/spotify-sequential-skip-prediction-challenge/dataset_files, and extract the files to the data/
folder. Untar them (this might take some time) to have the following directory structure:
|-- data/
| |-- training_set/ (training sessions)
| |-- test_set/ (leaderboard partial sessions)
| |-- track_features/ (track metadata and audio features)
| |-- submissions/ (submissions folder - contains sample submissions)
In particular, I've used the complete test set (14 GB)
and the first part of the training dataset (10 GB)
together with the track features (1.2 GB).
All those data were unpacked locally and each file was gziped locally to reduce disk footprint
(tools like pandas
and dask
can read directly from gziped csv files).
For preprocessing of the original data I've used dask and in particular dask.DataFrame to allow pandas-like data manipulation in multi-threaded fashion out-of-core (since my local laptop could not read in whole dataset into RAM).
The tools to do preprocessing are Preprocess_Dask.ipynb and Preprocess_Dask_AcousticVector.ipynb. There is one file produced byt each of those per input file. The two outputs can be directly concatenated together.
- The first extracts all features except
acoustic_vector
, adjusts feature types and stores output inHDF
(==h5
) format. The format choice was made to optimise readout speed (more than factor 5 faster thancsv.gz
) at the price of slightly higher disk space (50% more thancsv.gz
, but buch smaller than the plaincsv
). - The second extracts only acoustic vector. This is an atrifact of the original pre-processing, that did not include these data and a separate production was faster than complete re-run.
There is also Preprocess_Pandas.ipynb, which is similar to the dask implementation and does pre-processing recursively. However, pandas runs in a single thread only, so this version is much slower and does not scale well to a high-performance cluster.
Get_y_truth.ipynb extracts the competition target for the second halves of the sessions in the training data.
THe format is a pandas.Series
of lists of skip_2
values.
There is one file produced per input file.
For modelling, a GBM implementation in lightgbm was used. Individual models were build for the 1st, 2nd, ..., 10th track in the second half of the session. The following features were used by all those models:
- Full information about the last track from the first half of the session;
- Mean, min, max aggregates for all features over the first half;
- For each track feature for a track in the second half there is:
- the difference calculated wrt mean of that feature for
skip_2==0
andskip_2==1
tracks in the first half; - the significance of that difference is calculated dividing the difference with the standard deviation of the feature with the same selection as the difference;
- the difference between
skip==0
andskip==1
is calculated for all featured from the previous two steps; - the motivation for these features was to evaluate how similar is a new track to those that user skiped and didn't skip.
- the difference calculated wrt mean of that feature for
- For each track the predicted probability (=confidence) of the models for the previous tracks in the session was calculated and added as features. This was added predicting directly on the whole dataset for the second halves, i.e. OOF method was not used. The reason for this compromise was that OOF would have been too slow and model performance on the training and validation data was very similar, thus no significant bias is expected.
At the end, a classifier was built on the features described above to predict skip_2
for each track in the second half of the session.
Logloss objective function was used and binary error rate was used as metric for early stopping criterion.
Model_Building.ipynb has the primary model-building procedure. Model_Building_Iterations.ipynb contains a streamlined version to build several independent models on independent training subsets. The predictions of such models are be averaged to improve performance.
Model evaluation is performed locally on an independent sub-set of files, that were not used in training, and was found to give a reliable estimate of the score on the leader board. Local_evaluation.ipynb containes the procedure for evaluation outlined.
A set of helper functions used in different notebooks is collected in helpers.py.
We would like to thank our co-organizers from WSDM and CrowdAI for making this challenge possible.