GitHub - almightyGOSU/TheDatasetsDilemma: Code for our WSDM 2022 paper titled "The Datasets Dilemma: How Much Do We Really Know About Recommendation Datasets?"

Code for the WSDM 2022 paper

The Datasets Dilemma: How Much Do We Really Know About Recommendation Datasets?

Hello! :)

This repository contains the source code, as well as other useful information, for the paper "The Datasets Dilemma: How Much Do We Really Know About Recommendation Datasets?" in WSDM 2022.

The paper is available here: Paper (Best Paper Award Runner-up)

For a quick overview of the paper, you can refer to these slides: The Datasets Dilemma Slides

Reference

Please consider citing our work if you find it useful, thank you!

@inproceedings{10.1145/3488560.3498519,
  author = {Chin, Jin Yao and Chen, Yile and Cong, Gao},
  title = {The Datasets Dilemma: How Much Do We Really Know About Recommendation Datasets?},
  year = {2022},
  isbn = {9781450391320},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3488560.3498519},
  doi = {10.1145/3488560.3498519},
  booktitle = {Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining},
  pages = {141–149},
  numpages = {9},
  keywords = {datasets, item recommendation, evaluation, data characteristics},
  location = {Virtual Event, AZ, USA},
  series = {WSDM '22}
}

Outline

In our paper, we try to address the "datasets dilemma" using 3 main steps.

How are different datasets being utilised in recent papers?
- Are there any patterns?
- Code can be found in the ./Step 1/ folder (Please refer to its README file)
What are the similarities as well as differences between various datasets?
- Can we define them using objective measures?
- Code can be found in the ./Step 2/ folder (Please refer to its README file)
If the choice of datasets used could influence the observations and/or conclusions obtained
- Empirical study using a variety of item recommendation algorithms
- Code can be found in the ./Step 3/ folder (Please refer to its README file)

The ./Datasets/ folder

./Datasets/Source/ contains the raw datasets
./Datasets/Preprocessed/ contains the preprocessed datasets
The dataset characteristics (as well as other information) for all 51 datasets: characteristics_all.txt
- Basic Dataset Information (in a table format): characteristics_table_basic_detailed.txt
- Dataset Characteristics (in a table format): characteristics_table_basic_advanced.txt

Environment Setup

Python 3.6.8
PyTorch 1.4.0
Tensorflow 2.3.0
numpy 1.17.2
pandas 0.25.3
matplotlib 3.3.2
scikit-learn 0.23.2
scipy 1.3.0
scikit-optimize 0.8.1
mlxtend 0.18.0 (for frequent itemset mining)
implicit 0.4.4 (for Weighted Matrix Factorization (WMF))

Analyses & experiments were conducted on a Ubuntu server with version 16.04.6 LTS, and conda 4.8.4.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Datasets		Datasets
Step 1		Step 1
Step 2		Step 2
Step 3		Step 3
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WSDM 2022 - The Datasets Dilemma - Slides.pdf		WSDM 2022 - The Datasets Dilemma - Slides.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for the WSDM 2022 paper

The Datasets Dilemma: How Much Do We Really Know About Recommendation Datasets?

Reference

Outline

Environment Setup

About

Releases

Packages

Languages

License

almightyGOSU/TheDatasetsDilemma

Folders and files

Latest commit

History

Repository files navigation

Code for the WSDM 2022 paper

The Datasets Dilemma: How Much Do We Really Know About Recommendation Datasets?

Reference

Outline

Environment Setup

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages