This repository contains the source code and datasets for Language Models as Semantic Indexers, ICML 2024.
The code is written in Python 3.8. Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):
pip3 install -r requirements.txt
LMIndexer is a self-supervised framework learned to tokenize documents into semantic IDs.
LMIndexer can be applied to various downstream tasks, including recommendation and retrieval.
Download processed data. To reproduce the results in our paper, you need to first download the processed datasets. Then put the dataset folders under data/rec-data/{data_name}
(data_name=Beauty, Sports, Toys) and data/retrieval-data/{data_name}
(data_name=NQ_aug, macro) respectively.
Raw data & data processing. Raw data can be downloaded from Amazon-Recommendation, Amazon-Retrieval, NQ and MS-MACRO directly. More details about the data processing for recommendation, product retrieval and document retrieval can be found here.
Codes are in SemanticID/
. Please refer to the README.md
here.
Codes are in downstream/
. Please refer to the README.md
here.
Please cite the following paper if you find the code helpful for your research.
@article{jin2023language,
title={Language Models As Semantic Indexers},
author={Jin, Bowen and Zeng, Hansi and Wang, Guoyin and Chen, Xiusi and Wei, Tianxin and Li, Ruirui and Wang, Zhengyang and Li, Zheng and Li, Yang and Lu, Hanqing and others},
journal={arXiv preprint arXiv:2310.07815},
year={2023}
}