Language Models as Semantic Indexers

This repository contains the source code and datasets for Language Models as Semantic Indexers, ICML 2024.

Links

Requirements
Overview
Data Preparation
Learn Semantic IDs
Downstream Tasks
Citations

Requirements

The code is written in Python 3.8. Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):

pip3 install -r requirements.txt

Overview

LMIndexer is a self-supervised framework learned to tokenize documents into semantic IDs.

LMIndexer can be applied to various downstream tasks, including recommendation and retrieval.

Data Preparation

Download processed data. To reproduce the results in our paper, you need to first download the processed datasets. Then put the dataset folders under data/rec-data/{data_name} (data_name=Beauty, Sports, Toys) and data/retrieval-data/{data_name} (data_name=NQ_aug, macro) respectively.

Raw data & data processing. Raw data can be downloaded from Amazon-Recommendation, Amazon-Retrieval, NQ and MS-MACRO directly. More details about the data processing for recommendation, product retrieval and document retrieval can be found here.

Learn Semantic IDs

Codes are in SemanticID/. Please refer to the README.md here.

Downstream Tasks

Codes are in downstream/. Please refer to the README.md here.

Citations

Please cite the following paper if you find the code helpful for your research.

@article{jin2023language,
  title={Language Models As Semantic Indexers},
  author={Jin, Bowen and Zeng, Hansi and Wang, Guoyin and Chen, Xiusi and Wei, Tianxin and Li, Ruirui and Wang, Zhengyang and Li, Zheng and Li, Yang and Lu, Hanqing and others},
  journal={arXiv preprint arXiv:2310.07815},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Models as Semantic Indexers

Links

Requirements

Overview

Data Preparation

Learn Semantic IDs

Downstream Tasks

Citations

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
SemanticID		SemanticID
data		data
downstream		downstream
fig		fig
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

PeterGriffinJin/LMIndexer

Folders and files

Latest commit

History

Repository files navigation

Language Models as Semantic Indexers

Links

Requirements

Overview

Data Preparation

Learn Semantic IDs

Downstream Tasks

Citations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages