This repository contains the source code and datasets for Patton: Language Model Pretraining on Text-rich Networks, published in ACL 2023.
The code is written in Python 3.8. Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):
pip3 install -r requirements.txt
Patton is a framework to pretrain language models on text-rich networks, with two strategies: network-contextualized masked language modeling and masked node prediction.
Download processed data. To reproduce the results in our paper, you need to first download the processed datasets. The extract the data files by
tar -xf data.tar.gz
Create a new ckpt/
folder for checkpoint saving and a new logs/
folder for logs saving.
mkdir ckpt
mkdir logs
Raw data & data processing. Raw data can be downloaded from MAG and Amazon directly. You can also find our data processing codes here. They might be useful if you want to obtain processed dataset (both for pretrain and finetune) for other networks in MAG and Amazon.
Use your own dataset. To pretrain Patton on your own data, you need to prepare the pretraining files: train.tsv, val.tsv, test.tsv. In the three files, each row represents a linked node pair:
{
"q_text": (str) node_1 associated text,
"k_text": (str) node_2 associated text,
"q_n_text": (List(str)) node_1 neighbors' associated text,
"k_n_text": (List(str)) node_2 neighbors' associated text,
}
Please refer to the file in our processed dataset for their detailed format information.
We also provide pre-tokenization code here to improve pretraining/finetuning efficiency.
Pretraining Patton starting from bert-base-uncased.
bash run_pretrain.sh
Pretraining SciPatton starting from scibert-base-uncased.
bash run_pretrain_sci.sh
Change $PROJ_DIR
to your project directory. We support both single GPU training and multi-GPU training.
You can directly download our pretrained checkpoints here. Then extract the checkpoint files by
tar -xf pretrained_ckpt.tar.gz
bash nc_class_train.sh
bash nc_class_test.sh
Change $STEP
to the highest validation set performance step.
Run bm25 to prepare hard negatives.
cd bm25/
bash bm25.sh
Prepare data for retrieval.
cd src/
bash nc_retrieve_gen_bm25neg.sh
bash build_train.sh
Run retrieval train.
bash nc_retrieve_train.sh
Run retrieval test.
bash nc_infer.sh
bash nc_retrieval.sh
Prepare data for reranking.
bash scripts/match.sh
Run reranking train.
bash nc_rerank_train.sh
Run reranking test.
bash nc_rerank_test.sh
Run link prediction train.
bash lp_train.sh
Run link prediction test.
bash lp_test.sh
Please cite the following paper if you find the code helpful for your research.
@inproceedings{jin2023patton,
title={Patton: Language Model Pretraining on Text-Rich Networks},
author={Jin, Bowen and Zhang, Wentao and Zhang, Yu and Meng, Yu and Zhang, Xinyang and Zhu, Qi and Han, Jiawei},
booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics},
year={2023}
}
Some parts of our code are adapted from the tevatron repository. Huge thanks to the contributors of the amazing repository!
$CODE_DIR
├── ckpt
├── data
│ ├── amazon
│ │ ├── cloth
│ │ ├── home
│ │ └── sports
│ └── MAG
│ ├── CS
│ ├── Geology
│ └── Mathematics
├── src
│ ├── OpenLP
│ │ ├── __init__.py
│ │ ├── __pycache__
│ │ ├── arguments.py
│ │ ├── dataset
│ │ ├── driver
│ │ ├── loss.py
│ │ ├── models
│ │ ├── modeling.py
│ │ ├── retriever
│ │ ├── trainer
│ │ └── utils.py
│ └── scripts
│ ├── build_train.py
│ ├── build_train_ncc.py
│ ├── build_train_neg.py
│ └── bm25_neg.py
└── logs