This repository contains the code and models of the paper "AugTriever: Unsupervised Dense Retrieval by Scalable Data Augmentation"
Our code is based on the following repositories:
- SimCSE released with the SimCSE paper.
- Contriever released with the Contriever paper.
- MoCo released with the MoCo paper.
- DPR released with the DPR paper.
- SPIDER released with the SPIDER paper.
- MTEB released with the MTEB paper.
- BEIR released with the BEIR paper.
- SentEval released with the SentEval paper.
- E5 released with the E5 paper.
- rank_bm25
AugQ-Wiki and AugQ-CC can be downloaded from Huggingface Hub.
Naming corresponds to Table 1 in the paper.
Aug method | Model | MM | BEIR (14 tasks) | Download Link |
---|---|---|---|---|
Hybrid-TQGen+ | MoCo | 24.6 | 41.1 | [download] |
Hybrid-All | MoCo | 23.5 | 39.4 | [download] |
Hybrid-TQGen | MoCo | 23.3 | 39.4 | [download] |
Doc-Title | MoCo | 21.8 | 38.7 | [download] |
QExt-PLM | MoCo | 20.6 | 38.2 | [download] |
TQGen-Topic | MoCo | 21.2 | 38.9 | [download] |
TQGen-Title | MoCo | 21.8 | 39.3 | [download] |
TQGen-AbSum | MoCo | 23.2 | 39.6 | [download] |
TQGen-ExSum | MoCo | 23.0 | 39.4 | [download] |
TQGen-Topic | InBatch | 20.7 | 39.0 | [download] |
A few scripts for starting training are placed in the folder examples/traning
. For example:
cd $PATH_TO_REPO
sh examples/training/cc.moco.topic50.bs2048.gpu8.sh
Please refer to BEIR for data download.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python torch.distributed.launch --nproc_per_node=8 --master_addr=127.0.0.1 --master_port=2255 eval_beir.py --model_name_or_path output_dir/augtriever-release/cc.T03b_title50.moco-2e14.contriever256-special50.bert-base-uncased.avg.dot.q128d256.step100k.bs1024.lr5e5/ --dataset fiqa --metric dot --pooling average --per_gpu_batch_size 128 --beir_data_path data/beir/ --output_dir eval_dir/beir
Please refer to Spider for details about QA data download and processing.
export EXP_DIR="output_dir/cc-hybrid.RC20+T0gen80.seed477.moco-2e14.contriever256-special50.bert-base-uncased.avg.dot.q128d256.step200k.bs2048.lr5e5/"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python torch.distributed.launch --nproc_per_node=8 --master_port=31133 --max_restarts=0 generate_passage_embeddings.py --model_name_or_path $EXP_DIR --output_dir $EXP_DIR/embeddings --passages data/nq/psgs_w100.tsv --per_gpu_batch_size 512
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python eval_qa.py --model_name_or_path facebook/contriever --passages data/nq/psgs_w100.tsv --passages_embeddings "$EXP_DIR/embeddings/*" --qa_file data/nq/qas/*-test.csv,data/nq/qas/entityqs/test/P*.test.json --output_dir $EXP_DIR/qa_output --save_or_load_index
Convert to Huggingface BERT model
python convert_checkpoint_to_hf_bert.py --ckpt_path output_dir/cc.T03b_topic.inbatch.contriever256-special.bert-base-uncased.avg.dot.q128d256.step100k.bs1024.lr5e5 --output_dir output_dir/cc.T03b_topic.inbatch.contriever256-special.bert-base-uncased.avg.dot.q128d256.step100k.bs1024.lr5e5/hf_ckpt_bert --model_type shared
Convert to Huggingface DPR model
python convert_checkpoint_to_hf_dpr.py --ckpt_path output_dir/cc-hybrid.RC20+T0gen80.seed477.moco-2e14.contriever256-special50.bert-base-uncased.avg.dot.q128d256.step200k.bs2048.lr5e5 --output_dir output_dir/cc-hybrid.RC20+T0gen80.seed477.moco-2e14.contriever256-special50.bert-base-uncased.avg.dot.q128d256.step200k.bs2048.lr5e5/hf_ckpt_dpr --model_type shared
Replace the exp path in gather_score_beir.py/gather_score_qa.py/gather_score_senteval.py
and run it. For example
python gather_score_beir.py
AugTriever is licensed under the BSD 3-Clause License.
Evaluation codes that are forked from external repositories are placed in subfolders (e.g. src/beir
, src/beireval
, src/mteb
, src/mtebeval
, src/qa
, src/senteval
). Please refer to LICENSE in each subfolder for their Copyright information.
If you find the AugTriever code or models useful, please cite it by using the following BibTeX entry.
@article{meng2022augtriever,
title={AugTriever: Unsupervised Dense Retrieval by Scalable Data
Augmentation},
author={Meng, Rui and Liu, Ye and Yavuz, Semih and Agarwal, Divyansh and Tu, Lifu and Yu, Ning and Zhang, Jianguo and Bhat, Meghana and Zhou, Yingbo},
journal={arXiv preprint arXiv:2212.08841},
year={2022}
}