Construct KILT++ with Wikipedia knowledge source and the original KILT dataset.
python utils/construct_dataset.py
Construct prefix tree for D0
to D4
.
python utils/construct_trie.py
Train the backbone model in the initial phase with D0
and R0
.
For this procedure, please refer the CorpusBrain repository.
Note that the backbone is trained with fairseq
to facilitate efficiency, use the following script to convert the fairseq
checkpoint to huggingface
version.
python utils/convert_fairseq_huggingface.py [fairseq_path]
python replay/kmeans.py
Generate query-document pairs for each specific task.
python tasks/[specific_task]/generate.py
Continually pre-train the adapters with the backbone parameters frozen.
python train_adapter.py --task [task] --batch_size [batch_size] --config_file [config_file] --save_name [save_name] --lr [learning_rate] --max_steps [max_steps] -grad_acc [grad_acc] --eval_steps [eval_steps] --load_adapter_path [load_adapter_path]
python scripts/eval_all.sh
@article{guo2024corpusbrain++,
title={CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks},
author={Guo, Jiafeng and Zhou, Changjiang and Zhang, Ruqing and Chen, Jiangui and de Rijke, Maarten and Fan, Yixing and Cheng, Xueqi},
journal={arXiv preprint arXiv:2402.16767},
year={2024}
}