CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks

Data construction

Construct KILT++ with Wikipedia knowledge source and the original KILT dataset.

python utils/construct_dataset.py

Construct prefix tree

Construct prefix tree for D0 to D4.

python utils/construct_trie.py

Train backbone model

Train the backbone model in the initial phase with D0 and R0.

For this procedure, please refer the CorpusBrain repository.

Note that the backbone is trained with fairseq to facilitate efficiency, use the following script to convert the fairseq checkpoint to huggingface version.

python utils/convert_fairseq_huggingface.py [fairseq_path]

Revisit old documents

python replay/kmeans.py

Pre-training tasks

Generate query-document pairs for each specific task.

python tasks/[specific_task]/generate.py

Continual learning

Continually pre-train the adapters with the backbone parameters frozen.

python train_adapter.py --task [task] --batch_size [batch_size] --config_file [config_file] --save_name [save_name] --lr [learning_rate] --max_steps [max_steps] -grad_acc [grad_acc] --eval_steps [eval_steps] --load_adapter_path [load_adapter_path]

Evaluation

python scripts/eval_all.sh

Citation

@article{guo2024corpusbrain++,
  title={CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks},
  author={Guo, Jiafeng and Zhou, Changjiang and Zhang, Ruqing and Chen, Jiangui and de Rijke, Maarten and Fan, Yixing and Cheng, Xueqi},
  journal={arXiv preprint arXiv:2402.16767},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
fairseq		fairseq
genre		genre
kilt		kilt
logs		logs
replay		replay
scripts		scripts
tasks		tasks
utils		utils
README.md		README.md
requirements.txt		requirements.txt
train_adapter.py		train_adapter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks

Data construction

Construct prefix tree

Train backbone model

Revisit old documents

Pre-training tasks

Continual learning

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Sherlock-coder/CorpusBrainPlusPlus

Folders and files

Latest commit

History

Repository files navigation

CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks

Data construction

Construct prefix tree

Train backbone model

Revisit old documents

Pre-training tasks

Continual learning

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages