Skip to content

ZJU-DAILY/BIRDIE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BIRDIE: Natural Language-Driven Table Discovery Using Differentiable Search Index

BIRDIE: An effective NL-driven table discovery framework using a differentiable search index. BIRDIE first assigns each table a prefix-aware identifier and leverages a large language model-based query generator to create synthetic queries for each table. It then encodes the mapping between synthetic queries/tables and their corresponding table identifiers into the parameters of an encoder-decoder language model, enabling deep query-table interactions. During search, the trained model directly generates table identifiers for a given query. To accommodate the continual indexing of dynamic tables, we introduce an index update strategy via parameter isolation, which mitigates the issue of catastrophic forgetting.

Requirements

  • Python 3.7
  • PyTorch 1.10.1
  • CUDA 11.5
  • NVIDIA 4090 GPUs

Please refer to the source code to install all required packages in Python.

Datasets

We use two benchmark datasets NQ-Tables and FetaQA, used in the previous study.

Run Experimental Case

Scenario I : Indexing from scratch

  • Train the model to index the tables in the repository
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run.py --task "Index" --train_file "./dataset/fetaqa/train.json" --valid_file "./dataset/fetaqa/test.json" --gradient_accumulation_steps 6 --max_steps 8000 --run_name "feta" --output_dir "./model/feta"
  • Search using the trained model
CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 run.py --task "Search" --train_file "./dataset/fetaqa/train.json" --valid_file "./dataset/fetaqa/test.json" --base_model_path "./model/feta/checkpoint-8000" --output_dir "./model/feta"

Scenario II : Index Update

  • Train the model M0 on the repository D0
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run_cont.py --task "Index" --train_file "./dataset/fetaqa_inc/train_0.json" --valid_file "./dataset/fetaqa_inc/test_0.json" --gradient_accumulation_steps 6 --max_steps 7000 --run_name "feta_inc0" --output_dir "./model/feta_inc0" 
  • Train a memory unit L1 to index D1 based on the model M0 using LoRA
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run_cont.py --task "Index" --base_model_path "./model/feta_inc0/checkpoint-7000" --train_file "./dataset/fetaqa_inc/train_1.json" --valid_file "./dataset/fetaqa_inc/test_1.json" --peft True --gradient_accumulation_steps 6 --max_steps 4000 --run_name "feta_LoRA_d1" --output_dir "./model/feta_LoRA_d1"
  • Search tables using the model M0 and the plug-and-play LoRA memory L1
CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 run_cont.py --task "Search" --train_file  --valid_file "./dataset/fetaqa_inc/test_0+1.json" --LoRA_1 "./model/feta_LoRA_d1/checkpoint-4000" --partition_0 "./dataset/fetaqa_inc/train_0.json" --partition_1 "./dataset/fetaqa_inc/train_1.json" --output_dir "./model/feta_LoRA_d1"

Acknowledgments

The original datasets are from NQ-Tables, and FetaQA.

We thank the previous studies working towards this direction Solo, DSI-QG...

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages