update README & clean code

weizhepei · weizhepei · commit ac8a2873e36c · 2024-06-24T06:49:48.000-04:00
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,7 @@
 # Data Files
-dataset/
+dataset/*
+!dataset/README.md
+!dataset/*/demos.json
 saved_checkpoints/
 wandb/
 qa_results/
diff --git a/README.md b/README.md
@@ -1 +1,94 @@
-# Coming soon...
+<h1 align="center">
+InstructRAG 
+</h1>
+
+<h3 align="center">
+Instructing Retrieval-Augmented Generation with Explicit Denoising <br>
+[<a href="https://arxiv.org/abs/2406.13629">arXiv</a>] [<a href="https://arxiv.org/abs/2406.13629">Website</a>] [<a href="https://huggingface.co/meng-lab/TriviaQA-InstructRAG-FT">Model</a>] [<a href="https://huggingface.co/datasets/meng-lab/InstructRAG">Dataset</a>] [<a href="https://x.com/weizhepei/status/1803992285899620837">X Summary</a>]
+</h3>
+
+InstructRAG is a simple yet effective RAG framework that allows LMs to explicitly denoise retrieved contents by generating rationales for better verifiability and trustworthiness. 
+
+![](https://weizhepei.com/instruct-rag-page/static/images/instructrag.pdf)
+
+## **InstructRAG Key Features:**
+
+- 🤖 **Self-Synthesis**: InstructRAG leverages instruction-tuned LMs to generate their OWN supervision for denoising.
+- 🔌 **Easy-to-Use**: InstructRAG can be applied in both in-context learning (ICL) and supervised fine-tuning (SFT).
+- 🚀 **Effectiveness**: Up to 8.3% better results across 5 benchmarks (Table [5](https://arxiv.org/html/2406.13629v1#S3.T5)).
+- 💪 **Noise Robustness**: InstructRAG is robust to increased noise ratios in both training-free and trainable scenarios (Figure [3](https://arxiv.org/html/2406.13629v1#S3.F3)).
+- 🔁 **Task Transferability**: InstructRAG can solve out-of-domain unseen tasks (Figure [4](https://arxiv.org/html/2406.13629v1#S3.F4)).
+
+Please see also our [paper](https://arxiv.org/abs/2406.13629) and [X summary](https://x.com/weizhepei/status/1803992285899620837) for more details.
+
+## 🔗 Quick Links
+- [InstructRAG: Instructing Retrieval-Augmented Generation with Explicit Denoising](#instructrag-key-features)
+    - [Installation](#installation)
+    - [Training Script](#training-script)
+    - [Evaluation](#evaluation)
+    - [Generation Example](#generation-example)
+    - [Model Checkpoints](#model-checkpoints)
+
+## Installation
+The following script will create an Python virtual environment and install all required packages.
+```shell
+bash setup.sh
+```
+
+Alternatively, you can also directly create a conda environment using the provided configuration file.
+
+```shell
+conda env create -f environment.yml
+```
+
+## Training Script
+To train the model (i.e., InstructRAG-FT), just activate the environment and run the following training script. The training config is set for 4xH100 80G GPUs. You may need to adjust NUM_DEVICE and PER_DEVICE_BATCH_SIZE based on your computation environment.
+
+```shell
+conda activate instrag
+bash train.sh
+```
+## Evaluation
+There are two instantiations of our framework:
+- InstructRAG-ICL: training-free & easy-to-adapt
+- InstructRAG-FT: trainable & better performance
+
+Use the following script to evaluate InstructRAG in both training-free and trainable settings. You can specify the task and model by adjusting DATASET and MODEL in `eval.sh`.
+
+```shell
+conda activate instrag
+bash eval.sh
+```
+
+
+## Generation Example
+
+The following case study shows that InstructRAG can effectively identify relevant information from noisy input and leverage its own knowledge to correctly answer questions when required. The red texts denote irrelevant or inaccurate model generations, while the green texts denote contents relevant to the question. 
+
+![](https://weizhepei.com/instruct-rag-page/static/images/case_study.pdf)
+
+
+## Model Checkpoints
+Below is the full list of InstructRAG models fine-tuned on each dataset in our work.
+
+| Dataset | HF Model Repo | Retriever |
+|------------------------------|-----------------------------------------------------------------------------------------------------------|:------:|
+| PopQA | [meng-lab/PopQA-InstructRAG-FT](https://huggingface.co/meng-lab/PopQA-InstructRAG-FT) | Contriever |
+| TriviaQA | [meng-lab/TriviaQA-InstructRAG-FT](https://huggingface.co/meng-lab/TriviaQA-InstructRAG-FT) | Contriever |
+| Natural Questions | [meng-lab/NaturalQuestions-InstructRAG-FT](https://huggingface.co/meng-lab/NaturalQuestions-InstructRAG-FT) | DPR |
+| ASQA | [meng-lab/ASQA-InstructRAG-FT](https://huggingface.co/meng-lab/ASQA-InstructRAG-FT) | GTR |
+| 2WikiMultiHopQA | [meng-lab/2WikiMultiHopQA-InstructRAG-FT](https://huggingface.co/meng-lab/2WikiMultiHopQA-InstructRAG-FT) | BM25 |
+
+## Bugs or Questions?
+If you have any questions related to the code or the paper, feel free to email Zhepei (zhepei.wei@virginia.edu). If you encounter any problems when using the code, or want to report a bug, feel free to open an issue! Please try to specify the problem with details so we can help you better and quicker!
+
+## Citation
+Please cite our paper if you find the repo helpful in your work:
+
+```bibtex
+@article{wei2024instructrag,
+  title={{InstructRAG}: Instructing Retrieval-Augmented Generation with Explicit Denoising},
+  author={Wei, Zhepei and Chen, Wei-Lin and Meng, Yu},
+  year={2024}
+}
+```
diff --git a/dataset/README.md b/dataset/README.md
@@ -0,0 +1,5 @@
+## Dataset
+The datasets (augmented with retirevd documents) used in our work can be downdoaded from our HF dataset repo: [meng-lab/InstructRAG](https://huggingface.co/datasets/meng-lab/InstructRAG).
+
+
+Please refer to the [generate_rationale.sh](../generate_rationale.sh) script for detailed instructions on preparing data with your own corpus.
diff --git a/environment.yml b/environment.yml
@@ -0,0 +1,128 @@
+name: instrag
+channels:
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - _openmp_mutex=5.1=1_gnu
+  - bzip2=1.0.8=h5eee18b_6
+  - ca-certificates=2024.3.11=h06a4308_0
+  - ld_impl_linux-64=2.38=h1181459_1
+  - libffi=3.4.4=h6a678d5_1
+  - libgcc-ng=11.2.0=h1234567_1
+  - libgomp=11.2.0=h1234567_1
+  - libstdcxx-ng=11.2.0=h1234567_1
+  - libuuid=1.41.5=h5eee18b_0
+  - ncurses=6.4=h6a678d5_0
+  - openssl=3.0.14=h5eee18b_0
+  - python=3.10.14=h955ad1f_1
+  - readline=8.2=h5eee18b_0
+  - sqlite=3.45.3=h5eee18b_0
+  - tk=8.6.14=h39e8969_0
+  - tzdata=2024a=h04d1e81_0
+  - xz=5.4.6=h5eee18b_1
+  - zlib=1.2.13=h5eee18b_1
+  - pip:
+    - accelerate==0.31.0
+    - aiosignal==1.3.1
+    - annotated-types==0.7.0
+    - anyio==4.4.0
+    - attrs==23.2.0
+    - certifi==2024.6.2
+    - charset-normalizer==3.3.2
+    - click==8.1.7
+    - cloudpickle==3.0.0
+    - cmake==3.29.6
+    - diskcache==5.6.3
+    - dnspython==2.6.1
+    - einops==0.8.0
+    - email-validator==2.2.0
+    - exceptiongroup==1.2.1
+    - fastapi==0.111.0
+    - fastapi-cli==0.0.4
+    - filelock==3.15.4
+    - flash-attn==2.5.6
+    - frozenlist==1.4.1
+    - fsspec==2024.6.0
+    - h11==0.14.0
+    - httpcore==1.0.5
+    - httptools==0.6.1
+    - httpx==0.27.0
+    - huggingface-hub==0.23.4
+    - idna==3.7
+    - interegular==0.3.3
+    - jinja2==3.1.4
+    - joblib==1.4.2
+    - jsonschema==4.22.0
+    - jsonschema-specifications==2023.12.1
+    - lark==1.1.9
+    - llvmlite==0.43.0
+    - lm-format-enforcer==0.9.8
+    - markdown-it-py==3.0.0
+    - markupsafe==2.1.5
+    - mdurl==0.1.2
+    - mpmath==1.3.0
+    - msgpack==1.0.8
+    - nest-asyncio==1.6.0
+    - networkx==3.3
+    - ninja==1.11.1.1
+    - numba==0.60.0
+    - numpy==1.26.4
+    - nvidia-cublas-cu12==12.1.3.1
+    - nvidia-cuda-cupti-cu12==12.1.105
+    - nvidia-cuda-nvrtc-cu12==12.1.105
+    - nvidia-cuda-runtime-cu12==12.1.105
+    - nvidia-cudnn-cu12==8.9.2.26
+    - nvidia-cufft-cu12==11.0.2.54
+    - nvidia-curand-cu12==10.3.2.106
+    - nvidia-cusolver-cu12==11.4.5.107
+    - nvidia-cusparse-cu12==12.1.0.106
+    - nvidia-ml-py==12.555.43
+    - nvidia-nccl-cu12==2.19.3
+    - nvidia-nvjitlink-cu12==12.5.40
+    - nvidia-nvtx-cu12==12.1.105
+    - orjson==3.10.5
+    - outlines==0.0.34
+    - packaging==24.1
+    - pip==24.0
+    - prometheus-client==0.20.0
+    - protobuf==5.27.1
+    - psutil==6.0.0
+    - py-cpuinfo==9.0.0
+    - pydantic==2.7.4
+    - pydantic-core==2.18.4
+    - pygments==2.18.0
+    - python-dotenv==1.0.1
+    - python-multipart==0.0.9
+    - pyyaml==6.0.1
+    - ray==2.30.0
+    - referencing==0.35.1
+    - regex==2024.5.15
+    - requests==2.32.3
+    - rich==13.7.1
+    - rpds-py==0.18.1
+    - safetensors==0.4.3
+    - scipy==1.13.1
+    - sentencepiece==0.2.0
+    - setuptools==69.5.1
+    - shellingham==1.5.4
+    - sniffio==1.3.1
+    - starlette==0.37.2
+    - sympy==1.12.1
+    - tiktoken==0.6.0
+    - tokenizers==0.19.1
+    - torch==2.2.1
+    - tqdm==4.66.4
+    - transformers==4.41.2
+    - triton==2.2.0
+    - typer==0.12.3
+    - typing-extensions==4.12.2
+    - ujson==5.10.0
+    - urllib3==2.2.2
+    - uvicorn==0.30.1
+    - uvloop==0.19.0
+    - vllm==0.4.1
+    - vllm-nccl-cu12==2.18.1.0.4.0
+    - watchfiles==0.22.0
+    - websockets==12.0
+    - wheel==0.43.0
+    - xformers==0.0.25
diff --git a/eval.sh b/eval.sh
@@ -1,13 +1,9 @@
-# conda activate instructrag
-
-export DATASET=ASQA
-export CACHE_DIR=/p/llmresearch/huggingface/hub
-MODEL=InstructRAG-ICL # [InstructRAG-FT, InstructRAG-ICL]
+DATASET=PopQA
+MODEL=InstructRAG-FT # [InstructRAG-FT, InstructRAG-ICL]
 
 CUDA_VISIBLE_DEVICES=0 python src/inference.py \
   --dataset_name $DATASET \
   --rag_model $MODEL \
   --n_docs 5 \
   --output_dir qa_results/${MODEL}/${DATASET}\
-  --cache_dir $CACHE_DIR \
   --load_local_model \
diff --git a/generate_rationale.sh b/generate_rationale.sh
@@ -1,7 +1,4 @@
-# conda activate instructrag
-
-export DATASET=ASQA
-export CACHE_DIR=/p/llmresearch/huggingface/hub
+DATASET=PopQA
 
 CUDA_VISIBLE_DEVICES=0 python src/inference.py \
   --dataset_name $DATASET \
diff --git a/requirement.txt b/requirement.txt
diff --git a/setup.sh b/setup.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+
+# Create a new conda environment with Python 3.10
+conda create -n instrag python=3.10 -y
+
+# Activate the new conda environment
+conda activate instrag
+
+# Install numpy, vllm, and accelerate
+pip install numpy==1.26.4 vllm==0.4.1 accelerate
+
+# Install flash-attn
+pip install flash-attn==2.5.6 --no-build-isolation
diff --git a/src/finetune.py b/src/finetune.py
@@ -94,7 +94,8 @@ class TrainingArguments(transformers.TrainingArguments):
 def main():
     parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
     model_args, data_args, training_args = parser.parse_args_into_dataclasses()
-
+    training_args.fsdp_config=dict(fsdp_transformer_layer_cls_to_wrap=["LlamaDecoderLayer"])
+    TrainingArguments.fsdp_config = training_args.fsdp_config
     ctx_mgr = common_utils.staggered_object_creation(
         local_rank=training_args.local_rank, world_size=training_args.world_size
     )
@@ -120,7 +121,7 @@ def main():
         truncation_side="left",
         use_fast=training_args.use_fast_tokenizer,
     )
-    
+
     tokenizer.padding = training_args.padding
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token
diff --git a/train.sh b/train.sh
@@ -1,18 +1,15 @@
-DATASET='ASQA'
+DATASET=PopQA
 PER_DEVICE_BATCH_SIZE=1
 NUM_DEVICE=4
 TOTAL_BATCH_SIZE=128
 GRADIENT_ACC_STEPS=$(($TOTAL_BATCH_SIZE/$NUM_DEVICE/$PER_DEVICE_BATCH_SIZE))
 
-export WANDB_MODE=offline
-
 CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=$NUM_DEVICE src/finetune.py \
   --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
   --dataset_name $DATASET \
   --output_dir saved_checkpoints/InstructRAG-FT/${DATASET} \
   --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
   --gradient_accumulation_steps $GRADIENT_ACC_STEPS \
-  --cache_dir "/p/llmresearch/huggingface/hub" \
   --num_train_epochs 2 \
   --n_docs 5 \
   --learning_rate 2.5e-5 \