| 🏆 Leaderboard | 📚 Dataset | 📑 arXiv | 🐦 Twitter/X |
This is the official code repo of our paper "FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation". This repository contains:
- FactBench: A new dynamic factuality benchmark grounded in the real-world usage of LMs. All related codes for constructing the benchmark is under
./FactBench
folder. - VERIFY: A factuality evaluation pipeline that considers the verifiability of generated content and categorizes units into supported, unsupported, or undecidable according to retrieval results. Codes available under
./VERIFY
folder. - Baselines (FActScore, SAFE, Factcheck-GPT): Related previous works that serve as our baselines. All baselines are accelerated and adapted to our framework. Codes available under
./baselines
folder. - Human Annotations: Our annotations on 4,467 content units are available in
./annotations.zip
.
First, clone our GitHub repository and navigate to the newly created folder:
git clone https://github.com/launchnlp/FactBench.git
cd FactBench
- Install all requirements & dependencies:
pip install -r requirements.txt
- Put FactBench data under:
./VERIFY/data/lmsys_data/final_dataset/
- Run VERIFY pipeline:
cd VERIFY
python factuality_evaluation.py --backbone_llm "Llama-3-70B-Instruct" --cache_dir "./cache/" --tier_number 1 --model_name "gpt4-o"
- You should be about to find evaluation results under:
./VERIFY/data/lmsys_data/benchmarking/BenchCurator
Please consider raising issues here and mention the name of your new models!
If you find our work for your research, please cite our paper:
@misc{bayat2024factbenchdynamicbenchmarkinthewild,
title={FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation},
author={Farima Fatahi Bayat and Lechen Zhang and Sheza Munir and Lu Wang},
year={2024},
eprint={2410.22257},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.22257},
}