Skip to content

launchnlp/FactBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔎 FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

| 🏆 Leaderboard | 📚 Dataset | 📑 arXiv | 🐦 Twitter/X |

This is the official code repo of our paper "FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation". This repository contains:

  1. FactBench: A new dynamic factuality benchmark grounded in the real-world usage of LMs. All related codes for constructing the benchmark is under ./FactBench folder.
  2. VERIFY: A factuality evaluation pipeline that considers the verifiability of generated content and categorizes units into supported, unsupported, or undecidable according to retrieval results. Codes available under ./VERIFY folder.
  3. Baselines (FActScore, SAFE, Factcheck-GPT): Related previous works that serve as our baselines. All baselines are accelerated and adapted to our framework. Codes available under ./baselines folder.
  4. Human Annotations: Our annotations on 4,467 content units are available in ./annotations.zip.

Pipeline Diagram

Accessing the Repository

First, clone our GitHub repository and navigate to the newly created folder:

git clone https://github.com/launchnlp/FactBench.git
cd FactBench

Environment Setup and Factuality Evaluation

If running VERIFY (Our Factuality Evaluation Pipeline):

  1. Install all requirements & dependencies:
pip install -r requirements.txt
  1. Put FactBench data under:
./VERIFY/data/lmsys_data/final_dataset/
  1. Run VERIFY pipeline:
cd VERIFY
python factuality_evaluation.py --backbone_llm "Llama-3-70B-Instruct" --cache_dir "./cache/" --tier_number 1 --model_name "gpt4-o" 
  1. You should be about to find evaluation results under:
./VERIFY/data/lmsys_data/benchmarking/BenchCurator

Add your favorite models to Leaderboard

Please consider raising issues here and mention the name of your new models!

Citation

If you find our work for your research, please cite our paper:

@misc{bayat2024factbenchdynamicbenchmarkinthewild,
      title={FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation}, 
      author={Farima Fatahi Bayat and Lechen Zhang and Sheza Munir and Lu Wang},
      year={2024},
      eprint={2410.22257},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.22257}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published