🔎 FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

This is the official code repo of our paper "FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation". This repository contains:

FactBench: A new dynamic factuality benchmark grounded in the real-world usage of LMs. All related codes for constructing the benchmark is under ./FactBench folder.
VERIFY: A factuality evaluation pipeline that considers the verifiability of generated content and categorizes units into supported, unsupported, or undecidable according to retrieval results. Codes available under ./VERIFY folder.
Baselines (FActScore, SAFE, Factcheck-GPT): Related previous works that serve as our baselines. All baselines are accelerated and adapted to our framework. Codes available under ./baselines folder.
Human Annotations: Our annotations on 4,467 content units are available in ./annotations.zip.

Pipeline Diagram

Accessing the Repository

First, clone our GitHub repository and navigate to the newly created folder:

git clone https://github.com/launchnlp/FactBench.git
cd FactBench

Environment Setup and Factuality Evaluation

If running VERIFY (Our Factuality Evaluation Pipeline):

Install all requirements & dependencies:

pip install -r requirements.txt

Put FactBench data under:

./VERIFY/data/lmsys_data/final_dataset/

Run VERIFY pipeline:

cd VERIFY
python factuality_evaluation.py --backbone_llm "Llama-3-70B-Instruct" --cache_dir "./cache/" --tier_number 1 --model_name "gpt4-o"

You should be about to find evaluation results under:

./VERIFY/data/lmsys_data/benchmarking/BenchCurator

Add your favorite models to Leaderboard

Please consider raising issues here and mention the name of your new models!

Citation

If you find our work for your research, please cite our paper:

@misc{bayat2024factbenchdynamicbenchmarkinthewild,
      title={FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation}, 
      author={Farima Fatahi Bayat and Lechen Zhang and Sheza Munir and Lu Wang},
      year={2024},
      eprint={2410.22257},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.22257}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
FactBench		FactBench
VERIFY		VERIFY
assets		assets
baselines		baselines
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
annotations.zip		annotations.zip
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔎 FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Pipeline Diagram

Accessing the Repository

Environment Setup and Factuality Evaluation

If running VERIFY (Our Factuality Evaluation Pipeline):

Add your favorite models to Leaderboard

Citation

About

Releases

Packages

Languages

launchnlp/FactBench

Folders and files

Latest commit

History

Repository files navigation

🔎 FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Pipeline Diagram

Accessing the Repository

Environment Setup and Factuality Evaluation

If running VERIFY (Our Factuality Evaluation Pipeline):

Add your favorite models to Leaderboard

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages