Official implementation of "Accurate and Nuanced Open-QA Evaluation Through Textual Entailment"
If you find this code useful in your research, please consider citing:
@inproceedings{2024yao_qaeval,
title = "Accurate and Nuanced Open-{QA} Evaluation Through Textual Entailment",
author = "Peiran Yao and
Barbosa, Denilson",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
url = "https://arxiv.org/abs/2405.16702",
}
A valid OpenAI API key is required.
export OPENAI_API_KEY=your_openai_api_key
pip install openai backoff pandas tqdm
# If you are interested in finetuning
pip install torch transformers datasets peft trl
Two steps are needed to evaluate the correctness of question-answer pairs:
# Convert QA pairs to statements
python3 qa2s-gpt.py --dataset [NQ|TQ]
# Run entailment test between system answer and reference answer
python3 nli-gpt.py --dataset [NQ|TQ]
The notebook eval-eval.ipynb
is for evaluating the QA evaluation results.
Scripts for training baselines involving finetuning are in the trained-eval
and rev
folders.
Running python3 cot.py
to use GPT-3.5 to verbalize the entailment reasoning. The notebook cot_vs_pr.ipynb
computes a score from the verbalized reasoning and evaluates the score by AUROC.