kieval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models #830
Labels
Automation
Automate the things
github
gh tools like cli, Actions, Issues, Pages
llm
Large Language Models
llm-evaluation
Evaluating Large Language Models performance and behavior through human-written evaluation sets
New-Label
Choose this option if the existing labels are insufficient to describe the content accurately
software-engineering
Best practice for software engineering
WisdomShell/kieval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Snippet
"This is the official repository for KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models.
Automatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model's response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models' real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning."
Quick Start
To get started, first clone the repository and setup the enviroment:
We provide a modular implementation of our method, currently we support evaluating models locally with Huggingface's Transformers, and remote models with text-generation-inference or other APIs.
To reproduce KIEval results, we recommend starting a text-generation-inference instance with your model:
Then, generate an evaluation config file with our script:
Finally, run the evaluation process:
This repository provides all settings necessary for researchers to reproduce the results of KIEval, it also facilitates the reproduction of all metrics (from previous works) discussed in our paper. Please refer to
config/templates
for all supported evaluation methods.Suggested labels
{'label-name': 'knowledge-grounded-evaluation', 'label-description': 'Evaluation framework incorporating domain-specific knowledge for large language models', 'confidence': 69.52}
The text was updated successfully, but these errors were encountered: