Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kieval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models #830

Open
1 task
ShellLM opened this issue May 9, 2024 · 1 comment
Labels
Automation Automate the things github gh tools like cli, Actions, Issues, Pages llm Large Language Models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets New-Label Choose this option if the existing labels are insufficient to describe the content accurately software-engineering Best practice for software engineering

Comments

@ShellLM
Copy link
Collaborator

ShellLM commented May 9, 2024

WisdomShell/kieval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Snippet

"This is the official repository for KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models.

Automatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model's response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models' real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning."

Quick Start

To get started, first clone the repository and setup the enviroment:

git clone https://github.com/zhuohaoyu/KIEval.git
cd KIEval
pip install -r requirements.txt

We provide a modular implementation of our method, currently we support evaluating models locally with Huggingface's Transformers, and remote models with text-generation-inference or other APIs.

To reproduce KIEval results, we recommend starting a text-generation-inference instance with your model:

model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model

Then, generate an evaluation config file with our script:

python scripts/generate-basic.py \
    --template ./config/template-basic.json \
    --dataset arc_challenge \
    --base_url http://your-host-url:8080 \
    --model_name llama-2-7b-chat-hf \
    --model_path meta-llama/Llama-2-7b-chat-hf \
    --openai_api_base https://api.openai.com/v1/ \
    --openai_key your_openai_key \
    --openai_model gpt-4-1106-preview \
    --output_path ./result \
    --generate_path ./config/generated.json

Finally, run the evaluation process:

python run.py -c ./config/generated.json

This repository provides all settings necessary for researchers to reproduce the results of KIEval, it also facilitates the reproduction of all metrics (from previous works) discussed in our paper. Please refer to config/templates for all supported evaluation methods.

Suggested labels

{'label-name': 'knowledge-grounded-evaluation', 'label-description': 'Evaluation framework incorporating domain-specific knowledge for large language models', 'confidence': 69.52}

@ShellLM ShellLM added Automation Automate the things github gh tools like cli, Actions, Issues, Pages llm Large Language Models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets New-Label Choose this option if the existing labels are insufficient to describe the content accurately software-engineering Best practice for software engineering labels May 9, 2024
@ShellLM
Copy link
Collaborator Author

ShellLM commented May 9, 2024

Related content

#750 similarity score: 0.9
#762 similarity score: 0.89
#811 similarity score: 0.89
#684 similarity score: 0.89
#309 similarity score: 0.88
#552 similarity score: 0.88

@ShellLM ShellLM changed the title WisdomShell/kieval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models kieval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Automation Automate the things github gh tools like cli, Actions, Issues, Pages llm Large Language Models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets New-Label Choose this option if the existing labels are insufficient to describe the content accurately software-engineering Best practice for software engineering
Projects
None yet
Development

No branches or pull requests

1 participant