Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlpacaEval: Revolutionizing Model Evaluation with LLM-Based Automatic Tools #813

Open
1 task
ShellLM opened this issue Apr 22, 2024 · 1 comment
Open
1 task
Labels
llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets openai OpenAI APIs, LLMs, Recipes and Evals Papers Research papers

Comments

@ShellLM
Copy link
Collaborator

ShellLM commented Apr 22, 2024

AlpacaEval: Revolutionizing Model Evaluation with LLM-Based Automatic Tools

Snippet: "Evaluation of instruction-following models (e.g., ChatGPT) typically requires human interactions. This is time-consuming, expensive, and hard to replicate. AlpacaEval in an LLM-based automatic evaluation that is fast, cheap, replicable, and validated against 20K human annotations. It is particularly useful for model development. Although we improved over prior automatic evaluation pipelines, there are still fundamental limitations like the preference for longer outputs. AlpacaEval provides the following:

  • Leaderboard: a leaderboard of common models on the AlpacaEval evaluation set. Caution: Automatic evaluators (e.g. GPT-4) may be biased towards models that generate longer outputs and/or that were fine-tuned on the model underlying the evaluator (e.g. GPT-4).
  • Automatic evaluator: an automatic evaluator that has high agreement with humans (validated on 20K annotations). We evaluate a model by measuring the fraction of times a powerful LLM (e.g. GPT-4) prefers the outputs from that model over outputs from a reference model. Our evaluators enable caching and output randomization by default.
  • Toolkit for building automatic evaluators: a simple interface for building advanced automatic evaluators (e.g. with caching, batching, or multi-annotators) and analyzing them (quality, price, speed, statistical power, bias, variance etc).
  • Human evaluation data: 20K human preferences between a given and reference model on the AlpacaFarm evaluation set. 2.5K of these are cross-annotations (4 humans annotating the same 650 examples).
  • AlpacaEval dataset: a simplification of AlpacaFarm's evaluation set, where "instructions" and "inputs" are merged into one field, and reference outputs are longer. Details here.

When to use and not use AlpacaEval?

  • When to use AlpacaEval? Our automatic evaluator is a quick and cheap proxy for human evaluation of simple instruction-following tasks. It is useful if you have to run many evaluations quickly, e.g., during model development.

  • When not to use AlpacaEval? As any other automatic evaluator, AlpacaEval should not replace human evaluation in high-stake decision-making, e.g., to decide on model release. In particular, AlpacaEval is limited by the fact that (1) the instructions in the eval set might not be representative of advanced usage of LLMs; (2) automatic evaluators may have biases such as favoring style over factuality of the answer; and (3) AlpacaEval does not measure the risks that a model could cause. Details in limitations."

Suggested labels

None

@ShellLM ShellLM added llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets openai OpenAI APIs, LLMs, Recipes and Evals Papers Research papers labels Apr 22, 2024
@ShellLM
Copy link
Collaborator Author

ShellLM commented Apr 22, 2024

Related content

#750 similarity score: 0.88
#431 similarity score: 0.87
#459 similarity score: 0.87
#811 similarity score: 0.87
#389 similarity score: 0.87
#628 similarity score: 0.87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets openai OpenAI APIs, LLMs, Recipes and Evals Papers Research papers
Projects
None yet
Development

No branches or pull requests

1 participant