Skip to content

The official repo for paper, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.

Notifications You must be signed in to change notification settings

CSHaitao/Awesome-LLMs-as-Judges

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Awesome-LLMs-as-Judges

GitHub Sponsors Build Status License Contributors Awesome List

🌟 About This Repo

With the rapid development of LLMs, LLM-as-a-Judge has garnered widespread attention in both academia and industry. LLM judges are not only capable of serving as flexible evaluators in various fields such as text generation, question answering, and dialogue systems, but also facilitate the self-evolution and performance improvement of models. This repository aims to provide a one-stop resource for developers, researchers, and practitioners, helping them explore how to effectively leverage LLMs-as-Judges technology.

This repo include the papers discussed in our latest survey paper:

📝LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.

We will continuously track the latest developments in LLMs-as-Judges and regularly update the repository with the newest related papers. If you find this repository helpful, please give us a ⭐!

If you notice any work we've missed, please feel free to submit a pull request or contact us at email liht22@mails.tsinghua.edu.cn.

We will update the repository and our paper. Welcome to discuss and contribute!

📚 Daily Papers on LLMs-as-Judges

Daily Papers on LLMs-as-Judges includes the latest paper titles and abstracts related to LLMs-as-Judges on arXiv, with information available in both English and Chinese.

⚡️ Update

🔥🔥 News: 2024/12/20: We have updated Daily Papers on LLMs-as-Judges, which automatically retrieves and updates daily papers from arXiv related to LLMs-as-Judges.

🔥🔥 News: 2024/12/14: We compiled papers related to LLMs-as-Judges presented at NeurIPS 2024.

🔥🔥 News: 2024/12/10: We released the first version of the full paper.

🔥🔥 News: 2024/11/10: We completed the foundational work for the project and structured the framework.

🌳 Contents

📖 Cite Our Work

If you find our work useful, please do not save your star and cite our work:

@misc{li2024llmsasjudgescomprehensivesurveyllmbased,
      title={LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods}, 
      author={Haitao Li and Qian Dong and Junjie Chen and Huixue Su and Yujia Zhou and Qingyao Ai and Ziyi Ye and Yiqun Liu},
      year={2024},
      eprint={2412.05579},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.05579}, 
}

📚 Overview of Awesome-LLMs-as-Judges

Overview limit Framework

📑 PaperList

1. Functionality

1.1 Performance Evaluation

1.1.1 Responses Evaluation

  • Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models

    ACL 2023. [Paper]

  • Automated Genre-Aware Article Scoring and Feedback Using Large Language Models

    arXiv 2024. [Paper]

  • Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks

    LREC-COLING 2024. [Paper]

  • Ares: An automated evaluation framework for retrieval-augmented generation systems

    NAACL 2024. [Paper]

  • Self-rag: Learning to retrieve, generate, and critique through self-reflection

    ICLR 2024. [Paper]

  • RecExplainer: Aligning Large Language Models for Explaining Recommendation Models

    KDD 2024. [Paper]

1.1.2 Model Evaluation

  • Judging llm-as-a-judge with mt-bench and chatbot arena

    NeurIPS 2023. [Paper]

  • Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

    arXiv 2024. [Paper]

  • VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

    arXiv 2024. [Paper]

  • Benchmarking foundation models with language-model-as-an-examiner

    NeurIPS 2023. [Paper]

  • Kieval: A knowledge-grounded interactive evaluation framework for large language models

    ACL 2024. [Paper]

1.2 Model Enhancement

1.2.1 Reward Modeling During Training

  • Self-rewarding language models

    ICML 2024. [Paper]

  • Direct language model alignment from online ai feedback

    arXiv 2024. [Paper]

  • Rlaif: Scaling reinforcement learning from human feedback with ai feedback

    arXiv 2024.[Paper]

  • Enhancing Reinforcement Learning with Dense Rewards from Language Model Critic

    EMNLP 2024. [Paper]

  • Cream: Consistency regularized self-rewarding language models

    arXiv 2024. [Paper]

  • The perfect blend: Redefining RLHF with mixture of judges

    arXiv 2024. [Paper]

  • Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs

    EMNLP (findings) 2023. [Paper]

1.2.2 Acting as Verifier During Inference

  • Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

    arXiv 2024. [Paper]

  • Fast Best-of-N Decoding via Speculative Rejection

    NeurIPS 2024 [Paper]

  • Tree of thoughts: Deliberate problem solving with large language models

    NeurIPS 2024. [Paper]

  • Graph of thoughts: Solving elaborate problems with large language models

    AAAI 2024.[Paper]

  • Let’s verify step by step

    ICLR 2024. [Paper]

  • Self-evaluation guided beam search for reasoning

    NeurIPS 2024. [Paper]

  • Rationale-Aware Answer Verification by Pairwise Self-Evaluation

    arXiv 2024. [Paper]

  • Creative Beam Search: LLM-as-a-Judge for Improving Response Generation.

    ICCC 2024. [Paper]

1.2.3 Feedback for Refinement

  • Self-refine: Iterative refinement with self-feedback

    NeurIPS 2024. [Paper]

  • Teaching large language models to self-debug

    arXiv 2023. [Paper]

  • Refiner: Reasoning feedback on intermediate representations

    EACL 2024. [Paper]

  • Towards reasoning in large language models via multi-agent peer review collaboration

    arXiv 2023. [Paper]

  • Large language models cannot self-correct reasoning yet

    ICLR 2024. [Paper]

  • LLMs cannot find reasoning errors, but can correct them!

    ACL (findings) 2024. [Paper]

  • Can large language models really improve by self-critiquing their own plans?

    NeurIPS (Workshop) 2023. [Paper]

1.3 Data Collection

1.3.1 Data Annotation

  • If in a Crowdsourced Data Annotation Pipeline, a GPT-4

    CHI 2024. [Paper]

  • ChatGPT outperforms crowd workers for text-annotation tasks

    PNAS 2023.[Paper]

  • ChatGPT-4 outperforms experts and crowd workers in annotating political Twitter messages with zero-shot learning

    arXiv 2023. [Paper]

  • Fullanno: A data engine for enhancing image comprehension of MLLMs

    arXiv 2024. [Paper]

  • Can large language models aid in annotating speech emotional data? Uncovering new frontiers

    IEEE 2024. [Paper]

  • Annollm: Making large language models to be better crowdsourced annotators

    NAACL 2024. [Paper]

  • LLMAAA: Making large language models as active annotators

    EMNLP (findings) 2023. [Paper]

1.3.2 Data Synthesize

  • Selfee: Iterative self-revising LLM empowered by self-feedback generation

    Blog post 2023.[Blog]

  • Self-Boosting Large Language Models with Synthetic Preference Data

    arXiv 2024. [Paper]

  • The fellowship of the LLMs: Multi-agent workflows for synthetic preference optimization dataset generation

    arXiv 2024. [Paper]

  • Self-consistency improves chain of thought reasoning in language models

    ICLR 2023. [Paper]

  • WizardLM: Empowering large language models to follow complex instructions

    ICLR 2024. [Paper]

  • Automatic Instruction Evolving for Large Language Models

    EMNLP 2024. [Paper]

  • STaR: Self-taught reasoner bootstrapping reasoning with reasoning

    NeurIPS 2022. [Paper]

  • Beyond human data: Scaling self-training for problem-solving with language models

    arXiv 2023. [Paper]

  • SAFER-INSTRUCT: Aligning Language Models with Automated Preference Data

    NAACL 2024. [Paper]

  • soda-eval: open-domain dialogue evaluation in the age of llms

    EMNLP (findings) 2024. [Paper]

2. METHODOLOGY

2.1 Single-LLM System

2.1.1 Prompt-based

2.1.1.1 In-Context Learning

  • A systematic survey of prompt engineering in large language models: Techniques and applications

    arXiv 2024. [Paper]

  • A survey on in-context learning

    arXiv 2022. [Paper]

  • Gptscore: Evaluate as you desire

    arXiv 2023. [Paper]

  • Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models

    NLP4ConvAI 2023. [Paper]

  • TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot

    arXiv 2024. [Paper]

  • Multi-dimensional evaluation of text summarization with in-context learning

    ACL 2023 (Findings). [Paper]

  • Calibrate before use: Improving few-shot performance of language models

    ICML 2021. [Paper]

  • Prototypical calibration for few-shot learning of language models

    arXiv 2022. [Paper]

  • Mitigating label biases for in-context learning

    ACL 2023. [Paper]

  • ALLURE: auditing and improving llm-based evaluation of text using iterative in-context-learning

    arXiv 2023. [Paper]

  • Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

    arXiv 2024. [Paper]

2.1.1.2 Step-by-step

  • Chain-of-thought prompting elicits reasoning in large language models

    NeurIPS 2022. [Paper]

  • Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task

    Eval4NLP 2023. [Paper]

  • G-eval: Nlg evaluation using gpt-4 with better human alignment

    arXiv 2023. [Paper]

  • ICE-Score: Instructing Large Language Models to Evaluate Code

    EACL 2024 (findings). [Paper]

  • ProtocoLLM: Automatic Evaluation Framework of LLMs on Domain-Specific Scientific Protocol Formulation Tasks

    arXiv 2024. [Paper]

  • A closer look into automatic evaluation using large language models

    EMNLP 2023 (findings). [Paper]

  • FineSurE: Fine-grained summarization evaluation using LLMs

    ACL 2024. [Paper]

  • Split and merge: Aligning position biases in large language model based evaluators

    arXiv 2023. [Paper]

2.1.1.3 Definition Augmentation

  • Can LLM be a Personalized Judge?

    arXiv 2024. [Paper]

  • Biasalert: A plug-and-play tool for social bias detection in llms

    arXiv 2024. [Paper]

  • LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation

    arXiv 2024. [Paper]

  • Unveiling Context-Aware Criteria in Self-Assessing LLMs

    arXiv 2024. [Paper]

  • Calibrating llm-based evaluator

    arXiv 2023. [Paper]

2.1.1.4 Multi-turn Optimization

  • Large Language Models Are Active Critics in NLG Evaluation

    arXiv 2024. [Paper]

  • Kieval: A knowledge-grounded interactive evaluation framework for large language models

    ACL 2024. [Paper]

  • Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

    arXiv 2024. [Paper]

  • Benchmarking foundation models with language-model-as-an-examiner

    NeurIPS 2023 (Datasets and Benchmarks). [Paper]

  • VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

    arXiv 2024 [Paper]

2.1.2 Tuning-based

  • On the limitations of fine-tuned judge models for llm evaluation

    arXiv 2024. [Paper]

2.1.2.1 Score-based Tuning

  • Adaptation with self-evaluation to improve selective prediction in llms

    EMNLP 2023 (Findings). [Paper]

  • Learning personalized story evaluation

    arXiv 2023. [Paper]

  • Improving Model Factuality with Fine-grained Critique-based Evaluator

    arXiv 2024. [Paper]

  • Ares: An automated evaluation framework for retrieval-augmented generation systems

    NAACL 2024. [Paper]

  • PHUDGE: Phi-3 as Scalable Judge

    arXiv 2024. [Paper]

  • Self-Judge: Selective Instruction Following with Alignment Self-Evaluation

    arXiv 2024. [Paper]

  • Automatic evaluation of attribution by large language models

    EMNLP 2023 (Findings). [Paper]

  • Sorry-bench: Systematically evaluating large language model safety refusal behaviors

    arXiv 2024. [Paper]

  • Tigerscore: Towards building explainable metric for all text generation tasks

    TLMR 2024. [Paper]

2.1.2.2 Preference-based Learning

  • Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

    arXiv 2024. [Paper]

  • Prometheus: Inducing fine-grained evaluation capability in language models

    ICLR 2024. [Paper]

  • Prometheus 2: An open source language model specialized in evaluating other language models

    arXiv 2024. [Paper]

  • FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom

    arXiv 2024. [Paper]

  • Self-rationalization improves LLM as a fine-grained judge

    arXiv 2024. [Paper]

  • Foundational autoraters: Taming large language models for better automatic evaluation

    arXiv 2024. [Paper]

  • Self-taught evaluators

    arXiv 2024. [Paper]

  • Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization

    ICLR 2024. [Paper]

  • CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

    arXiv 2024. [Paper]

  • Direct preference optimization: Your language model is secretly a reward model

    NeurIPS 2023. [Paper]

  • Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge

    arXiv 2024. [Paper]

  • Judgelm: Fine-tuned large language models are scalable judges

    arXiv 2023. [Paper]

  • INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback

    EMNLP 2023. [Paper]

  • Generative judge for evaluating alignment

    arXiv 2023. [Paper]

  • Shepherd: A critic for language model generation

    arXiv 2023. [Paper]

  • X-eval: Generalizable multi-aspect text evaluation via augmented instruction tuning with auxiliary evaluation aspects

    NAACL 2024. [Paper]

  • Themis: A reference-free nlg evaluation language model with flexibility and interpretability

    EMNLP 2024. [Paper]

  • Critiquellm: Towards an informative critique generation model for evaluation of large language model generation

    ACL 2024. [Paper]

  • Mitigating the Bias of Large Language Model Evaluation

    arXiv 2024. [Paper]

  • Halu-j: Critique-based hallucination judge

    arXiv 2024. [Paper]

  • Prometheusvision: Vision-language model as a judge for fine-grained evaluation

    ICLR 2024 (Workshop). [Paper]

  • Llava-critic: Learning to evaluate multimodal models

    arXiv 2024. [Paper]

2.1.3 Post-processing

2.1.3.1 Probability Calibration

  • Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons

    arXiv 2024. [Paper]

  • Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments

    arXiv 2024. [Paper]

  • Language Models can Evaluate Themselves via Probability Discrepancy

    ACL 2024 (Findings). [Paper]

  • Mitigating biases for instruction-following language models via bias neurons elimination

    ACL 2024. [Paper]

2.1.3.2 Text Reprocessing

  • Evaluation metrics in the era of GPT-4: reliably evaluating large language models on sequence to sequence tasks

    EMNLP 2023. [Paper]

  • Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

    arXiv 2024. [Paper]

  • Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing

    arXiv 2024. [Paper]

  • RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

    arXiv 2024. [Paper]

  • Generative judge for evaluating alignment

    arXiv 2023. [Paper]

  • Self-evaluation improves selective generation in large language models

    NeurIPS 2023 (Workshops). [Paper]

  • AI can help humans find common ground in democratic deliberation

    Science 2024. [Paper]

2.2 Multi-LLM System

2.2.1 Communication

2.2.1.1 Cooperation

  • Towards reasoning in large language models via multi-agent peer review collaboration

    arXiv 2023. [Paper]

  • Wider and deeper llm networks are fairer llm evaluators

    arXiv 2023. [Paper]

  • ABSEval: An Agent-based Framework for Script Evaluation

    ACL 2024. [Paper]

2.2.1.2 Competition

  • Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

    arXiv 2024. [Paper]

  • A multi-llm debiasing framework

    arXiv 2024. [Paper]

  • Prd: Peer rank and discussion improve large language model based evaluations

    TMLR 2024. [Paper]

  • Chateval: Towards better llm-based evaluators through multi-agent debate

    arXiv 2023. [Paper]

  • Evaluating the Performance of Large Language Models via Debates

    arXiv 2024. [Paper]

  • Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

    arXiv 2024. [Paper]

2.2.2 Aggregation

  • Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

    arXiv 2024. [Paper]

  • Benchmarking foundation models with language-model-as-an-examiner

    NeurIPS 2023 (Datasets and Benchmarks). [Paper]

  • Pre: A peer review based large language model evaluator

    arXiv 2024. [Paper]

  • An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

    arXiv 2024. [Paper]

  • Large language models as evaluators for recommendation explanations

    Proceedings of the 18th ACM Conference on Recommender Systems. [Paper]

  • Bayesian Calibration of Win Rate Estimation with LLM Evaluators

    EMNLP 2024. [Paper]

  • Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

    arXiv 2024. [Paper]

  • Fusion-Eval: Integrating Assistant Evaluators with LLMs

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. [Paper]

  • AIME: AI System Optimization via Multiple LLM Evaluators

    arXiv 2024. [Paper]

  • HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition

    arXiv 2024. [Paper]

  • An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers

    arXiv 2024. [Paper]

  • Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

    EMNLP 2024. [Paper]

  • Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

    arXiv 2024. [Paper]

  • PiCO: Peer Review in LLMs based on the Consistency Optimization

    arXiv 2024. [Paper]

  • Language Model Preference Evaluation with Multiple Weak Evaluators

    arXiv 2024. [Paper]

2.3 Human-AI Collaboration System

  • Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation

    arXiv 2023. [Paper]

  • Large language models are not fair evaluators

    arXiv 2023. [Paper]

  • Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

    EMNLP 2024. [Paper]

  • Human-Centered Design Recommendations for LLM-as-a-judge

    arXiv 2024. [Paper]

  • Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences

    Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. [Paper]

3. APPLICATION

3.1 General

  • DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset

    IJCNLP 2017.[Poster]

  • Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization

    EMNLP 2018. [Paper]

  • Improving LLM-based machine translation with systematic self-correction

    arXiv 2024. [Paper]

  • Fusion-Eval: Integrating Assistant Evaluators with LLMs

    EMNLP 2024.[Poster]

3.2 Multimodal

  • Llava-critic: Learning to evaluate multimodal models

    arXiv 2024. [Paper]

  • Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark

    ICML 2024. [Paper]

  • Can large language models aid in annotating speech emotional data? uncovering new frontiers

    IEEE 2024. [Paper]

  • Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

    arXiv 2024. [Paper]

  • Calibrated self-rewarding vision language models

    NeurIPS 2024. [Paper]

  • Automated evaluation of large vision-language models on self-driving corner cases

    arXiv 2024. [Paper]

3.3 Medical

  • DOCLENS: Multi-aspect fine-grained evaluation for medical text generation

    ACL 2024.[Paper]

  • Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency?

    NAACL (findings) 2024. [Paper]

  • Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation

    arXiv 2024. [Paper]

  • Automatic evaluation for mental health counseling using LLMs

    arXiv 2024. [Paper]

  • Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models

    Bioinformatics 2024. [Paper]

3.4 Legal

  • Disc-lawllm: Fine-tuning large language models for intelligent legal services

    arXiv 2023. [Paper]

  • Retrieval-based Evaluation for LLMs: A Case Study in Korean Legal QA NNLP (Workshop) 2023. 132–137. [Paper]

  • Constructing domain-specific evaluation sets for llm-as-a-judge

    customnlp4u (Workshop) 2024. [Paper]

  • Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval

    arXiv 2024. [Paper]

3.5 Financial

  • Pixiu: A large language model, instruction data and evaluation benchmark for finance

    NeurIPS 2023. [Paper]

  • GPT classifications, with application to credit lending

    Machine Learning with Applications 2024. [Paper]

  • KRX Bench: Automating Financial Benchmark Creation via Large Language Models

    FinNLP 2024. [Paper]

3.6 Education

  • Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

    EMNLP 2024. [Paper]

  • Automated Genre-Aware Article Scoring and Feedback Using Large Language Models

    arXiv 2024. [Paper]

  • Automated Essay Scoring and Revising Based on Open-Source Large Language Models

    IEEE 2024. [Paper]

  • Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks.

    LREC-COLING 2024. [Paper]

  • Evaluating Mathematical Reasoning Beyond Accuracy

    COLM 2024. [Paper]

  • Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM

    ACL (findings) 2024. [Paper]

3.7 Information Retrieval

  • LLMJudge: LLMs for Relevance Judgments

    LLM4Eval 2024. [Paper]

  • Don’t Use LLMs to Make Relevance Judgments

    arXiv 2024. [Paper]

  • JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

    arXiv 2024. [Paper]

  • Large language models as evaluators for recommendation explanations

    RecSys 2024. [Paper]

  • Ares: An automated evaluation framework for retrieval-augmented generation systems

    NAACL 2024. [Paper]

3.8 Others

  • AIME: AI System Optimization via Multiple LLM Evaluators

    arXiv 2024. [Paper]

  • CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences

    arXiv 2024. [Paper]

  • LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization

    arXiv 2024. [Paper]

  • Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation

    Biomedical NLP (Workshop) 2024.[Paper]

  • AI can help humans find common ground in democratic deliberation

    Science 2024.[Paper]

  • Sotopia: Interactive evaluation for social intelligence in language agents

    ICLR (spotlight) 2024. [Paper]

4. META-EVALUATION

4.1 Benchmarks

4.1.1 Code Generation

  • SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    ICLR 2024 [Paper]

  • CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences

    arXiv 2024 [Paper]

  • Evaluating Large Language Models Trained on Code

    arXiv 2021[Paper]

  • Agent-as-a-Judge: Evaluate Agents with Agents

    arXiv 2024 [Paper]

  • CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

    NeurIPS 2023 (Datasets and Benchmarks Track) [Paper ]

4.1.2 Machine Translation

  • Experts, errors, and context: A large-scale study of human evaluation for machine translation

    TACL 2021 [Paper]

  • Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-Based Human Evaluations on TED and News Domain

    Proceedings of the Sixth Conference on Machine Translation (WMT), 2021 [Paper]

  • Large Language Models Effectively Leverage Document-Level Context for Literary Translation, but Critical Errors Persist

    Proceedings of the Sixth Conference on Machine Translation (WMT), 2023 [Paper]

4.1.3 Text Summarization

  • Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

    NAACL 2021 [Paper]

  • SummEval: Re-evaluating Summarization Evaluation

    Transactions of the Association for Computational Linguistics (TACL), 2021 [Paper]

  • Opinsummeval: Revisiting automated evaluation for opinion summarization

    EMNLP 2023 [Paper]

4.1.4 Dialogue Generation

  • Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations

    INTERSPEECH 2019 [Paper]

  • Automatic evaluation and moderation of open-domain dialogue systems

    DSTC10 [Paper]

  • Personalizing Dialogue Agents: I Have a Dog, Do You Have Pets Too?

    ACL 2018 [Paper]

  • USR: An Unsupervised and Reference-Free Evaluation Metric for Dialog Generation

    ACL 2020 [Paper]

  • Overview of the Tenth Dialog System Technology Challenge: DSTC10

    IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023 [Paper]

4.1.5 Automatic Story Generation

  • OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

    ACL 2021 [Paper]

  • Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

    COLING 2022 [Paper]

  • Learning Personalized Story Evaluation

    ICLR 2024 [Paper]

  • Hierarchical Neural Story Generation

    ACL 2018 [Paper]

  • A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

    NAACL 2016 [Paper]

  • StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning

    EMNLP 2022 [Paper]

4.1.6 Values Alignment

  • A general language assistant as a laboratory for alignment

    arXiv 2021 [Paper]

  • PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

    arXiv 2024 [Paper]

  • Cvalues: Measuring the values of Chinese large language models from safety to responsibility

    arXiv 2023 [Paper]

4.1.7 Recommendation

  • Large Language Models as Evaluators for Recommendation Explanations
    RecSys 2024 [Paper]

  • Yelp Dataset Challenge: Review Rating Prediction

    arXiv 2016 [Paper]

  • The movielens datasets: History and context

    TiiS 2016 [Paper]

4.1.8 Search

  • Lecardv2: A large-scale chinese legal case retrieval dataset

    SIGIR 2024 [Paper]

  • Overview of the TREC 2021 Deep Learning Track

    TREC 2021 [Paper]

  • Overview of the TREC 2023 NeuCLIR Track

    TREC 2023 [Paper]

  • Ms MARCO: A human generated machine reading comprehension dataset

    ICLR 2017 [Paper]

  • Overview of the TREC 2022 Deep Learning Track

    Text REtrieval Conference (TREC) 2022 [Paper]

4.1.9 Comprehensive Data

  • Length-controlled alpacaeval: A simple way to debias automatic evaluators

    COLM 2024 [Paper]

  • Helpsteer: Multi-attribute helpfulness dataset for steerlm

    NAACL 2024 [Paper]

  • ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback

    ICML 2024 [Paper]

  • Helpsteer2-preference: Complementing ratings with preferences

    CoRR 2024 [Paper]

  • Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

    EMNLP 2023 [Paper]

  • RewardBench: Evaluating Reward Models for Language Modeling

    arXiv preprint, March 2024 [Paper]

  • FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

    ICLR 2024 [Paper]

  • RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

    ICLR 2025 [Paper]

  • MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

    ICML 2024 [Paper]

  • MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

    arXiv 2024 [Paper]

  • TruthfulQA: Measuring How Models Mimic Human Falsehoods

    arXiv 2021 [Paper]

  • CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

    arXiv 2024[ Paper]

  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    NeurIPS 2023 [Paper]

  • WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

    arXiv 2024 [Paper]

  • JudgeBench: A Benchmark for Evaluating LLM-based Judges

    arXiv 2024 [Paper]

  • Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality*

    LMSYS Org 2023 [Paper]

4.2 Metric

  • Pearson correlation coefficient

    Philosophical Transactions of the Royal Society of London, 1895 [Paper]

  • Spearman’s rank correlation coefficient

    The American Journal of Psychology, 1904 [Paper]

  • Estimates of the regression coefficient based on Kendall's tau

    Journal of the American Statistical Association, 1968[Paper]

  • The Intraclass Correlation Coefficient as a Measure of Reliability

    Psychological reports, 1966 [Paper]

  • Five ways to look at Cohen's kappa

    Journal of Psychology & Psychotherapy, 2015 [Paper]

5. LIMITATION

5.1 Biases

5.1.1 Presentation-Related Biases

  • Large language models are not robust multiple choice selectors

    ICLR 2024 [Paper]

  • Look at the first sentence: Position bias in question answering

    EMNLP 2020 [Paper]

  • Batch calibration: Rethinking calibration for in-context learning and prompt engineering

    ICLR 2024 [Paper]

  • Large Language Models Are Zero-Shot Rankers for Recommender Systems

    ECIR 2024 [Paper ]

  • Position bias in multiple-choice questions

    Journal of Marketing Research, 1984 [Paper ]

  • JurEE not Judges: safeguarding llm interactions with small, specialised Encoder Ensembles

    arXiv preprint, October 2024 [Paper ]

  • Split and merge: Aligning position biases in large language model based evaluators

    EMNLP 2024 [Paper]

  • Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs

    arXiv 2024 [Paper]

  • Large Language Models are not Fair Evaluators

    ACL 2024 [Paper]

  • REDUCING SELECTION BIAS IN LARGE LANGUAGE MODELS

    arXiv 2024 [Paper]

  • CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges

    arXiv 2024 [Paper]

  • Large language models are not fair evaluators

    ACL 2024 [Paper]

  • Debating with more persuasive LLMs leads to more truthful answers

    International Conference on Learning Representations (ICLR) 2024 [Paper]

  • Position bias estimation for unbiased learning to rank in personal search

    ACM International Conference on Web Search and Data Mining (WSDM) 2018[ Paper]

  • Humans or LLMs as the judge? A study on judgement biases

    EMNLP 2024 [Paper]

  • Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

    EMNLP 2024 [Paper]

  • Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments

    arXiv preprint, June 2024 [Paper]

  • Generative judge for evaluating alignment

    International Conference on Learning Representations (ICLR) 2024 [Paper]

  • Justice or prejudice? quantifying biases in llm-as-a-judge

    NeurIPS 2024 Workshop SafeGenAi 2024 [Paper]

5.1.2 Social-Related Biases

  • Benchmarking Cognitive Biases in Large Language Models as Evaluators

    Findings of ACL 2024 [Paper]

  • Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

    SFLLM Workshop @ NeurIPS 2024 [Paper]

  • Humans or LLMs as the Judge? A Study on Judgement Bias

    EMNLP 2024 [Paper]

  • mind vs. mouth: on measuring re-judge inconsistency of social bias in large language models

    arXiv 2024 [Paper]

5.1.3 Content-Related Biases

  • Calibrate Before Use: Improving Few-Shot Performance of Language Models

    ICML 2021[ Paper]

  • Mitigating Label Biases for In-Context Learning

    ACL 2023 [Paper]

  • Prototypical Calibration for Few-Shot Learning of Language Models

    ICLR 2023 [Paper]

  • Bias Patterns in the Application of LLMs for Clinical Decision Support: A Comprehensive Study

    arXiv preprint 2024 [Paper]

  • Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

    ICLR 2024 [Paper ]

5.1.4 Cognitive-Related Biases

  • Large Language Models Can Be Easily Distracted by Irrelevant Context

    ICML 2023 [Paper]

  • Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

    ACL ARR 2024 [Paper]

  • Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

    arXiv preprint, July 2024 [Paper ]

  • Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement

    ACL 2024 [Paper]

  • Humans or LLMs as the Judge? A Study on Judgement Biases

    arXiv preprint 2024 [Paper]

  • Evaluations of Self and Others: Self-Enhancement Biases in Social Judgments

    Social Cognition 1986 [Paper]

  • Benchmarking Cognitive Biases in Large Language Models as Evaluators

    Findings of ACL 2024 [Paper]

  • Debating with More Persuasive LLMs Leads to More Truthful Answers

    ICML 2024 [Paper]

5.2 Adversarial Attacks

5.2.1 Adversarial Attacks on LLMs

  • HotFlip: White-Box Adversarial Examples for Text Classification

    ACL 2018[ Paper]

  • Query-Efficient and Scalable Black-Box Adversarial Attacks on Discrete Sequential Data via Bayesian Optimization

    ICML 2022 [Paper]

  • Adv-BERT: BERT is Not Robust on Misspellings! Generating Natural Adversarial Samples on BERT

    arXiv 2020 [Paper]

  • An LLM Can Fool Itself: A Prompt-Based Adversarial Attack

    ICLR 2024 [Paper]

  • Natural Backdoor Attack on Text Data

    arXiv 2020 [Paper]

  • Ignore Previous Prompt: Attack Techniques for Language Models

    arXiv 2022 [Paper]

  • Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks

    arXiv 2023 [Paper]

  • Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples

    arXiv 2022[ Paper]

5.2.2 Adversarial Attacks on LLMs-as-Judges

  • Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

    SafeGenAI @ NeurIPS 2024 [Paper]

  • Optimization-based Prompt Injection Attack to LLM-as-a-Judge

    arXiv 2024 [Paper]

  • Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

    EMNLP 2024 [Paper]

  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    NeurIPS 2023 Datasets and Benchmarks Track [Paper]

  • Scaling Instruction-Finetuned Language Models

    Journal of Machine Learning Research 2024 [Paper]

  • Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

    EMNLP 2024 [Paper]

5.3 Inherent Weaknesses

5.3.1 Knowledge Recency

  • Retrieval-Augmented Generation for Large Language Models: A Survey

    arXiv 2023 [Paper]

  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    NeurIPS 2020 [Paper]

  • An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning

    arXiv 2023 [Paper]

  • Continual Learning for Large Language Models: A Survey

    arXiv 2024 [Paper]

  • Striking the Balance in Using LLMs for Fact-Checking: A Narrative Literature Review

    MISDOOM 2024 [Paper]

5.3.2 Hallucination

  • Striking the Balance in Using LLMs for Fact-Checking: A Narrative Literature Review

    MISDOOM 2024 [Paper]

  • Survey of Hallucination in Natural Language Generation

    ACM COMPUTING SURVEYS 2024 [Paper]

  • A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

    arXiv 2024 [Paper]

5.3.3 Domain-Specific Knowledge Gaps

  • Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

    arXiv 2023 [Paper]

  • Unifying Large Language Models and Knowledge Graphs: A Roadmap

    arXiv 2023 [Paper]

  • Retrieval-Augmented Generation for Large Language Models: A Survey

    arXiv 2023 [Paper]

  • Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks

    arXiv 2024 [Paper]

  • Limits to scalable evaluation at the frontier: LLM as Judge won’t beat twice the data

    arXiv 2024 [Paper]

  • Measuring the Inconsistency of Large Language Models in Ordinal Preference Formation

    KnowLLM (Workshop) 2024 [Paper]

👏 Welcome to discussion

We welcome anyone interested to engage in friendly communication with us!

About

The official repo for paper, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published