🚀 Awesome-LLMs-as-Judges

🌟 About This Repo

With the rapid development of LLMs, LLM-as-a-Judge has garnered widespread attention in both academia and industry. LLM judges are not only capable of serving as flexible evaluators in various fields such as text generation, question answering, and dialogue systems, but also facilitate the self-evolution and performance improvement of models. This repository aims to provide a one-stop resource for developers, researchers, and practitioners, helping them explore how to effectively leverage LLMs-as-Judges technology.

This repo include the papers discussed in our latest survey paper:

📝LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.

We will continuously track the latest developments in LLMs-as-Judges and regularly update the repository with the newest related papers. If you find this repository helpful, please give us a ⭐!

If you notice any work we've missed, please feel free to submit a pull request or contact us at email liht22@mails.tsinghua.edu.cn.

We will update the repository and our paper. Welcome to discuss and contribute!

📚 Daily Papers on LLMs-as-Judges

Daily Papers on LLMs-as-Judges includes the latest paper titles and abstracts related to LLMs-as-Judges on arXiv, with information available in both English and Chinese.

⚡️ Update

🔥🔥 News: 2024/12/20: We have updated Daily Papers on LLMs-as-Judges, which automatically retrieves and updates daily papers from arXiv related to LLMs-as-Judges.

🔥🔥 News: 2024/12/14: We compiled papers related to LLMs-as-Judges presented at NeurIPS 2024.

🔥🔥 News: 2024/12/10: We released the first version of the full paper.

🔥🔥 News: 2024/11/10: We completed the foundational work for the project and structured the framework.

🌳 Contents

🚀 Awesome-LLMs-as-Judges
🌟 About This Repo
📚 Daily arXiv Papers on LLMs-as-Judges
⚡️ Update
🌳 Contents
📖 Cite Our Work
📚 Overview of Awesome-LLMs-as-Judges
📑 PaperList
1. Functionality
2. METHODOLOGY
3. APPLICATION
4. META-EVALUATION
- 4.1 Benchmarks
- 4.2 Metric
5. LIMITATION
👏 Welcome to discussion

📖 Cite Our Work

If you find our work useful, please do not save your star and cite our work:

@misc{li2024llmsasjudgescomprehensivesurveyllmbased,
      title={LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods}, 
      author={Haitao Li and Qian Dong and Junjie Chen and Huixue Su and Yujia Zhou and Qingyao Ai and Ziyi Ye and Yiqun Liu},
      year={2024},
      eprint={2412.05579},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.05579}, 
}

📚 Overview of Awesome-LLMs-as-Judges

📑 PaperList

1. Functionality

1.1 Performance Evaluation

1.1.1 Responses Evaluation

Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models

ACL 2023. [Paper]
Automated Genre-Aware Article Scoring and Feedback Using Large Language Models

arXiv 2024. [Paper]
Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks

LREC-COLING 2024. [Paper]
Ares: An automated evaluation framework for retrieval-augmented generation systems

NAACL 2024. [Paper]
Self-rag: Learning to retrieve, generate, and critique through self-reflection

ICLR 2024. [Paper]
RecExplainer: Aligning Large Language Models for Explaining Recommendation Models

KDD 2024. [Paper]

1.1.2 Model Evaluation

Judging llm-as-a-judge with mt-bench and chatbot arena

NeurIPS 2023. [Paper]
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

arXiv 2024. [Paper]
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

arXiv 2024. [Paper]
Benchmarking foundation models with language-model-as-an-examiner

NeurIPS 2023. [Paper]
Kieval: A knowledge-grounded interactive evaluation framework for large language models

ACL 2024. [Paper]

1.2 Model Enhancement

1.2.1 Reward Modeling During Training

Self-rewarding language models

ICML 2024. [Paper]
Direct language model alignment from online ai feedback

arXiv 2024. [Paper]
Rlaif: Scaling reinforcement learning from human feedback with ai feedback

arXiv 2024.[Paper]
Enhancing Reinforcement Learning with Dense Rewards from Language Model Critic

EMNLP 2024. [Paper]
Cream: Consistency regularized self-rewarding language models

arXiv 2024. [Paper]
The perfect blend: Redefining RLHF with mixture of judges

arXiv 2024. [Paper]
Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs

EMNLP (findings) 2023. [Paper]

1.2.2 Acting as Verifier During Inference

Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

arXiv 2024. [Paper]
Fast Best-of-N Decoding via Speculative Rejection

NeurIPS 2024 [Paper]
Tree of thoughts: Deliberate problem solving with large language models

NeurIPS 2024. [Paper]
Graph of thoughts: Solving elaborate problems with large language models

AAAI 2024.[Paper]
Let’s verify step by step

ICLR 2024. [Paper]
Self-evaluation guided beam search for reasoning

NeurIPS 2024. [Paper]
Rationale-Aware Answer Verification by Pairwise Self-Evaluation

arXiv 2024. [Paper]
Creative Beam Search: LLM-as-a-Judge for Improving Response Generation.

ICCC 2024. [Paper]

1.2.3 Feedback for Refinement

Self-refine: Iterative refinement with self-feedback

NeurIPS 2024. [Paper]
Teaching large language models to self-debug

arXiv 2023. [Paper]
Refiner: Reasoning feedback on intermediate representations

EACL 2024. [Paper]
Towards reasoning in large language models via multi-agent peer review collaboration

arXiv 2023. [Paper]
Large language models cannot self-correct reasoning yet

ICLR 2024. [Paper]
LLMs cannot find reasoning errors, but can correct them!

ACL (findings) 2024. [Paper]
Can large language models really improve by self-critiquing their own plans?

NeurIPS (Workshop) 2023. [Paper]

1.3 Data Collection

1.3.1 Data Annotation

If in a Crowdsourced Data Annotation Pipeline, a GPT-4

CHI 2024. [Paper]
ChatGPT outperforms crowd workers for text-annotation tasks

PNAS 2023.[Paper]
ChatGPT-4 outperforms experts and crowd workers in annotating political Twitter messages with zero-shot learning

arXiv 2023. [Paper]
Fullanno: A data engine for enhancing image comprehension of MLLMs

arXiv 2024. [Paper]
Can large language models aid in annotating speech emotional data? Uncovering new frontiers

IEEE 2024. [Paper]
Annollm: Making large language models to be better crowdsourced annotators

NAACL 2024. [Paper]
LLMAAA: Making large language models as active annotators

EMNLP (findings) 2023. [Paper]

1.3.2 Data Synthesize

Selfee: Iterative self-revising LLM empowered by self-feedback generation

Blog post 2023.[Blog]
Self-Boosting Large Language Models with Synthetic Preference Data

arXiv 2024. [Paper]
The fellowship of the LLMs: Multi-agent workflows for synthetic preference optimization dataset generation

arXiv 2024. [Paper]
Self-consistency improves chain of thought reasoning in language models

ICLR 2023. [Paper]
WizardLM: Empowering large language models to follow complex instructions

ICLR 2024. [Paper]
Automatic Instruction Evolving for Large Language Models

EMNLP 2024. [Paper]
STaR: Self-taught reasoner bootstrapping reasoning with reasoning

NeurIPS 2022. [Paper]
Beyond human data: Scaling self-training for problem-solving with language models

arXiv 2023. [Paper]
SAFER-INSTRUCT: Aligning Language Models with Automated Preference Data

NAACL 2024. [Paper]
soda-eval: open-domain dialogue evaluation in the age of llms

EMNLP (findings) 2024. [Paper]

2. METHODOLOGY

2.1 Single-LLM System

2.1.1 Prompt-based

2.1.1.1 In-Context Learning

A systematic survey of prompt engineering in large language models: Techniques and applications

arXiv 2024. [Paper]
A survey on in-context learning

arXiv 2022. [Paper]
Gptscore: Evaluate as you desire

arXiv 2023. [Paper]
Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models

NLP4ConvAI 2023. [Paper]
TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot

arXiv 2024. [Paper]
Multi-dimensional evaluation of text summarization with in-context learning

ACL 2023 (Findings). [Paper]
Calibrate before use: Improving few-shot performance of language models

ICML 2021. [Paper]
Prototypical calibration for few-shot learning of language models

arXiv 2022. [Paper]
Mitigating label biases for in-context learning

ACL 2023. [Paper]
ALLURE: auditing and improving llm-based evaluation of text using iterative in-context-learning

arXiv 2023. [Paper]
Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

arXiv 2024. [Paper]

2.1.1.2 Step-by-step

Chain-of-thought prompting elicits reasoning in large language models

NeurIPS 2022. [Paper]
Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task

Eval4NLP 2023. [Paper]
G-eval: Nlg evaluation using gpt-4 with better human alignment

arXiv 2023. [Paper]
ICE-Score: Instructing Large Language Models to Evaluate Code

EACL 2024 (findings). [Paper]
ProtocoLLM: Automatic Evaluation Framework of LLMs on Domain-Specific Scientific Protocol Formulation Tasks

arXiv 2024. [Paper]
A closer look into automatic evaluation using large language models

EMNLP 2023 (findings). [Paper]
FineSurE: Fine-grained summarization evaluation using LLMs

ACL 2024. [Paper]
Split and merge: Aligning position biases in large language model based evaluators

arXiv 2023. [Paper]

2.1.1.3 Definition Augmentation

Can LLM be a Personalized Judge?

arXiv 2024. [Paper]
Biasalert: A plug-and-play tool for social bias detection in llms

arXiv 2024. [Paper]
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation

arXiv 2024. [Paper]
Unveiling Context-Aware Criteria in Self-Assessing LLMs

arXiv 2024. [Paper]
Calibrating llm-based evaluator

arXiv 2023. [Paper]

2.1.1.4 Multi-turn Optimization

Large Language Models Are Active Critics in NLG Evaluation

arXiv 2024. [Paper]
Kieval: A knowledge-grounded interactive evaluation framework for large language models

ACL 2024. [Paper]
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

arXiv 2024. [Paper]
Benchmarking foundation models with language-model-as-an-examiner

NeurIPS 2023 (Datasets and Benchmarks). [Paper]
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

arXiv 2024 [Paper]

2.1.2 Tuning-based

On the limitations of fine-tuned judge models for llm evaluation

arXiv 2024. [Paper]

2.1.2.1 Score-based Tuning

Adaptation with self-evaluation to improve selective prediction in llms

EMNLP 2023 (Findings). [Paper]
Learning personalized story evaluation

arXiv 2023. [Paper]
Improving Model Factuality with Fine-grained Critique-based Evaluator

arXiv 2024. [Paper]
Ares: An automated evaluation framework for retrieval-augmented generation systems

NAACL 2024. [Paper]
PHUDGE: Phi-3 as Scalable Judge

arXiv 2024. [Paper]
Self-Judge: Selective Instruction Following with Alignment Self-Evaluation

arXiv 2024. [Paper]
Automatic evaluation of attribution by large language models

EMNLP 2023 (Findings). [Paper]
Sorry-bench: Systematically evaluating large language model safety refusal behaviors

arXiv 2024. [Paper]
Tigerscore: Towards building explainable metric for all text generation tasks

TLMR 2024. [Paper]

2.1.2.2 Preference-based Learning

Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

arXiv 2024. [Paper]
Prometheus: Inducing fine-grained evaluation capability in language models

ICLR 2024. [Paper]
Prometheus 2: An open source language model specialized in evaluating other language models

arXiv 2024. [Paper]
FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom

arXiv 2024. [Paper]
Self-rationalization improves LLM as a fine-grained judge

arXiv 2024. [Paper]
Foundational autoraters: Taming large language models for better automatic evaluation

arXiv 2024. [Paper]
Self-taught evaluators

arXiv 2024. [Paper]
Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization

ICLR 2024. [Paper]
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

arXiv 2024. [Paper]
Direct preference optimization: Your language model is secretly a reward model

NeurIPS 2023. [Paper]
Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge

arXiv 2024. [Paper]
Judgelm: Fine-tuned large language models are scalable judges

arXiv 2023. [Paper]
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback

EMNLP 2023. [Paper]
Generative judge for evaluating alignment

arXiv 2023. [Paper]
Shepherd: A critic for language model generation

arXiv 2023. [Paper]
X-eval: Generalizable multi-aspect text evaluation via augmented instruction tuning with auxiliary evaluation aspects

NAACL 2024. [Paper]
Themis: A reference-free nlg evaluation language model with flexibility and interpretability

EMNLP 2024. [Paper]
Critiquellm: Towards an informative critique generation model for evaluation of large language model generation

ACL 2024. [Paper]
Mitigating the Bias of Large Language Model Evaluation

arXiv 2024. [Paper]
Halu-j: Critique-based hallucination judge

arXiv 2024. [Paper]
Prometheusvision: Vision-language model as a judge for fine-grained evaluation

ICLR 2024 (Workshop). [Paper]
Llava-critic: Learning to evaluate multimodal models

arXiv 2024. [Paper]

2.1.3 Post-processing

2.1.3.1 Probability Calibration

Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons

arXiv 2024. [Paper]
Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments

arXiv 2024. [Paper]
Language Models can Evaluate Themselves via Probability Discrepancy

ACL 2024 (Findings). [Paper]
Mitigating biases for instruction-following language models via bias neurons elimination

ACL 2024. [Paper]

2.1.3.2 Text Reprocessing

Evaluation metrics in the era of GPT-4: reliably evaluating large language models on sequence to sequence tasks

EMNLP 2023. [Paper]
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

arXiv 2024. [Paper]
Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing

arXiv 2024. [Paper]
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

arXiv 2024. [Paper]
Generative judge for evaluating alignment

arXiv 2023. [Paper]
Self-evaluation improves selective generation in large language models

NeurIPS 2023 (Workshops). [Paper]
AI can help humans find common ground in democratic deliberation

Science 2024. [Paper]

2.2 Multi-LLM System

2.2.1 Communication

2.2.1.1 Cooperation

Towards reasoning in large language models via multi-agent peer review collaboration

arXiv 2023. [Paper]
Wider and deeper llm networks are fairer llm evaluators

arXiv 2023. [Paper]
ABSEval: An Agent-based Framework for Script Evaluation

ACL 2024. [Paper]

2.2.1.2 Competition

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

arXiv 2024. [Paper]
A multi-llm debiasing framework

arXiv 2024. [Paper]
Prd: Peer rank and discussion improve large language model based evaluations

TMLR 2024. [Paper]
Chateval: Towards better llm-based evaluators through multi-agent debate

arXiv 2023. [Paper]
Evaluating the Performance of Large Language Models via Debates

arXiv 2024. [Paper]
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

arXiv 2024. [Paper]

2.2.2 Aggregation

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

arXiv 2024. [Paper]
Benchmarking foundation models with language-model-as-an-examiner

NeurIPS 2023 (Datasets and Benchmarks). [Paper]
Pre: A peer review based large language model evaluator

arXiv 2024. [Paper]
An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

arXiv 2024. [Paper]
Large language models as evaluators for recommendation explanations

Proceedings of the 18th ACM Conference on Recommender Systems. [Paper]
Bayesian Calibration of Win Rate Estimation with LLM Evaluators

EMNLP 2024. [Paper]
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

arXiv 2024. [Paper]
Fusion-Eval: Integrating Assistant Evaluators with LLMs

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. [Paper]
AIME: AI System Optimization via Multiple LLM Evaluators

arXiv 2024. [Paper]
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition

arXiv 2024. [Paper]
An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers

arXiv 2024. [Paper]
Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

EMNLP 2024. [Paper]
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

arXiv 2024. [Paper]
PiCO: Peer Review in LLMs based on the Consistency Optimization

arXiv 2024. [Paper]
Language Model Preference Evaluation with Multiple Weak Evaluators

arXiv 2024. [Paper]

2.3 Human-AI Collaboration System

Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation

arXiv 2023. [Paper]
Large language models are not fair evaluators

arXiv 2023. [Paper]
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

EMNLP 2024. [Paper]
Human-Centered Design Recommendations for LLM-as-a-judge

arXiv 2024. [Paper]
Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences

Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. [Paper]

3. APPLICATION

3.1 General

DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset

IJCNLP 2017.[Poster]
Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization

EMNLP 2018. [Paper]
Improving LLM-based machine translation with systematic self-correction

arXiv 2024. [Paper]
Fusion-Eval: Integrating Assistant Evaluators with LLMs

EMNLP 2024.[Poster]

3.2 Multimodal

Llava-critic: Learning to evaluate multimodal models

arXiv 2024. [Paper]
Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark

ICML 2024. [Paper]
Can large language models aid in annotating speech emotional data? uncovering new frontiers

IEEE 2024. [Paper]
Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

arXiv 2024. [Paper]
Calibrated self-rewarding vision language models

NeurIPS 2024. [Paper]
Automated evaluation of large vision-language models on self-driving corner cases

arXiv 2024. [Paper]

3.3 Medical

DOCLENS: Multi-aspect fine-grained evaluation for medical text generation

ACL 2024.[Paper]
Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency?

NAACL (findings) 2024. [Paper]
Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation

arXiv 2024. [Paper]
Automatic evaluation for mental health counseling using LLMs

arXiv 2024. [Paper]
Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models

Bioinformatics 2024. [Paper]

3.4 Legal

Disc-lawllm: Fine-tuning large language models for intelligent legal services

arXiv 2023. [Paper]
Retrieval-based Evaluation for LLMs: A Case Study in Korean Legal QA NNLP (Workshop) 2023. 132–137. [Paper]
Constructing domain-specific evaluation sets for llm-as-a-judge

customnlp4u (Workshop) 2024. [Paper]
Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval

arXiv 2024. [Paper]

3.5 Financial

Pixiu: A large language model, instruction data and evaluation benchmark for finance

NeurIPS 2023. [Paper]
GPT classifications, with application to credit lending

Machine Learning with Applications 2024. [Paper]
KRX Bench: Automating Financial Benchmark Creation via Large Language Models

FinNLP 2024. [Paper]

3.6 Education

Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

EMNLP 2024. [Paper]
Automated Genre-Aware Article Scoring and Feedback Using Large Language Models

arXiv 2024. [Paper]
Automated Essay Scoring and Revising Based on Open-Source Large Language Models

IEEE 2024. [Paper]
Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks.

LREC-COLING 2024. [Paper]
Evaluating Mathematical Reasoning Beyond Accuracy

COLM 2024. [Paper]
Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM

ACL (findings) 2024. [Paper]

3.7 Information Retrieval

LLMJudge: LLMs for Relevance Judgments

LLM4Eval 2024. [Paper]
Don’t Use LLMs to Make Relevance Judgments

arXiv 2024. [Paper]
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

arXiv 2024. [Paper]
Large language models as evaluators for recommendation explanations

RecSys 2024. [Paper]
Ares: An automated evaluation framework for retrieval-augmented generation systems

NAACL 2024. [Paper]

3.8 Others

AIME: AI System Optimization via Multiple LLM Evaluators

arXiv 2024. [Paper]
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences

arXiv 2024. [Paper]
LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization

arXiv 2024. [Paper]
Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation

Biomedical NLP (Workshop) 2024.[Paper]
AI can help humans find common ground in democratic deliberation

Science 2024.[Paper]
Sotopia: Interactive evaluation for social intelligence in language agents

ICLR (spotlight) 2024. [Paper]

4. META-EVALUATION

4.1 Benchmarks

4.1.1 Code Generation

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

ICLR 2024 [Paper]
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences

arXiv 2024 [Paper]
Evaluating Large Language Models Trained on Code

arXiv 2021[Paper]
Agent-as-a-Judge: Evaluate Agents with Agents

arXiv 2024 [Paper]
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

NeurIPS 2023 (Datasets and Benchmarks Track) [Paper ]

4.1.2 Machine Translation

Experts, errors, and context: A large-scale study of human evaluation for machine translation

TACL 2021 [Paper]
Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-Based Human Evaluations on TED and News Domain

Proceedings of the Sixth Conference on Machine Translation (WMT), 2021 [Paper]
Large Language Models Effectively Leverage Document-Level Context for Literary Translation, but Critical Errors Persist

Proceedings of the Sixth Conference on Machine Translation (WMT), 2023 [Paper]

4.1.3 Text Summarization

Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

NAACL 2021 [Paper]
SummEval: Re-evaluating Summarization Evaluation

Transactions of the Association for Computational Linguistics (TACL), 2021 [Paper]
Opinsummeval: Revisiting automated evaluation for opinion summarization

EMNLP 2023 [Paper]

4.1.4 Dialogue Generation

Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations

INTERSPEECH 2019 [Paper]
Automatic evaluation and moderation of open-domain dialogue systems

DSTC10 [Paper]
Personalizing Dialogue Agents: I Have a Dog, Do You Have Pets Too?

ACL 2018 [Paper]
USR: An Unsupervised and Reference-Free Evaluation Metric for Dialog Generation

ACL 2020 [Paper]
Overview of the Tenth Dialog System Technology Challenge: DSTC10

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023 [Paper]

4.1.5 Automatic Story Generation

OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

ACL 2021 [Paper]
Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

COLING 2022 [Paper]
Learning Personalized Story Evaluation

ICLR 2024 [Paper]
Hierarchical Neural Story Generation

ACL 2018 [Paper]
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

NAACL 2016 [Paper]
StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning

EMNLP 2022 [Paper]

4.1.6 Values Alignment

A general language assistant as a laboratory for alignment

arXiv 2021 [Paper]
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

arXiv 2024 [Paper]
Cvalues: Measuring the values of Chinese large language models from safety to responsibility

arXiv 2023 [Paper]

4.1.7 Recommendation

Large Language Models as Evaluators for Recommendation Explanations
RecSys 2024 [Paper]
Yelp Dataset Challenge: Review Rating Prediction

arXiv 2016 [Paper]
The movielens datasets: History and context

TiiS 2016 [Paper]

4.1.8 Search

Lecardv2: A large-scale chinese legal case retrieval dataset

SIGIR 2024 [Paper]
Overview of the TREC 2021 Deep Learning Track

TREC 2021 [Paper]
Overview of the TREC 2023 NeuCLIR Track

TREC 2023 [Paper]
Ms MARCO: A human generated machine reading comprehension dataset

ICLR 2017 [Paper]
Overview of the TREC 2022 Deep Learning Track

Text REtrieval Conference (TREC) 2022 [Paper]

4.1.9 Comprehensive Data

Length-controlled alpacaeval: A simple way to debias automatic evaluators

COLM 2024 [Paper]
Helpsteer: Multi-attribute helpfulness dataset for steerlm

NAACL 2024 [Paper]
ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback

ICML 2024 [Paper]
Helpsteer2-preference: Complementing ratings with preferences

CoRR 2024 [Paper]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

EMNLP 2023 [Paper]
RewardBench: Evaluating Reward Models for Language Modeling

arXiv preprint, March 2024 [Paper]
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

ICLR 2024 [Paper]
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

ICLR 2025 [Paper]
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

ICML 2024 [Paper]
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

arXiv 2024 [Paper]
TruthfulQA: Measuring How Models Mimic Human Falsehoods

arXiv 2021 [Paper]
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

arXiv 2024[ Paper]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

NeurIPS 2023 [Paper]
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

arXiv 2024 [Paper]
JudgeBench: A Benchmark for Evaluating LLM-based Judges

arXiv 2024 [Paper]
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality*

LMSYS Org 2023 [Paper]

4.2 Metric

Pearson correlation coefficient

Philosophical Transactions of the Royal Society of London, 1895 [Paper]
Spearman’s rank correlation coefficient

The American Journal of Psychology, 1904 [Paper]
Estimates of the regression coefficient based on Kendall's tau

Journal of the American Statistical Association, 1968[Paper]
The Intraclass Correlation Coefficient as a Measure of Reliability

Psychological reports, 1966 [Paper]
Five ways to look at Cohen's kappa

Journal of Psychology & Psychotherapy, 2015 [Paper]

5. LIMITATION

5.1 Biases

5.1.1 Presentation-Related Biases

Large language models are not robust multiple choice selectors

ICLR 2024 [Paper]
Look at the first sentence: Position bias in question answering

EMNLP 2020 [Paper]
Batch calibration: Rethinking calibration for in-context learning and prompt engineering

ICLR 2024 [Paper]
Large Language Models Are Zero-Shot Rankers for Recommender Systems

ECIR 2024 [Paper ]
Position bias in multiple-choice questions

Journal of Marketing Research, 1984 [Paper ]
JurEE not Judges: safeguarding llm interactions with small, specialised Encoder Ensembles

arXiv preprint, October 2024 [Paper ]
Split and merge: Aligning position biases in large language model based evaluators

EMNLP 2024 [Paper]
Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs

arXiv 2024 [Paper]
Large Language Models are not Fair Evaluators

ACL 2024 [Paper]
REDUCING SELECTION BIAS IN LARGE LANGUAGE MODELS

arXiv 2024 [Paper]
CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges

arXiv 2024 [Paper]
Large language models are not fair evaluators

ACL 2024 [Paper]
Debating with more persuasive LLMs leads to more truthful answers

International Conference on Learning Representations (ICLR) 2024 [Paper]
Position bias estimation for unbiased learning to rank in personal search

ACM International Conference on Web Search and Data Mining (WSDM) 2018[ Paper]
Humans or LLMs as the judge? A study on judgement biases

EMNLP 2024 [Paper]
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

EMNLP 2024 [Paper]
Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments

arXiv preprint, June 2024 [Paper]
Generative judge for evaluating alignment

International Conference on Learning Representations (ICLR) 2024 [Paper]
Justice or prejudice? quantifying biases in llm-as-a-judge

NeurIPS 2024 Workshop SafeGenAi 2024 [Paper]

5.1.2 Social-Related Biases

Benchmarking Cognitive Biases in Large Language Models as Evaluators

Findings of ACL 2024 [Paper]
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

SFLLM Workshop @ NeurIPS 2024 [Paper]
Humans or LLMs as the Judge? A Study on Judgement Bias

EMNLP 2024 [Paper]
mind vs. mouth: on measuring re-judge inconsistency of social bias in large language models

arXiv 2024 [Paper]

5.1.3 Content-Related Biases

Calibrate Before Use: Improving Few-Shot Performance of Language Models

ICML 2021[ Paper]
Mitigating Label Biases for In-Context Learning

ACL 2023 [Paper]
Prototypical Calibration for Few-Shot Learning of Language Models

ICLR 2023 [Paper]
Bias Patterns in the Application of LLMs for Clinical Decision Support: A Comprehensive Study

arXiv preprint 2024 [Paper]
Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

ICLR 2024 [Paper ]

5.1.4 Cognitive-Related Biases

Large Language Models Can Be Easily Distracted by Irrelevant Context

ICML 2023 [Paper]
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

ACL ARR 2024 [Paper]
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

arXiv preprint, July 2024 [Paper ]
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement

ACL 2024 [Paper]
Humans or LLMs as the Judge? A Study on Judgement Biases

arXiv preprint 2024 [Paper]
Evaluations of Self and Others: Self-Enhancement Biases in Social Judgments

Social Cognition 1986 [Paper]
Benchmarking Cognitive Biases in Large Language Models as Evaluators

Findings of ACL 2024 [Paper]
Debating with More Persuasive LLMs Leads to More Truthful Answers

ICML 2024 [Paper]

5.2 Adversarial Attacks

5.2.1 Adversarial Attacks on LLMs

HotFlip: White-Box Adversarial Examples for Text Classification

ACL 2018[ Paper]
Query-Efficient and Scalable Black-Box Adversarial Attacks on Discrete Sequential Data via Bayesian Optimization

ICML 2022 [Paper]
Adv-BERT: BERT is Not Robust on Misspellings! Generating Natural Adversarial Samples on BERT

arXiv 2020 [Paper]
An LLM Can Fool Itself: A Prompt-Based Adversarial Attack

ICLR 2024 [Paper]
Natural Backdoor Attack on Text Data

arXiv 2020 [Paper]
Ignore Previous Prompt: Attack Techniques for Language Models

arXiv 2022 [Paper]
Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks

arXiv 2023 [Paper]
Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples

arXiv 2022[ Paper]

5.2.2 Adversarial Attacks on LLMs-as-Judges

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

SafeGenAI @ NeurIPS 2024 [Paper]
Optimization-based Prompt Injection Attack to LLM-as-a-Judge

arXiv 2024 [Paper]
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

EMNLP 2024 [Paper]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

NeurIPS 2023 Datasets and Benchmarks Track [Paper]
Scaling Instruction-Finetuned Language Models

Journal of Machine Learning Research 2024 [Paper]
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

EMNLP 2024 [Paper]

5.3 Inherent Weaknesses

5.3.1 Knowledge Recency

Retrieval-Augmented Generation for Large Language Models: A Survey

arXiv 2023 [Paper]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

NeurIPS 2020 [Paper]
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning

arXiv 2023 [Paper]
Continual Learning for Large Language Models: A Survey

arXiv 2024 [Paper]
Striking the Balance in Using LLMs for Fact-Checking: A Narrative Literature Review

MISDOOM 2024 [Paper]

5.3.2 Hallucination

Striking the Balance in Using LLMs for Fact-Checking: A Narrative Literature Review

MISDOOM 2024 [Paper]
Survey of Hallucination in Natural Language Generation

ACM COMPUTING SURVEYS 2024 [Paper]
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

arXiv 2024 [Paper]

5.3.3 Domain-Specific Knowledge Gaps

Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

arXiv 2023 [Paper]
Unifying Large Language Models and Knowledge Graphs: A Roadmap

arXiv 2023 [Paper]
Retrieval-Augmented Generation for Large Language Models: A Survey

arXiv 2023 [Paper]
Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks

arXiv 2024 [Paper]
Limits to scalable evaluation at the frontier: LLM as Judge won’t beat twice the data

arXiv 2024 [Paper]
Measuring the Inconsistency of Large Language Models in Ordinal Preference Formation

KnowLLM (Workshop) 2024 [Paper]

👏 Welcome to discussion

We welcome anyone interested to engage in friendly communication with us!

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
img		img
.DS_Store		.DS_Store
NeurIPS.md		NeurIPS.md
README.md		README.md

CSHaitao/Awesome-LLMs-as-Judges

Folders and files

Latest commit

History

Repository files navigation